Method to restart XcoDiscoveryService on a discovery client?

Feb 13, 2011 at 2:44 AM

I have noticed that sometimes a XcoAppSpace defined to be a client to a XcoDiscoveryService (e.g. through a config string such as "tcp.port=9001;disco.address=9000") fails to connect to the discovery server if the server is not running at the time that the client app space is instantiated. It is stated in the documentation that the client app space will automatically retry to connect to the server but this doesn't seem to always work. Is there a way to manually force the client space to "reset" the connection to the discovery worker? I was hoping that if a call to XcoAppSpace.DiscoverWorker<T>() resulted in a XcoDiscoveryException that I could force a connection reset between the discovery client and discovery server space. Is this possible? Thanks in advance.

Coordinator
Feb 14, 2011 at 2:54 PM

Hi husterk,

Actually the connection retry should happen automatically every time DiscoverWorker is called. I tried it out, and couldn't find any problems.

What could cause problems is when the worker you are trying to reach isn't hosted by the discovery server directly, but by some other space. Let's say there are three spaces, S1 hosts the discovery server, S2 hosts a worker W, and S3 wants to connect to the worker W. When now S2 starts up before S1, it cannot connect to the discovery server and cannot announce the presence of the worker W. When S1 starts up and then S3 sends a discovery request, S1 doesn't know any workers and responds with a discovery exception. Only after S2 has successfully announced its presence to S1, the worker can be discovered (a retry to register at the discovery server happens every 10 seconds). So, unfortunately S3 can't do anything about that (a restart of the local discovery service wouldn't help).
Could this be your problem?

If that isn't the case, and your worker is actually hosted by the discovery server itself, could you post an example code that reproduces the error?

Feb 24, 2011 at 12:55 PM

Is there some type of event or notification that can be captured for detecting when the AppSpace announces itself to the discovery service (in order to detect success / failure). I still seem to be running into problems where the services appear to fail to announce themselves to the discovery server. However, this is difficult to troubleshoot without any information or feedback from the announcement / discovery process.

 

For example, suppose I have the following services:

  S1) Hosts the discovery server and W1.

  S2) Hosts W2.

 

Now, if I start S1 then start S2 then W2 successfully discovers W1. Then, I properly shutdown S2 (and ensure that StopWorker() is called for W2). Now, when I restart S2, W2 successfully re-discovers W1. However, if I stop S2 but fail to call StopWorker() for W2 I cannot ever reconnect W2 to W1. How can I ensure that I will be able to reconnect? Is it feasible to call stop worker each time I start a new service to ensure that the previous instance was properly shutdown prior to attempting to reconnect a worker? 

Feb 24, 2011 at 1:17 PM

I finally found some answers as to what is going on. It looks like that the call to RunWorker() sometimes fails if the worker had been previously run and not shutdown properly. I receive a System.Argument exception with message "An item with the same key has already been added." on line 51 in CausalityContextExtensions.cs. Bascially, you are trying to add a causality to the dispatcher but the dispatcher thinks that the causality already exists. I am going to try to see if there is a way to recover from this but it looks like it may be an issue that you will need to tackle.

Feb 24, 2011 at 1:33 PM

Update, This issues causes the XCoordination code to enter an infinite loop (which explains the behavior I have been seeing). The loop occurs in the StartReceive() method on line 141 in TCPServer.cs. Basically, the call to messageReceived.MessageReceived(msg, remoteAddress, commService) attempts to add the causality (which causes the argument exception) which is caught in the MessageTransmitter.cs file. So, the messageReceived.MessageReceived() method keeps getting called over and over again within the loop. I am going to try to handle the argument exception to see if this gets around the issue.

Feb 24, 2011 at 2:06 PM

Yep... that was the issue. I added a try catch block (catching ArgumentException) around the calls to Dispatcher.AddCausality() and Dispatcher.RemoveCausality() to ensure that if a failure occurs because the causality has already been added or removed that the code continues on as normal. Will you please look into this to make sure this issue is not occurring anywhere else and that my changes are acceptable in your workflow for processing workers? Thanks!

Coordinator
Feb 24, 2011 at 4:29 PM

Hi husterk,

Thanks for pointing this out. I fear the issue is more complicated though, as I have seen the error occurring with Dispatcher.AddCausality before. It seems to be related to a bug in the CCR, which has been around for some time now but unfortunately still hasn't been fixed (see here: http://social.msdn.microsoft.com/Forums/eu/roboticsccr/thread/386e224d-954b-4e06-a4d0-3e1b5337ceaf).

Though I don't know how it can happen at the place where you encountered it (at this position the Dispatcher always should have been cleared of any causalities) - but there sure is a possibility. I'll see if I can somehow prevent this error from happening there.

Coordinator
Feb 24, 2011 at 5:20 PM

I made a small change in the CausalityContextExtensions class (similar to yours), which should prevent the CCR bug from happening, but still allow causalities to work correctly. Please let me know if it helps.

Feb 24, 2011 at 5:42 PM

Thanks Thomass! I will take a look at your changes and let you know how it goes. It sounds like Causalities overly complex animals. I found some other discussions that mention requiring the user to remove the causality from the same thread that it was created on. I didn't realize this when I first started working with them but I guess it makes sense. I have also noticed that the PostWithCausality() and SendWithCausality() methods will sometimes just hang (e.g. fail to return) when called repeatedly. My guess is that this is related to the link you provided.