As mentioned last time, there is another behavior that may be a surprise (it certainly was to me when I discovered it). Consider the following minimal TCP server code:

It’s as simple as it gets and seems to be quite correct. The lock on the type is an overkill but should do the trick making sure our buffer is consistent.

Client code is just sending single bytes and flushing the stream every time (200 single byte messages).
What is the output? Here it is (a relevant part of it at least):

Notice lines 3 and 4 – the data is printed in reversed order. How come ?!

Let’s look again at the server code:

Console.Write is protected by the lock and it is guaranteed that only one thread will call it at the time, buffer is already copied, what can be wrong ?

The only thing that may rise an eyebrow is the call to BeginRead happening before the processing of the data read from buffer. It cannot interfere with the data though as there is a lock and our data is already in a copied byte array. “When you have eliminated the impossible, whatever remains, however improbable, must be the truth“. The only remaining option is that the thread that is doing the out-of-order write is the same one that is asynchronously called to process the incoming data, but instead of calling BeginRead and returning it actually synchronously calls the async callback. MSDN documentation fails to mention that, but actually that’s exactly what happens. We can verify that easily by adding the following piece of code:

and here is the relevant output:

Notice lines 1 and 8 – it’s the same thread that is calling the rcv method twice. Since in C# locks are reentrant, it doesn’t protect the EndReceive from executing (if they weren’t it would be a self-deadlock).

In the software where I observed it it didn’t happen a lot – the bigger the load the more likely was it to surface. Though when it did happen, the recursion depth was around 5-6.  The reason is that under the hood an OverlappedAsyncResult is used which can detect that there is another async IO in progress and piggyback on it.

Hope you’ll find it useful, I learned the hard way.

Here’s a scenario – two applications talking to each other through TCP binary protocol (proprietary – I work in a corporation, remember?). Both written in C#. A somewhat atypical client server with a pub-sub mechanism. One subscribes for events, and then the other initiates the connection and delivers the events. If there are no events for some time, the connection is torn down. Every once in a couple of weeks, the application delivering the events hangs. The thread count rises to > 500 and there are no events sent to the receiver.

This is the situation as was described to me at some point. I knew one of the applications (and knew it was not written well) so I started digging.

One word about problem reproduction – white rare, we noticed that certain pattern of events seemed to trigger it, so after ensuring there is nothing special about the events themselves, but the pattern, we managed to reproduce it way more often (every couple of hours).

First thing was to use my all time favourite tool WinDBG. Often unappreciated, always underestimated, usually little known.

and oddly enough, the clrstack of sender thread shows constantly something like this:

How can that be ? The events are small packets, network is healthy, etc. Some googling reveals the answer when such thing can happen – the other side keeps the connection open, but doesn’t read. This explains the thread count (there is a lock around the write and the thread pool is receiving some data and generating events is being set up with a 500 limit), but not the behavior of the whole system.

The system works fine most of the time (two nines or something ;), why can the write (or rather read)  problem occur so infrequently, we examined the reading code multiple times and it looks good! Some two gallons of coffee later, an epiphany about the nature of the traffic patter descents on us –  it’s disconnecting every now and then, and the problem surfaces always about the same number of events (couple hundred) after the disconnection. The problem has got to be in the connection acceptance.

We later found out that it indeed was. There was a race condition between the callback of network stream’s BeginRead and the actual assignment of some delegates that were supposed to handle the received data. Most of the time everything worked fine, but every now and then with unlucky thread interleaving the subsequent BeginRead was not called. What happens in such cases ? Connection is kept alive and healthy, the OS is actually receiving the data into its own buffer (hence the couple of hundred messages that were successfully sent from the event generating application), but the moment this buffer is filled to the brim, it asks the sender to wait (one can actually see the zero window size reported in Wireshark) and the sender blocks the Write call.

The TcpClient’s Write call actually has a version which allows you to provide a timeout – if your call isn’t returning after say 5 seconds, there most likely is a problem underneath.

Tcp BeginRead and EndRead have yet another quirky behavior which manifests in really weird way, but that’s a topic for a separate post