Everyone (ok, nearly everyone or this wouldn’t happen) knows that if you have a multithreaded application and different threads are accessing shared resources (and some of them do writes), you have to synchronize the access somehow.

Microsoft introduced a neat set of Concurrent Collections in the framework to help with that, but before ConcurrentDictionary there were plain old Dictionary. Some time ago I came across a piece of software that was crashing due to lack of proper synchronization (unfortunately this still happens a lot) and it got me thinking, what is the worst that could happen. So i started experimenting with multiple threads and a Dictionary to see in how many ways can it break. What i found surprised me. Outside of Duplicate Key exception and key not found exception which one may argue are not entirely tied to lack of proper synchronization there are two that stood out.

First one is rather trivial (it’s actually the exception I’ve observed in the aforementioned piece of software, so no surprise there):

Collection is being resized, and we’re still trying to do some operations on it.

The second one is more interesting. At some point my small testing program just didn’t end. I looked at the task manager and this is what I’ve seen:

Inline image 1

25% CPU utilization – most likely one thread going crazy. I connected through WinDBG and here’s some things I’ve seen:

So there is one offending thread – let’s see what it’s doing:

So it seems it’s stuck on inserting – actually inside framework Dictionary class – spinning!

Let’s find this dictionary and see if it’s corrupted:

This is my dictionary, it was declared as Dictionary<long,int>. There is four of them, but that’s because the program was running in a loop creating dictionaries and trying to break them. Remaining three just have not yet been garbage collected as can be observed below

Let’s see if that indeed is the case:

And surely that’s the only one with any gc root at all.

Before we look inside, a word about the structure of Dictionary – there are two important structures inside –

  • int[] buckets – it holds the index of the entries table that contains the first Entry for the hashCode that maps to that bucket
  • Entry[] entries – it contains all the entries in a linked-list fashion (Entry class has a next field pointing to the next Entry)

So the way this works is that if you put object A, a hashCode is calculated for it, then a bucket index is selected by calculating hashCode % buckets.length and the int in the buckets table for that index is pointing at the first entry in that ‘bucket’ that is contained in the entries table. If there is none, yours is the first. If there are some, you iterate the whole chain, and insert after the last one, updating the ‘next’ pointer of the previously last element.

Knowing that, we can look at the structure of my Dictionary:

Ok – just 10 elements inside the Dictionary, let’s see what the Entries look like:

 

For brevity I include only two relevant entries (actually only one is relevant). Entry number six – the next entry is – number seven (so far so good). Entry number seven – the next is … number seven ! Aha! That’s really bad. Concurent inserting (and deleting) without any synchronization has corrupted the Dictionary so that in the Entries linklist one of the entries is pointing at itself!

I would imagine Dictionary code for inserting would look something like this:

With such a corrupted structure this Insertion would then fall into infinite loop spinning on next.next.next… (which is what I could observe)

If you know any other way this could go wrong, send me an email or leave a comment. I’m sure there has to be some other ill behavior resulting from lack of proper synchronization.

Save early, save often. Sounds almost like a gospel, and having experienced various software crashes I usually make sure that I am saving often and making backups and so on. The wordpress blog editor is a little different though. Being used to the fact that in a browser CTRL+S saves the HTML document of the page rather than anything else I didn’t use it (Actually in wordpress blog post editor CRTL+S behaves as you would expect it to – saves draft).

So after writing for an hour while working on previous post and jumping between Eclipse and chrome to report on test results I clicked ‘Distraction Free Writing mode’ which basically is a fullscreen mode. Instead of turning on the fullscreen mode, the editor blinked and the edited text disappeared. CRTL+Z didn’t work, there was nothing left from the post in the page source. A brief look into the database showed that there is nothing there for this post. I was doomed – an hour of work gone.

Then I thought – sure, it disappeared, but it has to be somewhere in process memory still (some cache, not yet cleared). So I turned on the WinDbg, connected to the chrome process (chrome runs the tabs in separate processes, and it has a task manager which allows you to see which process corresponds to which tab). Thankfully I remembered some parts of what I was writing so I searched for “ArrayList implementation”:

0:007>s -a 0 L?80000000 “ArrayList implementation”

Search for ascii string (-a) in addresses starting from o for 80000000 bytes (L signifies length) matching  “ArrayList implementation”. And surely enough I got lots of results (~20 of them).

All that was left to do was to trace back to the beginning of the text (“ArrayList implementation” was somewhere in the middle) and dump that memory:

0:007> da 08bdbd10 L40000

0a53d418 “I’ve always thought that Java is”
0a53d438 ” in the wrong by not allowing ge”
0a53d458 “neric collections of primitive t”

I actually managed to recover all of the text thus saving myself an hour of work and 5 tons of frustration. After that I’ve googled on how to turn the auto-save and revisions.

BTW. If you know an easier way to do this – drop me a line, I’ll update the post to include the easiest method.

 

EDIT 1:

Jack Z suggested T-Search which is really a cheating engine for games (one that allows you to freeze you health, ammo or whatever) – It’s got a nice UI and would probably be easier to use and faster than using the WinDbg. It’s got a memory browser with a ASCII search function.

 

Here’s a scenario – two applications talking to each other through TCP binary protocol (proprietary – I work in a corporation, remember?). Both written in C#. A somewhat atypical client server with a pub-sub mechanism. One subscribes for events, and then the other initiates the connection and delivers the events. If there are no events for some time, the connection is torn down. Every once in a couple of weeks, the application delivering the events hangs. The thread count rises to > 500 and there are no events sent to the receiver.

This is the situation as was described to me at some point. I knew one of the applications (and knew it was not written well) so I started digging.

One word about problem reproduction – white rare, we noticed that certain pattern of events seemed to trigger it, so after ensuring there is nothing special about the events themselves, but the pattern, we managed to reproduce it way more often (every couple of hours).

First thing was to use my all time favourite tool WinDBG. Often unappreciated, always underestimated, usually little known.

and oddly enough, the clrstack of sender thread shows constantly something like this:

How can that be ? The events are small packets, network is healthy, etc. Some googling reveals the answer when such thing can happen – the other side keeps the connection open, but doesn’t read. This explains the thread count (there is a lock around the write and the thread pool is receiving some data and generating events is being set up with a 500 limit), but not the behavior of the whole system.

The system works fine most of the time (two nines or something ;), why can the write (or rather read)  problem occur so infrequently, we examined the reading code multiple times and it looks good! Some two gallons of coffee later, an epiphany about the nature of the traffic patter descents on us –  it’s disconnecting every now and then, and the problem surfaces always about the same number of events (couple hundred) after the disconnection. The problem has got to be in the connection acceptance.

We later found out that it indeed was. There was a race condition between the callback of network stream’s BeginRead and the actual assignment of some delegates that were supposed to handle the received data. Most of the time everything worked fine, but every now and then with unlucky thread interleaving the subsequent BeginRead was not called. What happens in such cases ? Connection is kept alive and healthy, the OS is actually receiving the data into its own buffer (hence the couple of hundred messages that were successfully sent from the event generating application), but the moment this buffer is filled to the brim, it asks the sender to wait (one can actually see the zero window size reported in Wireshark) and the sender blocks the Write call.

The TcpClient’s Write call actually has a version which allows you to provide a timeout – if your call isn’t returning after say 5 seconds, there most likely is a problem underneath.

Tcp BeginRead and EndRead have yet another quirky behavior which manifests in really weird way, but that’s a topic for a separate post