I’m back after some time spent on hard working, overtiming, doubleshifting and all that since my daytime project was heading for a rough landing. On the up side it gave me a lot to think and write about. Here’s the first topic – queue implementations and thread sleeps.

First of all, what motivated this – quite often (too often) I see a piece of code that looks like this:

Where the queue is shared among producers and consumers. The problem with this code is that if you run it, it’s going to spin. If the queue is more often empty then not (and it most likely is if you don’t have enormous amounts of data flowing through it). A very frequent modification of this ‘pattern’ is one with Thread.Yield() or Thread.Sleep(0) (they’re equivalend) after the while loop.

I used to think that it’s not all that bad, the Yield would give back the scheduled time without spinning much, the waste would be only on context switching, but it wouldn’t be that CPU consuming. Wrong! It does perform slightly better under overall CPU heavy load, but in ‘idle’ mode, it still consumes a whole one CPU core (if there is one consumer).

Another variation of this is something like Thread.Sleep(1) or Thread.Sleep(5). While it helps in idle mode, it has a couple of problems. It still does a lot of context switching completely unnecessairly and it also delays your processing (the bigger the sleep value, the larger the delay, the less context switching)

So how can we do better? I used to use some form of Monitor.Wait and Monitor.Pulse/PulseAll  was a good solution, and while it works fine it’s usually a bit difficult to get completely right and the consequences of not getting it right are dire. If you miss one Pulse or mishandle some exception, you can stop processing or wait while there are messages in queue, or a whole bunch of other unwanted stuff can happen.

Happily, .NET 4.0 introduced a BlockingQueue. This code:

can completely replace what we’ve had above (minus the cancellation, we’ll get there in a moment).

The foreach loop will block until there are elements in the collection and will consume them in a nice thread safe manner. When a producer is done adding, they can call collection.CompleteAdding() and the loop will end for all consumers.

The BlockingCollection contains a bunch of static methods that allow different behaviors dealing with mulltiple collections at one (AddToAny/TakeFromAny) that also can take cancellation tokens and bail out nicely if you decide to quit processing. One word of caution – it’s not guaranteed to behave as if there were any priorities – if you have two blocking collections and you do TakeFromAny, you’ll most likely receive elements from the first one (if there are any) and then from the second one, but it’s not a guaranteed behavior (at least I have not found it documented anywhere). In .NET 4 and 4.5 ILSpy can tell us that it will always behave this way, but in 5.0+ – who knows.

One last thing. This behavior brings me to mind another cool library for .NET – reactive extensions (parts of it – some interfaces – are actually a part of the framework). In short it’s a way of treating your collections (or any deta sources) as sort of ‘data available event producers’ (and more). It’s pretty robust, and it plays nicely with linq. Check it out here: http://msdn.microsoft.com/en-us/data/gg577609.aspx.

 

Some time ago when working on some performance critical serialization code for a C# project, I was looking deeply into all flavors of available serialization benchmarks. One thing that struck me at the time was that there weren’t that many of them. The most comprehensive I’ve found was this one: http://www.servicestack.net/benchmarks/NorthwindDatabaseRowsSerialization.100000-times.2010-08-17.html Published by service stack so maybe a little biased, but pretty neat nonetheless.

The test published were pretty in line with what I have observed – protobuf was a clear winner, but if you didn’t want to (couldn’t) use open source, DataContractSerializer was best available to you.

After some more testing, I started finding some discrepancies between this report and my tests and they we’re pretty big – it seemed that DataContractSerializer was better than reported, and the whole order was sometimes shuffled . It all boiled down to two major differences that I’ll explain here

 

Data Contract Serializer performance

In the quoted tests, DataContractSerializer was almost 7 times slower than protobuf. That’s a lot. In my tests it was 3 times slower. Big difference, huh ? When I ran the tests on my own machine (the thing I love about people at servicestack.net is that they published on GIT the source to their tests so that you could reproduce it at home) I found that DataContractSerializer was 7 times slower. So obviously, there had to be a difference in the way we used it! And indeed there was, when I added my version it was 2,8 times slower than protobuf running on the same test set. Here’s the difference:

Original:

My version:

See the difference ?

Sure, it’s binary, you can’t always use it, but in most cases it will do, and it’s way faster. I also noticed, that the payload size is about 1/3 more than JsonDataContractSerializer, which is actually pretty good, so if you’re worried by bandwidth usage, it’s an additional plus. Here are the results of running the Northwind test data with just one additional serialization option (binary one) on my machine:

Serializer Larger than best Serialization Deserialization Slower than best
MS DataContractSerializer 5,89x 10,24x 5,38x 7,24x
MS DataContractSerializerBinary 4x 3,01x 2,75x 2,85x
MS JsonDataContractSerializer 2,22x 8,07x 9,59x 9,01x
MS BinaryFormatter 6,44x 10,35x 6,23x 7,81x
ProtoBuf.net 1x 1x 1x 1x
NewtonSoft.Json 2,22x 2,83x 3,00x 2,94x
ServiceStack Json 2,11x 3,43x 2,48x 2,84x
ServiceStack TypeSerializer 1,67x 2,44x 2,05x 2,19x

 

Having this difference aside, the results were more consistent with my tests, but still there were big differences in some cases, which leads us to second important point

 

It’s all about data

There is nothing that will be best for all cases (even protobuf), so you should carefully pick your use case. In my tests, the data was larger, and usually more complex than in Northwind data set used for benchmarking by servicestack.net. Even if you look at their results, the smaller data set the bigger the difference – for RegionDto which has only one int and one string, the DataContractSerializer is 14 times slower (10 times slower on my machine, binary XML for DataContractSerializer is 3,71 slower).

So, does the DataContractSerializer perform better (especially in binary version) if larger objects are involved? It does indeed.

I created a EmployeeComplexDto class that inherits from EmployeeDto and adds some data as follows:

The object being serialized  consisted of not more than 10 Orders, 10 Customers and 10 Friends (for each test the same amount and the same data), and here’s what I got serializing and deserializing this 100 000 times:

Serializer Payload size Larger than best Serialization Deserialization Total Avg per iteration Slower than best
MS DataContractSerializer 1155 1,88x 6407245 6123807 12531052 125,3105 4,11x
MS DataContractSerializerBinary 920 1,50x 2380570 3452640 5833210 58,3321 1,91x
MS JsonDataContractSerializer 865 1,41x 7386162 14391807 21777969 217,7797 7,14x
MS BinaryFormatter 1449 2,36x 9734509 7949369 17683878 176,8388 5,80x
ProtoBuf.net 613 1x 1099788 1948251 3048039 30,4804 1x
NewtonSoft.Json 844 1,38x 2844681 4272574 7117255 71,1726 2,34x
ServiceStack Json 844 1,38x 4904168 5747964 10652132 106,5213 3,49x
ServiceStack TypeSerializer 753 1,23x 3055495 4606597 7662092 76,6209 2,51x

So – are you really ‘stuck’ with DataContractSerializer, or is it quite good ? The answer is of course – it depends. Surely, whenever you can, use the Binary Writer as it is way faster, but even if you do, whether it’s a best choice or not can be answered only with a specific use case in mind – think about what data will you be serializing, how often, etc.

As always with reasoning about performance, best option is to test with your data, in close to real scenarios and (preferably) on similar hardware as the application is going to run in production. The closer you get to this, the more accurate your tests will be.

 

About a year ago we had some internal coding challenge in the company, in which I decided to take part. The goal was simple enough – we were given a huge file (couple of GB) in a specific format where some IDs were put on a per-row basis along with other attributes. The goal was to read off of standard input the ID and find the associated attributes (or return some error i think when the ID did not exist). The entries were sorted according to the ID .

The language was not specified, you could use any. I used C#.

With data being more or less random it was very hard to do better than just binary search, though I’ve tried various algorithms (interpolation search with binary search cutoff was in some cases a little better) but ended up with a straight binary search. I had however a small caching algorithm that was building a map of pointers to the original file so that the initial binary search would be performed in-memory rather than seeking in  the heavy file. The cache was only ensuring it was sparse enough to ensure it’s size was below 512K. This was because the test procedure involved running the program multiple times on single-item inputs rather than once on multi-item input, so I judged that loading/saving may contribute too much if the cache was large).

The benchmark for me was simply the amount of reads of the disk until the answer could be given and I managed to keep it at about 13 which compared with raw binary search was a bit better (16 without cache).

Imagine my surprise when after all tests were ran, my program took twice as much as the best one !

Of course I started analyzing why did this happen and there were two very interesting conclusions:

1. Loading and Saving

HDD disks nowadays have about 32MB of cache. One thing that I noticed was that when I ran my program twice on the same input, it returned instantly. Just because the data being accessed on the disk was already in the cache and could be returned in no time. I assumed the same (to perhaps an even higher degree) would happen to my cache being saved and loaded each time the process started.

I was wrong. Loading and especially saving 512KB of data takes time. A lot of time if you do that frequently. In our test cases, the program was ran couple of hundreds of times with different inputs. That means a couple of hundred of loads and possibly as many saves (there not always were changes in the cache, but thanks to some re-balancing there often where).

After removing the cache and reverting to the raw binary search, the program was still 25% slower than the winning one, still having one major disadvantage, but no longer having the cache.

2.Process startup time

Now we come to the main point that interested me in this story – .NET process startup time.

In native code, the binary EXE file is just executed. The binary format (PE) of the EXE defines where the code starts (where the bootstrap code starts) so that the OS can just go ahead and execute this. In .NET world however, all we have is a managed assembly. When you execute a managed EXE, it jumps to a initialization function in mscoree.dll (If I’m not mistaken, it’s a Windows dll, not a framework dll, so it will tell you so even if you don’t have the framework installed at all). There it is decided which CLR version should be initialized (1.0/2.0/4.0) and only after this initialization CLR JITs and executes your managed code. This initialization has to take time. But how much of it ?

First – some baseline tests:

Most basic program, not doing anything, just main function that exits immediately. Executed 500 times using a batch script on my PC.

.NET version DEBUG RELEASE
2.0 39,08 s 39,55 s
4.0 37,55 s 36,08 s
4.5 36,25 s 35,51 s

Note here that 4.0 and 4.5 are not really different CLR types – it’s just framework and compiler.

Having this baseline I tried different things to speed up the startup – first changing the build type to x86/x64 instead of any cpu. Result – nothing.  The numbers didn’t differ by more than a single second (and sometimes in the wrong direction)

Next – removing all the referenced assemblies. Result – the same – no change.

After some more unsuccessful attempts of trying various optimization and compilation options which yielded nothing I’ve found the only thing that in my tests has made any difference. NGEN.

Pre-generating the native image (thus removing the need for JIT compiling) speeded up the startup time by about 10% (the best in class 4.5 compilation reached 32,63 s). All other attempts for optimization didn’t provide signifficant results.

For comparison I wrote a similar program in C++ (main function, return immediately), and used the same test procedure. The result – 26,85 s.

A bonus lesson learned (at least for me) was that for any competition before you submit your program one should check the impact of the test procedure on ones program. It may just make all the difference.

 

Stability testing is important, but often overlooked. Especially in stable products. We had a saying about one of the products I was working that it had two bugs – one known and one hidden. And that every time you fixed any of them it immediately (and miraculously) returned to a stable state of having two bugs.

After spending about two weeks trying to find and fix one of them, we resolved to find and fix both. Surely enough there were some signs of the other – hidden bug – namely the number of handles reported by task manager. After a couple of days it was reaching 15 000. That’s a lot of handles event for a server application. Of course the worst thing was that after weekend of idleness, the handle count didn’t go down (unless you restarted the application and then it started growing again from nominal ~150).

In situations like these WinDbg is an invaluable ally and we quickly found out that the handles are mostly to a stopped threads and are not released because a finalizer thread is stuck.

Not only stuck but stuck with a very peculiar stacktrace:

Wait for single object? No wonder it’s frozen. But what is it waiting on ?  One thing worth noticing is the last line of the abbreviated stacktrace here which seems to be pointing at some COM object problem.

Some digging inside revealed the following piece of code:

I really don’t like WMI management objects, I find them hard to work with and the API somewhat cryptic, but sometimes there is no way around it. This particular code what being executed as a part of startup configuration from the main thread which then proceeded to execute some main program timer loop. Now the thing about the management objects is that they sometime have (or use something that has) destructors which are executed by the finalizer thread.

Say your main method looks like this:

I see a lot of methods like this (especially in software that is about 5-8 years old 😉 If your readConfiguration() method is using something that will be accessing COM object in a destructor – you’re in trouble (your Finalizer thread will show the same stacktrace as the one in the beginning of this post).

Now why is that: the whole issue boils down to the annotation above the Main method – [STAThread]. Your main thread while creating the COM object will associate it with its own Single Threaded Apartment. Because of this, when finalizer thread will want to do something with this COM object, it won’t be able to do this directly, but will have to do it through the thread that created this COM object – your main thread. Your main thread however will be busy doing other things and not willing to proxy to the COM object (even if you do the Thread.sleep). The end result will be your finalizer thread frozen waiting, your handle count growing and eventually a crash of the application.

How to alleviate this problem – there are multiple ways to avoid it. Easiest fix is to just remove [STAThread], provided you don’t need it for other COM objects. Other is to execute your COM object creating code in another thread that is MTA. I chose to avoid using WMI at all – we’ve found that the reading of service start mode was completely unnecessary and there only for legacy reasons.

One interesting thing I noticed is that if you call GC.WaitForPendingFinalizers() in main thread, it will indeed wait, but also will release the finalizer thread from its waitOne by interacting with the COM on its behalf.

Everyone (ok, nearly everyone or this wouldn’t happen) knows that if you have a multithreaded application and different threads are accessing shared resources (and some of them do writes), you have to synchronize the access somehow.

Microsoft introduced a neat set of Concurrent Collections in the framework to help with that, but before ConcurrentDictionary there were plain old Dictionary. Some time ago I came across a piece of software that was crashing due to lack of proper synchronization (unfortunately this still happens a lot) and it got me thinking, what is the worst that could happen. So i started experimenting with multiple threads and a Dictionary to see in how many ways can it break. What i found surprised me. Outside of Duplicate Key exception and key not found exception which one may argue are not entirely tied to lack of proper synchronization there are two that stood out.

First one is rather trivial (it’s actually the exception I’ve observed in the aforementioned piece of software, so no surprise there):

Collection is being resized, and we’re still trying to do some operations on it.

The second one is more interesting. At some point my small testing program just didn’t end. I looked at the task manager and this is what I’ve seen:

Inline image 1

25% CPU utilization – most likely one thread going crazy. I connected through WinDBG and here’s some things I’ve seen:

So there is one offending thread – let’s see what it’s doing:

So it seems it’s stuck on inserting – actually inside framework Dictionary class – spinning!

Let’s find this dictionary and see if it’s corrupted:

This is my dictionary, it was declared as Dictionary<long,int>. There is four of them, but that’s because the program was running in a loop creating dictionaries and trying to break them. Remaining three just have not yet been garbage collected as can be observed below

Let’s see if that indeed is the case:

And surely that’s the only one with any gc root at all.

Before we look inside, a word about the structure of Dictionary – there are two important structures inside –

  • int[] buckets – it holds the index of the entries table that contains the first Entry for the hashCode that maps to that bucket
  • Entry[] entries – it contains all the entries in a linked-list fashion (Entry class has a next field pointing to the next Entry)

So the way this works is that if you put object A, a hashCode is calculated for it, then a bucket index is selected by calculating hashCode % buckets.length and the int in the buckets table for that index is pointing at the first entry in that ‘bucket’ that is contained in the entries table. If there is none, yours is the first. If there are some, you iterate the whole chain, and insert after the last one, updating the ‘next’ pointer of the previously last element.

Knowing that, we can look at the structure of my Dictionary:

Ok – just 10 elements inside the Dictionary, let’s see what the Entries look like:

 

For brevity I include only two relevant entries (actually only one is relevant). Entry number six – the next entry is – number seven (so far so good). Entry number seven – the next is … number seven ! Aha! That’s really bad. Concurent inserting (and deleting) without any synchronization has corrupted the Dictionary so that in the Entries linklist one of the entries is pointing at itself!

I would imagine Dictionary code for inserting would look something like this:

With such a corrupted structure this Insertion would then fall into infinite loop spinning on next.next.next… (which is what I could observe)

If you know any other way this could go wrong, send me an email or leave a comment. I’m sure there has to be some other ill behavior resulting from lack of proper synchronization.

As mentioned last time, there is another behavior that may be a surprise (it certainly was to me when I discovered it). Consider the following minimal TCP server code:

It’s as simple as it gets and seems to be quite correct. The lock on the type is an overkill but should do the trick making sure our buffer is consistent.

Client code is just sending single bytes and flushing the stream every time (200 single byte messages).
What is the output? Here it is (a relevant part of it at least):

Notice lines 3 and 4 – the data is printed in reversed order. How come ?!

Let’s look again at the server code:

Console.Write is protected by the lock and it is guaranteed that only one thread will call it at the time, buffer is already copied, what can be wrong ?

The only thing that may rise an eyebrow is the call to BeginRead happening before the processing of the data read from buffer. It cannot interfere with the data though as there is a lock and our data is already in a copied byte array. “When you have eliminated the impossible, whatever remains, however improbable, must be the truth“. The only remaining option is that the thread that is doing the out-of-order write is the same one that is asynchronously called to process the incoming data, but instead of calling BeginRead and returning it actually synchronously calls the async callback. MSDN documentation fails to mention that, but actually that’s exactly what happens. We can verify that easily by adding the following piece of code:

and here is the relevant output:

Notice lines 1 and 8 – it’s the same thread that is calling the rcv method twice. Since in C# locks are reentrant, it doesn’t protect the EndReceive from executing (if they weren’t it would be a self-deadlock).

In the software where I observed it it didn’t happen a lot – the bigger the load the more likely was it to surface. Though when it did happen, the recursion depth was around 5-6.  The reason is that under the hood an OverlappedAsyncResult is used which can detect that there is another async IO in progress and piggyback on it.

Hope you’ll find it useful, I learned the hard way.

Here’s a scenario – two applications talking to each other through TCP binary protocol (proprietary – I work in a corporation, remember?). Both written in C#. A somewhat atypical client server with a pub-sub mechanism. One subscribes for events, and then the other initiates the connection and delivers the events. If there are no events for some time, the connection is torn down. Every once in a couple of weeks, the application delivering the events hangs. The thread count rises to > 500 and there are no events sent to the receiver.

This is the situation as was described to me at some point. I knew one of the applications (and knew it was not written well) so I started digging.

One word about problem reproduction – white rare, we noticed that certain pattern of events seemed to trigger it, so after ensuring there is nothing special about the events themselves, but the pattern, we managed to reproduce it way more often (every couple of hours).

First thing was to use my all time favourite tool WinDBG. Often unappreciated, always underestimated, usually little known.

and oddly enough, the clrstack of sender thread shows constantly something like this:

How can that be ? The events are small packets, network is healthy, etc. Some googling reveals the answer when such thing can happen – the other side keeps the connection open, but doesn’t read. This explains the thread count (there is a lock around the write and the thread pool is receiving some data and generating events is being set up with a 500 limit), but not the behavior of the whole system.

The system works fine most of the time (two nines or something ;), why can the write (or rather read)  problem occur so infrequently, we examined the reading code multiple times and it looks good! Some two gallons of coffee later, an epiphany about the nature of the traffic patter descents on us –  it’s disconnecting every now and then, and the problem surfaces always about the same number of events (couple hundred) after the disconnection. The problem has got to be in the connection acceptance.

We later found out that it indeed was. There was a race condition between the callback of network stream’s BeginRead and the actual assignment of some delegates that were supposed to handle the received data. Most of the time everything worked fine, but every now and then with unlucky thread interleaving the subsequent BeginRead was not called. What happens in such cases ? Connection is kept alive and healthy, the OS is actually receiving the data into its own buffer (hence the couple of hundred messages that were successfully sent from the event generating application), but the moment this buffer is filled to the brim, it asks the sender to wait (one can actually see the zero window size reported in Wireshark) and the sender blocks the Write call.

The TcpClient’s Write call actually has a version which allows you to provide a timeout – if your call isn’t returning after say 5 seconds, there most likely is a problem underneath.

Tcp BeginRead and EndRead have yet another quirky behavior which manifests in really weird way, but that’s a topic for a separate post