Stability testing is important, but often overlooked. Especially in stable products. We had a saying about one of the products I was working that it had two bugs – one known and one hidden. And that every time you fixed any of them it immediately (and miraculously) returned to a stable state of having two bugs.

After spending about two weeks trying to find and fix one of them, we resolved to find and fix both. Surely enough there were some signs of the other – hidden bug – namely the number of handles reported by task manager. After a couple of days it was reaching 15 000. That’s a lot of handles event for a server application. Of course the worst thing was that after weekend of idleness, the handle count didn’t go down (unless you restarted the application and then it started growing again from nominal ~150).

In situations like these WinDbg is an invaluable ally and we quickly found out that the handles are mostly to a stopped threads and are not released because a finalizer thread is stuck.

Not only stuck but stuck with a very peculiar stacktrace:

Wait for single object? No wonder it’s frozen. But what is it waiting on ?  One thing worth noticing is the last line of the abbreviated stacktrace here which seems to be pointing at some COM object problem.

Some digging inside revealed the following piece of code:

I really don’t like WMI management objects, I find them hard to work with and the API somewhat cryptic, but sometimes there is no way around it. This particular code what being executed as a part of startup configuration from the main thread which then proceeded to execute some main program timer loop. Now the thing about the management objects is that they sometime have (or use something that has) destructors which are executed by the finalizer thread.

Say your main method looks like this:

I see a lot of methods like this (especially in software that is about 5-8 years old 😉 If your readConfiguration() method is using something that will be accessing COM object in a destructor – you’re in trouble (your Finalizer thread will show the same stacktrace as the one in the beginning of this post).

Now why is that: the whole issue boils down to the annotation above the Main method – [STAThread]. Your main thread while creating the COM object will associate it with its own Single Threaded Apartment. Because of this, when finalizer thread will want to do something with this COM object, it won’t be able to do this directly, but will have to do it through the thread that created this COM object – your main thread. Your main thread however will be busy doing other things and not willing to proxy to the COM object (even if you do the Thread.sleep). The end result will be your finalizer thread frozen waiting, your handle count growing and eventually a crash of the application.

How to alleviate this problem – there are multiple ways to avoid it. Easiest fix is to just remove [STAThread], provided you don’t need it for other COM objects. Other is to execute your COM object creating code in another thread that is MTA. I chose to avoid using WMI at all – we’ve found that the reading of service start mode was completely unnecessary and there only for legacy reasons.

One interesting thing I noticed is that if you call GC.WaitForPendingFinalizers() in main thread, it will indeed wait, but also will release the finalizer thread from its waitOne by interacting with the COM on its behalf.

Here’s a scenario – two applications talking to each other through TCP binary protocol (proprietary – I work in a corporation, remember?). Both written in C#. A somewhat atypical client server with a pub-sub mechanism. One subscribes for events, and then the other initiates the connection and delivers the events. If there are no events for some time, the connection is torn down. Every once in a couple of weeks, the application delivering the events hangs. The thread count rises to > 500 and there are no events sent to the receiver.

This is the situation as was described to me at some point. I knew one of the applications (and knew it was not written well) so I started digging.

One word about problem reproduction – white rare, we noticed that certain pattern of events seemed to trigger it, so after ensuring there is nothing special about the events themselves, but the pattern, we managed to reproduce it way more often (every couple of hours).

First thing was to use my all time favourite tool WinDBG. Often unappreciated, always underestimated, usually little known.

and oddly enough, the clrstack of sender thread shows constantly something like this:

How can that be ? The events are small packets, network is healthy, etc. Some googling reveals the answer when such thing can happen – the other side keeps the connection open, but doesn’t read. This explains the thread count (there is a lock around the write and the thread pool is receiving some data and generating events is being set up with a 500 limit), but not the behavior of the whole system.

The system works fine most of the time (two nines or something ;), why can the write (or rather read)  problem occur so infrequently, we examined the reading code multiple times and it looks good! Some two gallons of coffee later, an epiphany about the nature of the traffic patter descents on us –  it’s disconnecting every now and then, and the problem surfaces always about the same number of events (couple hundred) after the disconnection. The problem has got to be in the connection acceptance.

We later found out that it indeed was. There was a race condition between the callback of network stream’s BeginRead and the actual assignment of some delegates that were supposed to handle the received data. Most of the time everything worked fine, but every now and then with unlucky thread interleaving the subsequent BeginRead was not called. What happens in such cases ? Connection is kept alive and healthy, the OS is actually receiving the data into its own buffer (hence the couple of hundred messages that were successfully sent from the event generating application), but the moment this buffer is filled to the brim, it asks the sender to wait (one can actually see the zero window size reported in Wireshark) and the sender blocks the Write call.

The TcpClient’s Write call actually has a version which allows you to provide a timeout – if your call isn’t returning after say 5 seconds, there most likely is a problem underneath.

Tcp BeginRead and EndRead have yet another quirky behavior which manifests in really weird way, but that’s a topic for a separate post

 

 

I recently stumbled upon a wikipedia page humoristically describing various types of computer software bugs (http://en.wikipedia.org/wiki/Heisenbug). One in particular caught my attention: Schrödinbug. “…a bug that manifests itself in running software after a programmer notices that the code should never have worked in the first place.” It surprised me a little as I always though a Schrödinbug is one that lays dormant for a long time but once it manifests itself nothing works again until it is fixed.

It reminded me of my favourite bug I introduced in the old days which falls under this second definition of Schrödinbug. Consider the following snippet:

That construct (of using some current time related value as a seed) is often used and in general works well (if not used for cryptography). In this instance however, the unit of the value returned by get_time is peculiar – hectoseconds. The number of 100th of a second since some arbitrary chosen epoch (OS dependent, not 1970). according to documentation (http://docs.oracle.com/cd/B12037_01/server.101/b10755/dynviews_2147.htm) the hsecs wraps to zero to prevent integer overflow roughly every 500 days, seed takes integer as in value and all would be well if not for one thing:

hsecs datatype : NUMBER

seed input datatype BINARY_INTEGER

looks good on first glance, NUMBER is an interesting ORACLE datatype which according to docs (http://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#i22289)can take the range of:

  • Positive numbers in the range 1 x 10-130 to 9.99…9 x 10125 with up to 38 significant digits
  • Negative numbers from -1 x 10-130 to 9.99…99 x 10125 with up to 38 significant digits

but we’re guaranteed that hsecs will wrap to zero after reaching 232 – 1

Can you already see where this is going ? BINARY_INTEGER which is seed input data type is a signed integer. So – depending on your luck and the choice of a baseline for the epoch, after up to 250 days, this code will crash with an exception that the seed cannot accept the cast of hsecs value as it causes overflow.

This happens to me in real system, with little damage fortunately and was trivial to fix. Nevertheless – lesson learned – be mindful of the whether integers you’re using are signed as this can cause very weird things to happen that will come up after half a year after deployment.