The other day I was watching .NET cryptography tutorials on pluralsight (http://pluralsight.com/training/Courses/TableOfContents/cryptography-introduction-dotnet) and one thing struck me – there were some performance tests for SHA hash algorithms, the results of which were couter-intuitive. Namely – the SHA512 was faster than SHA256. I’m not an expert in cryptography, but i know a thing or two how SHA works and I could see nothing that could justify those results. So I did some testing

The result of 1000 hashings of 10 KB file on my PC:

SHA 256 – 202 ms.

SHA 512 – 955 ms.

 

Now that’s not what I’ve seen in the video ! I ran the ILSpy (http://ilspy.net/) on the libraries  and a thing that immediately stands out as a difference between SHA256 and SHA512 (note: I was using System.Security.Cryptography.SHA256Managed, there is another flavor of SHA256 and I’ll talk about that later) is that where SHA256 uses int, SHA512 uses ulong.

Aha! I quickly recompiled everything as x64 and as expected the results were mighty different:

SHA 256 – 201 ms.

SHA 512 – 130 ms.

Now it makes sense. I guess historically framework designers thought that by the time SHA512 will be a standard, all compilations will be x64 bit so it made sense to speed it up for those kinds of builds.

 

One other interesting thing is that there are two (at least) mechanisms for SHA computation in .NET. The *Managed one (SHA1Managed, SHA512Managed,etc.) is older and it is not certified to be FIPS 140-2 compliant (which basically means it’s potentialy less secure and you definitely shouldn’t use it if you’re writing software for government/military or public safety use). The other option is to use CryptoServiceProviders (SHA512CryptoServiceProvider, SHA256CryptoServiceProvider, etc.). It’s altogether a better option because compliance aside, those algorithms are faster as they use native windows libraries instead of managed implementations (The only drawback is that you need framework 3.5+).

Here are all the results for different providers and different systems.

compilation SHA256Managed SHA512Managed SHA256CSP SHA512CSP
x86 202 ms. 955 ms. 150 ms. 256 ms.
x64 201 ms. 130 ms. 114 ms. 73 ms.

 

So, the bottom line – use CryptoServiceProvider wherever possible and get ready to move on to 64 bit compilation if you need that extra security and/or extra performance (there is about a gazilion reasons to do it anyway)

 

 

About a year ago we had some internal coding challenge in the company, in which I decided to take part. The goal was simple enough – we were given a huge file (couple of GB) in a specific format where some IDs were put on a per-row basis along with other attributes. The goal was to read off of standard input the ID and find the associated attributes (or return some error i think when the ID did not exist). The entries were sorted according to the ID .

The language was not specified, you could use any. I used C#.

With data being more or less random it was very hard to do better than just binary search, though I’ve tried various algorithms (interpolation search with binary search cutoff was in some cases a little better) but ended up with a straight binary search. I had however a small caching algorithm that was building a map of pointers to the original file so that the initial binary search would be performed in-memory rather than seeking in  the heavy file. The cache was only ensuring it was sparse enough to ensure it’s size was below 512K. This was because the test procedure involved running the program multiple times on single-item inputs rather than once on multi-item input, so I judged that loading/saving may contribute too much if the cache was large).

The benchmark for me was simply the amount of reads of the disk until the answer could be given and I managed to keep it at about 13 which compared with raw binary search was a bit better (16 without cache).

Imagine my surprise when after all tests were ran, my program took twice as much as the best one !

Of course I started analyzing why did this happen and there were two very interesting conclusions:

1. Loading and Saving

HDD disks nowadays have about 32MB of cache. One thing that I noticed was that when I ran my program twice on the same input, it returned instantly. Just because the data being accessed on the disk was already in the cache and could be returned in no time. I assumed the same (to perhaps an even higher degree) would happen to my cache being saved and loaded each time the process started.

I was wrong. Loading and especially saving 512KB of data takes time. A lot of time if you do that frequently. In our test cases, the program was ran couple of hundreds of times with different inputs. That means a couple of hundred of loads and possibly as many saves (there not always were changes in the cache, but thanks to some re-balancing there often where).

After removing the cache and reverting to the raw binary search, the program was still 25% slower than the winning one, still having one major disadvantage, but no longer having the cache.

2.Process startup time

Now we come to the main point that interested me in this story – .NET process startup time.

In native code, the binary EXE file is just executed. The binary format (PE) of the EXE defines where the code starts (where the bootstrap code starts) so that the OS can just go ahead and execute this. In .NET world however, all we have is a managed assembly. When you execute a managed EXE, it jumps to a initialization function in mscoree.dll (If I’m not mistaken, it’s a Windows dll, not a framework dll, so it will tell you so even if you don’t have the framework installed at all). There it is decided which CLR version should be initialized (1.0/2.0/4.0) and only after this initialization CLR JITs and executes your managed code. This initialization has to take time. But how much of it ?

First – some baseline tests:

Most basic program, not doing anything, just main function that exits immediately. Executed 500 times using a batch script on my PC.

.NET version DEBUG RELEASE
2.0 39,08 s 39,55 s
4.0 37,55 s 36,08 s
4.5 36,25 s 35,51 s

Note here that 4.0 and 4.5 are not really different CLR types – it’s just framework and compiler.

Having this baseline I tried different things to speed up the startup – first changing the build type to x86/x64 instead of any cpu. Result – nothing.  The numbers didn’t differ by more than a single second (and sometimes in the wrong direction)

Next – removing all the referenced assemblies. Result – the same – no change.

After some more unsuccessful attempts of trying various optimization and compilation options which yielded nothing I’ve found the only thing that in my tests has made any difference. NGEN.

Pre-generating the native image (thus removing the need for JIT compiling) speeded up the startup time by about 10% (the best in class 4.5 compilation reached 32,63 s). All other attempts for optimization didn’t provide signifficant results.

For comparison I wrote a similar program in C++ (main function, return immediately), and used the same test procedure. The result – 26,85 s.

A bonus lesson learned (at least for me) was that for any competition before you submit your program one should check the impact of the test procedure on ones program. It may just make all the difference.