Not long ago someone sent me an email about my SubsetSum problem post, asking how to reconstruct the set of numbers that were used to create the sum. This is pretty straightforward, until you consider scenarios in which there is more than one correct subset and you want a specific one or all of them.

But, first thing’s first. Let’s consider the easy case – we assume there is only one correct result. For example consider the follwing set { 15, 20, 35, 40 } and the sum of 90 (15+35+40). Using the standard version of the algorithm, we’ll receive the table as follows (top row has values instead of indexes to increase readability, also, I skip non-relevant rows:

X 0 15 20 35 40
0 true true true true true
15 false true true true true
20 false false true true true
35 false false true true true
40 false false false false true
50 false false false true true
55 false false false false true
90 false false false false true

And so, to reconstruct values used, we’re going to determine which number was used to create each subset.

we start with our final value of 90, the first column in which we have ‘true’ is the last one, representing the value ’40’. We add 40 to the result set and substract this value from our final sum.And so recursively, we search for numbers until we reach 0. Starting from 90 we’ll find 40, then search for 90-40 = 50, 50 is true for column representing 35, so we’ll look for 50-35 = 15, we’ll find 15 in the second column and will not search for 15-15 = 0 (and we’re done)

The code can look something like this:


So what will happen if we want to find out what numbers to use to create 35? It can be either 15+20, or just 35. Algorithm in current form will return answer using as little numbers as possible , so it will be just 35. What to do to make it return both ?

Well, we need to remember in how many ways we can create any number from previous numbers. To do this, we’ll use int[][] instead of boolean[][] and increment the value each time we can find a new way to create a number

So the new table will look like this

X 0 15 20 35 40
0 1 1 1 1 1
15 0 1 1 1 1
20 0 0 1 1 1
35 0 0 1 2 2

We can see from this table that 35 can be created from either 35 (leading to 35-35 = 0 and end of algorithm)  or from 20 (leading to 35-20 = 15).

We’ll need to branch on every such occasion, so it’s easiest to use recursion. It may look something like this:


There are two things I believe are worth noticing:

1. It’s relatively easy to apply the same optimizations as in the original post – directly if we have unique sums or more indirectly by creating lists of starting positions if we have multiple subsets giving the same sum.

2.It’s also relatively easy to count all possible subsets without enumerating them (just one line change, instead of table[i][j]++ do table[i][j] += table[i – numbers[j – 1]][j – 1];, table[sum].length – 1] will hold result after algorithm is done)

There is a lot of fun in modifying this to achieve results like ‘subset using least numbers’ or ‘subset minimizing the use of even numbers’, etc. most of which require just small changes. I encourage you to to play with it as a fun exercise.

The Algorithm (the problem)

Yesterday I’ve encountered the following problem – from a set of positive integer numbers, determine all positive integers that can be constructed from any subset of that set.

For example set {1,3,5} can produce the following numbers {1,3,5,4,6,8,9}. In general case it’s a NP complete problem solvable by determining all combinations (this is what I implemented). However for anything larger than ~25 numbers in the set it’s strating to get too long to find the answer. Moreover, a ridiculous set like this {1,1,1,1,1,1,1,1,1,1,1,1…} that can be enumerated by fifth-grader becomes imposible to process.

So of course I went on looking for a smarter solution and quickly enough found this:

Admittedly, yesterday was a long day, but I was sitting in front of the screen staring at this and couldn’t for the life of me understand how and why it works. So – let me put it out here is somewhat more elaborate but kind of step-by-step way (Once rested I understood it today and even managed to improve it a bit).

This isn’t exactly the problem as stated (find if subset sums to X or find all things subsets sum up to), but the algorithm used is the same.


How it works (on example)

Wikipedia states it’s pseudo-polynomial, the reason for it is simply that it doesn’t grow polynomialy with the number of items in the set, but with the range of numbers. So your problem with set like {1,2,3,4} will not take the same amount of time as set {1,2,3,99999999} (we’ll see why)

So – the general idea is to go through all numbers from 1 to sum({set}) (so for set {1,2,3,4} we’ll go through all numbers 1..(1+2+3+4 = 10)) and try to determine if this particular number can be constructed using first 1…X elements of the set. While trying to determine this we’ll construct an array storing the values for each number.

The array to store the results will be of type boolean and sizes [SUM_OF_NUMBERS_IN_SET+1][COUNT_OF_NUMBERS_IN_SET+1]

let’s work on an example – a trivial array of {1,3} the sum of elements is 4, the number of elements is 2.

We’ll produce an empty boolean array that will look like this

X 0 1 2
0 true true true
1 false false false
2 false false false
3 false false false
4 false false false


this is just initially set up table, the true,true in the first row indicates, that with no elements, or with only first item or with all items from the set, we can create a subset that sums up to zero (empty set) .

Now the algorithm kicks in – for each number (1…4) we’ll determine if with first 1…2 elements of the set, we can create this number. So – how can we determine that – there are just two simple cases – either we can create this number from a smaller subset (without including current number) or  we can create this number from currently analysed number and we know that the remainder we can create from previously seen numbers. The boolean array will help us keep track of that. For each element, we’ll check if either an element to the left is true (first case) or an element higher up (by the amount of currently analysed element) and adjacent to left is true (second case).

In pseudocode that will be: table[i][j] = table[i][j-1] | table[i-set[j-1]][j-1] where set[j-1]  is the value of element from the set (indexed from zero, hence j-1)

So – how will our example unfold:

row 1:

X 0 1 2
0 true true true
1 false true true
2 false false false
3 false false false
4 false false false

[1][1]  we set to true because table[1-1][0] (that is table[0][0]) is true (second case)
[1][2]  we set to true because table[1][1] is true (first case)

for next iteration, on row indexed by 2, we’ll not set anything to true:

[2][1] we’ll not set to true because neither [2][0] nor [1][0] is true
[2][2] we’ll not set to true because neither [2][1] nor [-1][0] is true

for the third row:

[3][1] we’ll not set to true because neither [3][0] nor [2][0] is true
[3][2] we’ll set to true because [0][1] is true

the last one is interesting :

[4][1] we’ll not set to true, because neither [4][0] nor [3][0] (this is the equivalent of saying, with just the first element from the set, we can’t construct 4)
[4][2] we’ll set to true because [1][1] is set to true (this is the equivalent of saying, with the first two elements, we can construct 4 – take 3, and somehow constructed 1 from first element from the set)

So the final table will look like this:

X 0 1 2
0 true true true
1 false true true
2 false false false
3 false true true
4 false false true


The answer to question ‘can we construct X’ from the set will be in table[X][set_size -1] (last element of the row)


The actual code looks like this:


And no we can print what values are possible to create from such set:



The improvement isn’t that big (not big-O kind of thing), but in practical applications it matters. The main observation is that we only set the value to true at certain index for a number and then we keep it true. So we only need to track at which index we found we can make this number (if). The improved code (memory footprint is way smaller) can go like this:

That’s it! Hope you’ll find this instructive and won’t waste too much time trying to understand how exactly are those indices set and why on a sunday evening.


“Although the basic idea of binary search is comparatively straightforward, the details can be surprisingly tricky”. This statement is famously attributed to Donald Knuth, and I’ve recently learned the hard way how right he was.

Consider the following problem:

You’re given an inequality |A-X| < P – where A and P are some integer constants. Find how many integers satisfy that inequality. For simplicity let’s limit ourselves to X > – 2^31 and X < 2^31-1 Simple enough, we can divide the function |A-X| to two sections (- 2^31,A> where the function is strictly decreasing and <A,2^31-1) where it’s strictly increasing.

This can be solved easily analytically – X has to be between P-A and P+A, but let’s think about a generic monotonic inequality that can be specified on runtime.

To solve this more general problem,we only need to do a binary search for the minimum X satisfying inequality in the left section and maximum X satisfying inequality in the right one.

With the typical ‘wikipedia’ implementation of binary search by default we won’t do either (it assumes decreasing order, so it won’t do for the second section, and it returns invalid value if there isn’t an exact edge-value satisfying the inequality).

So there are two major considerations for our search function:

  • increasing / decreasing order of our search domain
  • edge values – should the function return the last value that satisfies conditions, the first one that doesn’t or some value indicating error.

For our example we need both increasing / decreasing order of search and last value that satisfies the conditions. Please note that in this case, we’re not really looking for a particular value (we don’t know what that value is). So, how can we achieve it? For example like this:


First a couple of explanations:

BinarySearchOrder is just an enum of Increasing/Decreasing (and is not that necessary as we’ll see later on).  The ISatisfactionValidator is an interface with a single method isValid which takes an index (an X if you will) and returns true if the value produced for that X (whatever that would be) is still satisfies the conditions (whatever they may be) or not. This can easily be adapted to a single element search as we’ll see later on.

A couple of interesting points here:

  • long conversion for the calculation of middle point – this can be easily overlooked with disastrous consequences. M = L + (R-L)/2 – consider R = MAX and L = MIN (negative) – the R-L is way beyond 32-bits.
  • middle point selection depending on ordering – in case of increasing sequences we’re modifying the right edge so we need to make sure that whenever middle point is between two values, we choose the right one (ceil instead of floor)
  • finally – XOR for validity check is just to mirror the search pattern in case of increasing sequence.
  • One more note here – in case of multiple elements with the same value, in the decreasing sequence the algorithm will produce the first index (from the left) and in the increasing the last (first from right) with given value.

That is all well and good, but in this form, the algorithm deals nicely with monotonic functions and we usually deal with arrays. We can easily modify it to suit our needs:

Nothing really tricky here, we just determine the order and write our Stisfaction validator in such a way that it would be usable by the algorithm. The decision to throw an exception when an element is not found may in some cases not be the best, espectially in some performance critical code, but it can easy be modified.

Finally – just noticed right now that in generic search with all elements the same, a Decreasing option will be chosen (and therefore first element returned). It would be more appropriate to name the enum NonIncreasing (or modify the selection and choose Decreasing/NonDecreasing set)





The other day I was watching a new series of lectures about Least significant digit Radix sort given by Robert Sedgewick and Kevin Wayne on Coursera ( and it got me thinking – why don’t I use it anywhere ?

I got so comfy with Java’s mergesort and Comparable<T> that I never thought on optimizing that even in some performance critical code. So i though i should at least check what the performance impact can be.

LSD Radix sort is only useful in cases where you have a reasonable length of the keys on which you are sorting, so for my tests I assumed long as THE key data type. Reasoning behind was the following datatype:

since timestamp (Date object) can be expressed as long, and it’s perfectly reasonable to sort by date, it will do.

There are two additional things I had to define to enable more general sorting:

An interface that allows for extracting the sort key from object (and thus allowing for sorting various object by various keys – an equivalent of Comparator) and implementation of that for my data type.

So now for the sort itself:

So how this works:

We start by extracting the keys to a separate array (to avoid extracting them for every pass). It’s an optimization step, but it’s worth to do it, since calling the extractor method that many times will be expensive (but of course, there is a memory trade-off). Then, for each double octet (16 bits) starting from least significant ones, we do a key-index counting sort on extracted keys also moving the objects themselves into an auxiliary array. The result of each pass is an array sorted by that double byte. Because the sort is stable, we can start from least significant and move ‘up’.

The method is nor overly complicated, there are a couple of subtleties though:

lines 20-21 and 34-35 – the XOR with most significant bit of the most significant byte is required in order to correctly sort the negative numbers (negating the sign bit)

lines 40-46 – exchanging the auxiliary array with initial array (or just two aux arrays in case of keys) is needed in order to avoid copying all elements from aux array which is typically done in one-pass key-index sort.

Also note here, that thanks to the fact we have an even number of passes (4), we use the input array as an auxiliary array in the last pass, so we end up with sorted elements in original array and there is no need to copy elements from aux array to the original array.

So, is this any good ? Here’s results:

Size Comparable LSD Radix
10 0 ms. 10 ms.
100 1 ms. 9 ms.
1_000 3 ms. 6 ms.
10_000 37 ms. 10 ms.
100_000 178 ms. 65 ms.
1_000_000 466 ms. 179 ms.
10_000_000 8882 ms. 2219 ms.

For small number of elements it’s definitely not worth it, but that was expected. Above certain threshold though it definitely makes sense – it’s faster than system sort, and since Java uses mergesort for objects anyway, it doesn’t have that much bigger memory requirements than system sorting (the two aux key arrays, but in performance critical code if you want to sacrifice some purity for performance it too can be avoided).

There is one low-hanging optimization that can double the performance for large arrays – if you can limit your key size to a smaller number of bytes, you can get away with less passes. For example I tested with the Date converted to milliseconds from start of unix epoch which is long datatype, but afterwords did some tests with the key calculated by making the number of milliseconds relative to January 1-st 2000 (since I knew my data set couldn’t contain dates prior to that one) and lowered resolution to seconds which allowed me to fit it in 4 byte int.

It not always can be done, but as with many optimizations – a lot depends on inherent properties of your data.

I’ve always thought that Java is in the wrong by not allowing generic collections of primitive types. I’ve experienced in the past some situations where boxing and unboxing was the reason of a bottleneck in a performance critical code and always felt when working with C# that for example Dictionary<int,int> is very useful and gives that extra edge with performance So today I’ve done some experiments:

I declared ArrayList<Integer> on which I performed 100 000 inserts followed by removing all elements one-by-one (always removing the first element):

  • 6 ms. for inserts (includes amortized array resizing)
  • 3790 ms. for removes

Oddly enough I found that the removals do not cause the shrinking of the array. Canonical implementations resize by a factor of two up whenever the space runs out, but also shrink the underlying array by a factor of two if it’s 3/4 empty
I checked the Java source code and indeed, there is no shrinking of the array with removals (also the increase is not by a factor of 2 but 3)

But, back to the original thought
Here’s the code i used for my resizable list of primitive integer values:

It uses the same System.arraycopy method to move elements around and Arrays.copyOf for growing as the ArrayList implementation

And the results?

  • 3 ms. for inserts
  • 3117 ms. for removes

Clearly the cost is dominated by remove and in particular by shifting all elements by one ‘left’ which each removal (no surprise there). The shrinking of the array doesn’t help since the move is made only for elements up to current size regardless of the actual array size (it actually adds a little)
So, to make the test more about arraylist of primitive types rather than about why I should never remove the first element from an arraylist and use some other structure instead, I modified the test to always remove the last element, thus making shifting of the elements unnecessary.I also increased the number of elements put to the list to 10 000 000 since it was running much faster now. The result startled me a little:

  • ArrayList add: 2096 ms.
  • ArrayList remove: 467 ms.
  • IntList add: 181 ms.
  • IntList remove: 85 ms.

Now that’s a bigger difference (especially for inserts). So I ran the profiler and here’s what I saw (I actually had to copy the ArrayList code to a separate class in order for the visualVM to show the stats):

  • ArrayList add: 461 ms.
  • IntList add: 296 ms.
  • ArrayList remove: 433 ms.
  • IntList remove: 189 ms.

Of course these numbers are ms calibrated to the system and are so-so accurate, but they show the scale. Without boxing taken into account the numbers are significantly different.I actually tested it again with a slightly modified version where the list of integers inserted was pre-generated and generation time was not added to insert time:

  • ArrayList add: 233 ms.
  • IntList add: 77 ms.
  • ArrayList remove: 205 ms.
  • IntList remove: 160 ms.

The differences here can be accounted for by the lack of error checking in the test application.

So again – back to original point – Is the boxing costly ? Yes, very.

This is an example that was shown to me back in the day when I was at the university. I took a great class ‘Anatomy of Java virtual machine’ taught absolutely brilliantly by prof. Grzegorz Czajkowski  (at the time working for SUN but soon after moving to Google) where we got to play with raw java bytecode and write our own (simple) garbage collectors. Quite challenging for a student but a lot of fun nevertheless.

Anyway, here’s a piece of code:

What will be the result of executing it ?

As expected – a lot of console output.

What if we change the value of the integers a bit ?

Will the output change ?

After running this the program terminates instantly with no output!  (By the way, the incrementation  is not necessary, we could simply start with 128)

It’s the only known to me example where the value has such impact on flow control (if you know any other interesting ones, drop me an email).

It’s all got to do with Boxing/Unboxing and a certain specific caching feature of Java. Let’s dissect it a bit. First of all, let’s look at the conditional statement:

i<j || i>j || i==j

remember, those are Integers, not ints i < j and i > j will force auto unboxing so the comparison will happen with unboxed ints. The == equality comparison however will happen for Integer types and will compare the references. That explains why we get no output for 128 – i and j are simply two different Integer objects, so the reference comparison produces false.

Why then if we set the value to 127 (or 10 for example) do we get an infinite loop ? JVM caches the Integers between -128 and 127 and when you use one of those, it uses a cached one instead of creating new Integer object every time. It’s actually a part of JLS (Java Language Specification) – 5.1.7 Boxing Conversion states “If the value p being boxed is truefalse, a byte, a char in the range \u0000 to \u007f, or an int or short number between -128 and 127, then let r1 and r2 be the results of any two boxing conversions of p. It is always the case that r1 == r2.”

Java 1.6 from certain update allows for specifying the upper threshold of the cached range – by using -XX:AutoBoxCacheMax=<size> JVM option

Some use this example as an argument that you should always use Integer.valueOf() instead of autoboxing. I think you just need to be careful and aware of such behaviors.

EDIT: Check out this blog post: It’s explaining how the Integer caching has changed through different versions of Java