A bitset is a data structure designed to store a vector of boolean values very compactly – one bit per value. In practice, they’re a really handy way to save memory. However, we had a situation in one of our extremely memory-intensive applications where a simple bitset wouldn’t cut it. We have over 2500 variables to store bits on, meaning that our bitsets took up over 300 bytes each!
If most or all of the bits were usually set, this would be an unavoidable issue, but in our case, the set bits were very rare – usually no more than 20 or 30. This leads to a large amount of wasted space. The traditional approach to sparse sets like these is to just store the position number of the set variables directly in a collection. To allow for the full 2500 position numbers, we needed a short int, meaning that with this approach, the memory size of the collection is 2 bytes times the number of elements. A sparse set of 30 elements will take a lot less memory than the equivalent bitset (60 bytes versus 313), but in our application, it’s still too much.
What if we could combine the tight-packing benefits of a bitset with the low impact of the sparse set? It turns out that we can. In our application, some of the variables are set pretty frequently (1/4 records have it set) and some are set very rarely (1/10000), with a gradient of various frequencies between. We can exploit this gradient to make really efficient storage decisions. Here’s the graph of frequencies over our set:
It comes down to this: in a bitset, each possible variable in your set costs you one bit, whether or not it’s set. In a sparse set, with our example of 2500 variables, each variable costs 16 bits, but only when it’s set. Comparing these two costs gives a clear tradeoff point. When a given variable is set on at least one of every 16 records, you should store it in a bitset; when it is set on one out of every 17 or more records, then you should store it in the sparse set.
To build this type of set in practice, you just need to get your variables into a list ordered by frequency of occurrence and then apply the cutoff. Everything above should be managed through a bitset, and everything below should be managed by a collection. We found it handy to wrap both of these sets up in a single class so that the user can be indifferent to where the data is stored.