A bit vector (also known as bit set or bit array) is a set data structure which uses only 1 bit per element. It answers the question is X in the set?.
The main advantage is memory efficiency and speed, as the memory is allocated in a single continuous block, making it very cache friendly (unlike some tree based data structures), and requiring only a few bitwise operations to access/modify elements in the set. The disadvantage is that we need to know the size of the bit vector beforehand, and that we might be wasting some of the memory if we only store a few elements in a large vector. Let’s look at this more closely.
For simplicity, we can think of the bit vector as an array of bits. Not boolean true/false values, or bytes, but bits. Our goal is to map the set of all possible values we might want to store (also called the domain) to a unique index in the bit vector. A good example would be if we wanted a set of small integers (say for an algorithm like the prime sieve of Eratosthenes). We would then need as many bits as is the highest integer we might want to store. If the highest number is 1024, our vector would need 1024 bits, or 128 bytes to store all our membership values (flags).
As a small sidenote, you can use the Chrome developer console to run the example code. It was written in a way that you can copy paste each snippet as you go along and everything will work.
Implementation using basic Array
and Number
types
Let’s do a simple implementation first, using bare JavaScript arrays of Number
. We can do this because the JavaScript bitwise operators treat their operands as 32 bit integers. Afterwards, we’ll do the same using the new Uint32Array
type. But first, let’s use a regular Array
to make things simpler.
To create a bit vector, we first need to specify the number of bits we need (which is also the number of possible values we can store membership of). The question becomes, how many 32-bit integers do we need to store N
bits? The answer is unsurprisingly N / 32
.
// A function which takes a number of bits and returns an initialized
// bit vector of given length.
function buildVector(bitCount) {
// The number of bits each `Number` can store.
var BITS_PER_ELEMENT = 32;
// Total number of `Number` values in the vector.
// We round up, because even if we need less than 32 bits, we need at least 1 `Number`.
var elementCount = Math.ceil(bitCount / BITS_PER_ELEMENT);
var vector = new Array(elementCount);
// We initialize our bit vector to all zeros
for (var i = 0; i < elementCount; i++) {
vector[i] = 0;
}
return vector;
}
Now that we have our bit vector, all that is left to do is write the get
and set
methods for manipulating bit values by their respective bit index. We’ll first consider only having a single Number
representing 32 bits.
Short introduction to binary
Binary numbers are represented as a sum of powers of two, for example:
- \(1 = 2^0\)
- \(2 = 2^1\)
- \(3 = 1 + 2 = 2^0 + 2^1\)
- \(4 = 2^2\)
- …
If you don’t have much experience with binary, you might be tempted to think that \(4 = 1 + 3 = 2^0 + (2^0 + 2^1)\), but that wouldn’t work, as we only want one of each power of two. A simple rule to achieve this is that we do a greedy approach, starting from the biggest power of two we can.
- 7 … we can fit a 4 into that, which means it’s \(4 + 3\), and we have to convert the 3, which is \(2 + 1\), resulting in \(4 + 2 + 1\) or \(2^2 + 2^1 + 2^0\)
- 14 … we can fit an 8, leaving us with 6, on which we iterate the same rule and get \(4 + 2\), resulting in \(8 + 4 + 2\) or \(2^3 + 2^2 + 2^1\)
A binary number is then simply a sequence of 0 or 1, stating 1 for each power of 2 we have (going from the lowest from the right), and 0 for each one we don’t have, thus:
- \(7 = 2^2 + 2^1 + 2^0\) which we can write as \(1 \cdot 2^2 + 1 \cdot 2^1 + 1 \cdot 2^0\), which gives us \(111\)
- \(14 = 2^3 + 2^2 + 2^1\) which we can write as \(1 \cdot 2^3 + 1 \cdot 2^2 + 1 \cdot 2^1 + 0 \cdot 2^0\), which gives us \(1110\) reading from the right
Note that we can add any number of zeros to the left, so 111 is equivalent to 0111 and 000000111.
A small sidestep here, we can also use hexadecimal numbers to represent binary, as the conversion is rather simple. A hexadecimal digit represents a value from 0 to 15, which is exactly what 4 bits represent. We can thus take any binary number, such as 100000101011111010 and convert it to hex (or back):
- look at it as groups of 4 bits from the right
10 0000 1010 1111 1010
- add leading zeros to the leftmost group if needed
0010 0000 1010 1111 1010
- convert each individual group
2 0 10 15 10
to decimal - write each decimal in hex
2 0 A F A
- put them back together, prefixing with
0x
to get0x20AFA
Converting hexadecimal back to binary is simple, just take these steps backwards. The reason we use hexadecimal numbers instead of binary often is because they are much easier to visually parse, understand and remember. Looking at a number like 0xA1
is much clearer than looking at 10100001
, because you don’t have to count how long is each run of zeros/ones.
Bitwise operators
To manipulate individual bits, we’ll make use of a few simple bitwise operators. They are called bitwise, because they manipulate individual bits. Specifically, we’ll need:
- negation (NOT), using the
~
(tilde) operator, which simply flips all the bits - conjunction (AND), using the
&
operator, which returns 1 when both bits are 1, otherwise 0 - disjunction (OR), using the
|
operator, which returns 0 when both bits are 0, otherwise 1 - left shift, using the
<<
operator, which is shifting all the bits to the left, or semantically multiplying by a given power of 2.
Because these operators are bitwise, they will operate on all the bits in parallel. This is different from the more common logical operators &&
and ||
(note that they’re doubled), which operate on the whole numbers. Here are a few examples (the b
suffix signifies a binary string, this is not proper JavaScript syntax and only used for demonstration purposes):
1 | 2 = 01b | 10b = 11b = 3
1 || 2 = 1
, because both are converted to booleans, and both are true1 & 3 = 01b & 11b = 01b = 2
1 && 3 = 1
, because both are converted to booleans, and both are true1 << 1 = 1b << 1 = 10b = 2
, or \(1 \cdot 2^1\)1 << 3 = 1b << 3 = 1000b = 8
, or \(1 \cdot 2^3\)
To understand the NOT operator, we first need to understand that while mathematically, binary numbers are infinite (or can be), we are only working with 32-bit integers. This means if we start with 0
and do a negation ~0
, we get a 32 bit number with all bits set to 1.
Because JavaScript uses two’s complement, ~0
will actually be -1
(or 11111111111111111111111111111111
in binary). This is because the Number
type behaves as a signed 32-bit number, which means it also has to represent negative values. The important thing here is that two’s complement doesn’t say anything about the actual bits, it only specifies what value those bits represent when doing other mathematical operations (+
, -
, *
, etc.). It also affects how the browser will display each number. If you want to learn more about two’s complement, check out the wikipedia article
or the following online calculator (there are many others) to get an idea for how it works.
But since our bit vector doesn’t need to do arithmetic, we don’t really need to worry about this. We might occasionally want to print out a given number in hex or binary, which can be done using the .toString
function, for example (14).toString(2)
outputs binary 1110
and (14).toString(16)
outputs hex e
.
Knowing how binary and bitwise operators work, we can finally figure out how to set a specific bit in a given 32-bit integer. The OR
operator |
is perfect for this, as it won’t change a 1
to 0
(and thus leaving the existing values alone), but will be able to set a 0
to 1
. Counting from 0, if we want to set the 1st bit (at index 0) to 1
, we simply do num | 1
, as this can also be read as num | 00000000000000000000000000000001b
. If we wanted to set the 2nd bit (at index 1), we’d want num | 10b
, or num | 2
in decimal/hex. The 3rd bit (at index 2) would be num | 100b
or num | 4
, the 4th bit (at index 3) would be num | 1000b
or num | 8
, and so on. We’ll call the number on the right side of the operator a bit mask.
If you look closely, you can probably figure out the pattern. To set the i
-th bit, we need to OR the number with \(2^i\), which can be easily created with a left shift as 1 << i
. The whole operation then becomes num | (1 << i)
. Before moving on, let’s do a more visual example. We’ll start with num = 0xDA
(or 11011010
, or 218
dec), and toggle the 3rd bit (index 2).
num 11011010
mask 00000100
OR | --------
11011110
We can also use the same operation to check if a given bit is set. As all the bits except for the i
-th are zero, we can use the &
operator, which will return a non-zero number if and only if the i
-th bit in num
is 1. The get
operation is then num & (1 << i)
.
Lastly, we might want to remove elements from the bit vector, which means we need the ability to clear a specific bit. A small recap of the &
operator.
| & | 0 | 1 |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 0 | 1 |
We can see that if we set the mask to all 1
, doing num & 1111...1111b
doesn’t change the num
value. We also see that no matter what value is in num
, if any of the bits in the mask is 0
, the resulting bit will also be 0
. Thus num & 11101b
will set the 2nd bit from the right (index 1) to 0
and leave all of the other bits intact.
Constructing such mask is simple, since we only need to take our set
mask from before and flip all the bits using the NOT operator ~
. Resulting in num & (~(1 << i))
. I’ve added extra parentheses to make the order of operations clear. Beware that &
and |
have a very low priority, so it might be a good idea to be very explicit with parens around bit operations unless you’re sure what you’re doing is correct.
Here’s a similar example as we did for OR, starting with num = 0xDA
, clearing the 7th bit (index 6). We first construct the mask step by step 1 << 6 = 01000000b
, followed by a negation ~01000000 = 10111111
.
num 11011010
mask 10111111
AND & --------
10011010
Implementing get
, set
and clear
on a 32-bit vector
As mentioned before, let’s first consider only a 32-bit vector stored in a single Number
. The operations would be:
// Set the i-th bit to 1
function set(vec, i) {
return vec | (1 << i);
}
// Clear the i-th bit
function clear(vec, i) {
return vec & (~(1 << i));
}
// Return the value of the i-th bit
function get(vec, i) {
var value = vec & (1 << i);
// we convert to boolean to make sure the result is always 0 or 1,
// instead of what is returned by the mask
return value != 0;
}
Note that all of these functions return a new number as their result. Let’s test to see if it works:
// Since our bit vector is stored in a single number, we simply initialize it as 0.
var vec = 0;
vec = set(vec, 3);
console.log("is 3 in vec? " + get(vec, 3));
// is 3 in vec? true
console.log("is 4 in vec? " + get(vec, 4));
// is 4 in vec? false
vec = clear(vec, 3);
console.log("is 3 in vec? " + get(vec, 3));
// is 3 in vec? false
Remember the number only has 32-bits, so don’t use an index bigger than 31. If you do, it will simply wrap around, so you’ll get set(0, 0) == set(0, 32)
.
Implementing get
, set
and clear
on an arbitrary length bit vector
Now we’re finally ready to create the whole data structure, an arbitrary length bit vector. We need to modify our get
, set
, and clear
to calculate the right Number
within the array first, and then to do the same bit manipulation they did before.
Going again from the right, bits 0 - 31 will be stored in the 1st Number
(at index 0), bits 32 - 63 at index 1, 64 - 95 at index 2, etc. From this, we can see that the index in the bigger array is simply the bit index divided by 32 and rounded down. Simply Math.floor(i / 32)
. This gives us the index of the Number
.
To get the bit index within the number, we simply take the remainder of dividing by 32, or the modulo 32 of the original bit index. This gives us i % 32
for the bit index. Putting this together (note that since we’re using an Array
, the bit vector is now mutable, unlike the previous 32-bit version using only a Number
). I’ve added the original buildVector
function to make it easier to copy paste this code as a whole.
// Set the i-th bit to 1
function set(vec, i) {
var bigIndex = Math.floor(i / 32);
var smallIndex = i % 32;
vec[bigIndex] = vec[bigIndex] | (1 << smallIndex);
}
// Clear the i-th bit
function clear(vec, i) {
var bigIndex = Math.floor(i / 32);
var smallIndex = i % 32;
vec[bigIndex] = vec[bigIndex] & (~(1 << smallIndex));
}
// Return the value of the i-th bit
function get(vec, i) {
var bigIndex = Math.floor(i / 32);
var smallIndex = i % 32;
var value = vec[bigIndex] & (1 << smallIndex);
// we convert to boolean to make sure the result is always 0 or 1,
// instead of what is returned by the mask
return value != 0;
}
// A function which takes a number of bits and returns an initialized
// bit vector of given length.
function buildVector(bitCount) {
// Total number of `Number` values in the vector.
// Adding Math.ceil here to make sure we allocate enough space even if the size
// is not divisible by 32.
var elementCount = Math.ceil(bitCount / 32);
var vector = new Array(elementCount);
for (var i = 0; i < elementCount; i++) {
vector[i] = 0;
}
return vector;
}
We can do a similar test as we did before to test our bit vector:
// Since our bit vector is stored in a single number, we simply initialize it as 0.
var vec = buildVector(64);
set(vec, 30);
console.log("is 30 in vec? " + get(vec, 30));
// is 30 in vec? true
console.log("is 40 in vec? " + get(vec, 40));
// is 40 in vec? false
clear(vec, 30);
console.log("is 30 in vec? " + get(vec, 30));
// is 30 in vec? false
Using Uint32Array instead of an Array
of Number
Modern browsers now provide a better and more efficient variant to an Array
of Number
, which is using Uint32Array
. The difference here is that JavaScript Array
is not exactly the array you would expect if you came out of a computer science class. It behaves more like a hash map with integer keys. You can also store different types in the same array, for example [1, "hello"]
is completely valid JavaScript. Secondly, Number
is not a 32-bit integer. According to the standard, Number
is a IEEE-754 double precision float. The trick here is that the bitwise operators convert their operands to a 32-bit integer before applying the operation. The conversion is defined as an abstract operation, so it most likely comes down to how the implementation chooses to handle things.
The ideal scenario would be that the JIT (just in-time compiler) recognizes that we’re only doing bitwise operations on something that starts out as a constant zero, and thus uses a 32-bit integer as the backing store for our data, and also recognizes that the array doesn’t contain anything else, so it wouldn’t have to use a generic implementation that allows different types, but rather a continuous block of memory. While this might be possible, it’s most likely not what happens, at least not something that can be guarateed to happen 100% of the time, because the JIT would need to understand everything your code is doing to prove that such optimization is possible. The halting problem however proves that the compiler can’t understand any arbitrary code, and as such any optimization could be only based on heuristics.
This is why the Uint32Array
type was added to JavaScript. While the compiler/interpreter/JIT can’t know that we only intend to use 32-bit integers, we as the programmers do know it, so we can choose a more specific data structure that allows for exactly that. Uint32Array
is a type which has only one purpose, to store unsigned 32-bit integers in a continuous block of memory.
Using it is actually even simpler than what we did before, as our buildVector
function turns into a one liner.
function buildVector(bitCount) {
// The constructor accepts a number of 32-bit integers in the array,
// which is simply the number of bits in our bit vector divided by 32.
// We also keep the `Math.ceil` just to make the API more robust.
return new Uint32Array(Math.ceil(bitCount / 32));
}
On the outside, the Uint32Array
behaves just like an Array
, with the exception that the operands to the indexer []
operator get converted to unsigned 32-bit integers.
var arr = new Uint32Array(10);
arr[0] = "foo";
arr[1] = "123";
arr[2] = 3.14;
console.log(arr[0] + " " + arr[1] + " " + arr[2]);
// 0 123 3
Everything else about the bit vector (set
, get
, and clear
) will stay the same, so there isn’t really anything we’re giving up for using the more efficient Uint32Array
version.
Conclusion
If you’ve read this far, you should now feel pretty confident about how the bit vector works, and be able to implement it yourself. A bit vector might not be the most popular data structure, but it can come handy in various different scenarios. A specific example could be using binary frames with the WebSocket
API, in which case you might want to minimize the network traffic as much as possible. When working with binary frames, you will most certainly run into Uint32Array
and bitwise operators, so at least knowing how a bit vector works can help you there. It’s also useful to know that there are other built-in array types with predefined length, such as Uint8Array
, Int32Array
(note the lack of U
, as this is a signed integer version of a 32-bit array), Float64Array
, etc. For more details on these check out the Indexed collections section under Global Objects on MDN. You might also be interested in seeing browser support of typed arrays in the modern browsers and different polyfill options.
Lastly, I’d like to note a little bit about dynamically sized bit vectors. Much like a regular array, a bit vector can also be implemented in a way that allows for resizing. In the Array
variant, we would just need to push a few additional zeroed Number
instances into the array to make the bit vector larger, while the Uint32Array
variant would require us to allocate a new Uint32Array
with larger size and copy things over. At first it might seem like the Array
variant is clearly superior in this regard, but here’s a few thoughts:
- if the JIT recognized our
Array
should use an efficient packed 32-bit integer block of memory to store the data, pushing a new element into it would do exactly the same as if we create a newUint32Array
(there could be more optimizations going on, but the same could be said for the JIT optimizing a resize of theUint32Array
variant) - if the
Array
is backed by a generic array of objects with extended capacity for pushing new elements into it, thepush
itself wouldn’t cost as much, but there could be a price paid in terms of performance of the regularset
,get
andclear
operations
Note that this is mostly food for thought, I haven’t done any benchmarks comparing the two variants, and could be very wrong with regards what happens in actual JavaScript implementations. If I was made to guess, I’d say the Uint32Array
would outperform the Array
even with an occasional resize. But feel free to correct me on this in the comments.
References
- MDN Bitwise Operators
- MDN Uint32Array
- Two’s complement on Wikipedia
- Two’s complement calculator
- Typed Array browser compatibility table
- MDN Global Objects (Indexed collections section)
- Browser support of typed arrays
Algorithms