Why memory allocation matters
This has been a topic I wanted to talk about for a while. We all know hashmaps enable constant lookup. O(1) in big oh lingo. But can we speed this super useful data structure another 10x or sometimes even 100x.
To better understand the problem let’s write a small program. The program will create a hash map (unordered_map in C++) and populate it with 100 million (10^8) elements.
#include <unordered_map>
#include <ctime>
#include <nmmintrin.h> // _mm_crc32_u64
int main(int argc, char *argv[]) {
struct hash {
size_t operator()(size_t h) const noexcept {
return _mm_crc32_u64(0, h);
}
};
size_t iters = 1e8;
std::unordered_map<size_t, size_t, hash> ht;
ht.reserve(iters);
std::srand(std::time(nullptr));
int r = std::rand();
for (size_t i = 0; i < iters; ++i) {
ht.try_emplace(i, i * r % iters);
}
return 0;
}
We can compile the above program and check how long it takes to run:
g++ -std=c++2b -O3 -g -mavx -DNDEBUG main.cc
Running the above program and measuring performance gives us the following numbers:
The program took about 3 mins and 20 seconds to run.
If we make some changes we can get runtime of the same program down to ~11seconds.
To make the above improvement we will need to change how and where the hash map is stored. It might be obvious that anything stored by any program is stored in “memory”. What is not obvious is that everything is stored in “virtual memory”. Each memory address that is accessed is actually a virtual memory address (VMA). CPU has to do some work to translate this VMA into PMA (physical memory address). Translation process from VMA to PMA is expensive and (to my knowledge) all modern CPUs have hardware circuitry to help with it. VMAs are managed in chunks and those chunks are called pages. In linux, default page size is 4KB. Instead of using regular pages to store the map, we can use “huge pages”. These pages are larger compared to regular ones and they come in two sizes - 2MB or 1GB. This means that, when using hugepages, program can access larger memory area and not have to re-populate TLB.
As part of this experiment I tried using 2MB and 1GB hugepages and change in runtime was not significant. Having said that, if you decide to go down this route in your project, you should run experiment yourself and see if using 2MB or 1GB pages makes a difference.
Once we start thinking about object placement it is necessary to take control of memory allocation. This is easy to do in some languages and super hard in others. For example, all data structures from STL in C++ can be parameterized with an allocator type.
Here’s an example that shows how to use a custom allocator with an unordered_map (hashmap) in C++:
int main(int argc, char *argv[]) {
struct hash {
size_t operator()(size_t h) const noexcept {
return _mm_crc32_u64(0, h);
}
};
struct eq {
size_t operator()(size_t a, size_t b) const noexcept {
return a == b;
}
};
using hpa = huge_page_allocator<std::pair<const size_t, size_t>>;
size_t iters = 100000000;
std::unordered_map<size_t, size_t, hash, eq, hpa> ht;
ht.reserve(iters);
std::srand(std::time(nullptr));
int r = std::rand();
for (size_t i = 0; i < iters; ++i) {
ht.try_emplace(i, i * r % iters);
}
return 0;
}
unordered_map
is parameterized with a huge_page_allocator<T>
. Full source code for this allocator can be found here. It provides allocate
and deallocate
methods. Allocate returns a pointer to memory location where new objects can be inserted. It allocator uses mmap
syscall that is used to allocate memory in a hugepage (instead of being allocated in a regular memory page). This is a very simple “arena” allocator. In on of my future posts I will talk more about this type of allocators and when to use them. I also plan to talk about one more optimization that I think will be come more and more common in the future - it has to do with process scheduling. For now, I’ll leave you with the code and results to ponder if this is something you could use more often. If you have questions or would like to know more about this topic feel free to reach out via chat.