Arena as the default allocator

Like many people in tech, I started my career as a web developer. I was always curious how exactly the machine works, and my career path follows that curiosity: first a Django developer, after that Go, C++ for the search engine, then network infrastructure engineer and kernel-adjustment developer.

After doing all that and now making native software (and working in an AI startup as my day job), I think the biggest lie in the web world is the notion of infinite memory. The service uses too much RAM to serve traffic? Throw more hardware at it!

I vividly remember this one time I talked with a Linux kernel contributor about cgroup memory management implementation details and I said something along the lines of "wait, everybody knows that the memory you allocate is not real memory, it's virtual pages" and half of my team audibly laughed. Turns out, that's not as common knowledge as I thought.

There's an incredible series of blog posts (or video, if you prefer) by Ryan Fleury about arenas. There are many ways to implement one, but basically it's a stack allocator. Since I used Ryan's project, raddbg, as a starting point for the thing I'm making, naturally, I've been using arenas a lot.

To make it more approachable I figured I'm going to sprinkle some additional allocator implementations into the codebase. "Mimalloc is great!" I thought, and sometimes I might want to use some generic implementation if I have many short-lived strings (think: web form). To store messages in an LLM chat app, I did something like:

struct Chat {
  ChatMetadata      metadata;
  Array<Message> msgs;
}

struct Chats {
  Heap *heap; // mimalloc heap
  Array<Chat> chats;
}

The issue with supporting multiple allocators is that you either need to store function pointers somewhere to make the utility code generic, or use generics everywhere. The generic option is out of the question, because I care deeply about compile times, and the function pointer solution spreads through the codebase quickly.

Another issue is that the described solution doesn't follow the lifetimes of the data the application operates upon. Message history is needed only for tab display; the lifetime of messages is tied to the lifetime of one (or multiple) tabs displaying their history. Even if I manage to create a correct implementation "unloading" messages on tab close, it's error-prone. "I won't unload messages, memory is cheap!" I thought. But that's exactly the lie I described earlier – memory is not infinite. And although for a chat app loading an entire 200MB of history is not an issue, this style of thinking produces applications like Discord, which recently started reloading itself in the background after it reaches 4GB of memory usage. That's just bad design.

A better one would build upon the simple realization that a) I don't need to store every chat's message history to make it fast. I use SQLite, it's an incredible piece of software, loading thousands of entries of history takes milliseconds and b) history lifetime is not coupled with metadata. Since I still need growable arrays both for metadata and messages, I just use chunked arrays: linked list of fixed-size entries. I'm a heavy LLM user, so I can get precise statistics of chat history based on my own data export for Claude: 8K chunk size for metadata (1–3 entries cover 99%+ of use cases, I have 10k+ chats), and 128-entry chunks for message arrays (my avg. conversation size is 20 messages). Memory overhead is negligible, giving the following structure:

struct Chat_MetadataEntry
{
  Chat_Metadata metadata;
  Chat_Status   status;
};

struct Chat_MetadataStorage
{
  Arena                                    *arena;
  ChunkedArray<Chat_MetadataEntry, 8192>    chunks;
};

struct Chat_Messages
{
  Chat_Messages                    *prev; // LRU linkage
  Chat_Messages                    *next;
  Arena                            *arena; // owns all message data + chunks (512KB committed, 1MB reserved)
  Chat_Handle                       owner; // which chat this belongs to
  ChunkedArray<Chat_Message, 128>   chunks;
  b32                               io_ref;    // IO thread is borrowing this, don't evict
  u32                               view_refs; // number of open views referencing this, don't evict
};

struct Chat_MessagesLru
{
  Arena         *arena;  // owns Chat_Messages structs (not their message data)
  Chat_Messages *first;  // most recently used
  Chat_Messages *last;   // eviction candidate
  Chat_Messages *free;   // singly-linked free list (uses 'next' pointer)
  u32            active; // count in LRU list
  u32            total;  // active + free
};

Twice the code, but a much, much simpler solution.

Funnily enough, before I started spending long nights reading how game engines work, I was wondering "how do game engines manage seemingly open-world games with a fixed memory budget?" And the structure above is pretty similar to the answer to this question: you pre-allocate fixed-size buffers, you use them as a cache for data streamed from disk.

"Ok, but what about the form with many strings?" you might ask. E.g., a property editor with many string fields. And the answer is: two-level-segregated-fit. TLSF gives O(1) complexity, it's trivial to implement (400 lines of code you can generate with an LLM, or just copy the sources yourself), and it's much better integration-wise, because you can build it on top of an arena yourself, keeping the uniformity of implementation.

Another neat trick I've done with arenas is keeping track of what was allocated where, like this:

#if BUILD_DEBUG
#define ArenaAllocSized( res, cmt ) arena_alloc__sized_impl( ( res ), ( cmt ), __FILE__, __LINE__ )
#define ArenaAlloc() arena_alloc_impl( __FILE__, __LINE__ )
#define HeapAlloc() heap_alloc_impl( __FILE__, __LINE__ )
#else
#define ArenaAllocSized( res, cmt ) arena_alloc__sized_impl( ( res ), ( cmt ) )
#define ArenaAlloc() arena_alloc_impl()
#define HeapAlloc() heap_alloc_impl()
#endif

struct ArenaSlot
{
  Arena  *arena;       // pointer to the actual arena (virtual memory mapped separately)
  cstring alloc_file;  // points to .rodata, no allocation needed
  u32     alloc_line;
};

struct ArenaPool
{
  ArenaSlot slots[ARENA_POOL_SIZE];
  u64       free_bits[ARENA_POOL_WORDS];  // 1 = available, 0 = in use
};

// Global pool - zero-initialized at startup
extern ArenaPool g_arena_pool;

With a simple CAS loop to track arenas:

lo_internal u32 arena_pool_claim_slot()
{
  // Lazy init - safe because first allocation is typically single-threaded
  arena_pool_init_once();

  // Scan bitmask for available slot
  for ( u32 i = 0; i < ARENA_POOL_WORDS; i++ )
  {
    u64 bits = ins_atomic_u64_eval( &g_arena_pool.free_bits[i] );

    while ( bits != 0 )
    {
      // Find first set bit
      u32 slot = static_cast<u32>( __builtin_ctzll( bits ) );
      u64 mask = 1ULL << slot;

      // Try to claim it: clear the bit atomically
      u64 old = ins_atomic_u64_fetch_and( &g_arena_pool.free_bits[i], ~mask );

      if ( old & mask )
      {
        // We got it - bit was set and we cleared it
        return i * 64 + slot;
      }

      // Someone else got it, reload and try again
      bits = ins_atomic_u64_eval( &g_arena_pool.free_bits[i] );
    }
  }

  // Pool exhausted
  AssertMsg( false, "Arena pool exhausted (2048 arenas). Recompile with larger ARENA_POOL_SIZE." );
  return UINT32_MAX;
}

This gives a thread-safe, global, and fast way to track arenas and see which ones I forgot to free on shutdown.

The moral of the story? Memory is not infinite, and software that assumes otherwise is incorrect. Use arenas, always.