malloc(3)

In many foreign cities, far away,
Thy lone voice spoke to me like memory.

— Rainer Maria Rilke (1918)

It’s a sad fact of life that managing dynamic memory is a very difficult programming task.

— Peter van der Linden (1994)

malloc is a function in the C standard library. It takes care of provisioning memory from the operating system and giving it back. Around this ostensibly simple responsibility dwell a number of other activities. malloc must be frugal in its use of memory for bookkeeping and debugging; memory is, after all, the application’s resource. It must guarantee proper alignment, sometimes at the cost of wasting bytes. It must be fast; time is also a resource borrowed from the application. It must make its best effort to reduce fragmentation. Thread-safe implementations must account for concurrent access and provide correct, consistent behavior.

One essential difficulty is that malloc needs to meet these demands, often at odds with each other, without knowing anything about the workload applications will present. That is, malloc is a general purpose allocator. An application may do thousands of small allocations and free them immediately. Another may return them all in batches. Another application may make large allocations, in the order of millions of bytes of memory at a time.

In scenarios where the workload is known in advance it pays to build a dedicated allocator, where choices are made to benefit the patterns of allocation, the types of objects stored, or the needs of locality. An application employs a slab allocator when it knows the kind of object it will store in some chunk of memory. It can answer allocation requests quickly, without being concerned with requests of arbitrary size in between, and can afford to keep a pool of initialized objects. An application that allocates multiple objects and disposes of them as on a stack can afford to employ a bump allocator, thanks to the guarantee that no older objects will be destroyed before more recent ones. Not malloc.

Another difficulty arises from the interface malloc and its friends must satisfy. Two functions are predominant: void *malloc(size_t) and void free(void *). There is a handful of other more specialized functions, unimportant at this point. The issue is not apparent from the perspective of the application. It receives memory, gives it back, Bob’s thy uncle. But malloc must perform all its janitorial duties when the application chooses to call it. There is no framework that calls into the application at specific points; no runtime to stop the presses when it determines it is time to collect. There is no preempting the application in the middle of its activities.

Thus, the situation for malloc is quite miserable. It must serve all applications well, with no prior knowledge, using meager amounts of memory and what little time it can get during the application’s business.

Exercise 1. Read your system’s manual page of malloc(3). Then look up the manual pages of other system’s malloc implementations, for example, those of the BSDs, macOS, Linux, Plan 9. Observe their differences: how do they react when given null pointers; do they make any guarantees about zero-initializing memory; do they set errno, and when.

Exercise 2. Look up a few real malloc implementations. Skim the repository of jemalloc, and compare its apparent complexity with that of the malloc implementation in K&R. Ponder the spectrum of complexity of general purpose memory management.

mmap(2)

All is not lost. One may temper the discomfort by the knowledge that the situation could be much worse. malloc stands between application and operating system; it lets the former operate abstracted from the particulars of the latter. The operating system, in turn, lets malloc function without concern for the technicalities of physical memory and multiplexing it among multiple processes. Thanks to the operating system, malloc need not consider page tables, page faults, swapping, or translation lookaside buffers.

The operating system has its own interfaces for giving memory to the userland. The mmap(2) system call is capable of allocating memory and mapping devices and files into the address space of a process.

Exercise 3. Read the manual page for mmap. Note the behavior of the system call for different values of the addr pointer, the protections that may be specified in the prot argument, and the values that may be combined to form the flags argument. Of these, pay special attention to MAP_ANONYMOUS and MAP_PRIVATE. Consider the qualities an allocation needs to have for malloc to handle memory for a process.

Exercise 4. Suppose you want to write instructions assembled by hand into some buffer, then execute them. Consider the memory protections documented by mmap. A policy called “write xor execute” enforces exactly what it says: a process is allowed to write or execute the contents of a page, but not both simultaneously. The mprotect(2) system call provides a way to change the protection flags of some pages after they have been allocated with mmap. Read Chris Wellons’ article “A Basic Just-In-Time Compiler”. Assemble some instructions by hand or with the help of an assembler, write them to a buffer, make the page executable, take a function pointer to your buffer and call it.

Memory pages

Aided by the hardware, operating systems employ a ruse called virtual memory to present processes with the view that they have a large, uninterrupted address space at their disposal. This abstraction makes memory management easier for processes, as they can operate free from the risk of interfering with the memory of other processes running on the system.

The operating system places boundaries to this abstraction. Programs must not try to read or write arbitrary addresses; they must request memory from the operating system first, and are allowed to make use of that memory only, lest they be terminated. Different sections of the address space allow reading, writing or executing.

The smallest unit of memory the operating system manipulates is a page. It does not allocate portions of a page. This is why mmap returns memory aligned to the page size of the system, and rounds up the size of the allocation to a multiple of the page size. A malloc implementation that gets its memory from mmap must potentially serve small requests from relatively large allocations.

Exercise 5. Read the manual page for sysconf(3). Use it to determine the page size of your system.

Alignment

CPUs have an easier time retrieving a full word from memory; if an object extends over a word boundary, multiple loads are necessary. Thus basic data types have alignment requirements that dictate the addresses at which they may be stored. Compilers then lay out data for efficient access rather than prudent use of memory.

Padding is needed to align objects of arbitrary size, acting as stuffing that puts entities on their marks. It materializes at multiple levels. mmap returns memory in pages, even if the requested size is not a multiple of the page size. malloc returns memory aligned for any kind of object; it does not return arbitrary addresses so it must sacrifice bytes to observe this requirement. C struct members either are basic data types, or are made up of basic data types, and therefore must be aligned to an address appropriate for their size. Compilers leave unused space between members to bring each to its required alignment.

A malloc implementation must supply memory that is suitably aligned for any type of object. With 64 bit systems, alignment to 16 bytes is a safe bet.

Exercise 6. Read “The Lost Art of Structure Packing” by Eric Raymond.

Exercise 7. Suppose M divides N. Convince yourself that alignment to N bytes implies alignment to M bytes.

“Step on a crack, break your mother’s back” has mutated to “dereference nonaligned then cuss, cause an error on the bus.”

— Peter van der Linden (1994)

Free lists

A free list is a linked list of free blocks of memory available to serve allocation requests. If a big enough block is available to serve a malloc call, it can be used without the expense of a system call. If there are no such blocks, a system call must be done to get memory from the operating system. When allocations are returned with free, the space can be added to the free list to serve future requests.

The canonical toy allocator uses sbrk(2) to get memory from the operating system. The heap model is of a simple bump allocator, and the interface of sbrk has a smaller granularity than that of mmap. This affords the programmer a way to construct a single free list and the ability to serve allocations of the requested size simply by moving the break by the necessary offset, albeit at the cost of being unable to return memory to the operating system save when the latest allocation is being freed.

Exercise 8. Read “A Memory Allocator” by Doug Lea.

Exercise 9. Set up a C project with a single file, malloc.c, and a Makefile that builds a shared object, malloc.so. Set the compiler to enforce the C99 standard, add debugging information, turn off all optimizations, and enable all warnings.

Page sizes vary by system, but they are substantially larger than basic types, small structs and arrays. Therefore small allocations must be served from comparatively large spans. This suggests an adjustment to the model of a single free list: obtain spans from mmap and organize them in a doubly linked list. Maintain a free list inside each span, and allocate blocks from those that have enough free space.

The two central types are struct span and struct block. A span has a size, pointers to the previous and next span in the list, a struct block * to access its free list, and a count of allocated blocks. A block has a size, pointers to the previous and next blocks in the free list, and a struct span * to its owner.

Exercise 10. Write the struct span and struct block types.

Exercise 11. Define a constant for ALIGNMENT. Find a way to ensure the span and block types are padded so that their size is a multiple of the alignment.

Exercise 12. Find a way to implement a static assert in C. Add static asserts to ensure the size of your span and block types is evenly divisible by the ALIGNMENT.

Preparations for malloc

When programs call malloc they state a size relevant to their needs. Internally, the allocator must prefix the returned space with a block header. Since the block header size is a multiple of the ALIGNMENT, affixing the payload to the header guarantees the required alignment. Likewise, the payload can be extended to the ALIGNMENT boundary to guarantee a block header placed immediately after will be correctly aligned.

Exercise 13. Write functions align_up(n, a), that returns the next multiple of a larger than or equal to n, and gross_size(s) that computes the total number of bytes required to store a block header, a payload of size s, and any necessary padding for alignment.

Exercise 14. malloc needs a location to store the system’s page size, which it must compute upon the first call. It also needs a fixed entry point into the list of spans. Add global variables page_size and base.

To allocate a block, malloc must traverse the list of spans searching for one whose free list has a block at least as large as the computed gross size. If no existing span has enough space, or if there is no existing span because none has been allocated yet, then memory must be requested from the operating system.

Note that blocks are removed from the free list once allocated, and put back into it when freed. Their presence in the list indicates they are free.

Exercise 15. Write a function block_find(gross_size) that traverses the list of spans, and the free list of each span, and returns a pointer to the first free block it finds that can fit the gross size, or NULL if there is no such block.

It is wise to request more memory with mmap than is immediately necessary to wring the most allocations out of each system call. However, going too far risks waste if the program uses only a fraction of the space. Furthermore, an allocation is tied up as long as it has blocks in use; some workloads could prevent large allocations from being unmapped, whereas smaller ones might be returned piecewise.

Exercise 16. Use your judgement and define MINIMUM_MMAP_ALLOCATION to be a number of bytes at least as large as your system’s page size. Make it a multiple of the page size. Observe that mmap returns memory in multiples of the page size; therefore it is in your best interest to align up your requests to multiples of the page size as well, to get those bytes for yourself.

Span allocation

When a program first calls malloc there are no spans to search. The only thing to do is to request memory for a span. While many small allocations can be served from one span, a program might request an amount of memory larger than MINIMUM_MMAP_ALLOCATION. When this happens, assuming no allocated span can serve the request, a large span with a single block is allocated.

The same reasoning applies for spans as it did for blocks. The allocator must request enough memory from the operating system to fit the entire span and its header.

Exercise 17. Write a function span_alloc(gross_size) that calls mmap for memory. It must request at least MINIMUM_MMAP_ALLOCATION bytes, and at least enough memory for the given gross size and the span header, padded up to a multiple of the page size. Initialize the span header with its size and block count, and prepend it to the list of spans.

Exercise 18. The span must have a block in its free list if it is to be considered for allocations. Write a function span_first_block(span_ptr) that computes a struct block * to the address immediately after the span header. The span header is padded to a multiple of ALIGNMENT, so this address is properly aligned to place a block header.

Exercise 19. Write a function block_init(ptr, span_ptr, size) that initializes a block header at the given location, with the span as its owner. Use block_init to place the first block in a span so as to take the entire space, and place it in its free list.

Block allocation

If block_find finds enough free space, a block can be allocated without making a system call. However, the block may be needlessly large, in which case it would be wasteful to assign it in its entirety. A piece of the desired size can be taken. The rest of the block remains in the free list where it may serve future allocations.

Splitting blocks indiscriminately has the undesirable side effect of hatching small gaps between allocations. These are inadequate to serve most requests, or any, if they are too small to fit even a block header, yet they must remain in the free list, adding unnecessary work to the search for free blocks.

One way to mitigate this fragmentation is to coalesce adjacent free blocks to turn two pieces into one, a sort of reverse mitosis. By joining contiguous blocks larger allocations can again be served, and as an added benefit the free list is reduced. The process is not infallible, and some pathological workloads will prevent coalescing, but it is worth doing.

A second way to alleviate the effect on the search process is to accept the fragmentation and give a program a larger block than it asked for. If the residual space after a split would be smaller than some minimum, the entire free block can be given away.

Exercise 20. Decide on a minimum block size and define a constant MINIMUM_BLOCK_SIZE to that value. It should be at least as large as the size of a block header. Write a function block_sever(block_ptr) that removes a block from its owner’s free list.

An allocation can be split off of a free block from the start or from the end. It is marginally more convenient to split from the end so as to keep the original block in the free list and avoid fiddling with list links. The new block comes into existence as an allocated block, so it never even makes it into the free list.

Consider a function block_split(block_ptr, gross_size) that initializes a block of the given size at the very end of the given block. Recall that a gross size includes the space for the block header and padding up to ALIGNMENT. Likewise, the size of an existing block is aligned to ALIGNMENT, for it spans up to a page boundary, or is otherwise the result of a gross_size() calculation. Taken together, these properties guarantee that it is correct to compute the position of the new block header in the natural way, to wit, gross_size bytes back from the end of the block.

Exercise 21. Write block_split(block_ptr, gross_size). It should return a pointer to a new block header of size gross_size placed at the end of the given block. The original block size must be updated. Consider asserting that the given block is large enough for gross_size, that the pointer that results from the computation is indeed aligned to ALIGNMENT, and that it falls within the span. For convenience, write functions is_ptr_aligned(ptr, a) and is_ptr_in_span(ptr, span_ptr).

The implementation for block_alloc(block_ptr, gross_size) handles two cases. Either the block is big enough to tolerate a split, or it must be taken in its entirety, in which case it needs to be severed from its free list.

Exercise 22. Write block_alloc(block_ptr, gross_size).

Test binary

Tests serve a further purpose than protecting a programmer from themselves. They are also a prime source of delight when they exhibit the code they exercise working as intended. This is especially true during the early stages of a project.

A test harness need not be sophisticated, but some paperwork is required first. The binary needs access to the functions in malloc.c, which currently lacks declarations.

Exercise 23. Add a new Makefile target for a binary tests. It requires two object files to be built: malloc.o and tests.o. tests.o in turn requires tests.c and malloc.h to be built. While at it, add a test target that executes the new binary.

Exercise 24. Create a new source file tests.c with a main() function that calls one initial test function. Have this function execute one of the functions in malloc.c, say, gross_size(). See make test fail because there is no malloc.h, and then because there is no declaration of gross_size. Add the necessary declarations to write tests for gross_size and align_up.

As the entry point into the allocator, malloc takes on the responsibility of storing the page size. Tests that circumvent malloc and execute internal functions will be missing its value. The test binary must call sysconf and set the page_size global itself. The global is defined in malloc.c, so the test binary has to declare it extern.

Testing functions that depend on mmap requires some way of unmapping spans first. The allocator is stateful, and is not reset between individual tests in this modest harness. Since malloc.c declares one global entry point, the lists of allocated spans and blocks are shared between tests, which are regular functions in a regular program. Indeed, this is the first binary to link with and use your allocator, such as it is.

Exercise 25. Go over the functions written so far and assert the properties their arguments must satisfy, as well as intermediate conditions that must be met. It is hard to overstate the importance of these assertions.

munmap(2)

munmap deletes mappings previously arranged with mmap. References to addresses in those pages are subsequently invalid. When every block in a span is freed the allocator has an opportunity to return the pages to the operating system. The simplest approach is to return every span when its allocated block count reaches zero. This amounts to removing the span from the list, and calling munmap on it.

Exercise 26. Implement span_sever(span_ptr) and span_free(span_ptr).

With span_free it is possible for each test function to reset the allocator to its initial state. This unlocks the functions that affect state for testing.

Exercise 27. Test span_alloc and span_free. Test block_find, block_split and block_sever. Test block_alloc.

Block deallocation

A call by the program to free corresponds to a block being put back on its span’s free list. For the block, the operation is no different from initializing a new free block. For its owner, besides the list manipulation, it means an update on the count of allocated blocks.

Exercise 28. Write block_prepend(block_ptr) and block_free(block_ptr).

It is advisable to furnish each function with assertions and write tests for them.

malloc and free

malloc must initialize the page size on first call, compute the gross size the allocator handles internally, try to find a block, allocate a new span if necessary and finally initialize a new block. Once the new block is ready, malloc returns a pointer to its payload.

free receives a pointer previously returned by malloc. It finds the block header and frees the block.

Exercise 29. Write functions block_payload(block_ptr) that computes a pointer to the block’s payload, and payload_block(void_ptr) that does the inverse operation.

Exercise 30. Implement malloc and free.

What is now proved was once only imagined.

— William Blake (1790)

calloc(3)

calloc allocates memory for a number of objects of a given size. It sets the memory to zero bytes.

Exercise 31. Write a function payload_size(block_ptr) that computes the size of the payload of a block. Note that this may be larger than the size originally requested by the program.

Exercise 32. Use memset(3) to implement calloc.

Block footer

Order in the free list does not generally correspond to the physical arrangement of blocks in a span. Blocks enter and leave the free list in a sequence determined by the workload; allocated blocks are not even in it. To be able to coalesce adjacent free blocks it is necessary to locate the headers of the adjacent blocks. The block following some block can be readily found from the latter’s size, but the previous one requires some machinery.

A block footer is the last word in its payload; if the block is free, the footer contains its size. Due to the arrangements made for alignment, this last word precedes a header. Then, given a pointer to a block header, it is in principle possible to find the previous block’s size by looking at its footer. The previous block header can be located from that size.

However, if the previous block is currently allocated it is not safe to interpret its footer, as the word might be part of the payload in use by the program. Some indication of the allocation status of the previously adjacent block is needed.

Exercise 33. Write a function block_next_adjacent(block_ptr) that computes a pointer to the header of the next physically adjacent block.

Bit flags

The size of a block is aligned to a multiple of ALIGNMENT. Depending on its value, a number of the least significant bits of a block’s size is guaranteed to be zero. These bits are the perfect place to encode status information.

Exercise 34. Add bit flags BLOCK_IS_FREE and BLOCK_IS_PREV_FREE with values corresponding to available least significant bits in a block’s size. Implement functions block_is_free(block_ptr), block_is_prev_free(block_ptr) and block_size(block_ptr) that mask the necessary bits to compute their result.

Meddling with bits in this way means it is no longer possible to directly interpret or modify a block’s size. Every access and update must go through functions that abstract away these details.

Exercise 35. Write functions block_set_free, block_set_used, block_set_prev_free, block_set_prev_used and block_set_size that update a block’s size field with care to maintain all the information encoded in it.

Exercise 36. Adjust the functions in the allocator to keep track of this new state. Update the relevant tests.

Previous block

The predicate block_is_prev_free guards against invalid reads of the previous block. A function to find the previous block header must also consider the possibility that the current block is the first in its span. In this case the alleged footer is instead a word in the span header.

Exercise 36. Write a function block_prev_footer(block_ptr) that computes a pointer to the word immediately before the given block header. Since every block is preceded either by another block or by the span header, a read of such a pointer is always legal, if not valid.

Exercise 37. Implement block_prev_adjacent(block_ptr). Assert the appropriate conditions on the given block header, as well as the alignment of the computed pointer.

Coalescing

Two adjacent free blocks are a wasted opportunity. Anything they could serve individually, a single aggregated block could serve as well. However, an allocation that would otherwise fit a larger block cannot necessarily be served if the allocator sees two entries in a free list, even if they are adjacent. This is an effect of external fragmentation. The natural solution is to combine free blocks whenever two are adjacent.

Coalescing need only happen when a free block appears. It is enough to coalesce once in each direction; there cannot be further adjacent free blocks, for they would have been coalesced the moment the last one became free.

Suppose a and b are adjacent free blocks, and furthermore that a is before b, that is, it extends towards b. To coalesce these two blocks means to grow a by the length of b. Its footer must be placed in the new location with the updated size. At this point b is no longer a valid block, and must be removed from the free list.

Exercise 38. Implement a function block_coalesce(block_ptr, block_ptr) that extends the first block over the second, assuming that the first comes before the second in the span.

Exercise 39. Implement a function coalesce(block_ptr) that coalesces a free block with its neighbors, if they are free. Observe that the given block pointer may no longer be valid after this operation.

Exercise 40. Adjust free to coalesce whenever possible.

realloc(3)

Ostensibly, realloc resizes an allocation. The edge cases and the interaction with the design decisions of the allocator make for an intricate implementation.

Exercise 41. Go back to the manual page for malloc(3) and go over the description of realloc.

The simplest scenario is when a null pointer is given to realloc, in which case it behaves like malloc. A close second is when the new size comes up to the same gross size an allocation already has. The caller believes they are extending or reducing their allocation by a handful of bytes, unaware that they are given the same block.

Similar considerations apply when truncating an allocation as when splitting a block. The allocation does not need to be moved; it can be returned with a smaller size, so the allocator must decide what to do with the residual space. A new free block is created if the space is larger than MINIMUM_BLOCK_SIZE. Otherwise, the waste is accepted and the block stays as it is. If a new block is created, it may be adjacent to a free block, so coalescing is necessary.

Exercise 42. Write a function realloc_truncate(block_ptr, size) that reduces the size of an allocation and handles the remaining space.

When extending an allocation there may be enough free space following the block that it does not need to be moved. Once again, the allocator must deal with any leftover space after splitting. If there is no such space, a new allocation needs to be made, the content copied over and the old block freed.

Exercise 43. Write a function realloc_extend(block_ptr, size). It must try to extend the allocation in place if possible, taking care to handle the excess space. Use memcpy(3) to copy the payload if a new block is allocated.

Exercise 44. Implement realloc.

Real programs

The dynamic linker can be instructed to interpose shared libraries by listing them in the LD_PRELOAD environment variable. If it is set, the dynamic linker searches the listed libraries before any other shared libraries for the external symbols in the program.

Exercise 45. Skim the manual page for the dynamic loader in your system. This may be ld.so(8) in Linux or rtld(1) on FreeBSD.

Exercise 46. Ensure your functions get called when running an external binary by triggering a visible side effect on malloc and interposing your malloc.so.

Exercise 47. Run an assortment of commands using your allocator. Mind the fact that it is single threaded. Debug and correct any errors. Some examples:

ls -l / > /dev/null
grep -r blk . > /dev/null
find . -maxdepth 2 -type f | wc -l > /dev/null
sort malloc.c > /dev/null
git status > /dev/null
git log --oneline -n 50 > /dev/null
git diff HEAD~5..HEAD > /dev/null
grep -r blk . 2> /dev/null | sort | uniq -c > /dev/null
diff -r . .. 2> /dev/null | head -100 > /dev/null
find /usr -maxdepth 4 -type f -name "*.h" -exec echo {} \; > /dev/null 2>&1
find . -type f -print0 | xargs -0 -n100 echo > /dev/null 2>&1
tar cf - /etc 2>/dev/null | wc -c > /dev/null

Foreign pointers

At some point free may be given a pointer that does not belong to the addresses covered by the allocator’s spans. This can happen if some part of the C library calls into an internal allocator for memory, bypassing the malloc entry point.

There is nothing the allocator can do with this pointer but to forward it to the real free, its intended destination. This involves detecting such a foreign pointer, and obtaining a pointer to the function that would have been called, had the allocator not been interposed.

Exercise 48. Implement a function payload_is_foreign(void_ptr) that checks if the pointer belongs in the address range of any span.

Exercise 49. Read the manual page for dlsym(3). Pay special attention to RTLD_NEXT.

Exercise 50. Add a global function pointer, initialized to NULL. Write a function init_forward_free; if the global pointer is NULL, it must call dlsym to get a pointer to the next symbol that resolves to "free" and assign it to the global pointer. Wrap free to intercept foreign pointers and route them to resolved lower free.

POSIX and the real world

POSIX is vague at times, and some programs may be coded with assumptions that are only valid for a specific implementation of malloc. Edge cases are notorious in this regard, so it pays to double check the behavior of, say, glibc when malloc is given a zero size, or when realloc is given zero size and a non-NULL pointer, or any other such scenario. It is easy to be misled into a deep debugging session by misplaced assumptions.

Avenues for exploration

The implementation is decidedly a toy allocator. Its most egregious deficiency is the lack of support for multi threaded programs. It is also too liberal about its use of memory. A bitmap for small allocations can salvage multiple block headers as well as the time it takes to traverse a free list to look for space. Block headers only need prev/next links when free; different headers for used and free blocks would save further memory. Segregation of blocks in buckets of specific sizes is an effective measure against fragmentation. Real implementations of malloc provide some knobs and toggles, for example through the non-standard mallopt(3) function on Linux.