malloc(3)
In many foreign cities, far away,
Thy lone voice spoke to me
like memory.
It’s a sad fact of life that managing dynamic memory is a very difficult programming task.
malloc is a function in the C standard library. It
takes care of provisioning memory from the operating system and
giving it back. Around this ostensibly simple responsibility dwell
a number of other activities. malloc must be frugal
in its use of memory for bookkeeping and debugging; memory is,
after all, the application’s resource. It must guarantee proper
alignment, sometimes at the cost of wasting bytes. It must be
fast; time is also a resource borrowed from the application. It
must make its best effort to reduce fragmentation. Thread-safe
implementations must account for concurrent access and provide
correct, consistent behavior.
One essential difficulty is that malloc needs to
meet these demands, often at odds with each other, without knowing
anything about the workload applications will present. That is,
malloc is a general purpose allocator. An application
may do thousands of small allocations and free them immediately.
Another may return them all in batches. Another application may
make large allocations, in the order of millions of bytes of
memory at a time.
In scenarios where the workload is known in advance it pays to
build a dedicated allocator, where choices are made to benefit the
patterns of allocation, the types of objects stored, or the needs
of locality. An application employs a slab allocator when it knows
the kind of object it will store in some chunk of memory. It can
answer allocation requests quickly, without being concerned with
requests of arbitrary size in between, and can afford to keep a
pool of initialized objects. An application that allocates
multiple objects and disposes of them as on a stack can afford to
employ a bump allocator, thanks to the guarantee that no older
objects will be destroyed before more recent ones. Not
malloc.
Another difficulty arises from the interface
malloc and its friends must satisfy. Two functions
are predominant: void *malloc(size_t) and
void free(void *). There is a handful of other more
specialized functions, unimportant at this point. The issue is not
apparent from the perspective of the application. It receives
memory, gives it back, Bob’s thy uncle. But malloc
must perform all its janitorial duties when the application
chooses to call it. There is no framework that calls into the
application at specific points; no runtime to stop the presses
when it determines it is time to collect. There is no preempting
the application in the middle of its activities.
Thus, the situation for malloc is quite miserable.
It must serve all applications well, with no prior knowledge,
using meager amounts of memory and what little time it can get
during the application’s business.
Exercise 1. Read your system’s manual page of
malloc(3). Then look up the manual pages of other
system’s malloc implementations, for example, those
of the BSDs, macOS, Linux, Plan 9. Observe their differences: how
do they react when given null pointers; do they make any
guarantees about zero-initializing memory; do they set
errno, and when.
Exercise 2. Look up a few real
malloc implementations. Skim the repository of
jemalloc, and compare its apparent complexity with
that of the malloc implementation in K&R. Ponder
the spectrum of complexity of general purpose memory
management.
mmap(2)
All is not lost. One may temper the discomfort by the knowledge
that the situation could be much worse. malloc stands
between application and operating system; it lets the former
operate abstracted from the particulars of the latter. The
operating system, in turn, lets malloc function
without concern for the technicalities of physical memory and
multiplexing it among multiple processes. Thanks to the operating
system, malloc need not consider page tables, page
faults, swapping, or translation lookaside buffers.
The operating system has its own interfaces for giving memory
to the userland. The mmap(2) system call is capable
of allocating memory and mapping devices and files into the
address space of a process.
Exercise 3. Read the manual page for
mmap. Note the behavior of the system call for
different values of the addr pointer, the protections
that may be specified in the prot argument, and the
values that may be combined to form the flags
argument. Of these, pay special attention to
MAP_ANONYMOUS and MAP_PRIVATE. Consider
the qualities an allocation needs to have for malloc
to handle memory for a process.
Exercise 4. Suppose you want to write
instructions assembled by hand into some buffer, then execute
them. Consider the memory protections documented by
mmap. A policy called “write xor execute” enforces
exactly what it says: a process is allowed to write or execute the
contents of a page, but not both simultaneously. The
mprotect(2) system call provides a way to change the
protection flags of some pages after they have been allocated with
mmap. Read Chris Wellons’ article “A
Basic Just-In-Time Compiler”. Assemble some instructions by
hand or with the help of an assembler, write them to a buffer,
make the page executable, take a function pointer to your buffer
and call it.
Memory pages
Aided by the hardware, operating systems employ a ruse called virtual memory to present processes with the view that they have a large, uninterrupted address space at their disposal. This abstraction makes memory management easier for processes, as they can operate free from the risk of interfering with the memory of other processes running on the system.
The operating system places boundaries to this abstraction. Programs must not try to read or write arbitrary addresses; they must request memory from the operating system first, and are allowed to make use of that memory only, lest they be terminated. Different sections of the address space allow reading, writing or executing.
The smallest unit of memory the operating system manipulates is
a page. It does not allocate portions of a page. This is why
mmap returns memory aligned to the page size of the
system, and rounds up the size of the allocation to a multiple of
the page size. A malloc implementation that gets its
memory from mmap must potentially serve small
requests from relatively large allocations.
Exercise 5. Read the manual page for
sysconf(3). Use it to determine the page size of your
system.
Alignment
CPUs have an easier time retrieving a full word from memory; if an object extends over a word boundary, multiple loads are necessary. Thus basic data types have alignment requirements that dictate the addresses at which they may be stored. Compilers then lay out data for efficient access rather than prudent use of memory.
Padding is needed to align objects of arbitrary size, acting as
stuffing that puts entities on their marks. It materializes at
multiple levels. mmap returns memory in pages, even
if the requested size is not a multiple of the page size.
malloc returns memory aligned for any kind of object;
it does not return arbitrary addresses so it must sacrifice bytes
to observe this requirement. C struct members either are basic
data types, or are made up of basic data types, and therefore must
be aligned to an address appropriate for their size. Compilers
leave unused space between members to bring each to its required
alignment.
A malloc implementation must supply memory that is
suitably aligned for any type of object. With 64 bit systems,
alignment to 16 bytes is a safe bet.
Exercise 6. Read “The Lost Art of Structure Packing” by Eric Raymond.
Exercise 7. Suppose M divides
N. Convince yourself that alignment to N
bytes implies alignment to M bytes.
“Step on a crack, break your mother’s back” has mutated to “dereference nonaligned then cuss, cause an error on the bus.”
Free lists
A free list is a linked list of free blocks of memory available
to serve allocation requests. If a big enough block is available
to serve a malloc call, it can be used without the
expense of a system call. If there are no such blocks, a system
call must be done to get memory from the operating system. When
allocations are returned with free, the space can be
added to the free list to serve future requests.
The canonical toy allocator uses sbrk(2) to get
memory from the operating system. The heap model is of a simple
bump allocator, and the interface of sbrk has a
smaller granularity than that of mmap. This affords
the programmer a way to construct a single free list and the
ability to serve allocations of the requested size simply by
moving the break by the necessary offset, albeit at the cost of
being unable to return memory to the operating system save when
the latest allocation is being freed.
Exercise 8. Read “A Memory Allocator” by Doug Lea.
Exercise 9. Set up a C project with a single
file, malloc.c, and a Makefile that
builds a shared object, malloc.so. Set the compiler
to enforce the C99 standard, add debugging information, turn off
all optimizations, and enable all warnings.
Page sizes vary by system, but they are substantially larger
than basic types, small structs and arrays. Therefore small
allocations must be served from comparatively large spans. This
suggests an adjustment to the model of a single free list: obtain
spans from mmap and organize them in a doubly linked
list. Maintain a free list inside each span, and allocate blocks
from those that have enough free space.
The two central types are struct span and
struct block. A span has a size, pointers to the
previous and next span in the list, a struct block *
to access its free list, and a count of allocated blocks. A block
has a size, pointers to the previous and next blocks in the free
list, and a struct span * to its owner.
Exercise 10. Write the
struct span and struct block types.
Exercise 11. Define a constant for
ALIGNMENT. Find a way to ensure the span
and block types are padded so that their size is a
multiple of the alignment.
Exercise 12. Find a way to implement a static
assert in C. Add static asserts to ensure the size of your
span and block types is evenly divisible
by the ALIGNMENT.
Preparations for malloc
When programs call malloc they state a size
relevant to their needs. Internally, the allocator must prefix the
returned space with a block header. Since the block header size is
a multiple of the ALIGNMENT, affixing the payload to
the header guarantees the required alignment. Likewise, the
payload can be extended to the ALIGNMENT boundary to
guarantee a block header placed immediately after will be
correctly aligned.
Exercise 13. Write functions
align_up(n, a), that returns the next multiple of
a larger than or equal to n, and
gross_size(s) that computes the total number of bytes
required to store a block header, a payload of size
s, and any necessary padding for alignment.
Exercise 14. malloc needs a
location to store the system’s page size, which it must compute
upon the first call. It also needs a fixed entry point into the
list of spans. Add global variables page_size and
base.
To allocate a block, malloc must traverse the list
of spans searching for one whose free list has a block at least as
large as the computed gross size. If no existing span has enough
space, or if there is no existing span because none has been
allocated yet, then memory must be requested from the operating
system.
Note that blocks are removed from the free list once allocated, and put back into it when freed. Their presence in the list indicates they are free.
Exercise 15. Write a function
block_find(gross_size) that traverses the list of
spans, and the free list of each span, and returns a pointer to
the first free block it finds that can fit the gross size, or
NULL if there is no such block.
It is wise to request more memory with mmap than
is immediately necessary to wring the most allocations out of each
system call. However, going too far risks waste if the program
uses only a fraction of the space. Furthermore, an allocation is
tied up as long as it has blocks in use; some workloads could
prevent large allocations from being unmapped, whereas smaller
ones might be returned piecewise.
Exercise 16. Use your judgement and define
MINIMUM_MMAP_ALLOCATION to be a number of bytes at
least as large as your system’s page size. Make it a multiple of
the page size. Observe that mmap returns memory in
multiples of the page size; therefore it is in your best interest
to align up your requests to multiples of the page size as well,
to get those bytes for yourself.
Span allocation
When a program first calls malloc there are no
spans to search. The only thing to do is to request memory for a
span. While many small allocations can be served from one span, a
program might request an amount of memory larger than
MINIMUM_MMAP_ALLOCATION. When this happens, assuming
no allocated span can serve the request, a large span with a
single block is allocated.
The same reasoning applies for spans as it did for blocks. The allocator must request enough memory from the operating system to fit the entire span and its header.
Exercise 17. Write a function
span_alloc(gross_size) that calls mmap
for memory. It must request at least
MINIMUM_MMAP_ALLOCATION bytes, and at least enough
memory for the given gross size and the span header, padded up to
a multiple of the page size. Initialize the span header with its
size and block count, and prepend it to the list of spans.
Exercise 18. The span must have a block in its
free list if it is to be considered for allocations. Write a
function span_first_block(span_ptr) that computes a
struct block * to the address immediately after the
span header. The span header is padded to a multiple of
ALIGNMENT, so this address is properly aligned to
place a block header.
Exercise 19. Write a function
block_init(ptr, span_ptr, size) that initializes a
block header at the given location, with the span as its owner.
Use block_init to place the first block in a span so
as to take the entire space, and place it in its free list.
Block allocation
If block_find finds enough free space, a block can
be allocated without making a system call. However, the block may
be needlessly large, in which case it would be wasteful to assign
it in its entirety. A piece of the desired size can be taken. The
rest of the block remains in the free list where it may serve
future allocations.
Splitting blocks indiscriminately has the undesirable side effect of hatching small gaps between allocations. These are inadequate to serve most requests, or any, if they are too small to fit even a block header, yet they must remain in the free list, adding unnecessary work to the search for free blocks.
One way to mitigate this fragmentation is to coalesce adjacent free blocks to turn two pieces into one, a sort of reverse mitosis. By joining contiguous blocks larger allocations can again be served, and as an added benefit the free list is reduced. The process is not infallible, and some pathological workloads will prevent coalescing, but it is worth doing.
A second way to alleviate the effect on the search process is to accept the fragmentation and give a program a larger block than it asked for. If the residual space after a split would be smaller than some minimum, the entire free block can be given away.
Exercise 20. Decide on a minimum block size
and define a constant MINIMUM_BLOCK_SIZE to that
value. It should be at least as large as the size of a block
header. Write a function block_sever(block_ptr) that
removes a block from its owner’s free list.
An allocation can be split off of a free block from the start or from the end. It is marginally more convenient to split from the end so as to keep the original block in the free list and avoid fiddling with list links. The new block comes into existence as an allocated block, so it never even makes it into the free list.
Consider a function
block_split(block_ptr, gross_size) that initializes a
block of the given size at the very end of the given block. Recall
that a gross size includes the space for the block header and
padding up to ALIGNMENT. Likewise, the size of an
existing block is aligned to ALIGNMENT, for it spans
up to a page boundary, or is otherwise the result of a
gross_size() calculation. Taken together, these
properties guarantee that it is correct to compute the position of
the new block header in the natural way, to wit,
gross_size bytes back from the end of the block.
Exercise 21. Write
block_split(block_ptr, gross_size). It should return
a pointer to a new block header of size gross_size
placed at the end of the given block. The original block size must
be updated. Consider asserting that the given block is large
enough for gross_size, that the pointer that results
from the computation is indeed aligned to ALIGNMENT,
and that it falls within the span. For convenience, write
functions is_ptr_aligned(ptr, a) and
is_ptr_in_span(ptr, span_ptr).
The implementation for
block_alloc(block_ptr, gross_size) handles two cases.
Either the block is big enough to tolerate a split, or it must be
taken in its entirety, in which case it needs to be severed from
its free list.
Exercise 22. Write
block_alloc(block_ptr, gross_size).
Test binary
Tests serve a further purpose than protecting a programmer from themselves. They are also a prime source of delight when they exhibit the code they exercise working as intended. This is especially true during the early stages of a project.
A test harness need not be sophisticated, but some paperwork is
required first. The binary needs access to the functions in
malloc.c, which currently lacks declarations.
Exercise 23. Add a new Makefile
target for a binary tests. It requires two object
files to be built: malloc.o and tests.o.
tests.o in turn requires tests.c and
malloc.h to be built. While at it, add a
test target that executes the new binary.
Exercise 24. Create a new source file
tests.c with a main() function that
calls one initial test function. Have this function execute one of
the functions in malloc.c, say,
gross_size(). See make test fail because
there is no malloc.h, and then because there is no
declaration of gross_size. Add the necessary
declarations to write tests for gross_size and
align_up.
As the entry point into the allocator, malloc
takes on the responsibility of storing the page size. Tests that
circumvent malloc and execute internal functions will
be missing its value. The test binary must call
sysconf and set the page_size global
itself. The global is defined in malloc.c, so the
test binary has to declare it extern.
Testing functions that depend on mmap requires
some way of unmapping spans first. The allocator is stateful, and
is not reset between individual tests in this modest harness.
Since malloc.c declares one global entry point, the
lists of allocated spans and blocks are shared between tests,
which are regular functions in a regular program. Indeed, this is
the first binary to link with and use your allocator, such as it
is.
Exercise 25. Go over the functions written so far and assert the properties their arguments must satisfy, as well as intermediate conditions that must be met. It is hard to overstate the importance of these assertions.
munmap(2)
munmap deletes mappings previously arranged with
mmap. References to addresses in those pages are
subsequently invalid. When every block in a span is freed the
allocator has an opportunity to return the pages to the operating
system. The simplest approach is to return every span when its
allocated block count reaches zero. This amounts to removing the
span from the list, and calling munmap on it.
Exercise 26. Implement
span_sever(span_ptr) and
span_free(span_ptr).
With span_free it is possible for each test
function to reset the allocator to its initial state. This unlocks
the functions that affect state for testing.
Exercise 27. Test span_alloc and
span_free. Test block_find,
block_split and block_sever. Test
block_alloc.
Block deallocation
A call by the program to free corresponds to a
block being put back on its span’s free list. For the block, the
operation is no different from initializing a new free block. For
its owner, besides the list manipulation, it means an update on
the count of allocated blocks.
Exercise 28. Write
block_prepend(block_ptr) and
block_free(block_ptr).
It is advisable to furnish each function with assertions and write tests for them.
malloc and free
malloc must initialize the page size on first
call, compute the gross size the allocator handles internally, try
to find a block, allocate a new span if necessary and finally
initialize a new block. Once the new block is ready,
malloc returns a pointer to its payload.
free receives a pointer previously returned by
malloc. It finds the block header and frees the
block.
Exercise 29. Write functions
block_payload(block_ptr) that computes a pointer to
the block’s payload, and payload_block(void_ptr) that
does the inverse operation.
Exercise 30. Implement malloc and
free.
What is now proved was once only imagined.
calloc(3)
calloc allocates memory for a number of objects of
a given size. It sets the memory to zero bytes.
Exercise 31. Write a function
payload_size(block_ptr) that computes the size of the
payload of a block. Note that this may be larger than the size
originally requested by the program.
Exercise 32. Use memset(3) to
implement calloc.
Block footer
Order in the free list does not generally correspond to the physical arrangement of blocks in a span. Blocks enter and leave the free list in a sequence determined by the workload; allocated blocks are not even in it. To be able to coalesce adjacent free blocks it is necessary to locate the headers of the adjacent blocks. The block following some block can be readily found from the latter’s size, but the previous one requires some machinery.
A block footer is the last word in its payload; if the block is free, the footer contains its size. Due to the arrangements made for alignment, this last word precedes a header. Then, given a pointer to a block header, it is in principle possible to find the previous block’s size by looking at its footer. The previous block header can be located from that size.
However, if the previous block is currently allocated it is not safe to interpret its footer, as the word might be part of the payload in use by the program. Some indication of the allocation status of the previously adjacent block is needed.
Exercise 33. Write a function
block_next_adjacent(block_ptr) that computes a
pointer to the header of the next physically adjacent block.
Bit flags
The size of a block is aligned to a multiple of
ALIGNMENT. Depending on its value, a number of the
least significant bits of a block’s size is guaranteed to be zero.
These bits are the perfect place to encode status information.
Exercise 34. Add bit flags
BLOCK_IS_FREE and BLOCK_IS_PREV_FREE
with values corresponding to available least significant bits in a
block’s size. Implement functions
block_is_free(block_ptr),
block_is_prev_free(block_ptr) and
block_size(block_ptr) that mask the necessary bits to
compute their result.
Meddling with bits in this way means it is no longer possible to directly interpret or modify a block’s size. Every access and update must go through functions that abstract away these details.
Exercise 35. Write functions
block_set_free, block_set_used,
block_set_prev_free, block_set_prev_used
and block_set_size that update a block’s size field
with care to maintain all the information encoded in it.
Exercise 36. Adjust the functions in the allocator to keep track of this new state. Update the relevant tests.
Previous block
The predicate block_is_prev_free guards against
invalid reads of the previous block. A function to find the
previous block header must also consider the possibility that the
current block is the first in its span. In this case the alleged
footer is instead a word in the span header.
Exercise 36. Write a function
block_prev_footer(block_ptr) that computes a pointer
to the word immediately before the given block header. Since every
block is preceded either by another block or by the span header, a
read of such a pointer is always legal, if not valid.
Exercise 37. Implement
block_prev_adjacent(block_ptr). Assert the
appropriate conditions on the given block header, as well as the
alignment of the computed pointer.
Coalescing
Two adjacent free blocks are a wasted opportunity. Anything they could serve individually, a single aggregated block could serve as well. However, an allocation that would otherwise fit a larger block cannot necessarily be served if the allocator sees two entries in a free list, even if they are adjacent. This is an effect of external fragmentation. The natural solution is to combine free blocks whenever two are adjacent.
Coalescing need only happen when a free block appears. It is enough to coalesce once in each direction; there cannot be further adjacent free blocks, for they would have been coalesced the moment the last one became free.
Suppose a and b are adjacent free
blocks, and furthermore that a is before
b, that is, it extends towards b. To
coalesce these two blocks means to grow a by the
length of b. Its footer must be placed in the new
location with the updated size. At this point b is no
longer a valid block, and must be removed from the free list.
Exercise 38. Implement a function
block_coalesce(block_ptr, block_ptr) that extends the
first block over the second, assuming that the first comes before
the second in the span.
Exercise 39. Implement a function
coalesce(block_ptr) that coalesces a free block with
its neighbors, if they are free. Observe that the given block
pointer may no longer be valid after this operation.
Exercise 40. Adjust free to
coalesce whenever possible.
realloc(3)
Ostensibly, realloc resizes an allocation. The
edge cases and the interaction with the design decisions of the
allocator make for an intricate implementation.
Exercise 41. Go back to the manual page for
malloc(3) and go over the description of
realloc.
The simplest scenario is when a null pointer is given to
realloc, in which case it behaves like
malloc. A close second is when the new size comes up
to the same gross size an allocation already has. The caller
believes they are extending or reducing their allocation by a
handful of bytes, unaware that they are given the same block.
Similar considerations apply when truncating an allocation as
when splitting a block. The allocation does not need to be moved;
it can be returned with a smaller size, so the allocator must
decide what to do with the residual space. A new free block is
created if the space is larger than
MINIMUM_BLOCK_SIZE. Otherwise, the waste is accepted
and the block stays as it is. If a new block is created, it may be
adjacent to a free block, so coalescing is necessary.
Exercise 42. Write a function
realloc_truncate(block_ptr, size) that reduces the
size of an allocation and handles the remaining space.
When extending an allocation there may be enough free space following the block that it does not need to be moved. Once again, the allocator must deal with any leftover space after splitting. If there is no such space, a new allocation needs to be made, the content copied over and the old block freed.
Exercise 43. Write a function
realloc_extend(block_ptr, size). It must try to
extend the allocation in place if possible, taking care to handle
the excess space. Use memcpy(3) to copy the payload
if a new block is allocated.
Exercise 44. Implement
realloc.
Real programs
The dynamic linker can be instructed to interpose shared
libraries by listing them in the LD_PRELOAD
environment variable. If it is set, the dynamic linker searches
the listed libraries before any other shared libraries for the
external symbols in the program.
Exercise 45. Skim the manual page for the
dynamic loader in your system. This may be ld.so(8)
in Linux or rtld(1) on FreeBSD.
Exercise 46. Ensure your functions get called
when running an external binary by triggering a visible side
effect on malloc and interposing your
malloc.so.
Exercise 47. Run an assortment of commands using your allocator. Mind the fact that it is single threaded. Debug and correct any errors. Some examples:
ls -l / > /dev/null
grep -r blk . > /dev/null
find . -maxdepth 2 -type f | wc -l > /dev/null
sort malloc.c > /dev/null
git status > /dev/null
git log --oneline -n 50 > /dev/null
git diff HEAD~5..HEAD > /dev/null
grep -r blk . 2> /dev/null | sort | uniq -c > /dev/null
diff -r . .. 2> /dev/null | head -100 > /dev/null
find /usr -maxdepth 4 -type f -name "*.h" -exec echo {} \; > /dev/null 2>&1
find . -type f -print0 | xargs -0 -n100 echo > /dev/null 2>&1
tar cf - /etc 2>/dev/null | wc -c > /dev/null
Foreign pointers
At some point free may be given a pointer that
does not belong to the addresses covered by the allocator’s spans.
This can happen if some part of the C library calls into an
internal allocator for memory, bypassing the malloc
entry point.
There is nothing the allocator can do with this pointer but to
forward it to the real free, its intended
destination. This involves detecting such a foreign pointer, and
obtaining a pointer to the function that would have been called,
had the allocator not been interposed.
Exercise 48. Implement a function
payload_is_foreign(void_ptr) that checks if the
pointer belongs in the address range of any span.
Exercise 49. Read the manual page for
dlsym(3). Pay special attention to
RTLD_NEXT.
Exercise 50. Add a global function pointer,
initialized to NULL. Write a function
init_forward_free; if the global pointer is
NULL, it must call dlsym to get a
pointer to the next symbol that resolves to "free"
and assign it to the global pointer. Wrap free to
intercept foreign pointers and route them to resolved lower
free.
POSIX and the real world
POSIX is vague at times, and some programs may be coded with
assumptions that are only valid for a specific implementation of
malloc. Edge cases are notorious in this regard, so
it pays to double check the behavior of, say, glibc when
malloc is given a zero size, or when
realloc is given zero size and a
non-NULL pointer, or any other such scenario. It is
easy to be misled into a deep debugging session by misplaced
assumptions.
Avenues for exploration
The implementation is decidedly a toy allocator. Its most
egregious deficiency is the lack of support for multi threaded
programs. It is also too liberal about its use of memory. A bitmap
for small allocations can salvage multiple block headers as well
as the time it takes to traverse a free list to look for space.
Block headers only need prev/next links
when free; different headers for used and free blocks would save
further memory. Segregation of blocks in buckets of specific sizes
is an effective measure against fragmentation. Real
implementations of malloc provide some knobs and
toggles, for example through the non-standard
mallopt(3) function on Linux.