, ,

Virtual Reminiscence Tricks · Our Machinery

Virtual Reminiscence Tricks · Our Machinery

info image

I’ve been planning for a while to blueprint a put up about stuff you would possibly well fabricate with digital memory, so when @jimsagevid tweeted me about digital memory in accordance with my last put up, I felt the time used to be staunch.

Virtual memory is droll. As programmers, we know that it’s there (on all recent CPUs and OSs), nonetheless we have a tendency to ignore it, seemingly because it’s indirectly uncovered in our programming languages. Or, we accurate tell it as the thing that makes our tool flee if truth be told dull in dwelling of crashing, when it runs out of bodily RAM.

But, it looks, while you certainly blueprint relate of what digital memory can fabricate, you would possibly well hang some handsome cool things.

Obscenely mountainous array

To recap — in my last put up I faced the concern of trying a mountainous look up desk from object IDs to object pointers. Furthermore, I wished this desk to preserve mounted in memory, so that writers would possibly well well atomically substitute pointers within the desk with out stressful readers. This fashion that the utilization of something care for a std::vector is no longer any longer that you simply would possibly well imagine, for the explanation that underlying records would possibly be reallocated when the vector grows.

A mounted array works:

object_o *objects[MAX_OBJECTS].

But what size will enjoy to soundless we relate for the array? Settle a mountainous size, and we’re losing memory. Settle a shrimp size, and we’d well well flee out of objects. In the article, I mature a hierarchical capability, nonetheless as @jimsagevid pointed out, we would possibly well well relate digital memory as a substitute, and assist far flung from the complexity of a hierarchical desk structure.

When we allocate memory thru the digital memory blueprint we reserve handle dwelling for the pages we allocate immediately (no loads of code can allocate memory in that handle vary), nonetheless proper bodily memory is no longer any longer allocated till we if truth be told contact the pages.

So for this scream, we can accurate desire a with ease astronomical quantity for our array and with regards to allocate it. Affirm 1 billion of objects, as an illustration:

#relate MAX_OBJECTS 1000000000ULL
object_o **objects = virtual_alloc(MAX_OBJECTS * sizeof(object_o *));

We’re the utilization of Eight GB of handle dwelling and digital memory, nonetheless easiest as important bodily memory as we if truth be told need for our objects. A easy solution that easiest requires one line of code.

Demonstrate: I’m the utilization of virtual_alloc() here as my OS agnostic digital memory allocation call. On Windows which you would possibly if truth be told call VirtualAlloc() and on Linux mmap().

One other demonstrate: Windows separates digital memory allocation into separate MEM_RESERVE and MEM_COMMIT calls. MEM_RESERVE reserves handle dwelling and MEM_COMMIT commits it to bodily memory. That does no longer mean that bodily memory is that if truth be told allocated while you MEM_COMMIT, the bodily memory is soundless no longer allocated till you certainly contact the pages. But MEM_COMMIT reserves memory within the page file, and while you are trying to commit extra memory than you enjoy to your page file, MEM_COMMIT will fail. So on Windows, you in all probability don’t are searching to MEM_COMMIT the total look up desk (because you easiest enjoy a dinky size page file). Instead which you would possibly be searching to MEM_RESERVE the entire desk within the inspiration after which accurate MEM_COMMIT the vary of the desk that you simply certainly relate.

In inequity, Linux permits overcommitting — i.e., you would possibly well commit extra memory than the scale of your page file. This is why, Linux doesn’t need separate reserve/commit operations which makes the digital memory loads extra efficient to make relate of.

Is reserving Eight GB of digital memory for the array a concern? Now no longer if truth be told. There are two limits that concern us here. The first is handle dwelling. In a Sixty four-bit application (which we buy), the handle dwelling is 2Sixty four, a staggeringly astronomical quantity with room for billions of gigabyte-sized arrays. The 2nd restrict is for digital memory. The OS usually doesn’t let us allocate the entire that you simply would possibly well imagine handle dwelling as digital memory. On Sixty four-bit Windows, as an illustration, we can easiest allocate 256 TB of digital memory. Tranquil, that’s room for 32 000 arrays of Eight GB every, so as lengthy as we don’t depart fully loopy, we’re handsome.

Truly, which you would possibly potentially relate this approach for all global or semi-global arrays to your application with out any complications.

Undergo in mind the aged-college C capability of writing video games with static arrays for all objects:

uint32_t num_tanks;
tank_t tanks[MAX_TANKS];

uint32_t num_bullets;
bullet_t bullets[MAX_BULLETS];


If you write code care for this, you would possibly well honest moreover be particular someone will whinge that it’s no longer “future-proof”, because you enjoy caps on the sequence of objects. That is roughly ridiculous, nonetheless to meet the criticism in a much less complex and extra enjoyable capability than the utilization of a std::vector you would possibly well accurate catch rid of the MAX_* defines and allocate 1 GB of digital memory for every array:

#relate GB a thousand million
uint32_t num_tanks;
tank_t *tanks = virtual_alloc(GB);
uint32_t num_bullets;
bullet_t *bullets = virtual_alloc(GB);

Application huge uncommon IDs

A form of sport engine methods need uncommon IDs to title objects. Most ceaselessly the code looks something care for this:

uint64_t allocate_id(system_t *sys)
    return sys->next_free_id++;

But digital memory affords us one more choice. Rather than the utilization of integers as identifiers, we would possibly well well relate addresses reserved from the digital memory blueprint. This permits us to fabricate IDs which would possibly be uncommon, no longer accurate in basically the most recent blueprint, nonetheless all the way thru the total application. And since we never if truth be told relate the memory pointed to by these addresses, we gained’t relate any bodily memory.

It could possibly well well seek for something care for this:

system_id_t *allocate_id(system_t *sys)
    if (!sys->id_block || sys->id_block_used == PAGE_SIZE) {
        sys->id_block = virtual_alloc(PAGE_SIZE);
        sys->id_block_used = Zero;
    return (system_id_t *)(sys->id_block + sys->id_block_used++);

Demonstrate that by the utilization of a pointer to an opaque struct for the ID we moreover catch some style safety, that we didn’t enjoy with the uint64_t capability.

Reminiscence overwrite detection

One amongst the toughest bugs to repair is random memory overwrites, the build some section of code is writing to memory that it shouldn’t contact. Then, later, when some loads of section of code tries to make relate of that memory, this would well honest read that trash records and (in all probability) atomize. The concern here is that while the atomize itself is easy to secure, it is complex to know the build the boring write that corrupted the records to open with came about.

Fortunately, digital memory can encourage us here too.

To depart making an are attempting to secure why, first demonstrate that the timeframe random memory overwrite is that if truth be told a misnomer. Edifying care for typical dwelling, handle dwelling is basically empty. With a Sixty four bit handle dwelling and an application size of declare 2 GB, the handle dwelling is ninety 9.999999988 % empty. This fashion that if the memory overwrites had been in actuality random, in all probability they would hit this empty dwelling and dwelling off a page fault/catch admission to violation. That will give us a atomize at the level of the boring write, in dwelling of the innocent read, which would possibly well well blueprint the malicious program important more uncomplicated to secure and repair.

But in spite of the entirety, the writes are in overall no longer in actuality random. Instead they ceaselessly descend in a single of two categories:

  • Writing to memory that has been freed.
  • Writing previous the allocated memory for an object.

In every cases, it is handsome seemingly that the write will if truth be told hit some loads of object, in dwelling of empty dwelling. In the first case, the memory will potentially had been recycled for something else. In the 2nd case, the write will potentially hit a neighboring object, or allocation block headers.

We will blueprint these cases seek for extra care for the random case by replacing the fashioned blueprint allocator with an cease-of-page allocator. Such an allocator makes relate of the digital memory blueprint to allocate every object by itself dwelling of pages, and furthermore, it aligns the thing so that the head of the thing coincides with the head of the last page.

Break of page block placement.

This accomplishes two things. First, as soon as we free the pages, we free them within the digital memory blueprint. This fashion that making an are attempting to jot down to the pages after they had been freed will end result in an catch admission to violation. Demonstrate that this usually doesn’t happen with a conventional memory allocator — this would well honest put the freed memory in its freelists to be mature for loads of allocations, in dwelling of returning it to the OS.

2nd, as the head of the block coincides with the head of the page, writing previous the head of the block will moreover dwelling off an catch admission to violation. In loads of words, with this means, our conventional boring memory writes will atomize at the level of the write and allow us to with out design back diagnose the concern.

Since this allocator rounds up all allocations to the page size, this would well honest raze a range of memory for shrimp allocations. It will also no longer be something that you simply will want enabled the total time. What I usually fabricate is work with the fashioned allocator. Then, if I suspect boring memory overwrites, I swap it out for the head-of-page allocator. As soon as I’ve mounted the concern, I swap assist to the fashioned allocator. This in spite of the entirety requires you to enjoy an architecture the build you would possibly well with out design back trade which allocator your blueprint is the utilization of.

Writing an cease cease-of-page allocator is no longer any longer complex at all. Here’s what malloc looks care for:

void *eop_malloc(uint64_t size)
    uint64_t pages = (size + PAGE_SIZE - 1) / PAGE_SIZE;
    char *noxious = virtual_alloc(pages * PAGE_SIZE);
    uint64_t offset = pages * PAGE_SIZE - size;
    return noxious + offset;

Demonstrate: There’s soundless boring writes that will depart undetected with this means. As an example, after we’ve freed the pages, a original page allocation would possibly well well happen within the same memory vary. Also, it is that you simply would possibly well imagine that one more dwelling of pages would possibly be allocated immediately after ours in memory, by which case we wouldn’t detect overwrites previous the last page.

Each and each these complications would possibly well honest moreover be mounted. For the first scream, we can leave the pages reserved, nonetheless no longer committed. That capability, the bodily memory is freed and we’re going to have the selection to catch page faults, nonetheless the addresses are reserved and can’t be mature by loads of objects. For the 2nd scream, we would possibly well well reserve an extra page after our pages, nonetheless no longer commit that page. No loads of object would possibly well well then affirm these addresses and writing to them would soundless dwelling off an catch admission to violation. (Demonstrate: This easiest works on Windows, the build reserve and commit are separate operations.)

In apply even though, I’ve never wished to perceive these extra precautions. For me, the classic cease-of-page allocator has at all times been ample to secure and repair the bugs.

Fragment free memory allocation

On aged console hardware, memory fragmentation would possibly well well give programmers nightmares. But even on recent machines, memory fragmentation is step by step a mountainous concern when making an are attempting to put into effect efficient memory allocators.

A memory allocator works by requesting mountainous chunks of memory from the blueprint and lowering them up into smaller pieces on inquire of of the user. Reminiscence fragmentation happens when easiest some of these objects are freed, leaving holes within the mature memory block:

Reminiscence fragmentation.

The concern here is that if the user requests a mountainous chunk of memory, none of the “holes” would possibly be mountainous ample. This fashion we have to allocate it from the free memory block at the head — increasing the memory relate of the application. In loads of words, we have a range of wasted memory that we can’t if truth be told relate. In the olden days, memory used to be a precious resource. Nowadays, it’s much less so, nonetheless we soundless don’t are searching to raze it.

Reminiscence fragmentation is a fantastic matter, nonetheless in dwelling of diving too deep into it, I’m accurate going to seek for at it from the digital memory level of view, which is handsome easy. If which you would possibly be allocating from digital memory, fragmentation is a non-scream.

The reason is that as soon as we are the utilization of digital memory, every page in handle dwelling is personally mapped to a page in bodily memory. So if we fabricate “holes” in bodily memory by allocating and releasing pages, it doesn’t matter, because these “holes” can later be mature to allocate what to us looks care for contiguous memory. Let me are attempting for example it with a image:

Bodily memory can no longer be fragmented by digital memory allocations.

Here, we have first made the red allocations and freed some of them, leaving holes in handle dwelling and bodily memory. On the different hand, this doesn’t cease us from making the mountainous red allocation. Each and each page within the red allocation would possibly well honest moreover be mapped to one of the precious outlet pages we created earlier, with out the need for allocating to any extent extra bodily memory. So nothing is misplaced to fragmentation.

Demonstrate that we soundless enjoy fragmentation within the handle dwelling. I.e. we can’t blueprint a mountainous contiguous allocation within the handle dwelling the build the red allocations are. But we can’t enjoy any fragmentation in bodily memory. Fragmentation in handle dwelling doesn’t if truth be told effort us, because as I said sooner than, handle dwelling is ceaselessly ninety 9.999999988 % empty anyway, so we enjoy no concern discovering contiguous handle blocks. (On 32 bit methods, it’s a uncommon yarn.)

The design back in contrast with the utilization of an well-liked allocator is that we have to round up the allocations to the page size, which usually is Four K on recent methods. This rounding up way we’re no longer the utilization of all of the accessible memory, a concern which is a cramped little bit of confusingly ceaselessly known as interior fragmentation.

There are a range of methods of going thru this. For mounted size objects we can relate object swimming pools — allocate a page of memory and divide it into as many objects will match on that page.

For dynamically increasing buffers, we can accurate blueprint the buffer size match the page size. This is a straightforward nonetheless provocative approach that I haven’t seen talked about loads. Affirm that you simply enjoy an array of objects which would possibly be 300 bytes mountainous. The mature setup is that as you will want extra entries you grow the array geometrically by declare doubling it in size. So you depart from sixteen to 32 to Sixty four to 128 parts. Geometrical development is important because it way you easiest pay an amortized constant value for along side an ingredient to the array.

On the different hand, sixteen * 300 = 4800. If you allocate that from digital memory, you enjoy to round it up to eight K, losing nearly a entire page. But we can with out design back repair that. Rather than focusing on the sequence of parts, we accurate grow the buffer by multiples of the page size: Four K, Eight K, sixteen K, 32 K, … after which assign as many parts as will slot in there (Thirteen, 27, fifty four, 109, …). This is soundless geometric development, so the value is soundless amortized constant, nonetheless now the interior fragmentation is accurate a hundred and fifty bytes on moderate, in dwelling of 2 K.

I secure it a cramped bit boring that fashioned memory allocators don’t snatch extra perfect thing about digital memory to assist far flung from fragmentation. Most allocators I’ve checked out soundless work by acquiring astronomical chunks from the OS and lowering them up, in dwelling of working with page sized digital memory allocations.

Is getting better chunks from the OS extra efficient? I’m heading into territory the build my records is handsome sketchy here, nonetheless I don’t tell so. Using a better page size would possibly well honest moreover be extra efficient, because it way a smaller page desk and better relate of the TLB cache. But, given a mounted page size, I don’t tell it matters while you enjoy one mountainous or many shrimp digital memory allocations, for the explanation that handle resolve is done page-by-page anyway. And even while you allocate astronomical chunks, consecutive digital pages are in overall no longer consecutive in bodily memory anyway.

There is potentially some memory overhead for the OS to assist word of a astronomical sequence of particular particular person memory allocations. Also we would favor blueprint calls to allocate and free up the pages that will relate a cramped time. Per chance that’s the explanation. Or seemingly it’s accurate that typical allocators are written to flee in a range of environments — on 32 bit methods or methods with astronomical page sizes — to allow them to’t snatch perfect thing about the fragment free allocations you would possibly well catch with Sixty four bit handle dwelling and Four K pages.

Gapless ring buffer

This is a trick I realized out about from Fabian Giesens blog, nonetheless the belief that looks to had been around for longer.

A ring buffer is a if truth be told worthwhile records structure for any scream the build records is produced and consumed at loads of rates. As an example, when reading from a socket, you would possibly well honest collect records in chunks that fabricate no longer match your packet size, so that you simply enjoy to buffer it till you enjoy received a paunchy packet.

A hoop buffer shops the records in an array of mounted size that “wraps around”, i.e. as soon as the array has been stuffed we originate writing records to the inspiration of the array again.

In addition to the array we moreover need read and write pointers to know the build within the buffer to read and write records. I purchase to make relate of uint64_t counters for this, specifying the total sequence of information written and skim, something care for this:

enum {BUFFER_SIZE = Eight*1024};
struct ring_buffer_t {
    uint8_t records[BUFFER_SIZE];
    uint64_t read;
    uint64_t written;

Demonstrate that for the explanation that buffer wraps around, we won’t let the write pointer flee too far earlier than the read pointer or this would well honest originate to trash records that hasn’t been read but. In overall, the BUFFER_SIZE controls how far forward the creator can catch. If the buffer is paunchy, the creator must stall and expect the reader. If the buffer is empty, the reader must stall and expect the creator. (The sequence of bytes accessible to the reader is written - read.)

A if truth be told annoying section of ring buffers is that you simply will want special code to handle the wraparound. If it weren’t for this, writing to the buffer would possibly well well be a easy memcpy, nonetheless as a substitute we’re going to have the selection to enjoy to soundless be cautious that we fabricate the staunch thing when the write straddles the wraparound level:

void write(ring_buffer_t *rb, uint8_t *p, uint64_t n)
    uint64_t offset = rb->written % BUFFER_SIZE;
    uint64_t dwelling = BUFFER_SIZE - offset;
    uint64_t first_write = n < space ? n : space;
    memcpy(rb->records + offset, p, first_write);
    memcpy(rb->records, p + first_write, n - first_write);
    rb->written += n;

As annoying as here is, reading is even worse. If it wasn’t for the wraparound, reading wouldn’t even require any copying at all — we would possibly well well accurate relate the records immediately from the buffer — nonetheless with the wraparound, we would favor two memcpy() calls to pass the records to a contiguous memory block for extra processing.

How can digital memory encourage here? Lets relate the “fantastic array” approach and accurate reserve an ideal array in dwelling of the ring buffer, commit pages as the creator evolved and decommit them as the reader evolved. With this we wouldn’t even must dwelling a mounted size for the array — it would possibly possibly well accurate relate as important memory as wished. Barely ingenious. But demonstrate which which you would possibly desire a fantastic array. To buffer a 1 Gbps network trot with an uptime of a twelve months you would possibly be in a position to must reserve Four PB (petabytes) of memory. Sadly, as we saw above, Sixty four-bit windows caps the digital memory at 256 TB. Also, these commit and decommit calls aren’t free.

But we can repair this in a single more capability, by taking perfect thing about the truth that a couple of digital pages would possibly well honest moreover be setup to resolve to the same bodily page. This is step by step mature to portion memory between loads of processes, nonetheless we can moreover dwelling it up in a single direction of. To relate this for the ring buffer we dwelling up a dwelling of pages, immediately after the ring buffer that expose the same bodily memory:

Ring buffer with page mapping. The murky block looks twice.

As you would possibly well look, with this setup the digital memory blueprint takes care of the wraparound for us. In the handle dwelling we can at all times secure a contiguous reproduction of the memory within the ring buffer, even supposing the buffer is split up physically. The read and write functions change into merely:

void write(ring_buffer_t *rb, uint8_t *p, uint64_t n)
    memcpy(rb->records + (rb->written % BUFFER_SIZE), p, n);
    rb->written += n;

uint8_t *read(ring_buffer_t *rb, uint64_t n)
    uint8_t *p = rb->records + (rb->read % BUFFER_SIZE);
    rb->read += n;
    return p;

Plenty nicer, and we’re soundless the utilization of the same quantity of (bodily) memory.

Demonstrate that establishing this memory structure is step by step a cramped bit complex. On Windows you enjoy to fabricate a file mapping into digital memory with CreateFileMapping() — yes, even supposing no info on disk are involved we soundless desire a “file mapping” because that’s how digital memory is shared. Since we don’t desire a file on disk we relate INVALID_HANDLE_VALUE for the file handle which creates a mapping in opposition to the page file. Then we relate MapViewOfFileEx() to attract that file mapping into two consecutive memory areas. Unfortunately, there is just not any longer any capability of guaranteeing that the memory areas we pass come in. We will reserve them, after which free them accurate sooner than calling MapViewOfFileEx() , nonetheless that soundless leaves a window of time the build if we’re astronomical unlucky, someone else would possibly well well advance and allocate something in that handle dwelling. Lets must retry the mapping a couple of cases sooner than it succeeds, nonetheless after that setup we can relate the buffer with out being concerned about it.

And that’s all from me

Attain you know of any natty digital memory tricks no longer talked about here. Tweet them to me at @niklasfrykholm.

Be taught Extra

What do you think?

0 points
Upvote Downvote

Total votes: 0

Upvotes: 0

Upvotes percentage: 0.000000%

Downvotes: 0

Downvotes percentage: 0.000000%