Discussion:
How do the malloc() and free() algorithms work?
Add Reply
Rick C. Hodgin
2017-05-16 19:39:13 UTC
Reply
Permalink
Raw Message
Can someone point me to a website or explain to me how malloc() and
free() work in a general sense on Windows?

I'd like to know how it propagates its structures from front-door
requests, through any layers, to the actual memory locations, and
what it uses to store data beyond that which is visible to us as
a buffer in an application program.

Thank you,
Rick C. Hodgin
Rick C. Hodgin
2017-05-16 21:02:23 UTC
Reply
Permalink
Raw Message
I ask because I've written one that is performing 3x to 8x faster than MSVC++'s,
and I can't believe it. I am thinking I am doing something wrong.

Mine is semaphore'd, uses VirtualAlloc() with MEM_COMMIT, and is a
three-tiered design to various requested block sizes, supports pre- and
post-guard buffers of N,M bytes each for memory corruption
detection, and a slower variant allows un-committed virtual pages to be
inserted before and after, allowing for memory corruption detection on
the instruction creating the fault (the reason I wrote it, as an aid in
debugging memory errors).

It consumes more memory than malloc, but with the speedup, I think
most would tolerate the trade.

I'm doing stress testing tomorrow and Thursday. I'm very surprised it can
even compete in the same ballpark as malloc() and free(). I would've
figured they were as optimized as could be due to their oft-use.

Thank you,
Rick C. Hodgin
s***@casperkitty.com
2017-05-16 22:00:05 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
I ask because I've written one that is performing 3x to 8x faster than MSVC++'s,
and I can't believe it. I am thinking I am doing something wrong.
It's possible to optimize memory management code around different usage
patterns. Code which is optimized for the usage patterns that are
actually encountered may be able to perform much faster than code which
is not. In the era when the first versions of MSVC were created, heavy
use of malloc() and free() was practically an admission that one wasn't
interested in performance, since programs with memory management code
tailored to its actual needs could vastly outperform functions that had
to accommodate arbitrary sequences of acquisitions and releases.
Rick C. Hodgin
2017-05-17 00:56:27 UTC
Reply
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Rick C. Hodgin
I ask because I've written one that is performing 3x to 8x faster than MSVC++'s,
and I can't believe it. I am thinking I am doing something wrong.
It's possible to optimize memory management code around different usage
patterns. Code which is optimized for the usage patterns that are
actually encountered may be able to perform much faster than code which
is not.
The tests I'm doing right now are on general purpose access using the
existing code bases I have. I didn't design the algorithm for a specific
usage pattern. In fact, I didn't design it to be efficient. I simply
wanted a way to have a solid debugging tool that was drop-in replacement
for malloc(), realloc(), and free(). I didn't care about performance
at all, provided it was not 10x slower than malloc(). I needed a way to
find allocated buffer overruns.

Designing these algorithms though, I've seen about five different
variations that I could see having advantages in different cases. I
have also considered how some of them could work with a CPU that
had a feature I'd call "stitching," which is a type of access that's
enabled by a special load instruction, which loads a pointer into a
register, along with its stitch buffer, which allows fragmented
blocks to appear contiguous and work together as though it were a
larger, linear block, all through the enabling of a few CPU resources
dedicated to that task.
Post by s***@casperkitty.com
In the era when the first versions of MSVC were created, heavy
use of malloc() and free() was practically an admission that one wasn't
interested in performance, since programs with memory management code
tailored to its actual needs could vastly outperform functions that had
to accommodate arbitrary sequences of acquisitions and releases.
I was not aware of that. I figured malloc() and free() would be among
the most heavily optimized algorithms in existence. I've never looked
into them, but I have seen memcpy() come up from time to time, and I've
seen where it is hand-coded assembly. I just assumed the same for the
core portions of malloc() and free().

Thank you,
Rick C. Hodgin
s***@casperkitty.com
2017-05-17 14:56:45 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Post by s***@casperkitty.com
In the era when the first versions of MSVC were created, heavy
use of malloc() and free() was practically an admission that one wasn't
interested in performance, since programs with memory management code
tailored to its actual needs could vastly outperform functions that had
to accommodate arbitrary sequences of acquisitions and releases.
I was not aware of that. I figured malloc() and free() would be among
the most heavily optimized algorithms in existence. I've never looked
into them, but I have seen memcpy() come up from time to time, and I've
seen where it is hand-coded assembly. I just assumed the same for the
core portions of malloc() and free().
If code running on a single thread will be churning many objects of a
particular type which contains a pointer to something of the same type
called "link" [that could be used for arbitrary purposes in allocated
objects], performance may be improved substantially be using an allocator
designed for that purpose.

THING *thing_recycle= 0;
THING *thing_alloc(void)
{
THING *ret;
if (thing_recycle)
{
ret = thing_recycle;
thing_recycle = ret->link;
}
else
{
ret = malloc(sizeof (THING));
ret->link = 0;
}
return ret;
}

void thing_free(THING *p)
{
p->link = thing_recycle;
thing_recycle = p;
}

void thing_empty_trash(void)
{
THING *temp;
while(thing_recycle)
{
temp = thing_recycle->link;
free(thing_recycle);
thing_recycle = temp;
};
}

Any storage which was ever allocated for a "THING" would be ineligible for
use as any other type unless/until code calls thing_empty_trash, but the
combined cost to release a thing and reallocate it without an intervening
thing_empty_trash call would, with mild optimization, be three loads,
three stores, and one null check. No general-purpose "malloc" could hope to
come close to that except in cases where a compiler can "see" the entire
lifetime of an allocated object and can see that nothing "unusual" happens
to it.

Incidentally, all that would be necessary to make the code thread-safe
would be to give each thread its own copy of thing_recycle, and make each
thread responsible for calling thing_empty_trash on its own pool of
things. No special handling would be required to accommodate things which
might be allocated on one thread and released on another, since the pools
would have no reason to care about what happens to things after they are
allocated.
Chris M. Thomasson
2017-05-16 22:15:21 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
I ask because I've written one that is performing 3x to 8x faster than MSVC++'s,
and I can't believe it. I am thinking I am doing something wrong.
Mine is semaphore'd,
Fwiw, here is a fast semaphore:

https://groups.google.com/d/topic/comp.lang.c++/60IpFwqbtu8/discussion

Thanks to Joe Seigh, a very smart guy indeed!
Post by Rick C. Hodgin
uses VirtualAlloc() with MEM_COMMIT, and is a
three-tiered design to various requested block sizes, supports pre- and
post-guard buffers of N,M bytes each for memory corruption
detection, and a slower variant allows un-committed virtual pages to be
inserted before and after, allowing for memory corruption detection on
the instruction creating the fault (the reason I wrote it, as an aid in
debugging memory errors).
It consumes more memory than malloc, but with the speedup, I think
most would tolerate the trade.
I'm doing stress testing tomorrow and Thursday. I'm very surprised it can
even compete in the same ballpark as malloc() and free(). I would've
figured they were as optimized as could be due to their oft-use.
Chris M. Thomasson
2017-05-16 22:06:37 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Can someone point me to a website or explain to me how malloc() and
free() work in a general sense on Windows?
I'd like to know how it propagates its structures from front-door
requests, through any layers, to the actual memory locations, and
what it uses to store data beyond that which is visible to us as
a buffer in an application program.
The best performances use per-thread memory pools with some
wait/lock-free magic to allow memory allocated in one thread to be freed
in another. I have some work on this, so does Intel.

https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/300265
(read all!)

https://www.threadingbuildingblocks.org/tutorial-intel-tbb-scalable-memory-allocator

https://groups.google.com/forum/embed/#!topic/comp.programming.threads/gkX0tIWR_Gk
(very old work!)

:^)
Chris M. Thomasson
2017-05-16 22:13:37 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Can someone point me to a website or explain to me how malloc() and
free() work in a general sense on Windows?
I'd like to know how it propagates its structures from front-door
requests, through any layers, to the actual memory locations, and
what it uses to store data beyond that which is visible to us as
a buffer in an application program.
https://www.bsdcan.org/2006/papers/jemalloc.pdf
Chris M. Thomasson
2017-05-16 22:45:41 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Can someone point me to a website or explain to me how malloc() and
free() work in a general sense on Windows?
I'd like to know how it propagates its structures from front-door
requests, through any layers, to the actual memory locations, and
what it uses to store data beyond that which is visible to us as
a buffer in an application program.
Also, take a look at:

https://en.wikipedia.org/wiki/Hoard_memory_allocator

;^)
Robert Wessel
2017-05-17 03:31:02 UTC
Reply
Permalink
Raw Message
On Tue, 16 May 2017 12:39:13 -0700 (PDT), "Rick C. Hodgin"
Post by Rick C. Hodgin
Can someone point me to a website or explain to me how malloc() and
free() work in a general sense on Windows?
I'd like to know how it propagates its structures from front-door
requests, through any layers, to the actual memory locations, and
what it uses to store data beyond that which is visible to us as
a buffer in an application program.
If you install MSVS with the right options, most of the source to the
C library is included.

But the short answer is that it mostly punts to the HeapXxxx
functions.
Rick C. Hodgin
2017-05-17 14:22:10 UTC
Reply
Permalink
Raw Message
The algorithm I have is three-tier. It has a root, that branches out
to pages which have control blocks indicating what memory is associated
with that page, which then branch out to allocated pages.

root ==> page ==> data

There are several roots based on allocation size:

root8 // Anything allocated 8-bytes or smaller
root16 // Anything allocated 16-bytes or smaller
root32 // Anything allocated 32-bytes or smaller
root64 // Anything allocated 64-bytes or smaller
root128 // Anything allocated 128-bytes or smaller
root256 // Anything allocated 256-bytes or smaller
root512 // Anything allocated 512-bytes or smaller
root1024 // Anything allocated 1024-bytes or smaller
root2048 // Anything allocated 2048-bytes or smaller
root4096 // Anything allocated 4096-bytes or smaller
root8192 // Anything allocated 8192-bytes or smaller
root16384 // Anything allocated 16384-bytes or smaller
root_bg // For bigger allocations, or memory with guard bytes

Each root has some entries which allow access:

struct root {
int size;
page* firstPage;
page* firstFreePage;
page* lastPage;

void* (*malloc)(int size);
void* (*realloc)(void* p, int new_size);
void* (*free)(void);
};

Each allocation page has some allocation slots, and basic fields:

struct slot {
int size; // Allocated size of memory block
};

struct page {
slot* slots[N]; // N is determined by page size

void* base; // Allocated memory for this page
page* prev; // Prev page in linked list
page* next; // Next page in linked list
int firstFreeSlot; // First free slot num on this page
};
// Note: Pages are typically 4KB, with sizeof(slot) being 4,
// or page::slots[N] of (4096 - 16) / 4 = 1020 slots

Each page allocates the full size of its slot allocation, so a
single allocation of 6 bytes, for example, creates the first
page, which is 4KB, and it uses slot 0, leaving 1019 slots still
available for use before more allocation for small sizes is
required. It allocates base = malloc(N * root->size);, which
would be base = malloc(1020 * 8); in this case, which would
allocate a 8160, which is rounded up to the nearest 4KB block
to align with page sizes.

The reallocate, and release/free algorithms traverse the base
range, and determine if a pointer is within that min..max range.
If it is, it determines the start of the slot (allowing for a
pointer in any area of the allocated block to be used to free
the block). If it's a reallocate and there's enough leftover
room, it immediately returns after updating the size. Otherwise,
it allocates a new block, copies the data, and frees the original
block, returning the new pointer. Free algorithms simply release
whatever's there.

For big / guard memory it's a little different. It uses more
comprehensive page and slot structures, containing base, size,
preguard, postguard bytes, along with some callback information,
and has the option to allocate a virtual page that can be used
to capture invalid access for reads or writes before and after
the memory block. This may result in a mis-aligned pointer in
the case of using guard pages, and depending on the size, but
the benefit of being able to trap the exact instruction that is
causing the invalid write is well worth it.

That part of the memory management is about 2x to 6x slower than
the traditional malloc() and free() because of the overhead of
creating all of the virtual pages on every call, and the average
misalignment when using guard bytes.

It's less than 2,000 lines long and I'm still stress testing
and debugging it today. This particular version belongs to the
company I work for. I plan on writing an expanded version which
contains the variants I mentioned above, and will release the
source code for it when completed.

It's working quite well ... much to my amazement to be honest.
In fact, it's turned out far better than I expected. I'm rather
stunned to be honest.

Thank you,
Rick C. Hodgin
Rick C. Hodgin
2017-05-17 16:15:21 UTC
Reply
Permalink
Raw Message
This new memory algorithm has significantly increased the performance
of some of my existing and well-debugged applications. I am floored
by the change from simply moving from one memory algorithm to another.
I'm in debug mode and am seeing almost a 10-fold increase in speed
on algorithms that load and parse data, and on algorithms have data
existing for relatively short lifetimes. It's unbelievable the
performance difference. I haven't even compiled into release mode
yet (with optimizations, debug mode is slow unoptimized code with
all kinds of checks enabled to validate data isn't being corrupted
accidentally).

-----
Is it common knowledge in 2017 that malloc() and realloc() and free()
are inherently slow in their implementation? I'm thinking that if
some existing common widely-distributed applications switched to a
different heap management subsystem, they too might see staggering
increases in performance with no other changes to their system.

Thank you,
Rick C. Hodgin
s***@casperkitty.com
2017-05-17 17:35:16 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Is it common knowledge in 2017 that malloc() and realloc() and free()
are inherently slow in their implementation? I'm thinking that if
some existing common widely-distributed applications switched to a
different heap management subsystem, they too might see staggering
increases in performance with no other changes to their system.
Programmers should be aware that allocation systems that operate on fixed-
sized chunks can vastly outperform those that have to accommodate chunks
of arbitrary size. Unfortunately, there is no standard means by which one
could make a fixed-size block allocator that was usable in type-agnostic
fashion from code that doesn't always write storage before reading it.
Some algorithms can benefit from being able to use a pattern like:

unsigned index = thing->table1[key];
if (index < thing->count && thing->table2[index] == key)
... Item was found

which would be unaffected by what value was read from any entry of
table1 that it hadn't written with a known value, but such code could
behave in arbitrary fashion if storage was recycled after having been
used as another type. For example, a compiler that could see the
sequence:


thingAsOtherType->whatever = 12345678;
free_chunk(thingAsOtherType);
thing = alloc_chunk();
thing->count = 0;
...
unsigned index = thing->table1[key];
if (index < thing->count && thing->table2[index] == key)
... Item was found

would be entitled to replace it with:

thing = (THING_TYPE*)thingAsOtherType;
thing->count = 0;
int ok = thing->table1[key] < thing->count;
thingAsOtherType->whatever = 12345678; // Moved from above
if (ok && thing->table2[thing->table1[key]] == key)
... Item was found

Such substitution could cause the value of thing->table1[key] used for
indexing to be beyond the range of thing->table2, which could in turn
trigger other arbitrary problems.

Unless one knows that a compiler won't perform that style of aggressive
optimizations, or the compiler provides a way of blocking them, the
only safe way to reuse storage as different types without having to
clear it before use is to use malloc/free.
bartc
2017-05-19 11:17:21 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
This new memory algorithm has significantly increased the performance
of some of my existing and well-debugged applications. I am floored
by the change from simply moving from one memory algorithm to another.
I'm in debug mode and am seeing almost a 10-fold increase in speed
on algorithms that load and parse data, and on algorithms have data
existing for relatively short lifetimes. It's unbelievable the
performance difference. I haven't even compiled into release mode
yet (with optimizations, debug mode is slow unoptimized code with
all kinds of checks enabled to validate data isn't being corrupted
accidentally).
-----
Is it common knowledge in 2017 that malloc() and realloc() and free()
are inherently slow in their implementation? I'm thinking that if
some existing common widely-distributed applications switched to a
different heap management subsystem, they too might see staggering
increases in performance with no other changes to their system.
Do you have any benchmark code we can try out? Using regular
malloc/free. Then we can appreciate what a 10x speed increase might be like!

The program below is the Binary Trees benchmark, which is sensitive to
the quality of malloc and free implementations. But it is limited as a
test because it only allocates and frees large numbers of the same small
allocation (the 'treenode' struct).

What differences does your version make with this? (You can change the
n=15 line for a bigger test.)


//-------------------------------------------

#include <stdio.h>
#include <stdlib.h>

int nallocs;

typedef struct _node {
struct _node *left;
struct _node *right;
int item;
} treenode;

treenode* newtreenode(treenode* left, treenode* right, int item) {
treenode* t;
t=malloc(sizeof(treenode));
++nallocs;
t->left=left;
t->right=right;
t->item=item;
return t;
}

int itemcheck(treenode* tree) {
if (tree->left==NULL)
return tree->item;
else
return tree->item+itemcheck(tree->left)-itemcheck(tree->right);
}

treenode* bottomuptree(int item, int depth) {
if (depth>0)
return newtreenode(bottomuptree(2*item-1,depth-1),
bottomuptree(2*item,depth-1),item);
else
return newtreenode(NULL,NULL,item);
}

void deletetree(treenode* tree) {
if (tree->left) {
deletetree(tree->left);
deletetree(tree->right);
}
free(tree);
}

int main(void) {
treenode *stretchtree, *longlivedtree, *temptree;
int n,depth,mindepth,maxdepth,stretchdepth,check,iterations,i;

n=15;

mindepth=4;
if ((mindepth+2)>n)
maxdepth=mindepth+1;
else
maxdepth=n;

stretchdepth=maxdepth+1;

stretchtree=bottomuptree(0,stretchdepth);
printf("Stretch tree of depth %d\tcheck: %d\n",
stretchdepth,itemcheck(stretchtree));
deletetree(stretchtree);

longlivedtree=bottomuptree(0,maxdepth);

depth=mindepth;
while (depth<=maxdepth) {
iterations=1<<(maxdepth-depth+mindepth);
check=0;
for (i=1; i<=iterations; ++i) {
temptree=bottomuptree(i,depth);
check=check+itemcheck(temptree);
deletetree(temptree);

temptree=bottomuptree(-i,depth);
check=check+itemcheck(temptree);
deletetree(temptree);
}
printf("%d\tTrees of depth\t%d\tcheck: %d\n",
iterations*2,depth,check);
depth=depth+2;
}
printf("%s %d %s %d\n","long lived tree of depth",
maxdepth,"\tcheck:",itemcheck(longlivedtree));
printf("Nallocs= %d\n",nallocs);
}

//-------------------------------------------
--
bartc
Rick C. Hodgin
2017-05-19 11:56:19 UTC
Reply
Permalink
Raw Message
I will benchmark it later tonight. I created the algorithm, but the
source code belongs to thw company I work for. I'll have to create
my own variant before I can publish source code.

Thank you,
Rick C. Hodgin
Rick C. Hodgin
2017-05-19 23:05:51 UTC
Reply
Permalink
Raw Message
Post by bartc
The program below is the Binary Trees benchmark, which is sensitive to
the quality of malloc and free implementations. But it is limited as a
test because it only allocates and frees large numbers of the same small
allocation (the 'treenode' struct).
What differences does your version make with this? (You can change the
n=15 line for a bigger test.)
This test exposes a serious flaw in my implementation. Mine takes
4,264ms compared to malloc()'s 862ms, ~5x slower.

I'll add an optimization over the next few days to remove this flaw
and post new results.

My algorithm is ahead on the first pass by 6%, but falls off sharply
as the allocation count grows.

It's an interesting test. Do you have any more?

Thank you,
Rick C. Hodgin
Mr. Man-wai Chang
2017-05-18 12:48:25 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Can someone point me to a website or explain to me how malloc() and
free() work in a general sense on Windows?
I'd like to know how it propagates its structures from front-door
requests, through any layers, to the actual memory locations, and
what it uses to store data beyond that which is visible to us as
a buffer in an application program.
Depends on the implementation of operating systems? Is it really related
to C? :)
--
@~@ Remain silent! Drink, Blink, Stretch! Live long and prosper!!
/ v \ Simplicity is Beauty!
/( _ )\ May the Force and farces be with you!
^ ^ (x86_64 Ubuntu 9.10) Linux 2.6.39.3
不借貸! 不詐騙! 不援交! 不打交! 不打劫! 不自殺! 請考慮綜援 (CSSA):
http://www.swd.gov.hk/tc/index/site_pubsvc/page_socsecu/sub_addressesa
Rick C. Hodgin
2017-05-18 14:19:06 UTC
Reply
Permalink
Raw Message
Post by Mr. Man-wai Chang
Post by Rick C. Hodgin
Can someone point me to a website or explain to me how malloc() and
free() work in a general sense on Windows?
I'd like to know how it propagates its structures from front-door
requests, through any layers, to the actual memory locations, and
what it uses to store data beyond that which is visible to us as
a buffer in an application program.
Depends on the implementation of operating systems? Is it really related
to C? :)
The C Standard defines malloc(), realloc(), free(), and others. How
they mechanically apply to a given OS and its set of low-level function
call APIs is another matter. I'm curious about how they are designed
in general (such as what type of model do they use for allocation and
free operations, as I assume there is some core algorithm that's been
developed over time, evolving into the one that's most preferred due
to its general reliability, speed, and minimal fragmentation).

http://www.iso-9899.info/n1570.html

7.22.3 [Memory management functions]
7.22.3.1 [The aligned_alloc function]
7.22.3.2 [The calloc function]
7.22.3.3 [The free function]
7.22.3.4 [The malloc function]
7.22.3.5 [The realloc function]

I think for CAlive I am going to directly define three types of memory:

1) fx-memory (fixed, process-only)
2) mv-memory (movable, process-only)
3) go-memory (global, system-wide)

When allocating with fx_malloc(), you receive a pointer to a fixed
memory block in the local heap. It is pointing directly to the place
where memory is, and you can directly manipulate that data.

When allocating with mv_malloc(), you receive a pointer to a pointer
to a fixed memory block in the local heap. This allows the underlying
memory to be moved around as needed. You reference it the same way in
source code, but a mild performance hit is had because it's going
through indirection for each reference.

When allocating with go_malloc(), you receive a pointer to a global
memory block in the global heap. Other processes can request their
access into this heap, with even access to the indicated pointer.

The go-memory will be a direct pointer pointing to the location
where memory is, and since it is managed by the OS, the underlying
virtual pages which relate to the offset allow it to be moved as
needed without application awareness.

Thank you,
Rick C. Hodgin
Malcolm McLean
2017-05-18 14:33:36 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
The C Standard defines malloc(), realloc(), free(), and others. How
they mechanically apply to a given OS and its set of low-level function
call APIs is another matter. I'm curious about how they are designed
in general (such as what type of model do they use for allocation and
free operations, as I assume there is some core algorithm that's been
developed over time, evolving into the one that's most preferred due
to its general reliability, speed, and minimal fragmentation).
Modern systems use a power of two pages algorithm.
Each allocation (up to a limit) is rounded up to the nearest power of
two. You then have pages for 8, 16, 32, 64 etc bytes, and you allocate
from that page. To free, you take the page address to tell you how
big the allocation was.
Mr. Man-wai Chang
2017-05-18 15:32:48 UTC
Reply
Permalink
Raw Message
Post by Malcolm McLean
Modern systems use a power of two pages algorithm.
Each allocation (up to a limit) is rounded up to the nearest power of
two. You then have pages for 8, 16, 32, 64 etc bytes, and you allocate
from that page. To free, you take the page address to tell you how
big the allocation was.
Only very old, primitive operating systems would allow a user
application to manage the kernel's memory. Think memory and stack
protection. MS/DOS maybe? :)
--
@~@ Remain silent! Drink, Blink, Stretch! Live long and prosper!!
/ v \ Simplicity is Beauty!
/( _ )\ May the Force and farces be with you!
^ ^ (x86_64 Ubuntu 9.10) Linux 2.6.39.3
不借貸! 不詐騙! 不援交! 不打交! 不打劫! 不自殺! 請考慮綜援 (CSSA):
http://www.swd.gov.hk/tc/index/site_pubsvc/page_socsecu/sub_addressesa
Rick C. Hodgin
2017-05-18 15:45:06 UTC
Reply
Permalink
Raw Message
Post by Mr. Man-wai Chang
Post by Malcolm McLean
Modern systems use a power of two pages algorithm.
Each allocation (up to a limit) is rounded up to the nearest power of
two. You then have pages for 8, 16, 32, 64 etc bytes, and you allocate
from that page. To free, you take the page address to tell you how
big the allocation was.
Only very old, primitive operating systems would allow a user
application to manage the kernel's memory. Think memory and stack
protection. MS/DOS maybe? :)
MS-DOS offered the same features. .COM files were automatically
granted access to all available free memory. The typical procedure
was to release that memory keeping only what you knew you needed
(typically 64KB or less), and then allocate blocks as needed with
the appropriate INT 21h call.

But I would imagine that even back then, the malloc() and realloc()
functions would request larger blocks from MS-DOS, and then divvy
them up themselves for the smaller requests.

It's just a guess, but it would seem to be the most efficient way,
and the only way to ensure reliability across various versions of
DOS.

Thank you,
Rick C. Hodgin
Rick C. Hodgin
2017-05-18 15:47:56 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Post by Mr. Man-wai Chang
Post by Malcolm McLean
Modern systems use a power of two pages algorithm.
Each allocation (up to a limit) is rounded up to the nearest power of
two. You then have pages for 8, 16, 32, 64 etc bytes, and you allocate
from that page. To free, you take the page address to tell you how
big the allocation was.
Only very old, primitive operating systems would allow a user
application to manage the kernel's memory. Think memory and stack
protection. MS/DOS maybe? :)
MS-DOS offered the same features. .COM files were automatically
granted access to all available free memory. The typical procedure
was to release that memory keeping only what you knew you needed
(typically 64KB or less), and then allocate blocks as needed with
the appropriate INT 21h call.
I meant to also add here that .EXE files only allocated as much
memory as they needed, with the rest automatically remaining free
when the main() function was hit. As such, all allocations
thereafter proceeded forward similarly in .COM files as they did
with .EXE files, with the only difference being pointer size,
and the math done on pointers to access larger blocks in a segmented
memory model.
Post by Rick C. Hodgin
But I would imagine that even back then, the malloc() and realloc()
functions would request larger blocks from MS-DOS, and then divvy
them up themselves for the smaller requests.
It's just a guess, but it would seem to be the most efficient way,
and the only way to ensure reliability across various versions of
DOS.
Thank you,
Rick C. Hodgin
Keith Thompson
2017-05-18 16:24:35 UTC
Reply
Permalink
Raw Message
Malcolm McLean <***@gmail.com> writes:
[...]
Post by Malcolm McLean
Modern systems use a power of two pages algorithm.
Each allocation (up to a limit) is rounded up to the nearest power of
two. You then have pages for 8, 16, 32, 64 etc bytes, and you allocate
from that page. To free, you take the page address to tell you how
big the allocation was.
I don't believe that's universally true.
--
Keith Thompson (The_Other_Keith) kst-***@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
Malcolm McLean
2017-05-18 19:18:33 UTC
Reply
Permalink
Raw Message
Post by Keith Thompson
[...]
Post by Malcolm McLean
Modern systems use a power of two pages algorithm.
Each allocation (up to a limit) is rounded up to the nearest power of
two. You then have pages for 8, 16, 32, 64 etc bytes, and you allocate
from that page. To free, you take the page address to tell you how
big the allocation was.
I don't believe that's universally true.
I'm sure it's not.
But that's the core of how most modern systems work. Instead of a flat
memory pool from which you dish out portions, you have a relatively
large number of pages, which are a power of two in size, and you then
chop them up into smaller powers of two. Thus fragmentation is kept to
a minimum, and it is fast to search the free list.
a***@gmail.com
2017-05-19 21:37:23 UTC
Reply
Permalink
Raw Message
the one that speed free is the one it search first the last malloc memory released as the stack memory, because functions call follow as a stack structure and they have to use malloc free
Joe Pfeiffer
2017-05-19 22:01:35 UTC
Reply
Permalink
Raw Message
Post by a***@gmail.com
the one that speed free is the one it search first the last malloc
memory released as the stack memory, because functions call follow as
a stack structure and they have to use malloc free
If you mean the functions call malloc() and free() to create their
activation records, no they don't (at least, not in the case of any
compiler I know of). They just move the registers pointing into the
stack.
Mr. Man-wai Chang
2017-05-18 15:28:44 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
The C Standard defines malloc(), realloc(), free(), and others. How
they mechanically apply to a given OS and its set of low-level function
call APIs is another matter. I'm curious about how they are designed
in general (such as what type of model do they use for allocation and
free operations, as I assume there is some core algorithm that's been
developed over time, evolving into the one that's most preferred due
to its general reliability, speed, and minimal fragmentation).
Memory is always managed by the operating system and/or kernels. C has
no say on these, as far as I know. The operating systems, however, could
be written in C, as for example, Unix and Linux. But the memory is still
managed by the operating systems.

You can, of course, create your own garbage memory management system ON
TOP OF the operating systems. ;)
--
@~@ Remain silent! Drink, Blink, Stretch! Live long and prosper!!
/ v \ Simplicity is Beauty!
/( _ )\ May the Force and farces be with you!
^ ^ (x86_64 Ubuntu 9.10) Linux 2.6.39.3
不借貸! 不詐騙! 不援交! 不打交! 不打劫! 不自殺! 請考慮綜援 (CSSA):
http://www.swd.gov.hk/tc/index/site_pubsvc/page_socsecu/sub_addressesa
Rick C. Hodgin
2017-05-18 15:42:43 UTC
Reply
Permalink
Raw Message
Post by Mr. Man-wai Chang
Post by Rick C. Hodgin
The C Standard defines malloc(), realloc(), free(), and others. How
they mechanically apply to a given OS and its set of low-level function
call APIs is another matter. I'm curious about how they are designed
in general (such as what type of model do they use for allocation and
free operations, as I assume there is some core algorithm that's been
developed over time, evolving into the one that's most preferred due
to its general reliability, speed, and minimal fragmentation).
Memory is always managed by the operating system and/or kernels. C has
no say on these, as far as I know.
Unless I'm mistaken, requests are made from the malloc() and realloc()
functions to allocate memory, but those algorithms decide internally
how that larger allocated block is divvied up to fulfill requests to
allocate 11 bytes for this request, 9 bytes on this request, 337 bytes
on that request, and so on.

The OS does not handle those allocations, but only larger blocks. If
you had to call into the OS to fulfill every malloc() request, it would
be a great performance hit as traversal from user space to kernel space
is relatively expensive in terms of clock cycles and context switches.
Post by Mr. Man-wai Chang
The operating systems, however, could
be written in C, as for example, Unix and Linux. But the memory is still
managed by the operating systems.
As I understand it, there is more than one level of management, broken
down by granularity. The OS fulfills large requests (multiples of 4KB
or 4MB blocks as per virtual page sizes), and then that large allocation
goes through the C standard library algorithm which then breaks the
large allocation into many smaller pieces per user app requests.

Thank you,
Rick C. Hodgin
Joe Pfeiffer
2017-05-18 15:46:39 UTC
Reply
Permalink
Raw Message
Post by Mr. Man-wai Chang
Post by Rick C. Hodgin
The C Standard defines malloc(), realloc(), free(), and others. How
they mechanically apply to a given OS and its set of low-level function
call APIs is another matter. I'm curious about how they are designed
in general (such as what type of model do they use for allocation and
free operations, as I assume there is some core algorithm that's been
developed over time, evolving into the one that's most preferred due
to its general reliability, speed, and minimal fragmentation).
Memory is always managed by the operating system and/or kernels. C has
no say on these, as far as I know. The operating systems, however,
could be written in C, as for example, Unix and Linux. But the memory
is still managed by the operating systems.
You can, of course, create your own garbage memory management system
ON TOP OF the operating systems. ;)
Exactly, which is how malloc() works. It manages a free space pool in
user space, using something like the power-of-two algorithm described by
another poster. When that pool is exhausted it requests a large new
chunk of space from the OS, and then allocates from within that. Note
this doesn't necessarily have anything to do with garbage collection.

It's true that this really isn't C specific, but not because it's all
managed by the kernel.
s***@casperkitty.com
2017-05-18 16:08:54 UTC
Reply
Permalink
Raw Message
Post by Mr. Man-wai Chang
Memory is always managed by the operating system and/or kernels. C has
no say on these, as far as I know. The operating systems, however, could
be written in C, as for example, Unix and Linux. But the memory is still
managed by the operating systems.
Unless an optimizer can somehow manage to break it, one can implement
malloc/free without any reliance upon the OS by simply saying:

typedef struct {
void *next;
size_t size } MEM_BLOCK;

union {
MEM_BLOCK free_area;
coarsestType arr[1+ (SIZE_OF_POOL-1) / sizeof (coarsestType)];
} mem_pool_obj = { { 0, 1+ (SIZE_OF_POOL-1) / sizeof (coarsestType) } };
void *volatile first_free_block = &mem_pool_obj;

and then have malloc/free return pointers to portions of mem_pool_obj. One
would have to know in advance what size the pool should be, and systems may
limit static allocations to a small portion of total RAM (many MS-DOS
and classic-Macintosh implementations were like that) but if a compiler is
required to presume that a volatile void* might identify a union containing
any arbitrary combination of types (as would have been the case under C89)
there would be no need for malloc() to use the OS at all unless the storage
it needs to handle exceeds the allowable size for static objects.
Chris M. Thomasson
2017-05-18 22:48:58 UTC
Reply
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Mr. Man-wai Chang
Memory is always managed by the operating system and/or kernels. C has
no say on these, as far as I know. The operating systems, however, could
be written in C, as for example, Unix and Linux. But the memory is still
managed by the operating systems.
Unless an optimizer can somehow manage to break it, one can implement
typedef struct {
void *next;
size_t size } MEM_BLOCK;
There is a way to combine the next pointer into the memory block itself.
Virtually zero overhead. This is using tricks to gain a pointer to the
"super-block" without having to create an extra pointer per block.
s***@casperkitty.com
2017-05-18 23:07:34 UTC
Reply
Permalink
Raw Message
Post by Chris M. Thomasson
Post by s***@casperkitty.com
Post by Mr. Man-wai Chang
Memory is always managed by the operating system and/or kernels. C has
no say on these, as far as I know. The operating systems, however, could
be written in C, as for example, Unix and Linux. But the memory is still
managed by the operating systems.
Unless an optimizer can somehow manage to break it, one can implement
typedef struct {
void *next;
size_t size } MEM_BLOCK;
There is a way to combine the next pointer into the memory block itself.
Virtually zero overhead. This is using tricks to gain a pointer to the
"super-block" without having to create an extra pointer per block.
If a memory manager requires applications to specify the size of blocks
being freed, and block sizes have sufficient granularity to hold a pointer
and a size, it's pretty easy to reduce per-block overhead to zero by only
keeping track of free blocks. Allocation with zero per-block overhead can
also be accomplished if the granularity only allows space for a pointer,
but at the cost of more code (keep separate free lists for minimal-sized
blocks and larger blocks). If applications are not required to specify
the size of blocks being released (e.g. when implementing a conforming
version of "free()"), it will be necessary to prepend each block with a
header giving its size, thus adding overhead.
Chris M. Thomasson
2017-05-20 22:17:54 UTC
Reply
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Chris M. Thomasson
Post by s***@casperkitty.com
Post by Mr. Man-wai Chang
Memory is always managed by the operating system and/or kernels. C has
no say on these, as far as I know. The operating systems, however, could
be written in C, as for example, Unix and Linux. But the memory is still
managed by the operating systems.
Unless an optimizer can somehow manage to break it, one can implement
typedef struct {
void *next;
size_t size } MEM_BLOCK;
There is a way to combine the next pointer into the memory block itself.
Virtually zero overhead. This is using tricks to gain a pointer to the
"super-block" without having to create an extra pointer per block.
If a memory manager requires applications to specify the size of blocks
being freed, and block sizes have sufficient granularity to hold a pointer
and a size, it's pretty easy to reduce per-block overhead to zero by only
keeping track of free blocks. Allocation with zero per-block overhead can
also be accomplished if the granularity only allows space for a pointer,
but at the cost of more code (keep separate free lists for minimal-sized
blocks and larger blocks). If applications are not required to specify
the size of blocks being released (e.g. when implementing a conforming
version of "free()"), it will be necessary to prepend each block with a
header giving its size, thus adding overhead.
The technique basically involves a rounding down a blocks address into
the superblock structure.
Siri Cruise
2017-05-20 23:06:13 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Can someone point me to a website or explain to me how malloc() and
free() work in a general sense on Windows?
I'd like to know how it propagates its structures from front-door
requests, through any layers, to the actual memory locations, and
For the allocator I use, free does nothing, and malloc is nontrivial and can
trigger a garbage collection. There is no single strategy 'in general'. If
allocation interests you, you can look up different strategies and actual
implementations.
Post by Rick C. Hodgin
what it uses to store data beyond that which is visible to us as
a buffer in an application program.
The documentation of whatever allocator you use should explain everything
visible to you.
--
:-<> Siri Seal of Disavowal #000-001. Disavowed. Denied. Deleted. @
'I desire mercy, not sacrifice.' /|\
Free the Amos Yee one. This post / \
Yeah, too bad about your so-called life. Ha-ha. insults Islam. Mohammed
Loading...