Discussion:
How do the malloc() and free() algorithms work?
Add Reply
Rick C. Hodgin
2017-05-16 19:39:13 UTC
Reply
Permalink
Raw Message
Can someone point me to a website or explain to me how malloc() and
free() work in a general sense on Windows?

I'd like to know how it propagates its structures from front-door
requests, through any layers, to the actual memory locations, and
what it uses to store data beyond that which is visible to us as
a buffer in an application program.

Thank you,
Rick C. Hodgin
Rick C. Hodgin
2017-05-16 21:02:23 UTC
Reply
Permalink
Raw Message
I ask because I've written one that is performing 3x to 8x faster than MSVC++'s,
and I can't believe it. I am thinking I am doing something wrong.

Mine is semaphore'd, uses VirtualAlloc() with MEM_COMMIT, and is a
three-tiered design to various requested block sizes, supports pre- and
post-guard buffers of N,M bytes each for memory corruption
detection, and a slower variant allows un-committed virtual pages to be
inserted before and after, allowing for memory corruption detection on
the instruction creating the fault (the reason I wrote it, as an aid in
debugging memory errors).

It consumes more memory than malloc, but with the speedup, I think
most would tolerate the trade.

I'm doing stress testing tomorrow and Thursday. I'm very surprised it can
even compete in the same ballpark as malloc() and free(). I would've
figured they were as optimized as could be due to their oft-use.

Thank you,
Rick C. Hodgin
s***@casperkitty.com
2017-05-16 22:00:05 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
I ask because I've written one that is performing 3x to 8x faster than MSVC++'s,
and I can't believe it. I am thinking I am doing something wrong.
It's possible to optimize memory management code around different usage
patterns. Code which is optimized for the usage patterns that are
actually encountered may be able to perform much faster than code which
is not. In the era when the first versions of MSVC were created, heavy
use of malloc() and free() was practically an admission that one wasn't
interested in performance, since programs with memory management code
tailored to its actual needs could vastly outperform functions that had
to accommodate arbitrary sequences of acquisitions and releases.
Rick C. Hodgin
2017-05-17 00:56:27 UTC
Reply
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Rick C. Hodgin
I ask because I've written one that is performing 3x to 8x faster than MSVC++'s,
and I can't believe it. I am thinking I am doing something wrong.
It's possible to optimize memory management code around different usage
patterns. Code which is optimized for the usage patterns that are
actually encountered may be able to perform much faster than code which
is not.
The tests I'm doing right now are on general purpose access using the
existing code bases I have. I didn't design the algorithm for a specific
usage pattern. In fact, I didn't design it to be efficient. I simply
wanted a way to have a solid debugging tool that was drop-in replacement
for malloc(), realloc(), and free(). I didn't care about performance
at all, provided it was not 10x slower than malloc(). I needed a way to
find allocated buffer overruns.

Designing these algorithms though, I've seen about five different
variations that I could see having advantages in different cases. I
have also considered how some of them could work with a CPU that
had a feature I'd call "stitching," which is a type of access that's
enabled by a special load instruction, which loads a pointer into a
register, along with its stitch buffer, which allows fragmented
blocks to appear contiguous and work together as though it were a
larger, linear block, all through the enabling of a few CPU resources
dedicated to that task.
Post by s***@casperkitty.com
In the era when the first versions of MSVC were created, heavy
use of malloc() and free() was practically an admission that one wasn't
interested in performance, since programs with memory management code
tailored to its actual needs could vastly outperform functions that had
to accommodate arbitrary sequences of acquisitions and releases.
I was not aware of that. I figured malloc() and free() would be among
the most heavily optimized algorithms in existence. I've never looked
into them, but I have seen memcpy() come up from time to time, and I've
seen where it is hand-coded assembly. I just assumed the same for the
core portions of malloc() and free().

Thank you,
Rick C. Hodgin
s***@casperkitty.com
2017-05-17 14:56:45 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Post by s***@casperkitty.com
In the era when the first versions of MSVC were created, heavy
use of malloc() and free() was practically an admission that one wasn't
interested in performance, since programs with memory management code
tailored to its actual needs could vastly outperform functions that had
to accommodate arbitrary sequences of acquisitions and releases.
I was not aware of that. I figured malloc() and free() would be among
the most heavily optimized algorithms in existence. I've never looked
into them, but I have seen memcpy() come up from time to time, and I've
seen where it is hand-coded assembly. I just assumed the same for the
core portions of malloc() and free().
If code running on a single thread will be churning many objects of a
particular type which contains a pointer to something of the same type
called "link" [that could be used for arbitrary purposes in allocated
objects], performance may be improved substantially be using an allocator
designed for that purpose.

THING *thing_recycle= 0;
THING *thing_alloc(void)
{
THING *ret;
if (thing_recycle)
{
ret = thing_recycle;
thing_recycle = ret->link;
}
else
{
ret = malloc(sizeof (THING));
ret->link = 0;
}
return ret;
}

void thing_free(THING *p)
{
p->link = thing_recycle;
thing_recycle = p;
}

void thing_empty_trash(void)
{
THING *temp;
while(thing_recycle)
{
temp = thing_recycle->link;
free(thing_recycle);
thing_recycle = temp;
};
}

Any storage which was ever allocated for a "THING" would be ineligible for
use as any other type unless/until code calls thing_empty_trash, but the
combined cost to release a thing and reallocate it without an intervening
thing_empty_trash call would, with mild optimization, be three loads,
three stores, and one null check. No general-purpose "malloc" could hope to
come close to that except in cases where a compiler can "see" the entire
lifetime of an allocated object and can see that nothing "unusual" happens
to it.

Incidentally, all that would be necessary to make the code thread-safe
would be to give each thread its own copy of thing_recycle, and make each
thread responsible for calling thing_empty_trash on its own pool of
things. No special handling would be required to accommodate things which
might be allocated on one thread and released on another, since the pools
would have no reason to care about what happens to things after they are
allocated.
Chris M. Thomasson
2017-05-16 22:15:21 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
I ask because I've written one that is performing 3x to 8x faster than MSVC++'s,
and I can't believe it. I am thinking I am doing something wrong.
Mine is semaphore'd,
Fwiw, here is a fast semaphore:

https://groups.google.com/d/topic/comp.lang.c++/60IpFwqbtu8/discussion

Thanks to Joe Seigh, a very smart guy indeed!
Post by Rick C. Hodgin
uses VirtualAlloc() with MEM_COMMIT, and is a
three-tiered design to various requested block sizes, supports pre- and
post-guard buffers of N,M bytes each for memory corruption
detection, and a slower variant allows un-committed virtual pages to be
inserted before and after, allowing for memory corruption detection on
the instruction creating the fault (the reason I wrote it, as an aid in
debugging memory errors).
It consumes more memory than malloc, but with the speedup, I think
most would tolerate the trade.
I'm doing stress testing tomorrow and Thursday. I'm very surprised it can
even compete in the same ballpark as malloc() and free(). I would've
figured they were as optimized as could be due to their oft-use.
Chris M. Thomasson
2017-05-16 22:06:37 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Can someone point me to a website or explain to me how malloc() and
free() work in a general sense on Windows?
I'd like to know how it propagates its structures from front-door
requests, through any layers, to the actual memory locations, and
what it uses to store data beyond that which is visible to us as
a buffer in an application program.
The best performances use per-thread memory pools with some
wait/lock-free magic to allow memory allocated in one thread to be freed
in another. I have some work on this, so does Intel.

https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/300265
(read all!)

https://www.threadingbuildingblocks.org/tutorial-intel-tbb-scalable-memory-allocator

https://groups.google.com/forum/embed/#!topic/comp.programming.threads/gkX0tIWR_Gk
(very old work!)

:^)
Chris M. Thomasson
2017-05-16 22:13:37 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Can someone point me to a website or explain to me how malloc() and
free() work in a general sense on Windows?
I'd like to know how it propagates its structures from front-door
requests, through any layers, to the actual memory locations, and
what it uses to store data beyond that which is visible to us as
a buffer in an application program.
https://www.bsdcan.org/2006/papers/jemalloc.pdf
Chris M. Thomasson
2017-05-16 22:45:41 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Can someone point me to a website or explain to me how malloc() and
free() work in a general sense on Windows?
I'd like to know how it propagates its structures from front-door
requests, through any layers, to the actual memory locations, and
what it uses to store data beyond that which is visible to us as
a buffer in an application program.
Also, take a look at:

https://en.wikipedia.org/wiki/Hoard_memory_allocator

;^)
Robert Wessel
2017-05-17 03:31:02 UTC
Reply
Permalink
Raw Message
On Tue, 16 May 2017 12:39:13 -0700 (PDT), "Rick C. Hodgin"
Post by Rick C. Hodgin
Can someone point me to a website or explain to me how malloc() and
free() work in a general sense on Windows?
I'd like to know how it propagates its structures from front-door
requests, through any layers, to the actual memory locations, and
what it uses to store data beyond that which is visible to us as
a buffer in an application program.
If you install MSVS with the right options, most of the source to the
C library is included.

But the short answer is that it mostly punts to the HeapXxxx
functions.
Rick C. Hodgin
2017-05-17 14:22:10 UTC
Reply
Permalink
Raw Message
The algorithm I have is three-tier. It has a root, that branches out
to pages which have control blocks indicating what memory is associated
with that page, which then branch out to allocated pages.

root ==> page ==> data

There are several roots based on allocation size:

root8 // Anything allocated 8-bytes or smaller
root16 // Anything allocated 16-bytes or smaller
root32 // Anything allocated 32-bytes or smaller
root64 // Anything allocated 64-bytes or smaller
root128 // Anything allocated 128-bytes or smaller
root256 // Anything allocated 256-bytes or smaller
root512 // Anything allocated 512-bytes or smaller
root1024 // Anything allocated 1024-bytes or smaller
root2048 // Anything allocated 2048-bytes or smaller
root4096 // Anything allocated 4096-bytes or smaller
root8192 // Anything allocated 8192-bytes or smaller
root16384 // Anything allocated 16384-bytes or smaller
root_bg // For bigger allocations, or memory with guard bytes

Each root has some entries which allow access:

struct root {
int size;
page* firstPage;
page* firstFreePage;
page* lastPage;

void* (*malloc)(int size);
void* (*realloc)(void* p, int new_size);
void* (*free)(void);
};

Each allocation page has some allocation slots, and basic fields:

struct slot {
int size; // Allocated size of memory block
};

struct page {
slot* slots[N]; // N is determined by page size

void* base; // Allocated memory for this page
page* prev; // Prev page in linked list
page* next; // Next page in linked list
int firstFreeSlot; // First free slot num on this page
};
// Note: Pages are typically 4KB, with sizeof(slot) being 4,
// or page::slots[N] of (4096 - 16) / 4 = 1020 slots

Each page allocates the full size of its slot allocation, so a
single allocation of 6 bytes, for example, creates the first
page, which is 4KB, and it uses slot 0, leaving 1019 slots still
available for use before more allocation for small sizes is
required. It allocates base = malloc(N * root->size);, which
would be base = malloc(1020 * 8); in this case, which would
allocate a 8160, which is rounded up to the nearest 4KB block
to align with page sizes.

The reallocate, and release/free algorithms traverse the base
range, and determine if a pointer is within that min..max range.
If it is, it determines the start of the slot (allowing for a
pointer in any area of the allocated block to be used to free
the block). If it's a reallocate and there's enough leftover
room, it immediately returns after updating the size. Otherwise,
it allocates a new block, copies the data, and frees the original
block, returning the new pointer. Free algorithms simply release
whatever's there.

For big / guard memory it's a little different. It uses more
comprehensive page and slot structures, containing base, size,
preguard, postguard bytes, along with some callback information,
and has the option to allocate a virtual page that can be used
to capture invalid access for reads or writes before and after
the memory block. This may result in a mis-aligned pointer in
the case of using guard pages, and depending on the size, but
the benefit of being able to trap the exact instruction that is
causing the invalid write is well worth it.

That part of the memory management is about 2x to 6x slower than
the traditional malloc() and free() because of the overhead of
creating all of the virtual pages on every call, and the average
misalignment when using guard bytes.

It's less than 2,000 lines long and I'm still stress testing
and debugging it today. This particular version belongs to the
company I work for. I plan on writing an expanded version which
contains the variants I mentioned above, and will release the
source code for it when completed.

It's working quite well ... much to my amazement to be honest.
In fact, it's turned out far better than I expected. I'm rather
stunned to be honest.

Thank you,
Rick C. Hodgin
Rick C. Hodgin
2017-05-17 16:15:21 UTC
Reply
Permalink
Raw Message
This new memory algorithm has significantly increased the performance
of some of my existing and well-debugged applications. I am floored
by the change from simply moving from one memory algorithm to another.
I'm in debug mode and am seeing almost a 10-fold increase in speed
on algorithms that load and parse data, and on algorithms have data
existing for relatively short lifetimes. It's unbelievable the
performance difference. I haven't even compiled into release mode
yet (with optimizations, debug mode is slow unoptimized code with
all kinds of checks enabled to validate data isn't being corrupted
accidentally).

-----
Is it common knowledge in 2017 that malloc() and realloc() and free()
are inherently slow in their implementation? I'm thinking that if
some existing common widely-distributed applications switched to a
different heap management subsystem, they too might see staggering
increases in performance with no other changes to their system.

Thank you,
Rick C. Hodgin
s***@casperkitty.com
2017-05-17 17:35:16 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Is it common knowledge in 2017 that malloc() and realloc() and free()
are inherently slow in their implementation? I'm thinking that if
some existing common widely-distributed applications switched to a
different heap management subsystem, they too might see staggering
increases in performance with no other changes to their system.
Programmers should be aware that allocation systems that operate on fixed-
sized chunks can vastly outperform those that have to accommodate chunks
of arbitrary size. Unfortunately, there is no standard means by which one
could make a fixed-size block allocator that was usable in type-agnostic
fashion from code that doesn't always write storage before reading it.
Some algorithms can benefit from being able to use a pattern like:

unsigned index = thing->table1[key];
if (index < thing->count && thing->table2[index] == key)
... Item was found

which would be unaffected by what value was read from any entry of
table1 that it hadn't written with a known value, but such code could
behave in arbitrary fashion if storage was recycled after having been
used as another type. For example, a compiler that could see the
sequence:


thingAsOtherType->whatever = 12345678;
free_chunk(thingAsOtherType);
thing = alloc_chunk();
thing->count = 0;
...
unsigned index = thing->table1[key];
if (index < thing->count && thing->table2[index] == key)
... Item was found

would be entitled to replace it with:

thing = (THING_TYPE*)thingAsOtherType;
thing->count = 0;
int ok = thing->table1[key] < thing->count;
thingAsOtherType->whatever = 12345678; // Moved from above
if (ok && thing->table2[thing->table1[key]] == key)
... Item was found

Such substitution could cause the value of thing->table1[key] used for
indexing to be beyond the range of thing->table2, which could in turn
trigger other arbitrary problems.

Unless one knows that a compiler won't perform that style of aggressive
optimizations, or the compiler provides a way of blocking them, the
only safe way to reuse storage as different types without having to
clear it before use is to use malloc/free.
bartc
2017-05-19 11:17:21 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
This new memory algorithm has significantly increased the performance
of some of my existing and well-debugged applications. I am floored
by the change from simply moving from one memory algorithm to another.
I'm in debug mode and am seeing almost a 10-fold increase in speed
on algorithms that load and parse data, and on algorithms have data
existing for relatively short lifetimes. It's unbelievable the
performance difference. I haven't even compiled into release mode
yet (with optimizations, debug mode is slow unoptimized code with
all kinds of checks enabled to validate data isn't being corrupted
accidentally).
-----
Is it common knowledge in 2017 that malloc() and realloc() and free()
are inherently slow in their implementation? I'm thinking that if
some existing common widely-distributed applications switched to a
different heap management subsystem, they too might see staggering
increases in performance with no other changes to their system.
Do you have any benchmark code we can try out? Using regular
malloc/free. Then we can appreciate what a 10x speed increase might be like!

The program below is the Binary Trees benchmark, which is sensitive to
the quality of malloc and free implementations. But it is limited as a
test because it only allocates and frees large numbers of the same small
allocation (the 'treenode' struct).

What differences does your version make with this? (You can change the
n=15 line for a bigger test.)


//-------------------------------------------

#include <stdio.h>
#include <stdlib.h>

int nallocs;

typedef struct _node {
struct _node *left;
struct _node *right;
int item;
} treenode;

treenode* newtreenode(treenode* left, treenode* right, int item) {
treenode* t;
t=malloc(sizeof(treenode));
++nallocs;
t->left=left;
t->right=right;
t->item=item;
return t;
}

int itemcheck(treenode* tree) {
if (tree->left==NULL)
return tree->item;
else
return tree->item+itemcheck(tree->left)-itemcheck(tree->right);
}

treenode* bottomuptree(int item, int depth) {
if (depth>0)
return newtreenode(bottomuptree(2*item-1,depth-1),
bottomuptree(2*item,depth-1),item);
else
return newtreenode(NULL,NULL,item);
}

void deletetree(treenode* tree) {
if (tree->left) {
deletetree(tree->left);
deletetree(tree->right);
}
free(tree);
}

int main(void) {
treenode *stretchtree, *longlivedtree, *temptree;
int n,depth,mindepth,maxdepth,stretchdepth,check,iterations,i;

n=15;

mindepth=4;
if ((mindepth+2)>n)
maxdepth=mindepth+1;
else
maxdepth=n;

stretchdepth=maxdepth+1;

stretchtree=bottomuptree(0,stretchdepth);
printf("Stretch tree of depth %d\tcheck: %d\n",
stretchdepth,itemcheck(stretchtree));
deletetree(stretchtree);

longlivedtree=bottomuptree(0,maxdepth);

depth=mindepth;
while (depth<=maxdepth) {
iterations=1<<(maxdepth-depth+mindepth);
check=0;
for (i=1; i<=iterations; ++i) {
temptree=bottomuptree(i,depth);
check=check+itemcheck(temptree);
deletetree(temptree);

temptree=bottomuptree(-i,depth);
check=check+itemcheck(temptree);
deletetree(temptree);
}
printf("%d\tTrees of depth\t%d\tcheck: %d\n",
iterations*2,depth,check);
depth=depth+2;
}
printf("%s %d %s %d\n","long lived tree of depth",
maxdepth,"\tcheck:",itemcheck(longlivedtree));
printf("Nallocs= %d\n",nallocs);
}

//-------------------------------------------
--
bartc
Rick C. Hodgin
2017-05-19 11:56:19 UTC
Reply
Permalink
Raw Message
I will benchmark it later tonight. I created the algorithm, but the
source code belongs to thw company I work for. I'll have to create
my own variant before I can publish source code.

Thank you,
Rick C. Hodgin
Rick C. Hodgin
2017-05-19 23:05:51 UTC
Reply
Permalink
Raw Message
Post by bartc
The program below is the Binary Trees benchmark, which is sensitive to
the quality of malloc and free implementations. But it is limited as a
test because it only allocates and frees large numbers of the same small
allocation (the 'treenode' struct).
What differences does your version make with this? (You can change the
n=15 line for a bigger test.)
This test exposes a serious flaw in my implementation. Mine takes
4,264ms compared to malloc()'s 862ms, ~5x slower.

I'll add an optimization over the next few days to remove this flaw
and post new results.

My algorithm is ahead on the first pass by 6%, but falls off sharply
as the allocation count grows.

It's an interesting test. Do you have any more?

Thank you,
Rick C. Hodgin
Rick C. Hodgin
2017-05-30 21:08:47 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Post by bartc
The program below is the Binary Trees benchmark, which is sensitive to
the quality of malloc and free implementations. But it is limited as a
test because it only allocates and frees large numbers of the same small
allocation (the 'treenode' struct).
What differences does your version make with this? (You can change the
n=15 line for a bigger test.)
This test exposes a serious flaw in my implementation. Mine takes
4,264ms compared to malloc()'s 862ms, ~5x slower.
I'll add an optimization over the next few days to remove this flaw
and post new results.
My algorithm is ahead on the first pass by 6%, but falls off sharply
as the allocation count grows.
I haven't forgotten about this. I've coded the change I needed, but
it's also brought up another performance bottleneck that I'm going to
try to code around as well.

Thank you,
Rick C. Hodgin
Mr. Man-wai Chang
2017-05-18 12:48:25 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Can someone point me to a website or explain to me how malloc() and
free() work in a general sense on Windows?
I'd like to know how it propagates its structures from front-door
requests, through any layers, to the actual memory locations, and
what it uses to store data beyond that which is visible to us as
a buffer in an application program.
Depends on the implementation of operating systems? Is it really related
to C? :)
--
@~@ Remain silent! Drink, Blink, Stretch! Live long and prosper!!
/ v \ Simplicity is Beauty!
/( _ )\ May the Force and farces be with you!
^ ^ (x86_64 Ubuntu 9.10) Linux 2.6.39.3
不借貸! 不詐騙! 不援交! 不打交! 不打劫! 不自殺! 請考慮綜援 (CSSA):
http://www.swd.gov.hk/tc/index/site_pubsvc/page_socsecu/sub_addressesa
Rick C. Hodgin
2017-05-18 14:19:06 UTC
Reply
Permalink
Raw Message
Post by Mr. Man-wai Chang
Post by Rick C. Hodgin
Can someone point me to a website or explain to me how malloc() and
free() work in a general sense on Windows?
I'd like to know how it propagates its structures from front-door
requests, through any layers, to the actual memory locations, and
what it uses to store data beyond that which is visible to us as
a buffer in an application program.
Depends on the implementation of operating systems? Is it really related
to C? :)
The C Standard defines malloc(), realloc(), free(), and others. How
they mechanically apply to a given OS and its set of low-level function
call APIs is another matter. I'm curious about how they are designed
in general (such as what type of model do they use for allocation and
free operations, as I assume there is some core algorithm that's been
developed over time, evolving into the one that's most preferred due
to its general reliability, speed, and minimal fragmentation).

http://www.iso-9899.info/n1570.html

7.22.3 [Memory management functions]
7.22.3.1 [The aligned_alloc function]
7.22.3.2 [The calloc function]
7.22.3.3 [The free function]
7.22.3.4 [The malloc function]
7.22.3.5 [The realloc function]

I think for CAlive I am going to directly define three types of memory:

1) fx-memory (fixed, process-only)
2) mv-memory (movable, process-only)
3) go-memory (global, system-wide)

When allocating with fx_malloc(), you receive a pointer to a fixed
memory block in the local heap. It is pointing directly to the place
where memory is, and you can directly manipulate that data.

When allocating with mv_malloc(), you receive a pointer to a pointer
to a fixed memory block in the local heap. This allows the underlying
memory to be moved around as needed. You reference it the same way in
source code, but a mild performance hit is had because it's going
through indirection for each reference.

When allocating with go_malloc(), you receive a pointer to a global
memory block in the global heap. Other processes can request their
access into this heap, with even access to the indicated pointer.

The go-memory will be a direct pointer pointing to the location
where memory is, and since it is managed by the OS, the underlying
virtual pages which relate to the offset allow it to be moved as
needed without application awareness.

Thank you,
Rick C. Hodgin
Malcolm McLean
2017-05-18 14:33:36 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
The C Standard defines malloc(), realloc(), free(), and others. How
they mechanically apply to a given OS and its set of low-level function
call APIs is another matter. I'm curious about how they are designed
in general (such as what type of model do they use for allocation and
free operations, as I assume there is some core algorithm that's been
developed over time, evolving into the one that's most preferred due
to its general reliability, speed, and minimal fragmentation).
Modern systems use a power of two pages algorithm.
Each allocation (up to a limit) is rounded up to the nearest power of
two. You then have pages for 8, 16, 32, 64 etc bytes, and you allocate
from that page. To free, you take the page address to tell you how
big the allocation was.
Mr. Man-wai Chang
2017-05-18 15:32:48 UTC
Reply
Permalink
Raw Message
Post by Malcolm McLean
Modern systems use a power of two pages algorithm.
Each allocation (up to a limit) is rounded up to the nearest power of
two. You then have pages for 8, 16, 32, 64 etc bytes, and you allocate
from that page. To free, you take the page address to tell you how
big the allocation was.
Only very old, primitive operating systems would allow a user
application to manage the kernel's memory. Think memory and stack
protection. MS/DOS maybe? :)
--
@~@ Remain silent! Drink, Blink, Stretch! Live long and prosper!!
/ v \ Simplicity is Beauty!
/( _ )\ May the Force and farces be with you!
^ ^ (x86_64 Ubuntu 9.10) Linux 2.6.39.3
不借貸! 不詐騙! 不援交! 不打交! 不打劫! 不自殺! 請考慮綜援 (CSSA):
http://www.swd.gov.hk/tc/index/site_pubsvc/page_socsecu/sub_addressesa
Rick C. Hodgin
2017-05-18 15:45:06 UTC
Reply
Permalink
Raw Message
Post by Mr. Man-wai Chang
Post by Malcolm McLean
Modern systems use a power of two pages algorithm.
Each allocation (up to a limit) is rounded up to the nearest power of
two. You then have pages for 8, 16, 32, 64 etc bytes, and you allocate
from that page. To free, you take the page address to tell you how
big the allocation was.
Only very old, primitive operating systems would allow a user
application to manage the kernel's memory. Think memory and stack
protection. MS/DOS maybe? :)
MS-DOS offered the same features. .COM files were automatically
granted access to all available free memory. The typical procedure
was to release that memory keeping only what you knew you needed
(typically 64KB or less), and then allocate blocks as needed with
the appropriate INT 21h call.

But I would imagine that even back then, the malloc() and realloc()
functions would request larger blocks from MS-DOS, and then divvy
them up themselves for the smaller requests.

It's just a guess, but it would seem to be the most efficient way,
and the only way to ensure reliability across various versions of
DOS.

Thank you,
Rick C. Hodgin
Rick C. Hodgin
2017-05-18 15:47:56 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Post by Mr. Man-wai Chang
Post by Malcolm McLean
Modern systems use a power of two pages algorithm.
Each allocation (up to a limit) is rounded up to the nearest power of
two. You then have pages for 8, 16, 32, 64 etc bytes, and you allocate
from that page. To free, you take the page address to tell you how
big the allocation was.
Only very old, primitive operating systems would allow a user
application to manage the kernel's memory. Think memory and stack
protection. MS/DOS maybe? :)
MS-DOS offered the same features. .COM files were automatically
granted access to all available free memory. The typical procedure
was to release that memory keeping only what you knew you needed
(typically 64KB or less), and then allocate blocks as needed with
the appropriate INT 21h call.
I meant to also add here that .EXE files only allocated as much
memory as they needed, with the rest automatically remaining free
when the main() function was hit. As such, all allocations
thereafter proceeded forward similarly in .COM files as they did
with .EXE files, with the only difference being pointer size,
and the math done on pointers to access larger blocks in a segmented
memory model.
Post by Rick C. Hodgin
But I would imagine that even back then, the malloc() and realloc()
functions would request larger blocks from MS-DOS, and then divvy
them up themselves for the smaller requests.
It's just a guess, but it would seem to be the most efficient way,
and the only way to ensure reliability across various versions of
DOS.
Thank you,
Rick C. Hodgin
Keith Thompson
2017-05-18 16:24:35 UTC
Reply
Permalink
Raw Message
Malcolm McLean <***@gmail.com> writes:
[...]
Post by Malcolm McLean
Modern systems use a power of two pages algorithm.
Each allocation (up to a limit) is rounded up to the nearest power of
two. You then have pages for 8, 16, 32, 64 etc bytes, and you allocate
from that page. To free, you take the page address to tell you how
big the allocation was.
I don't believe that's universally true.
--
Keith Thompson (The_Other_Keith) kst-***@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
Malcolm McLean
2017-05-18 19:18:33 UTC
Reply
Permalink
Raw Message
Post by Keith Thompson
[...]
Post by Malcolm McLean
Modern systems use a power of two pages algorithm.
Each allocation (up to a limit) is rounded up to the nearest power of
two. You then have pages for 8, 16, 32, 64 etc bytes, and you allocate
from that page. To free, you take the page address to tell you how
big the allocation was.
I don't believe that's universally true.
I'm sure it's not.
But that's the core of how most modern systems work. Instead of a flat
memory pool from which you dish out portions, you have a relatively
large number of pages, which are a power of two in size, and you then
chop them up into smaller powers of two. Thus fragmentation is kept to
a minimum, and it is fast to search the free list.
a***@gmail.com
2017-05-19 21:37:23 UTC
Reply
Permalink
Raw Message
the one that speed free is the one it search first the last malloc memory released as the stack memory, because functions call follow as a stack structure and they have to use malloc free
Joe Pfeiffer
2017-05-19 22:01:35 UTC
Reply
Permalink
Raw Message
Post by a***@gmail.com
the one that speed free is the one it search first the last malloc
memory released as the stack memory, because functions call follow as
a stack structure and they have to use malloc free
If you mean the functions call malloc() and free() to create their
activation records, no they don't (at least, not in the case of any
compiler I know of). They just move the registers pointing into the
stack.
a***@gmail.com
2017-05-31 07:05:19 UTC
Reply
Permalink
Raw Message
When one has a program, a loop
for example

for(;;)
{a=f(x)
b=g(a)
...
}

We admit f, g, ... Use malloc()free()

What happen: Malloc reserve memory for f variables
free() free the memory
for f variables

Malloc reserve memory for g vars
free() free the memory
for g variables

Ecc... All see that free() has to free the last memory malloc has to return
Joe Pfeiffer
2017-05-31 16:11:38 UTC
Reply
Permalink
Raw Message
Post by a***@gmail.com
When one has a program, a loop
for example
for(;;)
{a=f(x)
b=g(a)
...
}
We admit f, g, ... Use malloc()free()
What happen: Malloc reserve memory for f variables
free() free the memory
for f variables
Malloc reserve memory for g vars
free() free the memory
for g variables
Ecc... All see that free() has to free the last memory malloc has to return
I am not aware of any compiler that uses malloc() to allocate stack
space for local variables.
Malcolm McLean
2017-05-31 16:20:09 UTC
Reply
Permalink
Raw Message
Post by Joe Pfeiffer
I am not aware of any compiler that uses malloc() to allocate stack
space for local variables.
Apparently some mainframes do. There is no stack as such, just a series
of linked list memory chunks taken from the heap. Presumably it makes it
easier to have only one type of memory management over the whole machine.
Jerry Stuckle
2017-05-31 17:03:59 UTC
Reply
Permalink
Raw Message
Post by Malcolm McLean
Post by Joe Pfeiffer
I am not aware of any compiler that uses malloc() to allocate stack
space for local variables.
Apparently some mainframes do. There is no stack as such, just a series
of linked list memory chunks taken from the heap. Presumably it makes it
easier to have only one type of memory management over the whole machine.
As well as some other processors which don't have a hardware stack. Not
everything has push/pop stacks.
--
==================
Remove the "x" from my email address
Jerry Stuckle
***@attglobal.net
==================
David Kleinecke
2017-05-31 19:14:06 UTC
Reply
Permalink
Raw Message
Post by Jerry Stuckle
Post by Malcolm McLean
Post by Joe Pfeiffer
I am not aware of any compiler that uses malloc() to allocate stack
space for local variables.
Apparently some mainframes do. There is no stack as such, just a series
of linked list memory chunks taken from the heap. Presumably it makes it
easier to have only one type of memory management over the whole machine.
As well as some other processors which don't have a hardware stack. Not
everything has push/pop stacks.
One doesn't need explicit stacks. All one needs is a
big contiguous block of unallocated memory. A software
stack is easy. It also easy to use one end of the big block
for a stack and the other for a heap. It seems strange why
any other tactic would be used.

Perhaps a segmented memory model.
Jerry Stuckle
2017-05-31 20:11:22 UTC
Reply
Permalink
Raw Message
Post by David Kleinecke
Post by Jerry Stuckle
Post by Malcolm McLean
Post by Joe Pfeiffer
I am not aware of any compiler that uses malloc() to allocate stack
space for local variables.
Apparently some mainframes do. There is no stack as such, just a series
of linked list memory chunks taken from the heap. Presumably it makes it
easier to have only one type of memory management over the whole machine.
As well as some other processors which don't have a hardware stack. Not
everything has push/pop stacks.
One doesn't need explicit stacks. All one needs is a
big contiguous block of unallocated memory. A software
stack is easy. It also easy to use one end of the big block
for a stack and the other for a heap. It seems strange why
any other tactic would be used.
Perhaps a segmented memory model.
Well, among other reasons, having just one contiguous block of memory
can waste memory if it's not needed - important in memory-constrained
systems. And if you make it too small, it may not be large enough,
especially if you might have deep recursion.

Not every system is blessed with gigabytes of memory. Many have only a
few K or tens of Ks.
--
==================
Remove the "x" from my email address
Jerry Stuckle
***@attglobal.net
==================
David Kleinecke
2017-06-01 03:23:20 UTC
Reply
Permalink
Raw Message
Post by Jerry Stuckle
Post by David Kleinecke
Post by Jerry Stuckle
Post by Malcolm McLean
Post by Joe Pfeiffer
I am not aware of any compiler that uses malloc() to allocate stack
space for local variables.
Apparently some mainframes do. There is no stack as such, just a series
of linked list memory chunks taken from the heap. Presumably it makes it
easier to have only one type of memory management over the whole machine.
As well as some other processors which don't have a hardware stack. Not
everything has push/pop stacks.
One doesn't need explicit stacks. All one needs is a
big contiguous block of unallocated memory. A software
stack is easy. It also easy to use one end of the big block
for a stack and the other for a heap. It seems strange why
any other tactic would be used.
Perhaps a segmented memory model.
Well, among other reasons, having just one contiguous block of memory
can waste memory if it's not needed - important in memory-constrained
systems. And if you make it too small, it may not be large enough,
especially if you might have deep recursion.
Not every system is blessed with gigabytes of memory. Many have only a
few K or tens of Ks.
Well, the usual situation is that that memory-block is all the
memory the computer offers (after loading everything). If one is
mono-processing (which I expect in small systems) nothing is
wasted.
Jerry Stuckle
2017-06-01 15:13:35 UTC
Reply
Permalink
Raw Message
Post by David Kleinecke
Post by Jerry Stuckle
Post by David Kleinecke
Post by Jerry Stuckle
Post by Malcolm McLean
Post by Joe Pfeiffer
I am not aware of any compiler that uses malloc() to allocate stack
space for local variables.
Apparently some mainframes do. There is no stack as such, just a series
of linked list memory chunks taken from the heap. Presumably it makes it
easier to have only one type of memory management over the whole machine.
As well as some other processors which don't have a hardware stack. Not
everything has push/pop stacks.
One doesn't need explicit stacks. All one needs is a
big contiguous block of unallocated memory. A software
stack is easy. It also easy to use one end of the big block
for a stack and the other for a heap. It seems strange why
any other tactic would be used.
Perhaps a segmented memory model.
Well, among other reasons, having just one contiguous block of memory
can waste memory if it's not needed - important in memory-constrained
systems. And if you make it too small, it may not be large enough,
especially if you might have deep recursion.
Not every system is blessed with gigabytes of memory. Many have only a
few K or tens of Ks.
Well, the usual situation is that that memory-block is all the
memory the computer offers (after loading everything). If one is
mono-processing (which I expect in small systems) nothing is
wasted.
No, it isn't. Even small systems often have multiple programs running.
And the OS will allocate and free memory for its needs, independent of
what other programs are doing.
--
==================
Remove the "x" from my email address
Jerry Stuckle
***@attglobal.net
==================
Joe Pfeiffer
2017-06-01 03:50:45 UTC
Reply
Permalink
Raw Message
Post by David Kleinecke
Post by Jerry Stuckle
Post by Malcolm McLean
Post by Joe Pfeiffer
I am not aware of any compiler that uses malloc() to allocate stack
space for local variables.
Apparently some mainframes do. There is no stack as such, just a series
of linked list memory chunks taken from the heap. Presumably it makes it
easier to have only one type of memory management over the whole machine.
As well as some other processors which don't have a hardware stack. Not
everything has push/pop stacks.
One doesn't need explicit stacks. All one needs is a
big contiguous block of unallocated memory. A software
stack is easy. It also easy to use one end of the big block
for a stack and the other for a heap. It seems strange why
any other tactic would be used.
Perhaps a segmented memory model.
Glad I'm not the only one. This notion of "well, there's no hardware
stack so we'll jump right over to allocating everything on the heap"
sounds crazy to me. I note that no one suggesting mainframes and "other
processors" do this haven't presented a concrete example.
Ben Bacarisse
2017-06-01 10:28:39 UTC
Reply
Permalink
Raw Message
Post by Joe Pfeiffer
Post by David Kleinecke
Post by Jerry Stuckle
Post by Malcolm McLean
Post by Joe Pfeiffer
I am not aware of any compiler that uses malloc() to allocate stack
space for local variables.
Apparently some mainframes do. There is no stack as such, just a series
of linked list memory chunks taken from the heap. Presumably it makes it
easier to have only one type of memory management over the whole machine.
As well as some other processors which don't have a hardware stack. Not
everything has push/pop stacks.
One doesn't need explicit stacks. All one needs is a
big contiguous block of unallocated memory. A software
stack is easy. It also easy to use one end of the big block
for a stack and the other for a heap. It seems strange why
any other tactic would be used.
Perhaps a segmented memory model.
Glad I'm not the only one. This notion of "well, there's no hardware
stack so we'll jump right over to allocating everything on the heap"
sounds crazy to me. I note that no one suggesting mainframes and "other
processors" do this haven't presented a concrete example.
See "IBM System/360 Principles of Operation" or the same for S/370 for a
concrete example. It's crazy, but that's the way the world was for some
systems in the 1960. The design then hangs around like a bad smell
because compatibility. I've had nothing to with IBM mainframes for
decades, but I would not be surprised to find modern IBM hardware that
will at least emulate this design.

There is a long-running war (well, more like a series of skirmishes
really) in the industry between "real programming" and "academic
nonsense". Having an actual stack was, once, academic nonsense. No
"real programmer" wrote recursive functions so you might as well have a
static area save area for registers and local variables. This can be
used to link to the previous save area with IBM's classic "Branch and
Link" instruction. If this chain really must be dynamic, then you can
allocate these saves areas. Virtual memory can take care of shared
code.

Obviously not all academic ideas deserve to be taken seriously, but I am
struck, time and time again, how often this story plays out. C and Unix
were once used almost exclusively in academia or at least in research
labs. Now they are for "real programming". There's a lot of money to
be made by know where the font line will be moving to next. Google was
once a student's PhD thesis.
--
Ben.
s***@casperkitty.com
2017-06-01 14:40:36 UTC
Reply
Permalink
Raw Message
Post by Ben Bacarisse
There is a long-running war (well, more like a series of skirmishes
really) in the industry between "real programming" and "academic
nonsense".
On a related front, there was a similar war between microprocessors and
real computers. The issue wasn't merely one of pride--the transistors on
ICs had really anemic current-handling ability, and could not switch as
fast as discrete transistors. The PDP-11 (the machine for which C was
first designed) was interesting because it wasn't a microprocessor, but
many design aspects were imitated by the vast majority of microprocessors
since (data types with a power-of-two number of octets, math which operated
on a linear 2^N basis, stack-based subroutines, etc.)
Scott Lurndal
2017-06-01 15:36:24 UTC
Reply
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Ben Bacarisse
There is a long-running war (well, more like a series of skirmishes
really) in the industry between "real programming" and "academic
nonsense".
On a related front, there was a similar war between microprocessors and
real computers. The issue wasn't merely one of pride--the transistors on
ICs had really anemic current-handling ability, and could not switch as
fast as discrete transistors. The PDP-11 (the machine for which C was
first designed) was interesting because it wasn't a microprocessor, but
many design aspects were imitated by the vast majority of microprocessors
since (data types with a power-of-two number of octets, math which operated
on a linear 2^N basis, stack-based subroutines, etc.)
Most of those features weren't unique to the PDP-11, at the time it
was designed.
Jerry Stuckle
2017-06-01 15:14:16 UTC
Reply
Permalink
Raw Message
Post by Joe Pfeiffer
Post by David Kleinecke
Post by Jerry Stuckle
Post by Malcolm McLean
Post by Joe Pfeiffer
I am not aware of any compiler that uses malloc() to allocate stack
space for local variables.
Apparently some mainframes do. There is no stack as such, just a series
of linked list memory chunks taken from the heap. Presumably it makes it
easier to have only one type of memory management over the whole machine.
As well as some other processors which don't have a hardware stack. Not
everything has push/pop stacks.
One doesn't need explicit stacks. All one needs is a
big contiguous block of unallocated memory. A software
stack is easy. It also easy to use one end of the big block
for a stack and the other for a heap. It seems strange why
any other tactic would be used.
Perhaps a segmented memory model.
Glad I'm not the only one. This notion of "well, there's no hardware
stack so we'll jump right over to allocating everything on the heap"
sounds crazy to me. I note that no one suggesting mainframes and "other
processors" do this haven't presented a concrete example.
IBM Mainframes, for example. No hardware stack.
--
==================
Remove the "x" from my email address
Jerry Stuckle
***@attglobal.net
==================
Robert Wessel
2017-06-01 16:36:12 UTC
Reply
Permalink
Raw Message
On Wed, 31 May 2017 21:50:45 -0600, Joe Pfeiffer
Post by Joe Pfeiffer
Post by David Kleinecke
Post by Jerry Stuckle
Post by Malcolm McLean
Post by Joe Pfeiffer
I am not aware of any compiler that uses malloc() to allocate stack
space for local variables.
Apparently some mainframes do. There is no stack as such, just a series
of linked list memory chunks taken from the heap. Presumably it makes it
easier to have only one type of memory management over the whole machine.
As well as some other processors which don't have a hardware stack. Not
everything has push/pop stacks.
One doesn't need explicit stacks. All one needs is a
big contiguous block of unallocated memory. A software
stack is easy. It also easy to use one end of the big block
for a stack and the other for a heap. It seems strange why
any other tactic would be used.
Perhaps a segmented memory model.
Glad I'm not the only one. This notion of "well, there's no hardware
stack so we'll jump right over to allocating everything on the heap"
sounds crazy to me. I note that no one suggesting mainframes and "other
processors" do this haven't presented a concrete example.
IBM's "Metal C" variant of their mainframe C/C++ compiler uses a
modified version of that by default. Somewhat oversimplified, rather
than allocating the activation record on the heap for each function,
it allocates small areas that can be individually be used as a fairly
normal stack, but extends those as a linked list as needed. This
produces a hybrid between conventional C, stack-like, allocations, and
the normal system (linked-list style) calling conventions - any call
using the normal system conventions ends the use of the current
"stack-let", but if you use the Metal C convention, allocation
frequency is much reduced.

But the main thing is that the stack is *not* contiguous.
Gordon Burditt
2017-06-05 06:11:12 UTC
Reply
Permalink
Raw Message
Post by Joe Pfeiffer
Glad I'm not the only one. This notion of "well, there's no hardware
stack so we'll jump right over to allocating everything on the heap"
sounds crazy to me. I note that no one suggesting mainframes and "other
processors" do this haven't presented a concrete example.
Ok, here's an example for IBM/360 MVS circa 1974. C didn't exist
yet (at least not outside Bell Labs) but the linkage conventions
for the OS were very established. Things like formatted core dumps
depended on them. One of the big problems with trying to use a
register as a hardware stack is that the "register plus offset"
addressing mode only allowed for 12 bits of offset, allowing for a
stack frame of only 4096 bytes. I suppose you could compute the
address with code, but it slows down the code and makes it bigger.

Save areas were kept in double-linked lists. If you wanted, you could
use register 13 (the save area pointer) as a base register to address
your extensions to the save area, however, if you allocate one static
save area for each function, it's easier and possibly a few clock
cycles faster to just address it as SAVEAREA+N.

Dynamic save area allocation is required if the subroutine is not
a "leaf" (calling no other routines) and the subroutine might call
itself recursively (which also includes recursion through signal
handlers). Otherwise, each subroutine allocating its own save area
is sufficient.

Back when I was doing some OS/360 assembly-language coding, there
were REENTER and REEXIT macros that handled all this for
possibly-reentrant calls. I do not know whether these were official
IBM macros or whether these were local to my college data center.

Linkage conventions:

A subroutine call was usually done with code like this:

LA 1, ... Load optional parameters
LA 0, ... Load optional parameters
LA 15,MYSUBR Load routine address
BALR 14,15 Call it, putting return address in register 14
... Control returns here.

At entry to a subroutine, the registers should be:

0,1: Optional parameters
13: Address of caller save area
14: Return address to calling routine
15: Entry point of called routine (can be used as base register)

Inside the subroutine (assuming it's not a "leaf" routine that calls nothing
else), after the initial linkage, the registers should be:
13: Address of routine's own save area
14: Return address to calling routine
15: Entry point of called routine
Links between the routine's own save area and the caller's save area
should be set up, in both directions.

At exit from a subroutine, the registers should be:
Registers except 15, and possibly 0 and 1, should be restored.
0,1: Optional returned results
13: Address of caller save area
14: Return address to calling routine
15: Exit status. Typically 0 = OK, 4 = Warning, 8 = Error, 12 = Severe Error
In terms of a compiler, 0 is a successful compile, 4 is a successful compile
with warnings, 8 is a syntax error, and 12 is a "INTERNAL COMPILER ERROR" message
pieces of the compiler are not installed, memory allocation failed, out of
disk space, etc.

The entry prologue has the jobs of:
- Save registers 14-15, 0-12 in the caller's save area.
- If the function is using dynamic save area allocation:
- GETMAIN a new save area (at least 18*4 bytes long)
As I recall, there was some form of GETMAIN that did not
require a save area already be set up, so it wouldn't clobber
the registers you just saved.
- Load the address of the new save area into a register.
- else if the function is using static save area allocation:
- Load the address of the save area into a register.
- endif
- Set up links in both directions between the caller's save area and
the newly allocated one.
- Put the new save area address in register 13.

The exit routine has the job of:
- Restore the save area pointer to the caller's save area.
- If the function is using dynamic save area allocation:
- FREEMAIN the save area
- endif
- Restore registers 14-15, 0-12 from the caller's save area.
- Load the exit status into register 15.
- Optionally load return results into registers 0 and 1.
- Branch indirect through register 14 to return to the caller.


Save area format - a standard save area is 18 4-byte words long
Save area + 0: Not sure what this was used for
Save area + 4: Pointer to upper (caller) save area
Save area + 8: Pointer to lower (called from this routine) save area
Save area + 12: Saved R14 (return address to calling routine)
Save area + 16: Saved R15 (entry point to called routine)
Save area + 20: Saved R0
Save area + 24: Saved R1
Save area + 28: Saved R2
Save area + 32: Saved R3
Save area + 36: Saved R4
Save area + 40: Saved R5
Save area + 44: Saved R6
Save area + 48: Saved R7
Save area + 52: Saved R8
Save area + 56: Saved R9
Save area + 60: Saved R10
Save area + 64: Saved R11
Save area + 68: Saved R12
Save area + 72: end (You could add C local variables here)


And here's the subroutine:

MYSUBR CSECT
STM 14,12,12(13) Save registers 14-15,0-12 in caller's save area
LR 11, 15 Establish a base register for MYSUBR's code
USING MYSUBR,11

If you're using static save areas:
ST 13,MYSAVE+4 Link save areas
LA 10,MYSAVE
ST 10, 8(0,13)
LR 13,10
If you're using dynamic save areas:
GETMAIN LV=72,SP=12 Allocate a new save area (abort on fail)
I don't recall the exact details of this call.
LR 10,1 Get new save area address
ST 13, 4(10) Link save areas
ST 10,8(0,13)
LR 13,10
End conditional

... code to do something useful ...


DONE L 10,4(13) Get back caller's save area
If you're using dynamic save areas:
LR 1,13
FREEMAIN R=(1),SP=12 Free the save area
I don't recall the exact details of this call
End conditional
LR 13,10
LM 14, 12, 12(13) Restore registers
SR 15,15 Return successful status
BR 14 Return
s***@casperkitty.com
2017-06-05 15:01:26 UTC
Reply
Permalink
Raw Message
Post by Gordon Burditt
Ok, here's an example for IBM/360 MVS circa 1974. C didn't exist
yet (at least not outside Bell Labs) but the linkage conventions
for the OS were very established. Things like formatted core dumps
depended on them. One of the big problems with trying to use a
register as a hardware stack is that the "register plus offset"
addressing mode only allowed for 12 bits of offset, allowing for a
stack frame of only 4096 bytes. I suppose you could compute the
address with code, but it slows down the code and makes it bigger.
How often would one need more than 4096 bytes of non-array objects within
a stack frame? If a compiler identifies everything that will be in a
function's stack frame before it generates code for it and places small
objects first, if should be practical to support things beyond.

Incidentally, the first compiler bug I ran into was in Macintosh
Programmer's Workshop, when using a mode to allow stack frames over 32K
(same issue). The generated code was almost fine, except that frames
between 32768 and 65535 bytes were initiated with an SUBA.W instruction
which would subtract a *sign-extended* 16-bit value from the stack pointer
(oops).
Gordon Burditt
2017-06-06 04:49:50 UTC
Reply
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Gordon Burditt
Ok, here's an example for IBM/360 MVS circa 1974. C didn't exist
yet (at least not outside Bell Labs) but the linkage conventions
for the OS were very established. Things like formatted core dumps
depended on them. One of the big problems with trying to use a
register as a hardware stack is that the "register plus offset"
addressing mode only allowed for 12 bits of offset, allowing for a
stack frame of only 4096 bytes. I suppose you could compute the
address with code, but it slows down the code and makes it bigger.
How often would one need more than 4096 bytes of non-array objects within
a stack frame?
You need to consider structs with arrays inside them, which is
almost the same case as arrays, except structs don't allow re-ordering
the members. (Did the DCB macro expand to more than 4096 bytes in
any commonly-used situation? )

You still lose a couple of clock cycles (which is a VERY BIG DEAL!)
every time you refer to an array element past the 4096 offset with
a compile-time-constant subscript. The same applies to any
non-array variables that end up past the 4096 offset, but I don't
expect that to happen often - which opens up the possibility of
untested special cases (with bugs) in the generated code, like
the one you found in the MAC compiler below.
Post by s***@casperkitty.com
If a compiler identifies everything that will be in a
function's stack frame before it generates code for it and places small
objects first, if should be practical to support things beyond.
If it is considered *politically* practical to shoot (or at least
gag) the guy who says "but it still takes 200% more instructions"
about the same 2 extra instructions for the 10th time, to charge a
lot extra for seeing the compiler-generated code so no one can
afford to see it (so they can't complain about it), or ban counting
clock cycles or instructions, it might be practical to support
things beyond. Otherwise, you may never get to run it.

I am reminded that one instructor took off points for using any but
his favorite instruction (might have been XR 2,2 - bitwise XOR
register 2 with itself) for the *FIRST* exam for an "Assembly
Language 101" (introductory) course, after a class discussion about
the fastest ways to zero a register, with the conclusion that which
instruction was fastest depended on the CPU model, which wasn't
specified on the exam. Further, "the system our computer lab has"
was somewhat up in the air as it was being upgraded during the week
of the exam. Interestingly, the class textbook didn't mention clock
counts of instructions except for occasional comments like "using
an index register may take more clock cycles than not using one".

Some of the students taking that class were in it only because the
pre-requesites had been re-arranged and they suddenly needed it for
the advanced courses. Some of them were really good with assembly
language and machine language: They could disassemble code in their
head from core dumps (and often find bugs). They could generate
hex object code faster than the assembler could (running the assembler
was a batch job and could take hours to come back). Patch object decks
(punch cards! Did you know you could represent 1 byte with 1 column
of a punch card?) with a keypunch. Disassemble code in an object
deck (punch cards!). Scan 20 pages of core dump in about 20 seconds
and spot the one instruction that's not supported on the machine
it was run on, even though that part of the code hadn't been
executed yet.
Post by s***@casperkitty.com
Incidentally, the first compiler bug I ran into was in Macintosh
Programmer's Workshop, when using a mode to allow stack frames over 32K
(same issue). The generated code was almost fine, except that frames
between 32768 and 65535 bytes were initiated with an SUBA.W instruction
which would subtract a *sign-extended* 16-bit value from the stack pointer
(oops).
s***@casperkitty.com
2017-06-06 17:13:07 UTC
Reply
Permalink
Raw Message
Post by Gordon Burditt
Post by s***@casperkitty.com
How often would one need more than 4096 bytes of non-array objects within
a stack frame?
You need to consider structs with arrays inside them, which is
almost the same case as arrays, except structs don't allow re-ordering
the members. (Did the DCB macro expand to more than 4096 bytes in
any commonly-used situation? )
Such things are sometimes used, but the members of such structures tend not
to be accessed as heavily as "standalone" scalar objects.
Post by Gordon Burditt
You still lose a couple of clock cycles (which is a VERY BIG DEAL!)
every time you refer to an array element past the 4096 offset with
a compile-time-constant subscript. The same applies to any
non-array variables that end up past the 4096 offset, but I don't
expect that to happen often - which opens up the possibility of
untested special cases (with bugs) in the generated code, like
the one you found in the MAC compiler below.
Lowercase "AC". The Macintosh Programmer's Workshop compiler that my
employer acquired around 1990. Obviously if a compiler is going to try
to support such constructs, it needs to be tested better than the MPW
one was. On the other hand, unless a direct-addressing instruction can
access a wider range of addresses than a register-displacement instruction
*of the same length*, use of stack-frame variables will often be no less
efficient than use of absolute addresses, and may be more efficient than
using absolute addresses if the latter require bigger instructions (as on
the 68000) or extra instructions (as on the ARM).
Post by Gordon Burditt
If it is considered *politically* practical to shoot (or at least
gag) the guy who says "but it still takes 200% more instructions"
about the same 2 extra instructions for the 10th time, to charge a
lot extra for seeing the compiler-generated code so no one can
afford to see it (so they can't complain about it), or ban counting
clock cycles or instructions, it might be practical to support
things beyond. Otherwise, you may never get to run it.
Most functions have a few variables that get accessed *a lot* more than
the rest. If a function makes 10,000 non-variably-subscripted accesses
to variables within the first 4K, and 1,000 such accesses to everything
else combined, an an extra couple cycle added to each of those 1,000
accesses isn't going to matter. On platforms where platform's register-
displacement mode was limited to a much smaller displacement (like 15
bytes, or 0 bytes), the use of absolute addressing for automatic variables
would make sense. I would think that if a platform supports register-
displacement addressing with a 4K range for the same cost as absolute
addressing, and would be efficient at setting up stack frames, use of
such addressing would generally be advantageous.
Post by Gordon Burditt
Did you know you could represent 1 byte with 1 column
of a punch card?) with a keypunch. Disassemble code in an object
deck (punch cards!). Scan 20 pages of core dump in about 20 seconds
and spot the one instruction that's not supported on the machine
it was run on, even though that part of the code hadn't been
executed yet.
I've used 80-column cards. I found it curious that keypunches didn't
offer any remotely-convenient means of moving the current card without
moving the previous card, or vice versa. I appreciate that for many
purposes one would want to ensure that they stay in sync, but an
ability to insert text by locking the "previous" card while typing on
the current card would have been handy. Insertion could be accommodated,
IIRC, by holding the previous card with a death grip, but only if one
was planning to discard that previous card afterward (since it would often
get a bit mangled).
David Kleinecke
2017-06-06 23:50:30 UTC
Reply
Permalink
Raw Message
Post by Gordon Burditt
Post by s***@casperkitty.com
Post by Gordon Burditt
Ok, here's an example for IBM/360 MVS circa 1974. C didn't exist
yet (at least not outside Bell Labs) but the linkage conventions
for the OS were very established. Things like formatted core dumps
depended on them. One of the big problems with trying to use a
register as a hardware stack is that the "register plus offset"
addressing mode only allowed for 12 bits of offset, allowing for a
stack frame of only 4096 bytes. I suppose you could compute the
address with code, but it slows down the code and makes it bigger.
How often would one need more than 4096 bytes of non-array objects within
a stack frame?
You need to consider structs with arrays inside them, which is
almost the same case as arrays, except structs don't allow re-ordering
the members. (Did the DCB macro expand to more than 4096 bytes in
any commonly-used situation? )
You still lose a couple of clock cycles (which is a VERY BIG DEAL!)
every time you refer to an array element past the 4096 offset with
a compile-time-constant subscript. The same applies to any
non-array variables that end up past the 4096 offset, but I don't
expect that to happen often - which opens up the possibility of
untested special cases (with bugs) in the generated code, like
the one you found in the MAC compiler below.
Post by s***@casperkitty.com
If a compiler identifies everything that will be in a
function's stack frame before it generates code for it and places small
objects first, if should be practical to support things beyond.
If it is considered *politically* practical to shoot (or at least
gag) the guy who says "but it still takes 200% more instructions"
about the same 2 extra instructions for the 10th time, to charge a
lot extra for seeing the compiler-generated code so no one can
afford to see it (so they can't complain about it), or ban counting
clock cycles or instructions, it might be practical to support
things beyond. Otherwise, you may never get to run it.
I am reminded that one instructor took off points for using any but
his favorite instruction (might have been XR 2,2 - bitwise XOR
register 2 with itself) for the *FIRST* exam for an "Assembly
Language 101" (introductory) course, after a class discussion about
the fastest ways to zero a register, with the conclusion that which
instruction was fastest depended on the CPU model, which wasn't
specified on the exam. Further, "the system our computer lab has"
was somewhat up in the air as it was being upgraded during the week
of the exam. Interestingly, the class textbook didn't mention clock
counts of instructions except for occasional comments like "using
an index register may take more clock cycles than not using one".
Some of the students taking that class were in it only because the
pre-requesites had been re-arranged and they suddenly needed it for
the advanced courses. Some of them were really good with assembly
language and machine language: They could disassemble code in their
head from core dumps (and often find bugs). They could generate
hex object code faster than the assembler could (running the assembler
was a batch job and could take hours to come back). Patch object decks
(punch cards! Did you know you could represent 1 byte with 1 column
of a punch card?) with a keypunch. Disassemble code in an object
deck (punch cards!). Scan 20 pages of core dump in about 20 seconds
and spot the one instruction that's not supported on the machine
it was run on, even though that part of the code hadn't been
executed yet.
Post by s***@casperkitty.com
Incidentally, the first compiler bug I ran into was in Macintosh
Programmer's Workshop, when using a mode to allow stack frames over 32K
(same issue). The generated code was almost fine, except that frames
between 32768 and 65535 bytes were initiated with an SUBA.W instruction
which would subtract a *sign-extended* 16-bit value from the stack pointer
(oops).
How about filling some of the holes in a card with chad
to reverse their meaning then duplicating the card to
insure the patch wouldn't fall out? We did desperate
things when working at midnight on a program that had
to run tomorrow.
Joe Pfeiffer
2017-06-07 03:33:01 UTC
Reply
Permalink
Raw Message
Post by David Kleinecke
How about filling some of the holes in a card with chad
to reverse their meaning then duplicating the card to
insure the patch wouldn't fall out? We did desperate
things when working at midnight on a program that had
to run tomorrow.
That struck you as a better idea than retyping the card?
Robert Wessel
2017-06-07 05:37:32 UTC
Reply
Permalink
Raw Message
On Tue, 06 Jun 2017 21:33:01 -0600, Joe Pfeiffer
Post by Joe Pfeiffer
Post by David Kleinecke
How about filling some of the holes in a card with chad
to reverse their meaning then duplicating the card to
insure the patch wouldn't fall out? We did desperate
things when working at midnight on a program that had
to run tomorrow.
That struck you as a better idea than retyping the card?
Especially if the card had binary data on it, sometimes people found
it simpler to cover the hole and dup the card, rather than trying to
dup the card column-by-column and then get the punches for the
modified column just right. OTOH, punching a binary column was not
*that* hard, your trusty green or yellow card had the punch
combinations for all 256 values on it (you just had to remember that
the 11 and 12 punches were the dash and ampersand).
David Kleinecke
2017-06-08 00:13:00 UTC
Reply
Permalink
Raw Message
Post by Joe Pfeiffer
Post by David Kleinecke
How about filling some of the holes in a card with chad
to reverse their meaning then duplicating the card to
insure the patch wouldn't fall out? We did desperate
things when working at midnight on a program that had
to run tomorrow.
That struck you as a better idea than retyping the card?
Yes. There are a lot of holes on an IBM card and this
being a binary no easy way to make the holes because the
key punch machines expected character input.
Joe Pfeiffer
2017-06-08 03:35:33 UTC
Reply
Permalink
Raw Message
Post by David Kleinecke
Post by Joe Pfeiffer
Post by David Kleinecke
How about filling some of the holes in a card with chad
to reverse their meaning then duplicating the card to
insure the patch wouldn't fall out? We did desperate
things when working at midnight on a program that had
to run tomorrow.
That struck you as a better idea than retyping the card?
Yes. There are a lot of holes on an IBM card and this
being a binary no easy way to make the holes because the
key punch machines expected character input.
I'll take your word for it. It wouldn't take needing to tape much chad
back in before repunching (remember, there was the multipunch key) would
sound like a better bet to me...
David Kleinecke
2017-06-08 05:06:58 UTC
Reply
Permalink
Raw Message
Post by Joe Pfeiffer
Post by David Kleinecke
Post by Joe Pfeiffer
Post by David Kleinecke
How about filling some of the holes in a card with chad
to reverse their meaning then duplicating the card to
insure the patch wouldn't fall out? We did desperate
things when working at midnight on a program that had
to run tomorrow.
That struck you as a better idea than retyping the card?
Yes. There are a lot of holes on an IBM card and this
being a binary no easy way to make the holes because the
key punch machines expected character input.
I'll take your word for it. It wouldn't take needing to tape much chad
back in before repunching (remember, there was the multipunch key) would
sound like a better bet to me...
You sound like you never worked with IBM cards.
Joe Pfeiffer
2017-06-13 14:45:48 UTC
Reply
Permalink
Raw Message
Post by David Kleinecke
Post by Joe Pfeiffer
Post by David Kleinecke
Post by Joe Pfeiffer
Post by David Kleinecke
How about filling some of the holes in a card with chad
to reverse their meaning then duplicating the card to
insure the patch wouldn't fall out? We did desperate
things when working at midnight on a program that had
to run tomorrow.
That struck you as a better idea than retyping the card?
Yes. There are a lot of holes on an IBM card and this
being a binary no easy way to make the holes because the
key punch machines expected character input.
I'll take your word for it. It wouldn't take needing to tape much chad
back in before repunching (remember, there was the multipunch key) would
sound like a better bet to me...
You sound like you never worked with IBM cards.
Never binary, but yes, source code. Fortunately I'm young enough that
it was only for a few years, though.
Jerry Stuckle
2017-06-08 12:36:46 UTC
Reply
Permalink
Raw Message
Post by David Kleinecke
Post by Joe Pfeiffer
Post by David Kleinecke
How about filling some of the holes in a card with chad
to reverse their meaning then duplicating the card to
insure the patch wouldn't fall out? We did desperate
things when working at midnight on a program that had
to run tomorrow.
That struck you as a better idea than retyping the card?
Yes. There are a lot of holes on an IBM card and this
being a binary no easy way to make the holes because the
key punch machines expected character input.
IBM Keypunches also a "hold" button which would allow you to punch
multiple holes in a column. You could also punch individual rows in a
column, i.e. '0' would punch row 0, '1' would punch row 1, etc.

They also had a feature to copy as much or as little of a card as you
want, correct the error, and continue copying (if you wish).

With this combination you could punch any combination of holes in any
column. Filling the hole with a chad could cause problems with the card
reader due to the tape or glue used to hold the chad in.
--
==================
Remove the "x" from my email address
Jerry Stuckle
***@attglobal.net
==================
s***@casperkitty.com
2017-06-08 14:51:43 UTC
Reply
Permalink
Raw Message
Post by Jerry Stuckle
IBM Keypunches also a "hold" button which would allow you to punch
multiple holes in a column. You could also punch individual rows in a
column, i.e. '0' would punch row 0, '1' would punch row 1, etc.
They also had a feature to copy as much or as little of a card as you
want, correct the error, and continue copying (if you wish).
With this combination you could punch any combination of holes in any
column. Filling the hole with a chad could cause problems with the card
reader due to the tape or glue used to hold the chad in.
I wouldn't expect people to fill in individual holes. What I would expect
would be more common would be to blank out rows or half-rows. Text
characters are punched vertically, so each column of twelve positions would
represent one character. Machine code, however, is read horizontally, with
(IIRC) one instruction in columns 1-36 and a second in columns 37-72 (the
machine code loader ignores columns 73-80, allowing them to be used for
card-identification or sorting purposes). Running a strip of tape so it
covers up the first 72 columns of a row, duplicating the contents of the
other 11 rows, and then punching the new contents for that row using one
keystroke per bit, seems much easier than having to manually duplicate the
contents of all rows.
Jerry Stuckle
2017-06-08 15:19:40 UTC
Reply
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Jerry Stuckle
IBM Keypunches also a "hold" button which would allow you to punch
multiple holes in a column. You could also punch individual rows in a
column, i.e. '0' would punch row 0, '1' would punch row 1, etc.
They also had a feature to copy as much or as little of a card as you
want, correct the error, and continue copying (if you wish).
With this combination you could punch any combination of holes in any
column. Filling the hole with a chad could cause problems with the card
reader due to the tape or glue used to hold the chad in.
I wouldn't expect people to fill in individual holes. What I would expect
would be more common would be to blank out rows or half-rows. Text
characters are punched vertically, so each column of twelve positions would
represent one character. Machine code, however, is read horizontally, with
(IIRC) one instruction in columns 1-36 and a second in columns 37-72 (the
machine code loader ignores columns 73-80, allowing them to be used for
card-identification or sorting purposes). Running a strip of tape so it
covers up the first 72 columns of a row, duplicating the contents of the
other 11 rows, and then punching the new contents for that row using one
keystroke per bit, seems much easier than having to manually duplicate the
contents of all rows.
No, I believe the card was read as one byte per column. Card readers
(at least those from IBM) didn't know or care if the data was binary or
character; the just took the contents of a column, convert it to a value
and sends it to the processor. One column could theoretically hold 1.5
bytes, but not all values were used, so there would not be a confusion
between binary and character data.

Additionally, the IBM 360/370 instruction set had instruction lengths of
2, 4 and 6 bytes (16, 32 and 48 bytes). Those would not work well with
a 36 bit row. Plus, they may not have been instructions; binary data,
which could be of any length, was also fed in using punch cards.

In any case, tape over the holes would have the tendency to gum up the
works. As the card is fed, the tape could catch and tear off. Gum can
leak out from under the tape - all kinds of things happen when you run
tape through machinery which isn't meant to have taped items go through it.
--
==================
Remove the "x" from my email address
Jerry Stuckle
***@attglobal.net
==================
Robert Wessel
2017-06-08 16:39:03 UTC
Reply
Permalink
Raw Message
On Thu, 8 Jun 2017 11:19:40 -0400, Jerry Stuckle
Post by Jerry Stuckle
Post by s***@casperkitty.com
Post by Jerry Stuckle
IBM Keypunches also a "hold" button which would allow you to punch
multiple holes in a column. You could also punch individual rows in a
column, i.e. '0' would punch row 0, '1' would punch row 1, etc.
They also had a feature to copy as much or as little of a card as you
want, correct the error, and continue copying (if you wish).
With this combination you could punch any combination of holes in any
column. Filling the hole with a chad could cause problems with the card
reader due to the tape or glue used to hold the chad in.
I wouldn't expect people to fill in individual holes. What I would expect
would be more common would be to blank out rows or half-rows. Text
characters are punched vertically, so each column of twelve positions would
represent one character. Machine code, however, is read horizontally, with
(IIRC) one instruction in columns 1-36 and a second in columns 37-72 (the
machine code loader ignores columns 73-80, allowing them to be used for
card-identification or sorting purposes). Running a strip of tape so it
covers up the first 72 columns of a row, duplicating the contents of the
other 11 rows, and then punching the new contents for that row using one
keystroke per bit, seems much easier than having to manually duplicate the
contents of all rows.
No, I believe the card was read as one byte per column. Card readers
(at least those from IBM) didn't know or care if the data was binary or
character; the just took the contents of a column, convert it to a value
and sends it to the processor. One column could theoretically hold 1.5
bytes, but not all values were used, so there would not be a confusion
between binary and character data.
Additionally, the IBM 360/370 instruction set had instruction lengths of
2, 4 and 6 bytes (16, 32 and 48 bytes). Those would not work well with
a 36 bit row. Plus, they may not have been instructions; binary data,
which could be of any length, was also fed in using punch cards.
Most IBM readers and punches supported column binary, although the
system did not make much use of that (and provided pretty limited
support, so you'd often be doing your own EXCPs if you wanted to read
or punch such a thing). Even old-style object and executable decks
just used the ordinary 8-bits-per-column scheme. Other vendors *did*
make use of the extra storage on the card.
s***@casperkitty.com
2017-06-08 17:39:23 UTC
Reply
Permalink
Raw Message
Post by Jerry Stuckle
No, I believe the card was read as one byte per column. Card readers
(at least those from IBM) didn't know or care if the data was binary or
character; the just took the contents of a column, convert it to a value
and sends it to the processor. One column could theoretically hold 1.5
bytes, but not all values were used, so there would not be a confusion
between binary and character data.
From what I've read, high-speed readers moved the cards sideways, and would
thus at the hardware level process a card as 12 groups of 80 bits. If one
needed to convert a stack of punched cards into magnetic tape using a bunch
of hardware shift registers to physically rearrange the bits so as to yield
a tape with one character per column might make sense, but if a reader is
directly connected to a computer, and if the computer isn't going to be
doing anything else while it's inputting the cards, rearranging the bits in
software should be cheaper than doing it in hardware.

I don't have a link handy anymore, but the process of starting up one of
those machines involved using switches to enter a short program (I think
it was about three instructions) that could read a few instructions of a
card and execute it. Those instructions could then contain a loop that
would read a few more cards into consecutive words of memory and execute
them. Once those were loaded, the software on them could perform any
desired translations on data read from the remainder of the deck.
Jerry Stuckle
2017-06-08 18:01:58 UTC
Reply
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Jerry Stuckle
No, I believe the card was read as one byte per column. Card readers
(at least those from IBM) didn't know or care if the data was binary or
character; the just took the contents of a column, convert it to a value
and sends it to the processor. One column could theoretically hold 1.5
bytes, but not all values were used, so there would not be a confusion
between binary and character data.
From what I've read, high-speed readers moved the cards sideways, and would
thus at the hardware level process a card as 12 groups of 80 bits. If one
needed to convert a stack of punched cards into magnetic tape using a bunch
of hardware shift registers to physically rearrange the bits so as to yield
a tape with one character per column might make sense, but if a reader is
directly connected to a computer, and if the computer isn't going to be
doing anything else while it's inputting the cards, rearranging the bits in
software should be cheaper than doing it in hardware.
I don't have a link handy anymore, but the process of starting up one of
those machines involved using switches to enter a short program (I think
it was about three instructions) that could read a few instructions of a
card and execute it. Those instructions could then contain a loop that
would read a few more cards into consecutive words of memory and execute
them. Once those were loaded, the software on them could perform any
desired translations on data read from the remainder of the deck.
Some of them physically read the cards sideways. But data was sent to
the computer as individual columns across 8 bit wide channels (actually
the channels were bidirectional, 8 bits each way with additional wires
for control). There were no switches (at least as of the IBM 1401 days,
which was the first machine I used) for bootstrapping the system - only
switches on the front which would select the boot device.
--
==================
Remove the "x" from my email address
Jerry Stuckle
***@attglobal.net
==================
Robert Wessel
2017-06-08 18:06:41 UTC
Reply
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Jerry Stuckle
No, I believe the card was read as one byte per column. Card readers
(at least those from IBM) didn't know or care if the data was binary or
character; the just took the contents of a column, convert it to a value
and sends it to the processor. One column could theoretically hold 1.5
bytes, but not all values were used, so there would not be a confusion
between binary and character data.
From what I've read, high-speed readers moved the cards sideways, and would
thus at the hardware level process a card as 12 groups of 80 bits. If one
needed to convert a stack of punched cards into magnetic tape using a bunch
of hardware shift registers to physically rearrange the bits so as to yield
a tape with one character per column might make sense, but if a reader is
directly connected to a computer, and if the computer isn't going to be
doing anything else while it's inputting the cards, rearranging the bits in
software should be cheaper than doing it in hardware.
I don't have a link handy anymore, but the process of starting up one of
those machines involved using switches to enter a short program (I think
it was about three instructions) that could read a few instructions of a
card and execute it. Those instructions could then contain a loop that
would read a few more cards into consecutive words of memory and execute
them. Once those were loaded, the software on them could perform any
desired translations on data read from the remainder of the deck.
It depends on which way "sideways" is...

The big fast system reader/punches usually read the card along the
top-to-bottom axis, but smaller, slower, devices often read them a
column at a time.

While the hardware could read all 80x12 holes, what those holes meant,
and how that got to the CPU varied.

For example, for normal IBM compatible 80 column punch cards, there
are defined punch combinations for each column for all 256 points in
the collating sequence. None of which used more than five or six
punches. Of course that was a problem if you tried translating the
card input to ASCII.

So with typical IBM cards, you really had 80 binary (8-bit) bytes on
the card in the normal format. The normal "read" channel command word
would cause the controller and card reader to translate the patterns
of holes in the standard fashion to those 80 bytes.

If you IPL'd (booted) a machine from card reader, the cards just
contained 80 binary bytes of program (executable) text. The way that
worked on S/360 (and still does on Z - although finding a real card
reader, as opposed to an emulated one, will likely be a challenge) is
that the first card had an initial PSW (flags register, instruction
pointer) and two additional channel command words on it. The IPL
would load those 24 bytes from the first card to location zero, and
then jump to the first CCW from that card, and continue executing the
channel program from there. So that first card might have two read
CCWs there, that each read an additional card to (say) locations
0x1000 and 0x1050, for a total of 160 bytes. After the channel
program ended, the machine would load the PSW from the first card, and
you'd have set that to start executing at location 0x1000, or
something like that.

Note that the IPL worked more-or-less the same way if you were
starting from cards, tape, disk, etc.

The IPL program could be considerable more complex. For example, you
could load one additional card with the first CCW, and then use the
second CCW to jump ("Transfer In Channel") to another CCW on that card
(which had room for 10 additional CCWs). And you could chain many of
those together.

Now most IBM card reads also supported reading the card in "column
binary" format*, in which mode they read, and returned to the CPU,
*all* of the 12x80 holes on the card. So you could read or punch
arbitrary cards, but at least on IBM mainframes, you didn't do it very
often (usually only when exchanging data from machines where such
*was* common practice).


*There was a special "read-column-binary" CCW for that. If a very dim
memory serves, it actually returned 160 bytes, treating the card as
two rows of six-bit binary values.
Robert Wessel
2017-06-08 18:18:41 UTC
Reply
Permalink
Raw Message
On Thu, 08 Jun 2017 13:06:41 -0500, Robert Wessel
Post by Robert Wessel
Post by s***@casperkitty.com
Post by Jerry Stuckle
No, I believe the card was read as one byte per column. Card readers
(at least those from IBM) didn't know or care if the data was binary or
character; the just took the contents of a column, convert it to a value
and sends it to the processor. One column could theoretically hold 1.5
bytes, but not all values were used, so there would not be a confusion
between binary and character data.
From what I've read, high-speed readers moved the cards sideways, and would
thus at the hardware level process a card as 12 groups of 80 bits. If one
needed to convert a stack of punched cards into magnetic tape using a bunch
of hardware shift registers to physically rearrange the bits so as to yield
a tape with one character per column might make sense, but if a reader is
directly connected to a computer, and if the computer isn't going to be
doing anything else while it's inputting the cards, rearranging the bits in
software should be cheaper than doing it in hardware.
I don't have a link handy anymore, but the process of starting up one of
those machines involved using switches to enter a short program (I think
it was about three instructions) that could read a few instructions of a
card and execute it. Those instructions could then contain a loop that
would read a few more cards into consecutive words of memory and execute
them. Once those were loaded, the software on them could perform any
desired translations on data read from the remainder of the deck.
It depends on which way "sideways" is...
The big fast system reader/punches usually read the card along the
top-to-bottom axis, but smaller, slower, devices often read them a
column at a time.
While the hardware could read all 80x12 holes, what those holes meant,
and how that got to the CPU varied.
For example, for normal IBM compatible 80 column punch cards, there
are defined punch combinations for each column for all 256 points in
the collating sequence. None of which used more than five or six
punches. Of course that was a problem if you tried translating the
card input to ASCII.
So with typical IBM cards, you really had 80 binary (8-bit) bytes on
the card in the normal format. The normal "read" channel command word
would cause the controller and card reader to translate the patterns
of holes in the standard fashion to those 80 bytes.
If you IPL'd (booted) a machine from card reader, the cards just
contained 80 binary bytes of program (executable) text. The way that
worked on S/360 (and still does on Z - although finding a real card
reader, as opposed to an emulated one, will likely be a challenge) is
that the first card had an initial PSW (flags register, instruction
pointer) and two additional channel command words on it. The IPL
would load those 24 bytes from the first card to location zero, and
then jump to the first CCW from that card, and continue executing the
channel program from there. So that first card might have two read
CCWs there, that each read an additional card to (say) locations
0x1000 and 0x1050, for a total of 160 bytes. After the channel
program ended, the machine would load the PSW from the first card, and
you'd have set that to start executing at location 0x1000, or
something like that.
Note that the IPL worked more-or-less the same way if you were
starting from cards, tape, disk, etc.
The IPL program could be considerable more complex. For example, you
could load one additional card with the first CCW, and then use the
second CCW to jump ("Transfer In Channel") to another CCW on that card
(which had room for 10 additional CCWs). And you could chain many of
those together.
Now most IBM card reads also supported reading the card in "column
binary" format*, in which mode they read, and returned to the CPU,
*all* of the 12x80 holes on the card. So you could read or punch
arbitrary cards, but at least on IBM mainframes, you didn't do it very
often (usually only when exchanging data from machines where such
*was* common practice).
*There was a special "read-column-binary" CCW for that. If a very dim
memory serves, it actually returned 160 bytes, treating the card as
two rows of six-bit binary values.
I slightly mis-remembered that. In column binary mode, you'd get 160
bytes, but the sequence was a byte with the six bits from the top six
rows of column 1, then a byte with the lower six rows of column one,
the the top six of column two... See page 8 of:

https://ia601904.us.archive.org/32/items/bitsavers_ibm35xxGA2ardReaderPunchSubsystemOct74_7156231/GA21-9124-5_3504_3505_3525_Card_Reader_Punch_Subsystem_Oct74.pdf
Scott Lurndal
2017-06-08 18:33:08 UTC
Reply
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Jerry Stuckle
No, I believe the card was read as one byte per column. Card readers
(at least those from IBM) didn't know or care if the data was binary or
character; the just took the contents of a column, convert it to a value
and sends it to the processor. One column could theoretically hold 1.5
bytes, but not all values were used, so there would not be a confusion
between binary and character data.
From what I've read, high-speed readers moved the cards sideways, and would
thus at the hardware level process a card as 12 groups of 80 bits.
Here's the burroughs tabletop 300 cards-per-minute reader:

https://umedia.lib.umn.edu/node/30986

It read 80 groups of 12-bit Hollerith codes. The host interface
converted the 12-bit hollerith to Binary, EBCDIC, BCL or BULL coding.

Every card reader[*] I've ever used processed a card column by column, never
row-by-row.

[*] Burroughs 300, 600 and 1200CPM readers, IBM readers (models unknown).

Here's a description of the host interface for above card reader:

http://vseries.lurndal.org/doku.php?id=dlps:card_reader
Robert Wessel
2017-06-08 19:21:51 UTC
Reply
Permalink
Raw Message
Post by Scott Lurndal
Post by s***@casperkitty.com
Post by Jerry Stuckle
No, I believe the card was read as one byte per column. Card readers
(at least those from IBM) didn't know or care if the data was binary or
character; the just took the contents of a column, convert it to a value
and sends it to the processor. One column could theoretically hold 1.5
bytes, but not all values were used, so there would not be a confusion
between binary and character data.
From what I've read, high-speed readers moved the cards sideways, and would
thus at the hardware level process a card as 12 groups of 80 bits.
https://umedia.lib.umn.edu/node/30986
It read 80 groups of 12-bit Hollerith codes. The host interface
converted the 12-bit hollerith to Binary, EBCDIC, BCL or BULL coding.
Every card reader[*] I've ever used processed a card column by column, never
row-by-row.
[*] Burroughs 300, 600 and 1200CPM readers, IBM readers (models unknown).
http://vseries.lurndal.org/doku.php?id=dlps:card_reader
As I mentioned in my prior post, it varied based on the card reader.

If you look on page 3 of:

https://ia601904.us.archive.org/32/items/bitsavers_ibm35xxGA2ardReaderPunchSubsystemOct74_7156231/GA21-9124-5_3504_3505_3525_Card_Reader_Punch_Subsystem_Oct74.pdf

"Card Sensing

The subsystem reads cards optically as the cards move past the read
station during card feed cycles. On the 3504 and 3505, data is read
serially by column, starting at column 1 of the card. On a 3525 with
the read feature, data is read parallel by row, 12-row first."

The older IBM 2540 used a set of 80 brushes instead of an optical
sensing systems, and also read the columns in parallel. 12-row first
as well (the top of the card for the folks who weren't there: the rows
were, top to bottom, 12, 11, then 0 to 9).

Of course there resulting data was always ultimately returned in
column order, no matter how it was physically read.
s***@casperkitty.com
2017-06-13 17:37:19 UTC
Reply
Permalink
Raw Message
Post by Robert Wessel
Of course there resulting data was always ultimately returned in
column order, no matter how it was physically read.
I'm not sure why you say "of course". I can't think of any of doing bit
reordering in hardware that could have been implemented cheaply in the days
before MSI logic (e.g. 74164 or 74165 shift registers), and even using those
it would be hard to get the chip count down below 50. The code required to
perform the task would seem much cheaper.
Scott Lurndal
2017-06-13 18:08:05 UTC
Reply
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Robert Wessel
Of course there resulting data was always ultimately returned in
column order, no matter how it was physically read.
I'm not sure why you say "of course". I can't think of any of doing bit
reordering in hardware that could have been implemented cheaply in the days
before MSI logic (e.g. 74164 or 74165 shift registers), and even using those
it would be hard to get the chip count down below 50. The code required to
perform the task would seem much cheaper.
You've obviously never used, worked with, worked on, or designed such kit.

It was always returned in column order. On Burroughs systems, I/O was
pretty simple from the programmer model standpoint. When the operating
system wanted to read a card, for example, it issued a single processor
instruction (Initiate I/O - IIO). This instruction had three operands.
The A (first) operand was the I/O descriptor (3 bytes) and the B and
C operand were the start and end+1 addresses in memory that the card
data would be read into.

The I/O descriptor fell into four categories: Echo(1), Test(2), Write(3), Read(3)
which consumed the second four bits of the I/O descriptor. The trailing two
bytes of the I/O descriptor denoted the unit number (for multiunit channels
such as disk, pack or tape) and other sub-options (such as the translation
mode from hollerith).

http://vseries.lurndal.org/doku.php?id=dlps:card_reader

Neither the card-reader nor the DLP's were inexpensive - the DLP itself
which took the 12-bit hollerith codes and did the translation was two
cards (each 16"x24") full of SSI logic linked with card-edge connectors.

The code to do this for every card, on a system that didn't have instructions
for even a simple bit shift or rotate, would be ridiculously inefficient.
Robert Wessel
2017-06-15 01:15:38 UTC
Reply
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Robert Wessel
Of course there resulting data was always ultimately returned in
column order, no matter how it was physically read.
I'm not sure why you say "of course". I can't think of any of doing bit
reordering in hardware that could have been implemented cheaply in the days
before MSI logic (e.g. 74164 or 74165 shift registers), and even using those
it would be hard to get the chip count down below 50. The code required to
perform the task would seem much cheaper.
In the context of a card reader attached to a S/360 channel,
supporting the normal set of I/O commands, there's really no question:
the data must be returned in column order. Having it not happen that
way would be as odd as something purporting to be a SCSI disk drive
respond with something other than a sector (block) image in response
to a read command. A S/360 channel is conceptually quite similar to a
SCSI channel, and just like a device on a SCSI channel, it required a
channel controller for (groups) of devices (consider the classic model
where a SCSI disk controller in an external box drove ESDI disk
drives). For a 2540 card reader on a S/360, that would usually have
been an IBM 2821 controller (which would also drive punches and some
line printers, and up to four devices could be attached to the
controller). The 2821 was a refrigerator-sized box:

http://bitsavers.trailing-edge.com/pdf/ibm/28xx/2821/A24-3312-7_2821_Component_Descr_Nov69.pdf

Picture of a 2821 on page 4, a 2540* card reader/punch on page 12.

Now it's certainly possible that there were card readers driven by
systems at much lower levels than via a S/360 channel (in fact I know
there were, see below). Even in those cases, what ultimately got
returned to whatever program was trying to read the card, would have
been a "normal" card image. The fact that the control of the card
reader got bit-banged in the device driver (taking the term "device
driver" broadly) is not really relevant.

OTOH, it's unlikely that many card readers that took the "read
sideways" approach would have taken such a low level approach. The
main advantage of going sideways was performance, and so only fast
(and expensive) card readers would have gotten such treatment, and for
such devices a low-level interface would not have made much sense.

Note that the interface between the 2821 control unit and the 2540
itself *was* very low level, and only modest amounts of electronics
were in the card reader itself, most of the work actually being done
in the controller. There were some rather fat cables, with many
conductors, running between the two.

Note that just like more modern SCSI disk drives, the channel
controller got integrated into more recent card readers (not to
mention printers, etc.), and no separate control unit was required.

I've seen at least one low cost card reader (not S/360 channel
attached), that was basically an upgraded keypunch machine, where the
host system could control a few of the functions to cause cards to
feed and read (basically the host could trigger some of what were
normally keyboard functions), and while cards were reading the output
was (roughly) a 12-bit parallel signal plus a clock pulse that the
host needed to interpret. Of course this was *physically* reading
cards in column order. I don't remember the exact speed, but I doubt
it was much more than about 10-15 cards per minute (IOW, 100 times
slower than a "real" 2540 reader).

I've also seen low cost card reader/punches based on keypunch machines
with actual channel interfaces. These were popular in smaller
mainframe shops in the late seventies and early eighties, when most
card processing had gone away, but you still needed to deal with the
occasional deck of cards. For a long time, a deck of cards was also
used for the rough equivalent of starting the OS in "safe mode" or
"recovery mode".


*You can visualize the card path pretty easily from that picture. The
reader is on the right, and the hopper is the thing sticking up. You
could load an entire box of cards on there. The cards were in the
hopper facedown and sideways/12-edge-forward (IOW, from the camera's
view point, we'd be looking at the skinny column-80 edges of the
cards, and the 12-edge would be facing the bulk of the 2540). Cards
fed (sideways) through the internal read station (mostly behind the
upper half of the dark panel on the right front of the machine), and
then were kicked straight forward into one of the three rightmost
stackers you see in the middle of the machine.

The left side was the card punch (that included three left most
stackers - the center stacker could be used for both the reader and
punch).

It's not a great scan, but you can see the innards of the card path of
a 2540 on page 12 of:

http://bitsavers.informatik.uni-stuttgart.de/pdf/ibm/25xx/A21-9033-1_2540_Card_Punch_Component_Description_1965.pdf
Scott Lurndal
2017-06-08 15:44:24 UTC
Reply
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Jerry Stuckle
IBM Keypunches also a "hold" button which would allow you to punch
multiple holes in a column. You could also punch individual rows in a
column, i.e. '0' would punch row 0, '1' would punch row 1, etc.
They also had a feature to copy as much or as little of a card as you
want, correct the error, and continue copying (if you wish).
With this combination you could punch any combination of holes in any
column. Filling the hole with a chad could cause problems with the card
reader due to the tape or glue used to hold the chad in.
I wouldn't expect people to fill in individual holes. What I would expect
would be more common would be to blank out rows or half-rows. Text
characters are punched vertically, so each column of twelve positions would
represent one character. Machine code, however, is read horizontally, with
(IIRC) one instruction in columns 1-36 and a second in columns 37-72 (the
machine code loader ignores columns 73-80, allowing them to be used for
card-identification or sorting purposes). Running a strip of tape so it
covers up the first 72 columns of a row, duplicating the contents of the
other 11 rows, and then punching the new contents for that row using one
keystroke per bit, seems much easier than having to manually duplicate the
contents of all rows.
Firstly, how binary data is represented on the card depends entirely
on the host that is reading the card. For example, burroughs systems
when punching binary would pack three bytes into two columns (24 bits).
Given that it was a BCD machine, the probability of punching a lace card
was in the noise.

Other vendors systems used various variants of the above scheme, albeit
on binary systems like the 360 family, they'd take care to avoid the
potential of generating lace cards for memory regions set to all @***@.

Secondly, In over 14 years of mainframe OS development, I never saw someone punch
a binary deck by hand - they were always generated using the host-connected
card punch. Using tape to block a hole is a non-starter with any of the
equipment used in the card days, it was hard enough to keep the readers
running with _good_ decks (dust, hanging chad, vibration, dirty optical
sensors, bent fingers, et alia). The readers varied from 300 cards-per-minute
to over 1200 cards-per-minute when not waiting for maintenance.

Thirdly, if you really needed to change a column, it could be done easily
on an 029 keypunch by duping the first N columns, hand-punching the bad column,
then duping the next 80[*] - N + 1 (dup was a function of the keypunch).

[*] Not a viable procedure with 96-column (Univac) cards, which were less than
half the size of the standard 80-column IBM card and had round holes.
David Kleinecke
2017-06-08 18:34:58 UTC
Reply
Permalink
Raw Message
Post by Jerry Stuckle
Post by David Kleinecke
Post by Joe Pfeiffer
Post by David Kleinecke
How about filling some of the holes in a card with chad
to reverse their meaning then duplicating the card to
insure the patch wouldn't fall out? We did desperate
things when working at midnight on a program that had
to run tomorrow.
That struck you as a better idea than retyping the card?
Yes. There are a lot of holes on an IBM card and this
being a binary no easy way to make the holes because the
key punch machines expected character input.
IBM Keypunches also a "hold" button which would allow you to punch
multiple holes in a column. You could also punch individual rows in a
column, i.e. '0' would punch row 0, '1' would punch row 1, etc.
They also had a feature to copy as much or as little of a card as you
want, correct the error, and continue copying (if you wish).
With this combination you could punch any combination of holes in any
column. Filling the hole with a chad could cause problems with the card
reader due to the tape or glue used to hold the chad in.
There were many different models of IBM keypunches. It's
been fifty-odd years but I don't remember most of the features
you mention as being on the keypunch machines I used.

I didn't use tape or glue. I just stuck the chad back in th hole.
It usually stayed in place long enough to be duplicated.
Scott Lurndal
2017-06-08 12:52:21 UTC
Reply
Permalink
Raw Message
Post by David Kleinecke
Post by Joe Pfeiffer
Post by David Kleinecke
How about filling some of the holes in a card with chad
to reverse their meaning then duplicating the card to
insure the patch wouldn't fall out? We did desperate
things when working at midnight on a program that had
to run tomorrow.
That struck you as a better idea than retyping the card?
Yes. There are a lot of holes on an IBM card and this
being a binary no easy way to make the holes because the
key punch machines expected character input.
It was easy with any 029 keypunch to make a hole
in any column of any row, or any combination of holes
within a column.

One could also dup a card up to a column, punch the column
manually, then dup the rest of the card.
Robert Wessel
2017-06-08 15:48:52 UTC
Reply
Permalink
Raw Message
Post by Scott Lurndal
Post by David Kleinecke
Post by Joe Pfeiffer
Post by David Kleinecke
How about filling some of the holes in a card with chad
to reverse their meaning then duplicating the card to
insure the patch wouldn't fall out? We did desperate
things when working at midnight on a program that had
to run tomorrow.
That struck you as a better idea than retyping the card?
Yes. There are a lot of holes on an IBM card and this
being a binary no easy way to make the holes because the
key punch machines expected character input.
It was easy with any 029 keypunch to make a hole
in any column of any row, or any combination of holes
within a column.
One could also dup a card up to a column, punch the column
manually, then dup the rest of the card.
It was also easy on the 026 and 029 to insert or delete columns - you
just put a finger on the card in either the read or punch station (so
it would not advance), while advancing the other card.

For example, if you wanted to insert something after the 20th column,
you put the original card in the read station, a new card in the punch
station, hit the dup (column) key 20 times, then prevented the card in
the read station from advancing while you typed the insertion, and
then dup'd the rest of the card.

To delete (say) three positions after column 20, you'd again load the
cards, dup 20 columns, then hold the card to in *punch* station so it
doesn't advance, and hit the spacebar three times (which would cause
the card in the read station to advance three positions), and finally
dup the remainder of the card.

Unfortunately that capability was lost on the (otherwise much nicer)
129 (which stored the card to be duplicated and/or punched in
electronic memory, rather than doing a read/punch on a mechanical
column-by-column basis).
a***@gmail.com
2017-05-31 20:10:44 UTC
Reply
Permalink
Raw Message
I say free() function could search for free the pointer at first from the last pointers malloc return...
Jerry Stuckle
2017-05-31 20:12:29 UTC
Reply
Permalink
Raw Message
Post by a***@gmail.com
I say free() function could search for free the pointer at first from the last pointers malloc return...
Which may not be the block you want to free!
--
==================
Remove the "x" from my email address
Jerry Stuckle
***@attglobal.net
==================
a***@gmail.com
2017-05-31 20:25:29 UTC
Reply
Permalink
Raw Message
If it is not the first I say would be the second the 3 the 4
If n is the number pointers returned from malloc
It is near O(1) for free( in the everyday program...)
Joe Pfeiffer
2017-06-01 03:51:32 UTC
Reply
Permalink
Raw Message
Post by a***@gmail.com
I say free() function could search for free the pointer at first from the last pointers malloc return...
Why would free() search from anything other than the pointer that was
passed to it?
Chris M. Thomasson
2017-06-01 04:05:54 UTC
Reply
Permalink
Raw Message
Post by Joe Pfeiffer
Post by a***@gmail.com
I say free() function could search for free the pointer at first from the last pointers malloc return...
Why would free() search from anything other than the pointer that was
passed to it?
Fwiw, there are allocators that simply round the address down to the
superblock and link the memory into a singly linked list. The free
operation is pretty fast. The superblock is aligned on a largish
boundary such that rounding any pointer within it down to said boundary
will gain a pointer to the superblocks data.
Mr. Man-wai Chang
2017-05-18 15:28:44 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
The C Standard defines malloc(), realloc(), free(), and others. How
they mechanically apply to a given OS and its set of low-level function
call APIs is another matter. I'm curious about how they are designed
in general (such as what type of model do they use for allocation and
free operations, as I assume there is some core algorithm that's been
developed over time, evolving into the one that's most preferred due
to its general reliability, speed, and minimal fragmentation).
Memory is always managed by the operating system and/or kernels. C has
no say on these, as far as I know. The operating systems, however, could
be written in C, as for example, Unix and Linux. But the memory is still
managed by the operating systems.

You can, of course, create your own garbage memory management system ON
TOP OF the operating systems. ;)
--
@~@ Remain silent! Drink, Blink, Stretch! Live long and prosper!!
/ v \ Simplicity is Beauty!
/( _ )\ May the Force and farces be with you!
^ ^ (x86_64 Ubuntu 9.10) Linux 2.6.39.3
不借貸! 不詐騙! 不援交! 不打交! 不打劫! 不自殺! 請考慮綜援 (CSSA):
http://www.swd.gov.hk/tc/index/site_pubsvc/page_socsecu/sub_addressesa
Rick C. Hodgin
2017-05-18 15:42:43 UTC
Reply
Permalink
Raw Message
Post by Mr. Man-wai Chang
Post by Rick C. Hodgin
The C Standard defines malloc(), realloc(), free(), and others. How
they mechanically apply to a given OS and its set of low-level function
call APIs is another matter. I'm curious about how they are designed
in general (such as what type of model do they use for allocation and
free operations, as I assume there is some core algorithm that's been
developed over time, evolving into the one that's most preferred due
to its general reliability, speed, and minimal fragmentation).
Memory is always managed by the operating system and/or kernels. C has
no say on these, as far as I know.
Unless I'm mistaken, requests are made from the malloc() and realloc()
functions to allocate memory, but those algorithms decide internally
how that larger allocated block is divvied up to fulfill requests to
allocate 11 bytes for this request, 9 bytes on this request, 337 bytes
on that request, and so on.

The OS does not handle those allocations, but only larger blocks. If
you had to call into the OS to fulfill every malloc() request, it would
be a great performance hit as traversal from user space to kernel space
is relatively expensive in terms of clock cycles and context switches.
Post by Mr. Man-wai Chang
The operating systems, however, could
be written in C, as for example, Unix and Linux. But the memory is still
managed by the operating systems.
As I understand it, there is more than one level of management, broken
down by granularity. The OS fulfills large requests (multiples of 4KB
or 4MB blocks as per virtual page sizes), and then that large allocation
goes through the C standard library algorithm which then breaks the
large allocation into many smaller pieces per user app requests.

Thank you,
Rick C. Hodgin
Joe Pfeiffer
2017-05-18 15:46:39 UTC
Reply
Permalink
Raw Message
Post by Mr. Man-wai Chang
Post by Rick C. Hodgin
The C Standard defines malloc(), realloc(), free(), and others. How
they mechanically apply to a given OS and its set of low-level function
call APIs is another matter. I'm curious about how they are designed
in general (such as what type of model do they use for allocation and
free operations, as I assume there is some core algorithm that's been
developed over time, evolving into the one that's most preferred due
to its general reliability, speed, and minimal fragmentation).
Memory is always managed by the operating system and/or kernels. C has
no say on these, as far as I know. The operating systems, however,
could be written in C, as for example, Unix and Linux. But the memory
is still managed by the operating systems.
You can, of course, create your own garbage memory management system
ON TOP OF the operating systems. ;)
Exactly, which is how malloc() works. It manages a free space pool in
user space, using something like the power-of-two algorithm described by
another poster. When that pool is exhausted it requests a large new
chunk of space from the OS, and then allocates from within that. Note
this doesn't necessarily have anything to do with garbage collection.

It's true that this really isn't C specific, but not because it's all
managed by the kernel.
s***@casperkitty.com
2017-05-18 16:08:54 UTC
Reply
Permalink
Raw Message
Post by Mr. Man-wai Chang
Memory is always managed by the operating system and/or kernels. C has
no say on these, as far as I know. The operating systems, however, could
be written in C, as for example, Unix and Linux. But the memory is still
managed by the operating systems.
Unless an optimizer can somehow manage to break it, one can implement
malloc/free without any reliance upon the OS by simply saying:

typedef struct {
void *next;
size_t size } MEM_BLOCK;

union {
MEM_BLOCK free_area;
coarsestType arr[1+ (SIZE_OF_POOL-1) / sizeof (coarsestType)];
} mem_pool_obj = { { 0, 1+ (SIZE_OF_POOL-1) / sizeof (coarsestType) } };
void *volatile first_free_block = &mem_pool_obj;

and then have malloc/free return pointers to portions of mem_pool_obj. One
would have to know in advance what size the pool should be, and systems may
limit static allocations to a small portion of total RAM (many MS-DOS
and classic-Macintosh implementations were like that) but if a compiler is
required to presume that a volatile void* might identify a union containing
any arbitrary combination of types (as would have been the case under C89)
there would be no need for malloc() to use the OS at all unless the storage
it needs to handle exceeds the allowable size for static objects.
Chris M. Thomasson
2017-05-18 22:48:58 UTC
Reply
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Mr. Man-wai Chang
Memory is always managed by the operating system and/or kernels. C has
no say on these, as far as I know. The operating systems, however, could
be written in C, as for example, Unix and Linux. But the memory is still
managed by the operating systems.
Unless an optimizer can somehow manage to break it, one can implement
typedef struct {
void *next;
size_t size } MEM_BLOCK;
There is a way to combine the next pointer into the memory block itself.
Virtually zero overhead. This is using tricks to gain a pointer to the
"super-block" without having to create an extra pointer per block.
s***@casperkitty.com
2017-05-18 23:07:34 UTC
Reply
Permalink
Raw Message
Post by Chris M. Thomasson
Post by s***@casperkitty.com
Post by Mr. Man-wai Chang
Memory is always managed by the operating system and/or kernels. C has
no say on these, as far as I know. The operating systems, however, could
be written in C, as for example, Unix and Linux. But the memory is still
managed by the operating systems.
Unless an optimizer can somehow manage to break it, one can implement
typedef struct {
void *next;
size_t size } MEM_BLOCK;
There is a way to combine the next pointer into the memory block itself.
Virtually zero overhead. This is using tricks to gain a pointer to the
"super-block" without having to create an extra pointer per block.
If a memory manager requires applications to specify the size of blocks
being freed, and block sizes have sufficient granularity to hold a pointer
and a size, it's pretty easy to reduce per-block overhead to zero by only
keeping track of free blocks. Allocation with zero per-block overhead can
also be accomplished if the granularity only allows space for a pointer,
but at the cost of more code (keep separate free lists for minimal-sized
blocks and larger blocks). If applications are not required to specify
the size of blocks being released (e.g. when implementing a conforming
version of "free()"), it will be necessary to prepend each block with a
header giving its size, thus adding overhead.
Chris M. Thomasson
2017-05-20 22:17:54 UTC
Reply
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Chris M. Thomasson
Post by s***@casperkitty.com
Post by Mr. Man-wai Chang
Memory is always managed by the operating system and/or kernels. C has
no say on these, as far as I know. The operating systems, however, could
be written in C, as for example, Unix and Linux. But the memory is still
managed by the operating systems.
Unless an optimizer can somehow manage to break it, one can implement
typedef struct {
void *next;
size_t size } MEM_BLOCK;
There is a way to combine the next pointer into the memory block itself.
Virtually zero overhead. This is using tricks to gain a pointer to the
"super-block" without having to create an extra pointer per block.
If a memory manager requires applications to specify the size of blocks
being freed, and block sizes have sufficient granularity to hold a pointer
and a size, it's pretty easy to reduce per-block overhead to zero by only
keeping track of free blocks. Allocation with zero per-block overhead can
also be accomplished if the granularity only allows space for a pointer,
but at the cost of more code (keep separate free lists for minimal-sized
blocks and larger blocks). If applications are not required to specify
the size of blocks being released (e.g. when implementing a conforming
version of "free()"), it will be necessary to prepend each block with a
header giving its size, thus adding overhead.
The technique basically involves a rounding down a blocks address into
the superblock structure.
Siri Cruise
2017-05-20 23:06:13 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Can someone point me to a website or explain to me how malloc() and
free() work in a general sense on Windows?
I'd like to know how it propagates its structures from front-door
requests, through any layers, to the actual memory locations, and
For the allocator I use, free does nothing, and malloc is nontrivial and can
trigger a garbage collection. There is no single strategy 'in general'. If
allocation interests you, you can look up different strategies and actual
implementations.
Post by Rick C. Hodgin
what it uses to store data beyond that which is visible to us as
a buffer in an application program.
The documentation of whatever allocator you use should explain everything
visible to you.
--
:-<> Siri Seal of Disavowal #000-001. Disavowed. Denied. Deleted. @
'I desire mercy, not sacrifice.' /|\
Free the Amos Yee one. This post / \
Yeah, too bad about your so-called life. Ha-ha. insults Islam. Mohammed
Loading...