*Post by Ben Bacarisse**Post by Malcolm McLean**Post by Ben Bacarisse**Post by Malcolm McLean**Post by Ben Bacarisse**Post by Janis Papanagnou*In a recent thread realloc() was a substantial part of the discussion.

"Occasionally" the increased data storage will be relocated along

with the previously stored data. On huge data sets that might be a

performance factor. Is there any experience or are there any concrete

factors about the conditions when this relocation happens? - I could

imagine that it's no issue as long as you're in some kB buffer range,

but if, say, we're using realloc() to substantially increase buffers

often it might be an issue to consider. It would be good to get some

feeling about that internal.

There is obviously a cost, but there is (usually) no alternative if

contiguous storage is required. In practice, the cost is usually

moderate and can be very effectively managed by using an exponential

allocation scheme: at every reallocation multiply the storage space by

some factor greater than 1 (I often use 3/2, but doubling is often used

as well). This results in O(log(N)) rather than O(N) allocations as in

your code that added a constant to the size. Of course, some storage is

wasted (that /might/ be retrieved by a final realloc down to the final

size) but that's rarely significant.

So can we work it out?

What is "it"?

*Post by Malcolm McLean*Let's assume for the moment that the allocations have a semi-normal

distribution,

What allocations? The allocations I talked about don't have that

distribution.

*Post by Malcolm McLean*with negative values disallowed. Now ignoring the first few

values, if we have allocated, say, 1K, we ought to be able to predict the

value by integrating the distribution from 1k to infinity and taking the

mean.

I have no idea what you are talking about. What "value" are you looking

to calculate?

We have a continuously growing buffer, and we want the best strategy for

reallocations as the stream of characters comes at us. So, given we now how

many characters have arrived, can we predict how many will arrive, and

therefore ask for the best amount when we reallocate, so that we neither

make too many reallocation (reallocate on every byte received) or ask for

too much (demand SIZE_MAX memory when the first byte is received).?

Obviously not, or we'd use the prediction. You question was probably

rhetorical, but it didn't read that way.

*Post by Malcolm McLean*Your strategy for avoiding these extremes is exponential growth.

It's odd to call it mine. It's very widely know and used. "The one I

mentioned" might be less confusing description.

*Post by Malcolm McLean*You

allocate a small amount for the first few bytes. Then you use exponential

growth, with a factor of ether 2 or 1.5. My question is whether or not we

can be cuter. And of course we need to know the statistical distribution of

the input files. And I'm assuming a semi-normal distribution, ignoring the

files with small values, which we will allocate enough for anyway.

And so we integrate the distribution between the point we are at and

infinity. Then we tkae the mean. And that gives us a best estimate of how

many bytes are to come, and therefore how much to grow the buffer by.

I would be surprised if that were worth the effort at run time. A

static analysis of "typical" input sizes might be interesting as that

could be used to get an estimate of good factors to use, but anything

more complicated than maybe a few factors (e.g. doubling up to 1MB then

3/2 thereafter) is likely to be too messy to useful.

There's virualy no run-time effort, unless you ask caller to pass in a

customised distribution, which you analyse on the fly, which would be

quite a bit of work.

All the work is done beforehand. We need a statistical distribution of

the files sizes of the files we are interesed in. So, probably, text

files on personal computers. Then we'll excude the small files, say

under 1kb which will have an odd distribution for various reasons, and

which we are not interested in as we can easily afford 1k as an initial

buffer.

And we're probably looking at a semi- normal, maybe log- normal

distribution. There's no reason to suspect it would be anything odd. And

with the normal distribution there is no closed form integral, but

tables of integrals are published.

So we convert 1K to a Z-score, integrate from that to infinity, halve

the result, and that gives us an estimate of the most likely file size -

having established that the file is over 1k, half will be below and half

above this size. So that's the next amount to realloc. Say, for the sake

of argument, 4K. Then we do the same thing, starting from 4k, and

working out the most likely file size, given that the file is over 4K.

Now the disribution tends to flatten out towards the tail, so if best

guess, given at least 1K, was 4K, best guess diven 4k, won't be 8K. It

will be 10k, maybe 12k. Do the same again for 12k. And we'll get a

series of numbers like this.

1k, 4k, 10k, 20k, 50k, 120k, 500k, 2MB, 8MB ...

and so on, rapidly increasing to SIZE_MAX. And then at runtime we just

hardcode those in, it's a lookup table with not too many entries.

Becuase we've chosen the mean, half the time you will reallocate. You

can easily fine tune the strategy by choosing a proportion other than

0.5, depending on whether saving memory or reducing allocations is more

important to you.

and the hard part is getting some real statistics to work on.

*Post by Ben Bacarisse*Also, the cost of reallocations is not constant. Larger ones are

usually more costly than small ones, so if one were going to a lot of

effort to make run-time guesses, that cost should be factored in as

well.

Unfortunately yes. Real optimisation problems can be almost impossible

for reasons like this. iF the cost of reallocations isn't constant,

tou've got to put in correctiong factors, and then what was a fairly

simple procedure becomes difficult.

--

Check out my hobby project.

http://malcolmmclean.github.io/babyxrc