Post by Ben BacarissePost by Malcolm McLeanPost by Ben BacarissePost by Malcolm McLeanPost by Ben BacarissePost by Janis PapanagnouIn a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.
There is obviously a cost, but there is (usually) no alternative if
contiguous storage is required. In practice, the cost is usually
moderate and can be very effectively managed by using an exponential
allocation scheme: at every reallocation multiply the storage space by
some factor greater than 1 (I often use 3/2, but doubling is often used
as well). This results in O(log(N)) rather than O(N) allocations as in
your code that added a constant to the size. Of course, some storage is
wasted (that /might/ be retrieved by a final realloc down to the final
size) but that's rarely significant.
So can we work it out?
What is "it"?
Post by Malcolm McLeanLet's assume for the moment that the allocations have a semi-normal
distribution,
What allocations? The allocations I talked about don't have that
distribution.
Post by Malcolm McLeanwith negative values disallowed. Now ignoring the first few
values, if we have allocated, say, 1K, we ought to be able to predict the
value by integrating the distribution from 1k to infinity and taking the
mean.
I have no idea what you are talking about. What "value" are you looking
to calculate?
We have a continuously growing buffer, and we want the best strategy for
reallocations as the stream of characters comes at us. So, given we now how
many characters have arrived, can we predict how many will arrive, and
therefore ask for the best amount when we reallocate, so that we neither
make too many reallocation (reallocate on every byte received) or ask for
too much (demand SIZE_MAX memory when the first byte is received).?
Obviously not, or we'd use the prediction. You question was probably
rhetorical, but it didn't read that way.
Post by Malcolm McLeanYour strategy for avoiding these extremes is exponential growth.
It's odd to call it mine. It's very widely know and used. "The one I
mentioned" might be less confusing description.
Post by Malcolm McLeanYou
allocate a small amount for the first few bytes. Then you use exponential
growth, with a factor of ether 2 or 1.5. My question is whether or not we
can be cuter. And of course we need to know the statistical distribution of
the input files. And I'm assuming a semi-normal distribution, ignoring the
files with small values, which we will allocate enough for anyway.
And so we integrate the distribution between the point we are at and
infinity. Then we tkae the mean. And that gives us a best estimate of how
many bytes are to come, and therefore how much to grow the buffer by.
I would be surprised if that were worth the effort at run time. A
static analysis of "typical" input sizes might be interesting as that
could be used to get an estimate of good factors to use, but anything
more complicated than maybe a few factors (e.g. doubling up to 1MB then
3/2 thereafter) is likely to be too messy to useful.
There's virualy no run-time effort, unless you ask caller to pass in a
customised distribution, which you analyse on the fly, which would be
quite a bit of work.
All the work is done beforehand. We need a statistical distribution of
the files sizes of the files we are interesed in. So, probably, text
files on personal computers. Then we'll excude the small files, say
under 1kb which will have an odd distribution for various reasons, and
which we are not interested in as we can easily afford 1k as an initial
buffer.
And we're probably looking at a semi- normal, maybe log- normal
distribution. There's no reason to suspect it would be anything odd. And
with the normal distribution there is no closed form integral, but
tables of integrals are published.
So we convert 1K to a Z-score, integrate from that to infinity, halve
the result, and that gives us an estimate of the most likely file size -
having established that the file is over 1k, half will be below and half
above this size. So that's the next amount to realloc. Say, for the sake
of argument, 4K. Then we do the same thing, starting from 4k, and
working out the most likely file size, given that the file is over 4K.
Now the disribution tends to flatten out towards the tail, so if best
guess, given at least 1K, was 4K, best guess diven 4k, won't be 8K. It
will be 10k, maybe 12k. Do the same again for 12k. And we'll get a
series of numbers like this.
1k, 4k, 10k, 20k, 50k, 120k, 500k, 2MB, 8MB ...
and so on, rapidly increasing to SIZE_MAX. And then at runtime we just
hardcode those in, it's a lookup table with not too many entries.
Becuase we've chosen the mean, half the time you will reallocate. You
can easily fine tune the strategy by choosing a proportion other than
0.5, depending on whether saving memory or reducing allocations is more
important to you.
and the hard part is getting some real statistics to work on.
Post by Ben BacarisseAlso, the cost of reallocations is not constant. Larger ones are
usually more costly than small ones, so if one were going to a lot of
effort to make run-time guesses, that cost should be factored in as
well.
Unfortunately yes. Real optimisation problems can be almost impossible
for reasons like this. iF the cost of reallocations isn't constant,
tou've got to put in correctiong factors, and then what was a fairly
simple procedure becomes difficult.
--
Check out my hobby project.
http://malcolmmclean.github.io/babyxrc