Discussion:
realloc() - frequency, conditions, or experiences about relocation?
(too old to reply)
Janis Papanagnou
2024-06-17 06:08:07 UTC
Permalink
In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.

Janis
Chris M. Thomasson
2024-06-17 06:34:49 UTC
Permalink
Post by Janis Papanagnou
In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor.
[...]

It sure can be a factor to consider. Usually its pretty nice, but not so
much for certain workloads.
Ben Bacarisse
2024-06-17 09:18:40 UTC
Permalink
Post by Janis Papanagnou
In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.
There is obviously a cost, but there is (usually) no alternative if
contiguous storage is required. In practice, the cost is usually
moderate and can be very effectively managed by using an exponential
allocation scheme: at every reallocation multiply the storage space by
some factor greater than 1 (I often use 3/2, but doubling is often used
as well). This results in O(log(N)) rather than O(N) allocations as in
your code that added a constant to the size. Of course, some storage is
wasted (that /might/ be retrieved by a final realloc down to the final
size) but that's rarely significant.
--
Ben.
Malcolm McLean
2024-06-17 09:31:59 UTC
Permalink
Post by Ben Bacarisse
Post by Janis Papanagnou
In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.
There is obviously a cost, but there is (usually) no alternative if
contiguous storage is required. In practice, the cost is usually
moderate and can be very effectively managed by using an exponential
allocation scheme: at every reallocation multiply the storage space by
some factor greater than 1 (I often use 3/2, but doubling is often used
as well). This results in O(log(N)) rather than O(N) allocations as in
your code that added a constant to the size. Of course, some storage is
wasted (that /might/ be retrieved by a final realloc down to the final
size) but that's rarely significant.
So can we work it out?

Let's assume for the moment that the allocations have a semi-normal
distribution, with negative values disallowed. Now ignoring the first
few values, if we have allocated, say, 1K, we ought to be able to
predict the value by integrating the distribution from 1k to infinity
and taking the mean.
--
Check out my hobby project.
http://malcolmmclean.github.io/babyxrc
Ben Bacarisse
2024-06-17 09:55:33 UTC
Permalink
Post by Malcolm McLean
Post by Ben Bacarisse
Post by Janis Papanagnou
In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.
There is obviously a cost, but there is (usually) no alternative if
contiguous storage is required. In practice, the cost is usually
moderate and can be very effectively managed by using an exponential
allocation scheme: at every reallocation multiply the storage space by
some factor greater than 1 (I often use 3/2, but doubling is often used
as well). This results in O(log(N)) rather than O(N) allocations as in
your code that added a constant to the size. Of course, some storage is
wasted (that /might/ be retrieved by a final realloc down to the final
size) but that's rarely significant.
So can we work it out?
What is "it"?
Post by Malcolm McLean
Let's assume for the moment that the allocations have a semi-normal
distribution,
What allocations? The allocations I talked about don't have that
distribution.
Post by Malcolm McLean
with negative values disallowed. Now ignoring the first few
values, if we have allocated, say, 1K, we ought to be able to predict the
value by integrating the distribution from 1k to infinity and taking the
mean.
I have no idea what you are talking about. What "value" are you looking
to calculate?
--
Ben.
Malcolm McLean
2024-06-17 12:45:19 UTC
Permalink
Post by Ben Bacarisse
Post by Malcolm McLean
Post by Ben Bacarisse
Post by Janis Papanagnou
In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.
There is obviously a cost, but there is (usually) no alternative if
contiguous storage is required. In practice, the cost is usually
moderate and can be very effectively managed by using an exponential
allocation scheme: at every reallocation multiply the storage space by
some factor greater than 1 (I often use 3/2, but doubling is often used
as well). This results in O(log(N)) rather than O(N) allocations as in
your code that added a constant to the size. Of course, some storage is
wasted (that /might/ be retrieved by a final realloc down to the final
size) but that's rarely significant.
So can we work it out?
What is "it"?
Post by Malcolm McLean
Let's assume for the moment that the allocations have a semi-normal
distribution,
What allocations? The allocations I talked about don't have that
distribution.
Post by Malcolm McLean
with negative values disallowed. Now ignoring the first few
values, if we have allocated, say, 1K, we ought to be able to predict the
value by integrating the distribution from 1k to infinity and taking the
mean.
I have no idea what you are talking about. What "value" are you looking
to calculate?
We have a continuously growing buffer, and we want the best strategy for
reallocations as the stream of characters comes at us. So, given we now
how many characters have arrived, can we predict how many will arrive,
and therefore ask for the best amount when we reallocate, so that we
neither make too many reallocation (reallocate on every byte received)
or ask for too much (demand SIZE_MAX memory when the first byte is
received).?
Your strategy for avoiding these extremes is exponential growth. You
allocate a small amount for the first few bytes. Then you use
exponential growth, with a factor of ether 2 or 1.5. My question is
whether or not we can be cuter. And of course we need to know the
statistical distribution of the input files. And I'm assuming a
semi-normal distribution, ignoring the files with small values, which we
will allocate enough for anyway.

And so we integrate the distribution between the point we are at and
infinity. Then we tkae the mean. And that gives us a best estimate of
how many bytes are to come, and therefore how much to grow the buffer by.
--
Check out my hobby project.
http://malcolmmclean.github.io/babyxrc
Ben Bacarisse
2024-06-17 14:33:47 UTC
Permalink
Post by Malcolm McLean
Post by Ben Bacarisse
Post by Malcolm McLean
Post by Ben Bacarisse
Post by Janis Papanagnou
In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.
There is obviously a cost, but there is (usually) no alternative if
contiguous storage is required. In practice, the cost is usually
moderate and can be very effectively managed by using an exponential
allocation scheme: at every reallocation multiply the storage space by
some factor greater than 1 (I often use 3/2, but doubling is often used
as well). This results in O(log(N)) rather than O(N) allocations as in
your code that added a constant to the size. Of course, some storage is
wasted (that /might/ be retrieved by a final realloc down to the final
size) but that's rarely significant.
So can we work it out?
What is "it"?
Post by Malcolm McLean
Let's assume for the moment that the allocations have a semi-normal
distribution,
What allocations? The allocations I talked about don't have that
distribution.
Post by Malcolm McLean
with negative values disallowed. Now ignoring the first few
values, if we have allocated, say, 1K, we ought to be able to predict the
value by integrating the distribution from 1k to infinity and taking the
mean.
I have no idea what you are talking about. What "value" are you looking
to calculate?
We have a continuously growing buffer, and we want the best strategy for
reallocations as the stream of characters comes at us. So, given we now how
many characters have arrived, can we predict how many will arrive, and
therefore ask for the best amount when we reallocate, so that we neither
make too many reallocation (reallocate on every byte received) or ask for
too much (demand SIZE_MAX memory when the first byte is received).?
Obviously not, or we'd use the prediction. You question was probably
rhetorical, but it didn't read that way.
Post by Malcolm McLean
Your strategy for avoiding these extremes is exponential growth.
It's odd to call it mine. It's very widely know and used. "The one I
mentioned" might be less confusing description.
Post by Malcolm McLean
You
allocate a small amount for the first few bytes. Then you use exponential
growth, with a factor of ether 2 or 1.5. My question is whether or not we
can be cuter. And of course we need to know the statistical distribution of
the input files. And I'm assuming a semi-normal distribution, ignoring the
files with small values, which we will allocate enough for anyway.
And so we integrate the distribution between the point we are at and
infinity. Then we tkae the mean. And that gives us a best estimate of how
many bytes are to come, and therefore how much to grow the buffer by.
I would be surprised if that were worth the effort at run time. A
static analysis of "typical" input sizes might be interesting as that
could be used to get an estimate of good factors to use, but anything
more complicated than maybe a few factors (e.g. doubling up to 1MB then
3/2 thereafter) is likely to be too messy to useful.

Also, the cost of reallocations is not constant. Larger ones are
usually more costly than small ones, so if one were going to a lot of
effort to make run-time guesses, that cost should be factored in as
well.
--
Ben.
Anton Shepelev
2024-06-17 15:10:34 UTC
Permalink
Post by Ben Bacarisse
Post by Malcolm McLean
We have a continuously growing buffer, and we want the
best strategy for reallocations as the stream of
characters comes at us. So, given we now how many
characters have arrived, can we predict how many will
arrive, and therefore ask for the best amount when we
reallocate, so that we neither make too many
reallocation (reallocate on every byte received) or ask
for too much (demand SIZE_MAX memory when the first byte
is received).?
Obviously not, or we'd use the prediction.
Not so obvious to me, for the exponential algorithm may be
the best when the distribution of buffer size is /not/
known, whereas Malcolm is interested in the cases when we
know it.
Post by Ben Bacarisse
Post by Malcolm McLean
Your strategy for avoiding these extremes is exponential
growth.
It's odd to call it mine. It's very widely know and used.
"The one I mentioned" might be less confusing description.
I think it is a modern English idiom, which I dislike as
well. StackOverflow is full of questions starting like:
"How do you do this?" and "How do I do that?" They are
informal ways of the more literary "How does one do this?"
or "What is the way to do that?"
--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments
Tim Rentsch
2024-06-18 07:09:24 UTC
Permalink
Post by Anton Shepelev
[next is a comment from Malcolm]
Post by Malcolm McLean
Your strategy for avoiding these extremes is exponential
growth.
It's odd to call it mine. It's very widely know and used.
"The one I mentioned" might be less confusing description.
I think it is a modern English idiom, which I dislike as
"How do you do this?" and "How do I do that?" They are
informal ways of the more literary "How does one do this?"
or "What is the way to do that?"
I have a different take here. First the "your" of "your
strategy" reads as a definite pronoun, meaning it refers
specifically to Ben and not to some unknown other party.
(And incidentally is subtly insulting because of that,
whether it was meant that way or not.)

Second the use of "you" to mean an unspecified other person
is not idiom but standard usage. The word "you" is both a
definite pronoun and an indefinite pronoun, depending on
context. The word "they" also has this property. Consider
these two examples:

The bank downtown was robbed. They haven't been caught
yet.

They say the sheriff isn't going to run for re-election.

In the first example "they" is a definite pronoun, referring
to the people who robbed the bank. In the second example,
"they" is an indefinite pronoun, referring to unspecified
people in general (perhaps but not necessarily everyone).
The word "you" is similar: it can mean specifically the
listener, or it can mean generically anyone in a broader
audience, even those who never hear or read the statement
with "you" in it.

The word "one" used as a pronoun is more formal, and to me
at least often sounds stilted. In US English "one" is most
often an indefinite pronoun, either second person or third
person. But "one" can also be used as a first person
definite pronoun (referring to the speaker), which an online
reference tells me is chiefly British English. (I would
guess that this usage predominates in "the Queen's English"
dialect of English, but I have very little experience in
such things.)

Finally I would normally read "I" as a first person definite
pronoun, and not an indefinite pronoun. So I don't have any
problem with someone saying "how should I ..." when asking
for advice. They aren't asking how someone else should ...
but how they should ..., and what advice I might give could
very well depend on who is doing the asking.
Malcolm McLean
2024-06-18 10:19:12 UTC
Permalink
Post by Tim Rentsch
Post by Anton Shepelev
[next is a comment from Malcolm]
Post by Malcolm McLean
Your strategy for avoiding these extremes is exponential
growth.
It's odd to call it mine. It's very widely know and used.
"The one I mentioned" might be less confusing description.
I think it is a modern English idiom, which I dislike as
"How do you do this?" and "How do I do that?" They are
informal ways of the more literary "How does one do this?"
or "What is the way to do that?"
I have a different take here. First the "your" of "your
strategy" reads as a definite pronoun, meaning it refers
specifically to Ben and not to some unknown other party.
(And incidentally is subtly insulting because of that,
whether it was meant that way or not.)
Second the use of "you" to mean an unspecified other person
is not idiom but standard usage. The word "you" is both a
definite pronoun and an indefinite pronoun, depending on
context. The word "they" also has this property. Consider
The bank downtown was robbed. They haven't been caught
yet.
They say the sheriff isn't going to run for re-election.
In the first example "they" is a definite pronoun, referring
to the people who robbed the bank. In the second example,
"they" is an indefinite pronoun, referring to unspecified
people in general (perhaps but not necessarily everyone).
The word "you" is similar: it can mean specifically the
listener, or it can mean generically anyone in a broader
audience, even those who never hear or read the statement
with "you" in it.
The word "one" used as a pronoun is more formal, and to me
at least often sounds stilted. In US English "one" is most
often an indefinite pronoun, either second person or third
person. But "one" can also be used as a first person
definite pronoun (referring to the speaker), which an online
reference tells me is chiefly British English. (I would
guess that this usage predominates in "the Queen's English"
dialect of English, but I have very little experience in
such things.)
Finally I would normally read "I" as a first person definite
pronoun, and not an indefinite pronoun. So I don't have any
problem with someone saying "how should I ..." when asking
for advice. They aren't asking how someone else should ...
but how they should ..., and what advice I might give could
very well depend on who is doing the asking.
Ben said
Post by Tim Rentsch
Restore snipped Ben upthread
"In practice, the cost is usually
moderate and can be very effectively managed by using an exponential
allocation scheme: at every reallocation multiply the storage space by
some factor greater than 1 (I often use 3/2, but doubling is often used
as well)."

So it's open and shut, and no two ways about it. Ben's strategy is
exponential growth. And to be fair I use that strategy myself in
functions like fslutp(). It's only not Ben's strategy if we mean to
imply that Ben was the first person to use expoential growth, or the
first to understand the mathematical implications, and of course that's
not the case. It was all worked out by Euler long before any of us were
born.

The question is whether we can be a bit more rigorous than "we need
exonential growth, let's double on each reallocation. Actually, that
looks a bit greedy. Try 3/2". And we can do that. We can put it on a
sounder statistical footing. Whether it actually worth it or not is a
different matter.
--
Check out my hobby project.
http://malcolmmclean.github.io/babyxrc
Malcolm McLean
2024-06-18 10:46:36 UTC
Permalink
Post by Malcolm McLean
Post by Anton Shepelev
[next is a comment from Malcolm]
Post by Malcolm McLean
Your strategy for avoiding these extremes is exponential
growth.
It's odd to call it mine.  It's very widely know and used.
"The one I mentioned" might be less confusing description.
I think it is a modern English idiom, which I dislike as
"How do you do this?" and "How do I do that?"  They are
informal ways of the more literary "How does one do this?"
or "What is the way to do that?"
I have a different take here.  First the "your" of "your
strategy" reads as a definite pronoun, meaning it refers
specifically to Ben and not to some unknown other party.
(And incidentally is subtly insulting because of that,
whether it was meant that way or not.)
Second the use of "you" to mean an unspecified other person
is not idiom but standard usage.  The word "you" is both a
definite pronoun and an indefinite pronoun, depending on
context.  The word "they" also has this property.  Consider
    The bank downtown was robbed.  They haven't been caught
    yet.
    They say the sheriff isn't going to run for re-election.
In the first example "they" is a definite pronoun, referring
to the people who robbed the bank.  In the second example,
"they" is an indefinite pronoun, referring to unspecified
people in general (perhaps but not necessarily everyone).
The word "you" is similar:  it can mean specifically the
listener, or it can mean generically anyone in a broader
audience, even those who never hear or read the statement
with "you" in it.
The word "one" used as a pronoun is more formal, and to me
at least often sounds stilted.  In US English "one" is most
often an indefinite pronoun, either second person or third
person.  But "one" can also be used as a first person
definite pronoun (referring to the speaker), which an online
reference tells me is chiefly British English.  (I would
guess that this usage predominates in "the Queen's English"
dialect of English, but I have very little experience in
such things.)
Finally I would normally read "I" as a first person definite
pronoun, and not an indefinite pronoun.  So I don't have any
problem with someone saying "how should I ..." when asking
for advice.  They aren't asking how someone else should ...
but how they should ..., and what advice I might give could
very well depend on who is doing the asking.
Ben said
Restore snipped Ben upthread
"In practice, the cost is usually
moderate and can be very effectively managed by using an exponential
allocation scheme: at every reallocation multiply the storage space by
some factor greater than 1 (I often use 3/2, but doubling is often used
as well)."
So it's open and shut, and no two ways about it. Ben's strategy is
exponential growth. And to be fair I use that strategy myself in
functions like fslutp(). It's only not Ben's strategy if we mean to
imply that Ben was the first person to use expoential growth, or the
first to understand the mathematical implications, and of course that's
not the case. It was all worked out by Euler long before any of us were
born.
The question is whether we can be a bit more rigorous than "we need
exonential growth, let's double on each reallocation. Actually, that
looks a bit greedy. Try 3/2". And we can do that. We can put it on a
sounder statistical footing. Whether it actually worth it or not is a
different matter.
Here are some real stats on file sizes, in case anone is interested.

Data set, / OS Log-normal median & mean, Arithmetic mean, 50% occupied by
(< mean)

whole data set, 9.0 KB, 730 KB, 1.5 MB < 5.4 KB
Mac OS 8.0 KB, 533 KB, 1.4 MB < 4.9 KB
Windows 11.5 KB, 1.0 MB, 1.7 MB < 8.3 KB
GNU/Linux 10.8 KB, 1.7MB, 2.2 MB < 4.8 KB

https://www.researchgate.net/publication/353066615_How_Big_Are_Peoples%27_Computer_Files_File_Size_Distributions_Among_User-managed_Collections
--
Check out my hobby project.
http://malcolmmclean.github.io/babyxrc
Lawrence D'Oliveiro
2024-06-29 00:14:46 UTC
Permalink
Post by Malcolm McLean
Here are some real stats on file sizes, in case anone is interested.
Data set, / OS Log-normal median & mean, Arithmetic mean, 50% occupied
by (< mean)
whole data set, 9.0 KB, 730 KB, 1.5 MB < 5.4 KB
Mac OS 8.0 KB, 533 KB, 1.4 MB < 4.9 KB
Windows 11.5 KB, 1.0 MB, 1.7 MB < 8.3 KB
GNU/Linux 10.8 KB, 1.7MB, 2.2 MB < 4.8 KB
https://www.researchgate.net/publication/353066615_How_Big_Are_Peoples%27_Computer_Files_File_Size_Distributions_Among_User-managed_Collections
I don’t see any error bars. Without those, it hard to attach any
significance to the differences in figures.
Malcolm McLean
2024-07-02 09:18:32 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Malcolm McLean
Here are some real stats on file sizes, in case anone is interested.
Data set, / OS Log-normal median & mean, Arithmetic mean, 50% occupied
by (< mean)
whole data set, 9.0 KB, 730 KB, 1.5 MB < 5.4 KB
Mac OS 8.0 KB, 533 KB, 1.4 MB < 4.9 KB
Windows 11.5 KB, 1.0 MB, 1.7 MB < 8.3 KB
GNU/Linux 10.8 KB, 1.7MB, 2.2 MB < 4.8 KB
https://www.researchgate.net/publication/353066615_How_Big_Are_Peoples%27_Computer_Files_File_Size_Distributions_Among_User-managed_Collections
I don’t see any error bars. Without those, it hard to attach any
significance to the differences in figures.
You don't need error bars becuase those fugures indicate a distribution.
The file are log-normally distributed wth given means and median. So the
spread is part of that data.

(Strictly you do need N to assess the significance, but it's got to be
one in a fazilion, the null that there is no difference at all in file
sizes between the platforms doesn't have a prayer. Unless you argue that
because the files came from one machine each, N is actualy 1).
--
Check out my hobby project.
http://malcolmmclean.github.io/babyxrc
Ben Bacarisse
2024-07-02 15:39:21 UTC
Permalink
Post by Malcolm McLean
Post by Lawrence D'Oliveiro
Post by Malcolm McLean
Here are some real stats on file sizes, in case anone is interested.
Data set, / OS Log-normal median & mean, Arithmetic mean, 50% occupied
by (< mean)
whole data set, 9.0 KB, 730 KB, 1.5 MB < 5.4 KB
Mac OS 8.0 KB, 533 KB, 1.4 MB < 4.9 KB
Windows 11.5 KB, 1.0 MB, 1.7 MB < 8.3 KB
GNU/Linux 10.8 KB, 1.7MB, 2.2 MB < 4.8 KB
https://www.researchgate.net/publication/353066615_How_Big_Are_Peoples%27_Computer_Files_File_Size_Distributions_Among_User-managed_Collections
I don’t see any error bars. Without those, it hard to attach any
significance to the differences in figures.
You don't need error bars becuase those fugures indicate a
distribution. The file are log-normally distributed wth given means
and median. So the spread is part of that data.
There are (or should be) two different distributions. Error bars are
intended to show the spread within the data. The log normal
distribution is across the data.

Now I suspect they didn't do it this way and just amalgamated all the
file save data into one, but that explains rather than excuses the lack
of error bars!
--
Ben.
Lawrence D'Oliveiro
2024-07-03 23:48:41 UTC
Permalink
Post by Malcolm McLean
The file are log-normally distributed wth given means and median. So the
spread is part of that data.
That’s an assumption of the parametric fit, not a fact of the data. Error
bars would indicate how close the fit is.
Tim Rentsch
2024-06-24 16:32:40 UTC
Permalink
Post by Malcolm McLean
Post by Tim Rentsch
Post by Anton Shepelev
[next is a comment from Malcolm]
Post by Malcolm McLean
Your strategy for avoiding these extremes is exponential
growth.
It's odd to call it mine. It's very widely know and used.
"The one I mentioned" might be less confusing description.
I think it is a modern English idiom, which I dislike as
"How do you do this?" and "How do I do that?" They are
informal ways of the more literary "How does one do this?"
or "What is the way to do that?"
I have a different take here. First the "your" of "your
strategy" reads as a definite pronoun, meaning it refers
specifically to Ben and not to some unknown other party.
(And incidentally is subtly insulting because of that,
whether it was meant that way or not.)
Second the use of "you" to mean an unspecified other person
is not idiom but standard usage. The word "you" is both a
definite pronoun and an indefinite pronoun, depending on
context. The word "they" also has this property. Consider
The bank downtown was robbed. They haven't been caught
yet.
They say the sheriff isn't going to run for re-election.
In the first example "they" is a definite pronoun, referring
to the people who robbed the bank. In the second example,
"they" is an indefinite pronoun, referring to unspecified
people in general (perhaps but not necessarily everyone).
The word "you" is similar: it can mean specifically the
listener, or it can mean generically anyone in a broader
audience, even those who never hear or read the statement
with "you" in it.
The word "one" used as a pronoun is more formal, and to me
at least often sounds stilted. In US English "one" is most
often an indefinite pronoun, either second person or third
person. But "one" can also be used as a first person
definite pronoun (referring to the speaker), which an online
reference tells me is chiefly British English. (I would
guess that this usage predominates in "the Queen's English"
dialect of English, but I have very little experience in
such things.)
Finally I would normally read "I" as a first person definite
pronoun, and not an indefinite pronoun. So I don't have any
problem with someone saying "how should I ..." when asking
for advice. They aren't asking how someone else should ...
but how they should ..., and what advice I might give could
very well depend on who is doing the asking.
Ben said
Post by Tim Rentsch
Restore snipped Ben upthread
"In practice, the cost is usually moderate and can be very
effectively managed by using an exponential allocation scheme: at
every reallocation multiply the storage space by some factor greater
than 1 (I often use 3/2, but doubling is often used as well)."
So it's open and shut, and no two ways about it. Ben's strategy is
exponential growth. And to be fair I use that strategy myself in
functions like fslutp(). It's only not Ben's strategy if we mean to
imply that Ben was the first person to use expoential growth, or the
first to understand the mathematical implications, and of course
that's not the case. It was all worked out by Euler long before any
of us were born. [...]
You have an annoying habit. Your writing often comes across as
authoritarian and somewhat condescending. Furthermore you tend not
to listen very well. Your response above is a case in point. You
ignore what I'm talking about (which is not whether Ben uses an
exponential growth strategy, or whether such a strategy is "Ben's"
or not), and instead talk about something that is irrelevant to what
I was saying. You have completely missed the point. Your comments
do nothing to extend the conversation. From where I sit all they do
is cause irritation and illustrate how muddled your thinking is.
I'm sure this isn't the first time you've heard comments along these
lines. It would be nice if you would make an effort to improve
your behavior in light of these repeated comments.
David Brown
2024-06-24 17:19:38 UTC
Permalink
Post by Tim Rentsch
You have an annoying habit. Your writing often comes across as
authoritarian and somewhat condescending. Furthermore you tend not
to listen very well.
The irony of that post is /astounding/.

I have met few people with a greater knowledge and insight in the C
language than you. And I have met few with less self-insight.
Anton Shepelev
2024-06-19 23:51:56 UTC
Permalink
Cross-posting to alt.english.usage .
Post by Tim Rentsch
Post by Anton Shepelev
I think it is a modern English idiom, which I dislike as
"How do you do this?" and "How do I do that?" They are
informal ways of the more literary "How does one do
this?" or "What is the way to do that?"
I have a different take here. First the "your" of "your
strategy" reads as a definite pronoun, meaning it refers
specifically to Ben and not to some unknown other party.
And I am /sure/ it is intended in the general (indefinite)
Post by Tim Rentsch
You allocate a small amount for the first few bytes. Then
you use exponential growth, with a factor of ether 2 or
1.5.
This is the typical wording of impersonal instruction in
modern English.
Post by Tim Rentsch
(And incidentally is subtly insulting because of that,
whether it was meant that way or not.)
Yes! My first impulse is always to interpret those pronouns
according to their literal (definite) meanings, which gives
the text an insulting (because presumptuos) taint. This is
why I wince at the indefinite useage of first- and second-
person pronouns.
Post by Tim Rentsch
Second the use of "you" to mean an unspecified other
person is not idiom but standard usage.
`Idiomatic' can mean `standard':

Of or pertaining to, or conforming to, the mode of
expression peculiar to a language; as, an idiomatic
meaning; an idiomatic phrase.
Post by Tim Rentsch
The word "you" is both a definite pronoun and an
indefinite pronoun, depending on context.
It /is/ used as in indefinite pronoun, is not widely
recognised as capable of that function:

https://eslgrammar.org/indefinite-pronouns/
https://dictionary.cambridge.org/grammar/british-grammar/indefinite-pronouns

And when it is, only as an informal relaxation:

https://www.englishclub.com/grammar/pronouns-indefinite.php

These recent informal usages can be ugly.
Post by Tim Rentsch
The word "they" also has this property.
I know it, and agree:

They took some honey from a tree,
Dressed it up and then called it me.
<https://on.soundcloud.com/NGz7nZgzm4hiQQbd6>

The indefinite `they' can be used formally as well.
Post by Tim Rentsch
The word "you" is similar: it can mean specifically the
listener, or it can mean generically anyone in a broader
audience, even those who never hear or read the statement
with "you" in it.
Modern teenagers definitely see it that way, and I have to
clench my teech and adapt.
Post by Tim Rentsch
The word "one" used as a pronoun is more formal, and to me
at least often sounds stilted. In US English "one" is
most often an indefinite pronoun, either second person or
third person.
How can it be a second-person pronoun? The famous phrase
"One never knows, do one?" features a third-person `one'
with a dialectical third-person `do'.
Post by Tim Rentsch
But "one" can also be used as a first person definite
pronoun (referring to the speaker), which an online
reference tells me is chiefly British English.
I had no idea it could, nor does Wikipedia. Can you share
an example of a definite first-person `one'?
Post by Tim Rentsch
Finally I would normally read "I" as a first person
definite pronoun, and not an indefinite pronoun.
And so would I, and I hate the indefinte usage.
Post by Tim Rentsch
So I don't have any problem with someone saying "how
should I ..." when asking for advice. They aren't asking
how someone else should ... but how they should ..., and
what advice I might give could very well depend on who is
doing the asking.
The problem is, in 99% of cases no personal information is
given that could possibly justify the personal wording of
the question.
--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments
vallor
2024-06-20 00:34:43 UTC
Permalink
Post by Anton Shepelev
Cross-posting to alt.english.usage .
I've set the followup-to: same
Post by Anton Shepelev
But "one" can also be used as a first person definite pronoun
(referring to the speaker), which an online reference tells me is
chiefly British English.
I had no idea it could, nor does Wikipedia. Can you share an example of
a definite first-person `one'?
I think that would go something like:

"One wonders how..."

(Also, in some writing, "this one" can replace "I" -- this one used to do
that on Usenet, but was told it was pretentious.)
--
-v
David Brown
2024-06-21 10:59:43 UTC
Permalink
Post by vallor
Post by Anton Shepelev
Cross-posting to alt.english.usage .
I've set the followup-to: same
I've put it back to comp.lang.c. It's off-topic for that group, but
that's where people were discussing it, and it seems to be of a at least
some interest to some regulars here. (I suppose maybe those regulars
are also followers of alt.english.usage.)
Post by vallor
Post by Anton Shepelev
But "one" can also be used as a first person definite pronoun
(referring to the speaker), which an online reference tells me is
chiefly British English.
I had no idea it could, nor does Wikipedia. Can you share an example of
a definite first-person `one'?
"One wonders how..."
It is /possible/, but archaic and very upper-class. You'd immediately
suspect the speaker is wearing a top hat.

It is much more common to say simply "I wonder how...", followed by "You
wonder how..." (with "you" being generic rather than specifically the
person listening). Such usage will be more common in some dialects than
others.
Post by vallor
(Also, in some writing, "this one" can replace "I" -- this one used to do
that on Usenet, but was told it was pretentious.)
That would be even less common. The form, again archaic, would be more
usually written as "This writer used to do that..." (or "This
programmer", or however the person wanted to style himself/herself).
Keith Thompson
2024-06-21 17:31:22 UTC
Permalink
Post by David Brown
Post by vallor
Post by Anton Shepelev
Cross-posting to alt.english.usage .
I've set the followup-to: same
I've put it back to comp.lang.c. It's off-topic for that group,
Please don't do that.

[...]
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
Kenny McCormack
2024-06-20 12:10:53 UTC
Permalink
Post by Anton Shepelev
Post by Tim Rentsch
Post by Anton Shepelev
I think it is a modern English idiom, which I dislike as
"How do you do this?" and "How do I do that?" They are
informal ways of the more literary "How does one do
this?" or "What is the way to do that?"
I have a different take here. First the "your" of "your
strategy" reads as a definite pronoun, meaning it refers
specifically to Ben and not to some unknown other party.
And I am /sure/ it is intended in the general (indefinite)
This sub-thread is certainly interesting, but it ultimately smacks of
people looking for ways to feel insulted. Victimhood complex, and all that.

But, it makes me think that the problem is the basic paradigm of newsgroup
(i.e., online forum) communication being thought of as personalized. I.e.,
as in direct person-to-person communication - as if it was being spoken in
a real room with real people (face-to-face). The fact is, it is not. It
would be better if we didn't think of it that way. Rather, it should be
thought of as communication between the speaker and an anonymous void.
I.e., I'm not talking to you - I am talking to the anonymous void.
Everybody is.

Sort of like in the (US) House of Representatives - members are not ever
supposed to be talking to each other. Rather, they are always speaking to
the void.

Like I am doing now.

This is also why it is good (And, yes, I know this goes against the CW) to
drop attributions, as I have done here. Keep it anonymous.
--
First of all, I do not appreciate your playing stupid here at all.

- Thomas 'PointedEars' Lahn -
Janis Papanagnou
2024-06-20 13:04:00 UTC
Permalink
Post by Kenny McCormack
This sub-thread is certainly interesting, but it ultimately smacks of
people looking for ways to feel insulted. Victimhood complex, and all that.
But, it makes me think that the problem is the basic paradigm of newsgroup
(i.e., online forum) communication being thought of as personalized. I.e.,
as in direct person-to-person communication - as if it was being spoken in
a real room with real people (face-to-face). The fact is, it is not. It
would be better if we didn't think of it that way. Rather, it should be
thought of as communication between the speaker and an anonymous void.
I.e., I'm not talking to you - I am talking to the anonymous void.
Everybody is.
Sort of like in the (US) House of Representatives - members are not ever
supposed to be talking to each other. Rather, they are always speaking to
the void.
Like I am doing now.
This is also why it is good (And, yes, I know this goes against the CW) to
drop attributions, as I have done here. Keep it anonymous.
Part 1

This is hard to achieve given that the technical NG infrastructure and
functions "logically" connect the articles; it's only a little burden
to identify (if not already obvious) the addressee.

I think it would be better to try to stay on the issue as opposed to
reply (as so often done) ad hominem (where arguments don't seem to
exist or don't help anymore). This is of course yet more difficult to
achieve and will in practice [also] not work in Usenet (I'm sure).

Language can be used or interpreted in personal or impersonal forms.
Some communication forms - and more so their semantical contents! -
are (beyond the "you" vs. "one" dichotomy) inherently [set up to be]
personal.

-- Anonymous
:-)

Part 2

That all said. I think it's important to know who said/posted what.
It allows to associate personal context/background information when
replying. You can also be more assured about the quality of contents
(to the good or bad) or even ignore certain posts. It saves time and
protects ones health and mental sanity.[*]

There's of course also post (or threads) that just exchange opinions,
and "we" know everyone [typically] has an opinion (and often even in
cases where they are put up against facts). Some folks are known to
post a lot, respond to every thread that appears, contributing facts
(sometimes) but also opinions (or personal offenses); this may be a
nuisance (or just ignored, unless "anonymously" posted).

So far my opinion on this non-technical meta-topic subthread.

Janis
(Darn, I disclosed my identity!)

[*] Wasn't that the inherent problem of all those "social media"
platforms? - Where anonymous posts - and some say that anonymity
does negatively contribute to the language and contents of such
disturbing posts - lead to barbarian communication conditions.
(I know that only from hearsay but it seems common perception.)
Cri-Cri
2024-06-21 00:55:09 UTC
Permalink
But "one" can also be used as a first person definite pronoun
(referring to the speaker), which an online reference tells me is
chiefly British English.
We have the same construct in Swedish: 'en', as in "en kanske skulle kunna
ta en banan", meaning "one could perhaps have a banana." Referring to
oneself from an outside perspective, in, for instance, a situation where
there used to be several items on offer to guests, but now there are only
bananas and some dry sponge cake left. IOW, of two lesser desirable items,
one could accept a banana.
--
Cri-Cri
Tim Rentsch
2024-06-21 06:23:29 UTC
Permalink
Post by Anton Shepelev
Cross-posting to alt.english.usage .
Post by Tim Rentsch
Post by Anton Shepelev
I think it is a modern English idiom, which I dislike as
"How do you do this?" and "How do I do that?" They are
informal ways of the more literary "How does one do
this?" or "What is the way to do that?"
I have a different take here. First the "your" of "your
strategy" reads as a definite pronoun, meaning it refers
specifically to Ben and not to some unknown other party.
And I am /sure/ it is intended in the general (indefinite)
sense,
I don't know why you think that. I don't know any native
English speakers who would read it as other than referring
to the person being responded to (who was Ben in this case).
Responding to Malcolm's statement, Ben said "It's odd to
call it mine", so it seems Ben also read it as a definite
pronoun, referring to himself.

As for the rest of your comments, you're reaching bad
conclusions because you're not looking in the right places.
To investigate the meaning and usage of words and phrases,
the best first place to look is always a dictionary. In
the process of writing my earlier response, I consulted
roughly half a dozen different online dictionaries, reading
definitions for 'one', 'you', 'they', 'indefinite pronoun',
'definite pronoun', and probably some other terms. I also
looked at non-dictionary sources if they looked relevant,
but my starting point was dictionaries. Oh, I also looked
up both 'idiom' and 'idiomatic' (which even though they are
related don't mean the same thing).

Incidentally, on the three pages you gave links for,
all of them had at least one example that used "you"
as an indefinite pronoun.

One point I want to clear up. A couple of times you
characterize the indefinite pronoun usage of "you"
as "modern". It is not in any sense modern. I am
a native speaker of English, speaking and reading
the language for well over 60 years. Furthermore I
was raised by a grammarian. In all of that time and
experience there was never any hint that "you" as an
indefinite pronoun was new or unusual or considered
substandard or slang or unacceptable. It is simply
standard usage, and has been for longer than my
lifetime.
Post by Anton Shepelev
Post by Tim Rentsch
But "one" can also be used as a first person definite
pronoun (referring to the speaker), which an online
reference tells me is chiefly British English.
I had no idea it could, nor does Wikipedia. Can you share
an example of a definite first-person `one'?
(A) Pick a good search engine (I use duckduckgo.com).
(B) Search for the two words one definition.
(C) Read the entries for every online dictionary that
is found, or at least the top five or six.

You should find a couple of examples. It was by going
through this process myself that I discovered "one"
is sometimes used as a first person definite pronoun.
Richard Harnden
2024-06-17 15:15:27 UTC
Permalink
Post by Ben Bacarisse
Post by Malcolm McLean
Post by Ben Bacarisse
Post by Malcolm McLean
Post by Ben Bacarisse
Post by Janis Papanagnou
In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.
There is obviously a cost, but there is (usually) no alternative if
contiguous storage is required. In practice, the cost is usually
moderate and can be very effectively managed by using an exponential
allocation scheme: at every reallocation multiply the storage space by
some factor greater than 1 (I often use 3/2, but doubling is often used
as well). This results in O(log(N)) rather than O(N) allocations as in
your code that added a constant to the size. Of course, some storage is
wasted (that /might/ be retrieved by a final realloc down to the final
size) but that's rarely significant.
So can we work it out?
What is "it"?
Post by Malcolm McLean
Let's assume for the moment that the allocations have a semi-normal
distribution,
What allocations? The allocations I talked about don't have that
distribution.
Post by Malcolm McLean
with negative values disallowed. Now ignoring the first few
values, if we have allocated, say, 1K, we ought to be able to predict the
value by integrating the distribution from 1k to infinity and taking the
mean.
I have no idea what you are talking about. What "value" are you looking
to calculate?
We have a continuously growing buffer, and we want the best strategy for
reallocations as the stream of characters comes at us. So, given we now how
many characters have arrived, can we predict how many will arrive, and
therefore ask for the best amount when we reallocate, so that we neither
make too many reallocation (reallocate on every byte received) or ask for
too much (demand SIZE_MAX memory when the first byte is received).?
Obviously not, or we'd use the prediction. You question was probably
rhetorical, but it didn't read that way.
Post by Malcolm McLean
Your strategy for avoiding these extremes is exponential growth.
It's odd to call it mine. It's very widely know and used. "The one I
mentioned" might be less confusing description.
Post by Malcolm McLean
You
allocate a small amount for the first few bytes. Then you use exponential
growth, with a factor of ether 2 or 1.5. My question is whether or not we
can be cuter. And of course we need to know the statistical distribution of
the input files. And I'm assuming a semi-normal distribution, ignoring the
files with small values, which we will allocate enough for anyway.
And so we integrate the distribution between the point we are at and
infinity. Then we tkae the mean. And that gives us a best estimate of how
many bytes are to come, and therefore how much to grow the buffer by.
I would be surprised if that were worth the effort at run time. A
static analysis of "typical" input sizes might be interesting as that
could be used to get an estimate of good factors to use, but anything
more complicated than maybe a few factors (e.g. doubling up to 1MB then
3/2 thereafter) is likely to be too messy to useful.
Also, the cost of reallocations is not constant. Larger ones are
usually more costly than small ones, so if one were going to a lot of
effort to make run-time guesses, that cost should be factored in as
well.
I usually keep track:

struct
{
size_t used;
size_t allocated;
void *data;
};

Then, if used + new_size is more than what's already been allocated then
a realloc will be required.

Start with an initial allocated size that's 'resonable' - the happy path
will never need any reallocs.

Otherwise multiply by some factor. Typicall I just double it.
Chris M. Thomasson
2024-06-17 20:05:23 UTC
Permalink
Post by Richard Harnden
Post by Malcolm McLean
Post by Ben Bacarisse
Post by Malcolm McLean
Post by Ben Bacarisse
Post by Janis Papanagnou
In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.
There is obviously a cost, but there is (usually) no alternative if
contiguous storage is required.  In practice, the cost is usually
moderate and can be very effectively managed by using an exponential
allocation scheme: at every reallocation multiply the storage space by
some factor greater than 1 (I often use 3/2, but doubling is often used
as well).  This results in O(log(N)) rather than O(N) allocations
as in
your code that added a constant to the size.  Of course, some
storage is
wasted (that /might/ be retrieved by a final realloc down to the final
size) but that's rarely significant.
So can we work it out?
What is "it"?
Post by Malcolm McLean
Let's assume for the moment that the allocations have a semi-normal
distribution,
What allocations?  The allocations I talked about don't have that
distribution.
Post by Malcolm McLean
with negative values disallowed. Now ignoring the first few
values, if we have allocated, say, 1K, we ought to be able to predict the
value by integrating the distribution from 1k to infinity and taking the
mean.
I have no idea what you are talking about.  What "value" are you
looking
to calculate?
We have a continuously growing buffer, and we want the best strategy for
reallocations as the stream of characters comes at us. So, given we now how
many characters have arrived, can we predict how many will arrive, and
therefore ask for the best amount when we reallocate, so that we neither
make too many reallocation (reallocate on every byte received) or ask for
too much (demand SIZE_MAX memory when the first byte is received).?
Obviously not, or we'd use the prediction.  You question was probably
rhetorical, but it didn't read that way.
Post by Malcolm McLean
Your strategy for avoiding these extremes is exponential growth.
It's odd to call it mine.  It's very widely know and used.  "The one I
mentioned" might be less confusing description.
Post by Malcolm McLean
You
allocate a small amount for the first few bytes. Then you use exponential
growth, with a factor of ether 2 or 1.5. My question is whether or not we
can be cuter. And of course we need to know the statistical
distribution of
the input files. And I'm assuming a semi-normal distribution, ignoring the
files with small values, which we will allocate enough for anyway.
And so we integrate the distribution between the point we are at and
infinity. Then we tkae the mean. And that gives us a best estimate of how
many bytes are to come, and therefore how much to grow the buffer by.
I would be surprised if that were worth the effort at run time.  A
static analysis of "typical" input sizes might be interesting as that
could be used to get an estimate of good factors to use, but anything
more complicated than maybe a few factors (e.g. doubling up to 1MB then
3/2 thereafter) is likely to be too messy to useful.
Also, the cost of reallocations is not constant.  Larger ones are
usually more costly than small ones, so if one were going to a lot of
effort to make run-time guesses, that cost should be factored in as
well.
struct
{
    size_t used;
    size_t allocated;
    void *data;
};
Then, if used + new_size is more than what's already been allocated then
a realloc will be required.
Fwiw, I remember way back using an n-ary tree of regions to accomplish
it. The trigger for creating a new region was very similar to your
logic. It performed pretty good and did not use realloc. Indeed it was
for a special use case. I remember having region partitions that would
link to other regions. Actually, it kind of reminded me of a strange
version of ropes:

https://en.wikipedia.org/wiki/Rope_(data_structure)

Fwiw, here is my old C code for a region:

https://pastebin.com/raw/f37a23918
(raw text, no ads)

Iirc, I first mentioned it in:

https://groups.google.com/g/comp.lang.c/c/7oaJFWKVCTw/m/sSWYU9BUS_QJ
Post by Richard Harnden
Start with an initial allocated size that's 'resonable' - the happy path
will never need any reallocs.
Otherwise multiply by some factor.  Typicall I just double it.
Malcolm McLean
2024-06-17 15:36:57 UTC
Permalink
Post by Ben Bacarisse
Post by Malcolm McLean
Post by Ben Bacarisse
Post by Malcolm McLean
Post by Ben Bacarisse
Post by Janis Papanagnou
In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.
There is obviously a cost, but there is (usually) no alternative if
contiguous storage is required. In practice, the cost is usually
moderate and can be very effectively managed by using an exponential
allocation scheme: at every reallocation multiply the storage space by
some factor greater than 1 (I often use 3/2, but doubling is often used
as well). This results in O(log(N)) rather than O(N) allocations as in
your code that added a constant to the size. Of course, some storage is
wasted (that /might/ be retrieved by a final realloc down to the final
size) but that's rarely significant.
So can we work it out?
What is "it"?
Post by Malcolm McLean
Let's assume for the moment that the allocations have a semi-normal
distribution,
What allocations? The allocations I talked about don't have that
distribution.
Post by Malcolm McLean
with negative values disallowed. Now ignoring the first few
values, if we have allocated, say, 1K, we ought to be able to predict the
value by integrating the distribution from 1k to infinity and taking the
mean.
I have no idea what you are talking about. What "value" are you looking
to calculate?
We have a continuously growing buffer, and we want the best strategy for
reallocations as the stream of characters comes at us. So, given we now how
many characters have arrived, can we predict how many will arrive, and
therefore ask for the best amount when we reallocate, so that we neither
make too many reallocation (reallocate on every byte received) or ask for
too much (demand SIZE_MAX memory when the first byte is received).?
Obviously not, or we'd use the prediction. You question was probably
rhetorical, but it didn't read that way.
Post by Malcolm McLean
Your strategy for avoiding these extremes is exponential growth.
It's odd to call it mine. It's very widely know and used. "The one I
mentioned" might be less confusing description.
Post by Malcolm McLean
You
allocate a small amount for the first few bytes. Then you use exponential
growth, with a factor of ether 2 or 1.5. My question is whether or not we
can be cuter. And of course we need to know the statistical distribution of
the input files. And I'm assuming a semi-normal distribution, ignoring the
files with small values, which we will allocate enough for anyway.
And so we integrate the distribution between the point we are at and
infinity. Then we tkae the mean. And that gives us a best estimate of how
many bytes are to come, and therefore how much to grow the buffer by.
I would be surprised if that were worth the effort at run time. A
static analysis of "typical" input sizes might be interesting as that
could be used to get an estimate of good factors to use, but anything
more complicated than maybe a few factors (e.g. doubling up to 1MB then
3/2 thereafter) is likely to be too messy to useful.
There's virualy no run-time effort, unless you ask caller to pass in a
customised distribution, which you analyse on the fly, which would be
quite a bit of work.
All the work is done beforehand. We need a statistical distribution of
the files sizes of the files we are interesed in. So, probably, text
files on personal computers. Then we'll excude the small files, say
under 1kb which will have an odd distribution for various reasons, and
which we are not interested in as we can easily afford 1k as an initial
buffer.

And we're probably looking at a semi- normal, maybe log- normal
distribution. There's no reason to suspect it would be anything odd. And
with the normal distribution there is no closed form integral, but
tables of integrals are published.

So we convert 1K to a Z-score, integrate from that to infinity, halve
the result, and that gives us an estimate of the most likely file size -
having established that the file is over 1k, half will be below and half
above this size. So that's the next amount to realloc. Say, for the sake
of argument, 4K. Then we do the same thing, starting from 4k, and
working out the most likely file size, given that the file is over 4K.
Now the disribution tends to flatten out towards the tail, so if best
guess, given at least 1K, was 4K, best guess diven 4k, won't be 8K. It
will be 10k, maybe 12k. Do the same again for 12k. And we'll get a
series of numbers like this.

1k, 4k, 10k, 20k, 50k, 120k, 500k, 2MB, 8MB ...

and so on, rapidly increasing to SIZE_MAX. And then at runtime we just
hardcode those in, it's a lookup table with not too many entries.

Becuase we've chosen the mean, half the time you will reallocate. You
can easily fine tune the strategy by choosing a proportion other than
0.5, depending on whether saving memory or reducing allocations is more
important to you.

and the hard part is getting some real statistics to work on.
Post by Ben Bacarisse
Also, the cost of reallocations is not constant. Larger ones are
usually more costly than small ones, so if one were going to a lot of
effort to make run-time guesses, that cost should be factored in as
well.
Unfortunately yes. Real optimisation problems can be almost impossible
for reasons like this. iF the cost of reallocations isn't constant,
tou've got to put in correctiong factors, and then what was a fairly
simple procedure becomes difficult.
--
Check out my hobby project.
http://malcolmmclean.github.io/babyxrc
Anton Shepelev
2024-06-17 15:02:49 UTC
Permalink
[cross-posted to: ci.stat.math]
Post by Malcolm McLean
We have a continuously growing buffer, and we want the
best strategy for reallocations as the stream of
characters comes at us. So, given we now how many
characters have arrived, can we predict how many will
arrive,
Do you mean in the next bunch, or in total (till the end of
the buffer's lifetime)?
Post by Malcolm McLean
and therefore ask for the best amount when we reallocate,
so that we neither make too many reallocation (reallocate
on every byte received) or ask for too much (demand
SIZE_MAX memory when the first byte is received).?
Your strategy for avoiding these extremes is exponential
growth. You allocate a small amount for the first few
bytes. Then you use exponential growth, with a factor of
ether 2 or 1.5.
This strategy ensures a constant ratio between the amount of
reallocated data to the length of the buffer by making
reallocations less frequent as the buffer grows.
Post by Malcolm McLean
And so we integrate the distribution between the point we
are at and infinity. Then we tkae the mean. And that gives
us a best estimate of how many bytes are to come, and
therefore how much to grow the buffer by.
You have an apriori distribution of the buffer size (can be
tracked on-the-fly, if unknown beforehand) and a partially
filled buffer. The task is to calculate the a-posteriori
distribution of /that/ buffer's final size, and then to
allocate the predicted value based on a good percentile.

How about using a percentile instead of the mean, e.g. if
the current size corresponds to percentile p, you allocate a
capacity corresponding to percentile 1-(1-p)/k , where k>1
denotes the balance between space and time efficency. For
example, if the 60th percentile of the buffer is required
and k=2, you allocate a capacity sufficient to hold
100-(100-60)/2=80% of buffers.
--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments
unknown
2024-06-18 17:59:30 UTC
Permalink
Post by Anton Shepelev
[cross-posted to: ci.stat.math]
Post by Malcolm McLean
We have a continuously growing buffer, and we want the
best strategy for reallocations as the stream of
characters comes at us. So, given we now how many
characters have arrived, can we predict how many will
arrive,
Do you mean in the next bunch, or in total (till the end of
the buffer's lifetime)?
Post by Malcolm McLean
and therefore ask for the best amount when we reallocate,
so that we neither make too many reallocation (reallocate
on every byte received) or ask for too much (demand
SIZE_MAX memory when the first byte is received).?
Your strategy for avoiding these extremes is exponential
growth. You allocate a small amount for the first few
bytes. Then you use exponential growth, with a factor of
ether 2 or 1.5.
This strategy ensures a constant ratio between the amount of
reallocated data to the length of the buffer by making
reallocations less frequent as the buffer grows.
Post by Malcolm McLean
And so we integrate the distribution between the point we
are at and infinity. Then we tkae the mean. And that gives
us a best estimate of how many bytes are to come, and
therefore how much to grow the buffer by.
You have an apriori distribution of the buffer size (can be
tracked on-the-fly, if unknown beforehand) and a partially
filled buffer. The task is to calculate the a-posteriori
distribution of that buffer's final size, and then to
allocate the predicted value based on a good percentile.
How about using a percentile instead of the mean, e.g. if
the current size corresponds to percentile p, you allocate a
capacity corresponding to percentile 1-(1-p)/k , where k>1
denotes the balance between space and time efficency. For
example, if the 60th percentile of the buffer is required
and k=2, you allocate a capacity sufficient to hold
100-(100-60)/2=80% of buffers.
Based on essentially no background to this question, not much can be
said. However, if one starts from the suggestion above to use the mean
of some distribution (or later some percentile), one notes that the
"mean" is just the minimum of a quadratic cast function ,,, so an
improvement would be to base the choice on some more realistic cost
function, chosen for the actual application. Given that the scenario
apparently involves a sequence of such decisions, the obvious extension
of the cost-based approach would be to employ some form of dynamic
programming. Of course, this might not be appealing, in which case one
might choose the theoretically-simple approach of tuning a policy based
on good stchastic simulations of the situation.
David Duffy
2024-06-19 06:48:17 UTC
Permalink
Post by Anton Shepelev
[cross-posted to: ci.stat.math]
Post by Malcolm McLean
We have a continuously growing buffer, and we want the
best strategy for reallocations as the stream of
characters comes at us. So, given we now how many
characters have arrived, can we predict how many will
arrive,
Do you mean in the next bunch, or in total (till the end of
the buffer's lifetime)?
Isn't this a halting problem? Aren't the more important data:
how much memory the user is allowed to allocate, the properties of
the current system's memory allocation algorithm, when your stream
will have to go to disc or other slow large volume storage, how
the stream can be compressed on the fly (the latter might well give
strong predictions for future storage requirements based on what
has been read to date).
Malcolm McLean
2024-06-19 09:12:22 UTC
Permalink
Post by David Duffy
Post by Anton Shepelev
[cross-posted to: ci.stat.math]
Post by Malcolm McLean
We have a continuously growing buffer, and we want the
best strategy for reallocations as the stream of
characters comes at us. So, given we now how many
characters have arrived, can we predict how many will
arrive,
Do you mean in the next bunch, or in total (till the end of
the buffer's lifetime)?
how much memory the user is allowed to allocate, the properties of
the current system's memory allocation algorithm, when your stream
will have to go to disc or other slow large volume storage, how
the stream can be compressed on the fly (the latter might well give
strong predictions for future storage requirements based on what
has been read to date).
No. We have to have some knowledge. And what we probaby know is that the
input is a file stored on someone's personal computer. And someone has
published on the statistical distribution of such files And they have a
log-normal distribution with a mean and a median which he gives. So with
that informaton, we can work out, given that a file is at least N
characters, what is the prbablity that an allocation of any size will
contain the whole file, and how many bytes, on average will be wasted.

Statistical analysis can't tell us what a allocation wil cost versus the
cost of hoggng memory we don't need, however. It can;t tell us what to
do. Just put us in the picture.
--
Check out my hobby project.
http://malcolmmclean.github.io/babyxrc
Ben Bacarisse
2024-06-19 15:36:01 UTC
Permalink
Post by Malcolm McLean
No. We have to have some knowledge. And what we probaby know is that the
input is a file stored on someone's personal computer. And someone has
published on the statistical distribution of such files
That's not the case that matters (to me at least). If the input is a
file, we have a much better way of "guessing" the size than guessing and
growing -- just ask for the size. Sure, we might need to make
adjustments if the file is changing, but there is always a better
measure than any statistical analysis.

To some extent this seems like a solution in search of a problem.
Growing the buffer exponentially is simple and effective.
--
Ben.
David Brown
2024-06-19 17:41:49 UTC
Permalink
Post by Ben Bacarisse
Post by Malcolm McLean
No. We have to have some knowledge. And what we probaby know is that the
input is a file stored on someone's personal computer. And someone has
published on the statistical distribution of such files
That's not the case that matters (to me at least). If the input is a
file, we have a much better way of "guessing" the size than guessing and
growing -- just ask for the size. Sure, we might need to make
adjustments if the file is changing, but there is always a better
measure than any statistical analysis.
To some extent this seems like a solution in search of a problem.
It seems more like a solution that doesn't exist in search of a problem
with absurdly unrealistic requirements. And even if Malcolm's solution
existed, and the problem existed, it /still/ wouldn't work - knowing the
distribution of file sizes tells us nothing about the size of any given
file.
Post by Ben Bacarisse
Growing the buffer exponentially is simple and effective.
Yes, that's the general way to handle buffers when you don't know what
size they should be.

A better solutions for this sort of program is usually, as you say,
asking the OS for the file size (there is no standard library function
for getting the file size, but it's not hard to do for any realistic
target OS). And then for big files, prefer mmap to reading the file
into a buffer.

It's only really for unsized "files" such as piped input that you have
no way of getting the size, and then exponential growth is the way to
go. Personally, I'd start with a big size (perhaps 10 MB) that is
bigger than you are likely to need in practice, but small enough that it
is negligible on even vaguely modern computers. Then the realloc code is
unlikely to be used (but it can still be there for completeness).
Ben Bacarisse
2024-06-19 21:24:35 UTC
Permalink
Post by Ben Bacarisse
Growing the buffer exponentially is simple and effective.
Yes, that's the general way to handle buffers when you don't know what size
they should be.
A better solutions for this sort of program is usually, as you say, asking
the OS for the file size (there is no standard library function for getting
the file size, but it's not hard to do for any realistic target OS). And
then for big files, prefer mmap to reading the file into a buffer.
It's only really for unsized "files" such as piped input that you have no
way of getting the size, and then exponential growth is the way to go.
Personally, I'd start with a big size (perhaps 10 MB) that is bigger than
you are likely to need in practice, but small enough that it is negligible
on even vaguely modern computers. Then the realloc code is unlikely to be
used (but it can still be there for completeness).
There are other uses that have nothing to do with files. I have a small
dynamic array library (just a couple of function) that I use for all
sorts of things. I can read a file or parse tokens or input a line just
by adding characters. Because of its rather general use, I don't start
with a large buffer (though the initial size can be set).
--
Ben.
David Brown
2024-06-20 11:22:31 UTC
Permalink
Post by Ben Bacarisse
Post by Ben Bacarisse
Growing the buffer exponentially is simple and effective.
Yes, that's the general way to handle buffers when you don't know what size
they should be.
A better solutions for this sort of program is usually, as you say, asking
the OS for the file size (there is no standard library function for getting
the file size, but it's not hard to do for any realistic target OS). And
then for big files, prefer mmap to reading the file into a buffer.
It's only really for unsized "files" such as piped input that you have no
way of getting the size, and then exponential growth is the way to go.
Personally, I'd start with a big size (perhaps 10 MB) that is bigger than
you are likely to need in practice, but small enough that it is negligible
on even vaguely modern computers. Then the realloc code is unlikely to be
used (but it can still be there for completeness).
There are other uses that have nothing to do with files.
Of course. This comment was for the specific purposes being discussed
here. For other uses, there can be many other structures and algorithms
that fit better. Exponentially increasing the size when needed is a
good general-purpose method.
Post by Ben Bacarisse
I have a small
dynamic array library (just a couple of function) that I use for all
sorts of things. I can read a file or parse tokens or input a line just
by adding characters. Because of its rather general use, I don't start
with a large buffer (though the initial size can be set).
Anton Shepelev
2024-06-19 22:53:47 UTC
Permalink
Post by Malcolm McLean
We have to have some knowledge. And what we probaby know
is that the input is a file stored on someone's personal
computer. And someone has published on the statistical
distribution of such files And they have a log-normal
distribution with a mean and a median which he gives. So
with that informaton, we can work out, given that a file
is at least N characters, what is the prbablity that an
allocation of any size will contain the whole file, and
how many bytes, on average will be wasted.
Observe that the standard algorithm of exponential growth is
memoryless and self-similar in that in does not depend on
context, or the history of previous reallocations. These
properties belong to (or even identify?) the exponential
distribution. We can therefore assume that exponential-
growth strategy is ideal for exponentially distributed
buffer sizes, and under that assumption determine the
relation between the CDF values (p) corresponding to
consequent re-allcoations:

p = e^x/L ,
p0 = 1-e^(L*x0) ,
p1 = 1-e^(L*x1) ,
x1 = k*x0 (by our strategy), =>
p1 = 1-(1-p0)^k .

which does not depend on the distribution and lets us
generalise this approach for any distribution:

x1 = Q( 1 - ( 1 - CDF(x0) )^k )

where:

x0 : the required size
x1 : the new recommended capacity
Q(p) : the p-Quantile of the given distribution
CDF(x): the CDF of the given distribution
k>1 : balance between speed and space efficiency
--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments
Anton Shepelev
2024-07-08 16:34:56 UTC
Permalink
Post by Anton Shepelev
p0 = 1-e^(L*x0) ,
p1 = 1-e^(L*x1) ,
x1 = k*x0 (by our strategy), =>
p1 = 1-(1-p0)^k .
which does not depend on the distribution and lets us
x1 = Q( 1 - ( 1 - CDF(x0) )^k )
x0 : the required size
x1 : the new recommended capacity
Q(p) : the p-Quantile of the given distribution
CDF(x): the CDF of the given distribution
k>1 : balance between speed and space efficiency
Let us test it with the exponential distribution, for which:

Q (p) = -Ln( 1 - p )/L
CDF(x) = 1 - e^(-Lx)

Substituting these into the equation for x1:

x1 = Q ( 1 - ( 1 - ( 1 - e^(-Lx0) ) )^k ) =
Q ( 1 - ( e^(-Lx0) )^k ) =
Q ( 1 - e^(-kLx0) ) =
-Ln( e^(-kLx0) )/L = k*x0 (QED)

That is, my solution is a/the generalisation of the
exponential growth strategy.
--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments
Anton Shepelev
2024-06-19 12:20:00 UTC
Permalink
Malcolm McLean writes that, given the log-normal distribution
of file sizes with known parameters,
Post by Malcolm McLean
we can work out, given that a file is at least N
characters, what is the prbablity that an allocation of
any size will contain the whole file, and how many bytes,
on average will be wasted.
This is why I thought statisticians might help him: Malcolm
wants to find the aposteriori distribution of the size of a
file, after it has been found to exceed N bytes. Am I right
that if we take the remaining (N>20) part of the density
function and re-normalise it, we shall obtain the desired
distribution?

My proposition was as follows:

1. Find quantile q0 corresponding to the buffer size
currently requested.

2. Calculate new quantile q1 = 1-(1-q0)/k, where k>1 is
an adjustable parameter, and use its corresponding
value as the new allocation size.

For example, assuming for simplicity a uniform [0,20]
distribution of file sizez and k=2, a sequence of allocation
may look like this:

requested allocated
2 20-(20- 2)/2 = 11
12 20-(20-12)/2 = 16
18 20-(20-18)/2 = 19
--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments
Rich Ulrich
2024-07-02 04:51:33 UTC
Permalink
On Mon, 17 Jun 2024 18:02:49 +0300, Anton Shepelev
Post by Anton Shepelev
[cross-posted to: ci.stat.math]
Anton,

The post being responded to was originally to comp.lang.c
which I don't subscribe to.

I have a question that I suppose reflects on my news source,
GigaNews, or else on my reader, Forte Agent.

Was this thread something posted 15 or 20 years ago?

I tried to call up the original post by clicking on the Message
ID when looking at headers; nothing comes up when Agent goes
online to look. The header shows multiple earlier messages;
none of them come up for me.

My clicking on Message ID works elsewhere. The logical and
simple explanation is that this is a thread old enough that
GigaNews does not have it.

I suppose that someone else might be able to tell me, if their
supplier goes back further or if GigaNews is somehow failing
to show me something that is recent.
--
Rich Ulrich
Keith Thompson
2024-07-02 05:10:00 UTC
Permalink
Post by Rich Ulrich
On Mon, 17 Jun 2024 18:02:49 +0300, Anton Shepelev
Post by Anton Shepelev
[cross-posted to: ci.stat.math]
Anton,
The post being responded to was originally to comp.lang.c
which I don't subscribe to.
I have a question that I suppose reflects on my news source,
GigaNews, or else on my reader, Forte Agent.
Was this thread something posted 15 or 20 years ago?
I tried to call up the original post by clicking on the Message
ID when looking at headers; nothing comes up when Agent goes
online to look. The header shows multiple earlier messages;
none of them come up for me.
My clicking on Message ID works elsewhere. The logical and
simple explanation is that this is a thread old enough that
GigaNews does not have it.
I suppose that someone else might be able to tell me, if their
supplier goes back further or if GigaNews is somehow failing
to show me something that is recent.
The first article in this thread was posted to comp.lang.c by Janis
Papanagnou on 17 Jun 2024.

There were several followups on the same day. The diret parent of your
article was cross-posted to comp.lang.c and sci.stat.math by Anton
Shepelev (his was the first cross-posted article in the thread).
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
Rich Ulrich
2024-07-02 15:45:25 UTC
Permalink
On Mon, 01 Jul 2024 22:10:00 -0700, Keith Thompson
Post by Keith Thompson
Post by Rich Ulrich
On Mon, 17 Jun 2024 18:02:49 +0300, Anton Shepelev
Post by Anton Shepelev
[cross-posted to: ci.stat.math]
Anton,
The post being responded to was originally to comp.lang.c
which I don't subscribe to.
I have a question that I suppose reflects on my news source,
GigaNews, or else on my reader, Forte Agent.
Was this thread something posted 15 or 20 years ago?
I tried to call up the original post by clicking on the Message
ID when looking at headers; nothing comes up when Agent goes
online to look. The header shows multiple earlier messages;
none of them come up for me.
My clicking on Message ID works elsewhere. The logical and
simple explanation is that this is a thread old enough that
GigaNews does not have it.
I suppose that someone else might be able to tell me, if their
supplier goes back further or if GigaNews is somehow failing
to show me something that is recent.
The first article in this thread was posted to comp.lang.c by Janis
Papanagnou on 17 Jun 2024.
There were several followups on the same day. The diret parent of your
article was cross-posted to comp.lang.c and sci.stat.math by Anton
Shepelev (his was the first cross-posted article in the thread).
Thanks, so it looks like a failure by GigaNews to retrieve the
recent posts.

I did see a bunch of cross-posted followups. I don't know C,
and I thought there could be more context.

Looking at original, absent posts is something I've done dozens
of times over the years. Never a problem, except fora time or two
with posts from the 1990s. I've used GigaNews since my regular
ISP stopped providing Usenet access, maybe 15 years ago.
--
Rich Ulrich
Anton Shepelev
2024-07-08 17:01:21 UTC
Permalink
Post by Rich Ulrich
Thanks, so it looks like a failure by GigaNews to retrieve
the recent posts.
I did see a bunch of cross-posted followups. I don't know
C, and I thought there could be more context.
Characters are being read in (say, from a file) sequentially
and stored in computer memory (RAM) in an "array" -- a
linear data structure storing elements (our characters) in a
sequential order, one ofter the other at addresses with
increasing indexes -- somewhat like a mathematical vector.

In order to store a character in an array, sufficient memory
has to be "allocated" for it, but while reading we do not
know beforehand the size of the file (or the total length of
the sequence), and therefore increase the allocated aray
size prospectively once the previous allocation is filled.
This operaion is called `realloc' and frequently involves
the tedious copying of the entire array onto a new location
in memory, taking a time in proportion to the number
elemennts so far allocated.

The question is to develop an optimal allcation strategy for
a given distribution of file sizes. The fasted solution is
to allocate a gigantic array beforehand, but it is a
terrible waste of memory. The slowet solution is to
reallcoate for each single character read it, but is a
terrible waste of CPU time. As I understand the problem, a
strategy is needed that manifests some compromise between
the extremes.
Post by Rich Ulrich
Looking at original, absent posts is something I've done
dozens of times over the years. Never a problem, except
fora time or two with posts from the 1990s. I've used
GigaNews since my regular ISP stopped providing Usenet
access, maybe 15 years ago.
Just in case, there are many totally free Usenet servers,
e.g.:
http://www.eternal-september.org/
https://www.i2pn2.org/

and even a web interface:

https://www.novabbs.com
--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments
Rich Ulrich
2024-07-21 23:40:16 UTC
Permalink
On Mon, 8 Jul 2024 20:01:21 +0300, Anton Shepelev
Post by Rich Ulrich
Thanks, so it looks like a failure by GigaNews to retrieve
the recent posts.
I did see a bunch of cross-posted followups. I don't know
C, and I thought there could be more context.
<snip. Thanks for the details.>

Okay, today I discovered that there was no failure by Giganews
or by Forte Agent -- Instead, there was BEHAVIOR by Agent
that I was not aware of.

Today, was looking at All Desks (Agent terminology) to see what
was in Sent, and I noticed there were messages in Inbox -- Those
were the messages that I thought Agent had failed to retrieve.
(Nice discussion there.)

I guess - every other time I've clicked on Message-ID, the old
message was in the group where I was reading and the old one
showed up where I was reading. I've read the Agent group for
ages, and I don't remember this feature ever being mentioned.

Live and learn.
--
Rich Ulrich
Anton Shepelev
2024-07-23 13:47:39 UTC
Permalink
Post by Rich Ulrich
Today, was looking at All Desks (Agent terminology) to see
what was in Sent, and I noticed there were messages in
Inbox -- Those were the messages that I thought Agent had
failed to retrieve. (Nice discussion there.)
Looking forward to your take on the problem. Mine is rather
simplisitc, but can be easily tested with several
distributions. This is like extrapolation: we know the
optimal solution for the given distribution (exponential)
and want to devise a general method to get the optimal
solution for any given distribution.

Malcolm, if you are still interested, can you provide a test
program that measures some statiscits for various allocation
strategies on various distributions, inclusing the
exponential?
--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments
Paul
2024-07-02 07:02:07 UTC
Permalink
Post by Rich Ulrich
On Mon, 17 Jun 2024 18:02:49 +0300, Anton Shepelev
Post by Anton Shepelev
[cross-posted to: ci.stat.math]
Anton,
The post being responded to was originally to comp.lang.c
which I don't subscribe to.
I have a question that I suppose reflects on my news source,
GigaNews, or else on my reader, Forte Agent.
Was this thread something posted 15 or 20 years ago?
I tried to call up the original post by clicking on the Message
ID when looking at headers; nothing comes up when Agent goes
online to look. The header shows multiple earlier messages;
none of them come up for me.
My clicking on Message ID works elsewhere. The logical and
simple explanation is that this is a thread old enough that
GigaNews does not have it.
I suppose that someone else might be able to tell me, if their
supplier goes back further or if GigaNews is somehow failing
to show me something that is recent.
MID: <v4ojs8$gvji$***@dont-email.me>

http://al.howardknight.net/

That gives this URL, as a copy of the message kicking off the thread.

http://al.howardknight.net/?STYPE=msgid&MSGI=%3Cv4ojs8%24gvji%241%40dont-email.me%3E

Some USENET News clients can work from the MID directly, but Thunderbird does not.
A bare MID does not work for everyone.

Paul
Rich Ulrich
2024-07-02 15:52:56 UTC
Permalink
Post by Paul
Post by Rich Ulrich
On Mon, 17 Jun 2024 18:02:49 +0300, Anton Shepelev
Post by Anton Shepelev
[cross-posted to: ci.stat.math]
Anton,
The post being responded to was originally to comp.lang.c
which I don't subscribe to.
I have a question that I suppose reflects on my news source,
GigaNews, or else on my reader, Forte Agent.
Was this thread something posted 15 or 20 years ago?
I tried to call up the original post by clicking on the Message
ID when looking at headers; nothing comes up when Agent goes
online to look. The header shows multiple earlier messages;
none of them come up for me.
My clicking on Message ID works elsewhere. The logical and
simple explanation is that this is a thread old enough that
GigaNews does not have it.
I suppose that someone else might be able to tell me, if their
supplier goes back further or if GigaNews is somehow failing
to show me something that is recent.
http://al.howardknight.net/
That gives this URL, as a copy of the message kicking off the thread.
http://al.howardknight.net/?STYPE=msgid&MSGI=%3Cv4ojs8%24gvji%241%40dont-email.me%3E
Yes, that's the message I see when I plug the Message ID
into the program at http://al.howardknight.net/

Thanks. I'm saving that.
Post by Paul
Some USENET News clients can work from the MID directly, but Thunderbird does not.
A bare MID does not work for everyone.
Forte Agent invites me to click on the MID; asks if it is a
mail or MID; asks if it should search the net. It still works
when I test it on an old message in another group.
--
Rich Ulrich
Rich Ulrich
2024-07-02 15:58:11 UTC
Permalink
On Tue, 02 Jul 2024 11:52:56 -0400, Rich Ulrich
Post by Rich Ulrich
Forte Agent invites me to click on the MID; asks if it is a
mail or MID; asks if it should search the net. It still works
when I test it on an old message in another group.
Now it occurs to me -- It actually does make sense,
economically, if what is searched online is limited the
groups I subscribe to, or (even) only the group that
is currently active.
--
Rich Ulrich
Paul
2024-07-02 19:09:25 UTC
Permalink
Post by Rich Ulrich
On Tue, 02 Jul 2024 11:52:56 -0400, Rich Ulrich
Post by Rich Ulrich
Forte Agent invites me to click on the MID; asks if it is a
mail or MID; asks if it should search the net. It still works
when I test it on an old message in another group.
Now it occurs to me -- It actually does make sense,
economically, if what is searched online is limited the
groups I subscribe to, or (even) only the group that
is currently active.
Every device has "retention", but retention is limited.

Whether it's a search site, or a USENET server (even Forte had
their own news server, at one time), you need retention for
older articles to be search-able either as body text, or as
a <mid>.

Now that Google Groups is no longer connected to USENET,
that's one fewer places with a decent-sized archive. Google closed
their service, after the ThaiSpam incident. The Eternal-September
server, changed from one server to a two-server setup. The
Transit Server had Spam Assassin loaded on it, removing THaiSpam,
and the second server continued to offer normal ("filtered") service.

The comp.lang.c group was one of the groups under attack. Since
Google was letting the spam in, now that Google is disconnected,
the spam is gone, and the readership on CLC has gone up.

Paul
James Kuyper
2024-07-02 20:54:46 UTC
Permalink
On 7/2/24 15:09, Paul wrote:
...
Post by Paul
Now that Google Groups is no longer connected to USENET,
I just checked, and as I had expected, Google is still connected.
Post by Paul
that's one fewer places with a decent-sized archive. Google closed
their service, after the ThaiSpam incident.
They didn't close their service. They just stopped adding new messages
to their archives. The messages that were stored prior to the closing
are still available for searching.
James Kuyper
2024-07-02 20:58:14 UTC
Permalink
On 7/2/24 15:09, Paul wrote:
...
Post by Paul
Now that Google Groups is no longer connected to USENET,
...
Post by Paul
that's one fewer places with a decent-sized archive. Google closed
their service, after the ThaiSpam incident.
While they no longer store new messages, they still have one of the
largest archives of old messages, and it's still available for searching.
Scott Lurndal
2024-06-17 16:58:52 UTC
Permalink
Post by Malcolm McLean
Post by Ben Bacarisse
I have no idea what you are talking about. What "value" are you looking
to calculate?
We have a continuously growing buffer,
At this point, you should be asking yourself
if there are better alternatives for storing
the incoming data than to a continuously growing
dynamically allocated piecemeal buffer.

C character stdio tends to work well for streaming applications
(i.e. pipelines where the input is (minimally) processed and forwarded
to the output), but not so efficiently for applications that need to
look at the data en masse.

Personnally, I'd mmap the input file and eschew stdio completely
and just walk through memory with the appropriate pointer.

(mmap showed up in the late 80s, so you can pretend it
is C90 if you like).
Chris M. Thomasson
2024-06-17 20:12:54 UTC
Permalink
Post by Scott Lurndal
Post by Malcolm McLean
Post by Ben Bacarisse
I have no idea what you are talking about. What "value" are you looking
to calculate?
We have a continuously growing buffer,
At this point, you should be asking yourself
if there are better alternatives for storing
the incoming data than to a continuously growing
dynamically allocated piecemeal buffer.
C character stdio tends to work well for streaming applications
(i.e. pipelines where the input is (minimally) processed and forwarded
to the output), but not so efficiently for applications that need to
look at the data en masse.
Indeed.
Post by Scott Lurndal
Personnally, I'd mmap the input file and eschew stdio completely
and just walk through memory with the appropriate pointer.
That works for sure. Actually, its been a while since I have used memory
mapped files. I made a little database thing with them that multiple
processes could use, lock-free... ;^)

Also, I remember a nifty trick to use memory mapped files with certain
lock-free algorithms. I need to find that old post over on
comp.programming.threads. Have a feeling that its going to be hard to find!
Post by Scott Lurndal
(mmap showed up in the late 80s, so you can pretend it
is C90 if you like).
Keith Thompson
2024-06-17 23:39:38 UTC
Permalink
Malcolm McLean <***@gmail.com> writes:
[...]
Post by Malcolm McLean
We have a continuously growing buffer, and we want the best strategy
for reallocations as the stream of characters comes at us. So, given
we now how many characters have arrived, can we predict how many will
arrive, and therefore ask for the best amount when we reallocate, so
that we neither make too many reallocation (reallocate on every byte
received) or ask for too much (demand SIZE_MAX memory when the first
byte is received).?
[...]

That's going to depend on the behavior of the implementations
realloc() function.

malloc() and realloc() typically allocate more memory than they're
asked for, so that another realloc() call can often expand the
allocated memory in place.

I have a small test program that uses realloc() to expand a 1-byte
allocated buffer to 2 bytes, then 3, then 4, and so on. In one
implementation, in one test run, it reallocates at 25 bytes, and
then not again until just over 128 kbytes. Other implementations
behave differently.

A realloc() call that's able to return its first argument is likely
to be much quicker than one that does an actual reallocation.
If you want to fine tune for performance, you'll have to allow
for the implementation you're using, either by performing run-time
measurements or by analyzing the implementation (which could change
in the next release).

Multiplying the size by 2 or 1.5 as needed is likely to be good enough.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
Malcolm McLean
2024-06-18 05:12:36 UTC
Permalink
Post by Keith Thompson
[...]
Post by Malcolm McLean
We have a continuously growing buffer, and we want the best strategy
for reallocations as the stream of characters comes at us. So, given
we now how many characters have arrived, can we predict how many will
arrive, and therefore ask for the best amount when we reallocate, so
that we neither make too many reallocation (reallocate on every byte
received) or ask for too much (demand SIZE_MAX memory when the first
byte is received).?
[...]
That's going to depend on the behavior of the implementations
realloc() function.
malloc() and realloc() typically allocate more memory than they're
asked for, so that another realloc() call can often expand the
allocated memory in place.
I have a small test program that uses realloc() to expand a 1-byte
allocated buffer to 2 bytes, then 3, then 4, and so on. In one
implementation, in one test run, it reallocates at 25 bytes, and
then not again until just over 128 kbytes. Other implementations
behave differently.
A realloc() call that's able to return its first argument is likely
to be much quicker than one that does an actual reallocation.
If you want to fine tune for performance, you'll have to allow
for the implementation you're using, either by performing run-time
measurements or by analyzing the implementation (which could change
in the next release).
Multiplying the size by 2 or 1.5 as needed is likely to be good enough.
A small test program can hog all of memory for itself, and so isn't a
good test. What matters is where you're running something like a video
game, and the artists stuff the comupter's memory full of artwork until
it runs out, and then slowly and reluctantly remove their masterworks
until the other routines have enough memory to run. So the game won't
run out of memory, but only just. It's very tight.
--
Check out my hobby project.
http://malcolmmclean.github.io/babyxrc
Bonita Montero
2024-06-18 07:56:07 UTC
Permalink
Post by Keith Thompson
I have a small test program that uses realloc() to expand a 1-byte
allocated buffer to 2 bytes, then 3, then 4, and so on. In one
implementation, in one test run, it reallocates at 25 bytes, and
then not again until just over 128 kbytes. Other implementations
behave differently.
Usually you don't expand a dynamic array byte per byte. Normaly you
expand it like a C++ vector whose capacity doubles with libstdc++
and libc++ (MSVC adds 50% to the capacity before).
David Brown
2024-06-17 18:11:48 UTC
Permalink
Post by Malcolm McLean
Post by Ben Bacarisse
Post by Janis Papanagnou
In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.
There is obviously a cost, but there is (usually) no alternative if
contiguous storage is required.  In practice, the cost is usually
moderate and can be very effectively managed by using an exponential
allocation scheme: at every reallocation multiply the storage space by
some factor greater than 1 (I often use 3/2, but doubling is often used
as well).  This results in O(log(N)) rather than O(N) allocations as in
your code that added a constant to the size.  Of course, some storage is
wasted (that /might/ be retrieved by a final realloc down to the final
size) but that's rarely significant.
So can we work it out?
Let's assume for the moment that the allocations have a semi-normal
distribution, with negative values disallowed. Now ignoring the first
few values, if we have allocated, say, 1K, we ought to be able to
predict the value by integrating the distribution from 1k to infinity
and taking the mean.
First, there is no reason for assuming such a distribution, other than
saying "lots of things are roughly normal".

Secondly, knowing the distribution gives you /no/ information about any
given particular case. You know the distribution for the results of
rolling two die - does that mean you can predict the next roll?

Thirdly, not all distributions have a mean (look up the Cauchy
distribution if you like).

Fourthly, even if you know the mean, it tells you nothing of use.


Knowing a bit about the distribution of file sizes can be useful, but
not nearly in the way you describe here. If you know that the files are
rarely or never bigger than 10 MB, malloc 10 MB and forget the realloc.
If you know they are often bigger than that, mmap the file and forget
the realloc.
Bonita Montero
2024-06-17 09:22:31 UTC
Permalink
Post by Janis Papanagnou
In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.
Janis
realloc() is just a convenience funciton. Usually the reallocation
can't happen in-place and a second malloc() followed by a copy and
a free() does the same.
For large data it would be nice if the pages being deallocated later
would be incrementally marked as discardable after copying a portion.
This would result in only a small portion of additional physical
memory being allocated since the newly allocated pages become asso-
ciated with phyiscal pages when they're touched first. Windows has
VirtualAlloc() with MEM_RESET for that, Linux has madvise() with
MADV_DONTNEED.
Vir Campestris
2024-06-20 20:08:00 UTC
Permalink
Post by Bonita Montero
realloc() is just a convenience funciton. Usually the reallocation
can't happen in-place and a second malloc() followed by a copy and
a free() does the same.
For large data it would be nice if the pages being deallocated later
would be incrementally marked as discardable after copying a portion.
This would result in only a small portion of additional physical
memory being allocated since the newly allocated pages become asso-
ciated with phyiscal pages when they're touched first. Windows has
VirtualAlloc() with MEM_RESET for that, Linux has madvise() with
MADV_DONTNEED.
"Usually can't happen in place"?

Really? It's not something I use a lot, but when it's appropriate I
will. It's got the advantage over doing this myself that for some
portion of calls all the run time library needs to do is change the size
field in the structure.

Nothing else.

No copying, and no duplicate allocations.

What proportion of calls can be managed by changing the size field alone
depends on your workload and the platform. But I doubt there are many
cases where it is 0%.

Andy
Bonita Montero
2024-06-21 19:12:12 UTC
Permalink
Post by Vir Campestris
Post by Bonita Montero
realloc() is just a convenience funciton. Usually the reallocation
can't happen in-place and a second malloc() followed by a copy and
a free() does the same.
For large data it would be nice if the pages being deallocated later
would be incrementally marked as discardable after copying a portion.
This would result in only a small portion of additional physical
memory being allocated since the newly allocated pages become asso-
ciated with phyiscal pages when they're touched first. Windows has
VirtualAlloc() with MEM_RESET for that, Linux has madvise() with
MADV_DONTNEED.
"Usually can't happen in place"?
Usually you don't resize the block with a few bytes and if you have
a large memory area it's unlikely that you can allocate a few additional
pages in-place.
C++23 has a nice function with the allocator interface called allocate
_at_least. It returns the pointer to the allocated memory with the size
actually allocated. This fits with modern allocators like mimalloc, je-
malloc or tcmalloc, which round up the allocation to the next 2 ^ N
boundary up to 8kB. So a vector's capacity directly includes the blend.
Lawrence D'Oliveiro
2024-06-24 08:40:03 UTC
Permalink
Usually you don't resize the block with a few bytes ...
The usual way I use realloc is to maintain separate counts of the number
of array elements I have allocated, and the number I am actually using. A
realloc call is only needed when the latter hits the former. Every time I
call realloc, I will extend by some minimum number of array elements (e.g.
128), roughly comparable to the sort of array size I typically end up
with.

And then when the structure is complete, I do a final realloc call to
shrink it down so the size is actually that used. Is it safe to assume
such a call will never fail? Hmm ...
Keith Thompson
2024-06-24 09:55:39 UTC
Permalink
Post by Lawrence D'Oliveiro
Usually you don't resize the block with a few bytes ...
The usual way I use realloc is to maintain separate counts of the number
of array elements I have allocated, and the number I am actually using. A
realloc call is only needed when the latter hits the former. Every time I
call realloc, I will extend by some minimum number of array elements (e.g.
128), roughly comparable to the sort of array size I typically end up
with.
And then when the structure is complete, I do a final realloc call to
shrink it down so the size is actually that used. Is it safe to assume
such a call will never fail? Hmm ...
It's not safe to assume that a shrinking realloc call will never fail.
It's possible that it will never fail in any existing implementation,
but the standard makes no such guarantee.

A (perhaps) plausible way it can fail is if allocations of different
sizes are allocated from different pools. Trying to shrink a 1000-byte
object to 100 bytes might fail if the 100-byte pool has been exhausted.
On the other hand, realloc() could just return its argument in such a
case and continue treating the object as a 1000-byte allocation.
(The allocation functions don't have to keep track of how much memory
you asked for, only how much they actually allocated, which is >= what
you asked for.)

Arguably only a poor implementation would fail on a shrinking realloc(),
but the standard permits poor implementations.

Having said all that, if realloc fails (indicated by returning a null
pointer), you still have the original pointer to the object.

Something else that occurs to me: If a shrinking realloc() never fails
in practice, then any code you write to handle a failure won't be
tested.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
David Brown
2024-06-24 11:40:08 UTC
Permalink
Post by Keith Thompson
Something else that occurs to me: If a shrinking realloc() never fails
in practice, then any code you write to handle a failure won't be
tested.
That is always a problem with allocation functions. Have you ever known
a non-pathological malloc() to fail?

I think, in fact, there's a good argument for ignoring the possibility
of malloc (and calloc and realloc) failures for most PC code. There is
virtually no chance of failure in reality, and if you get one, there is
almost never a sensible way to deal with it - you just kick the can down
the road by having functions return NULL until something gives up and
stops the program with an error message. You might as well just let the
OS kill the program when you try to access memory at address 0.

I've seen more than enough error handling code that has never been
tested in practice - including error handling code with bugs that lead
to far worse problems than just killing the program.

Of course such treatment is not appropriate for all allocations (or
other functions that could fail). But often I think it is better to
write clearer and fully testable (and tested!) code which ignores
hypothetical errors, rather than some of the untestable and untested
jumbles that are sometimes seen in an attempt to "handle" allocation
failures.
Malcolm McLean
2024-06-24 17:50:15 UTC
Permalink
Post by Keith Thompson
Something else that occurs to me: If a shrinking realloc() never fails
in practice, then any code you write to handle a failure won't be
tested.
That is always a problem with allocation functions.  Have you ever known
a non-pathological malloc() to fail?
I think, in fact, there's a good argument for ignoring the possibility
of malloc (and calloc and realloc) failures for most PC code.  There is
virtually no chance of failure in reality, and if you get one, there is
almost never a sensible way to deal with it - you just kick the can down
the road by having functions return NULL until something gives up and
stops the program with an error message.  You might as well just let the
OS kill the program when you try to access memory at address 0.
I've seen more than enough error handling code that has never been
tested in practice - including error handling code with bugs that lead
to far worse problems than just killing the program.
Of course such treatment is not appropriate for all allocations (or
other functions that could fail).  But often I think it is better to
write clearer and fully testable (and tested!) code which ignores
hypothetical errors, rather than some of the untestable and untested
jumbles that are sometimes seen in an attempt to "handle" allocation
failures.
Baby X has bbx_malloc() which is guaranteed never to return NULL, and
never to return a pointer to an allocation which cannot be indexed by an
int.

I use a "goto out_of_memory" in regular code, however.
--
Check out my hobby project.
http://malcolmmclean.github.io/babyxrc
Scott Lurndal
2024-06-24 18:20:32 UTC
Permalink
Post by Malcolm McLean
Post by David Brown
Of course such treatment is not appropriate for all allocations (or
other functions that could fail).  But often I think it is better to
write clearer and fully testable (and tested!) code which ignores
hypothetical errors, rather than some of the untestable and untested
jumbles that are sometimes seen in an attempt to "handle" allocation
failures.
Baby X has bbx_malloc() which is guaranteed never to return NULL, and
never to return a pointer to an allocation which cannot be indexed by an
int.
What do you mean by 'indexed by an int'? So, what happens if I index
your allocation with -109235?

Or did you mean to say unsigned (or positive) int less than the
size of the allocation?
Keith Thompson
2024-06-24 21:28:47 UTC
Permalink
[...]
Post by Scott Lurndal
Post by Malcolm McLean
Baby X has bbx_malloc() which is guaranteed never to return NULL, and
never to return a pointer to an allocation which cannot be indexed by an
int.
What do you mean by 'indexed by an int'? So, what happens if I index
your allocation with -109235?
Or did you mean to say unsigned (or positive) int less than the
size of the allocation?
If I recall correctly, Malcolm hates size_t. I presume that
bbx_malloc() will never allocate more than INT_MAX bytes.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
Malcolm McLean
2024-06-24 23:22:04 UTC
Permalink
Post by Keith Thompson
[...]
Post by Scott Lurndal
Post by Malcolm McLean
Baby X has bbx_malloc() which is guaranteed never to return NULL, and
never to return a pointer to an allocation which cannot be indexed by an
int.
What do you mean by 'indexed by an int'? So, what happens if I index
your allocation with -109235?
Or did you mean to say unsigned (or positive) int less than the
size of the allocation?
If I recall correctly, Malcolm hates size_t. I presume that
bbx_malloc() will never allocate more than INT_MAX bytes.
Exactly.
Currently bbx_malloc() does take a size_t, but there version in which it
takes an int. However you get warning sif yiub try something like
bbx_mallcc(10 * sizeof(int)) on the basis that expressin is size_t.

But if you pass either a negative number or more than iNT_MAX, it
assumes the request is invalid and terminates with an error message.
--
Check out my hobby project.
http://malcolmmclean.github.io/babyxrc
Chris M. Thomasson
2024-06-25 01:31:48 UTC
Permalink
Post by Keith Thompson
[...]
Post by Scott Lurndal
Post by Malcolm McLean
Baby X has bbx_malloc() which is guaranteed never to return NULL, and
never to return a pointer to an allocation which cannot be indexed by an
int.
What do you mean by 'indexed by an int'? So, what happens if I index
your allocation with -109235?
Or did you mean to say unsigned (or positive) int less than the
size of the allocation?
If I recall correctly, Malcolm hates size_t. I presume that
bbx_malloc() will never allocate more than INT_MAX bytes.
He still has to deal with bbx_malloc failing. Don't like automatically
aborting...

bbx_malloc(INT_MAX);
bbx_malloc(INT_MAX);
bbx_malloc(INT_MAX);
...

;^)
Lawrence D'Oliveiro
2024-06-25 07:06:41 UTC
Permalink
Baby X has bbx_malloc() which is guaranteed never to return NULL ...
Does it actually allocate the (physical) memory?

I wrote a memory-hog app for Android once, and found that allocating large
amounts of memory space had very little impact on the system. Then when I
added code to actually write data into those allocated pages, that’s when
it really started to break into a sweat ...
Bonita Montero
2024-06-25 08:38:28 UTC
Permalink
Post by Lawrence D'Oliveiro
I wrote a memory-hog app for Android once, and found that allocating large
amounts of memory space had very little impact on the system. Then when I
added code to actually write data into those allocated pages, that’s when
it really started to break into a sweat ...
Then android is also doing overcommit.
Lawrence D'Oliveiro
2024-06-26 00:51:40 UTC
Permalink
Post by Bonita Montero
Post by Lawrence D'Oliveiro
I wrote a memory-hog app for Android once, and found that allocating
large amounts of memory space had very little impact on the system.
Then when I added code to actually write data into those allocated
pages, that’s when it really started to break into a sweat ...
Then android is also doing overcommit.
It is running a Linux kernel, and that tends to be the default setup in
Linux.
Chris M. Thomasson
2024-06-24 19:33:04 UTC
Permalink
Post by Keith Thompson
Something else that occurs to me: If a shrinking realloc() never fails
in practice, then any code you write to handle a failure won't be
tested.
That is always a problem with allocation functions.  Have you ever known
a non-pathological malloc() to fail?
I think, in fact, there's a good argument for ignoring the possibility
of malloc (and calloc and realloc) failures for most PC code.  There is
virtually no chance of failure in reality, and if you get one, there is
almost never a sensible way to deal with it -
[...]

I have had to deal and roll with malloc failures. It put the server in
panic mode and it started killing connections that were already timed
out, dumping some caches (freelists of regions, ect), ect... There is a
way to recover.
Bonita Montero
2024-06-24 19:48:16 UTC
Permalink
Post by Chris M. Thomasson
Post by Keith Thompson
Something else that occurs to me: If a shrinking realloc() never fails
in practice, then any code you write to handle a failure won't be
tested.
That is always a problem with allocation functions.  Have you ever
known a non-pathological malloc() to fail?
I think, in fact, there's a good argument for ignoring the possibility
of malloc (and calloc and realloc) failures for most PC code.  There
is virtually no chance of failure in reality, and if you get one,
there is almost never a sensible way to deal with it -
[...]
I have had to deal and roll with malloc failures. It put the server in
panic mode and it started killing connections that were already timed
out, dumping some caches (freelists of regions, ect), ect... There is a
way to recover.
There are applications where you can ignore malloc()-failures /
bad_alloc since they only allocate a small amount of memory or
a lot of small allocations which sum up to a small amount of
memory.
Lawrence D'Oliveiro
2024-06-24 22:59:23 UTC
Permalink
Have you ever known a non-pathological malloc() to fail?
I was once commissioned, many decades ago, to write a multispectral image
viewer to run on old MacOS. I followed my usual memory-allocation
discipline. The client reported how he tried to open too many images at
once, and ran out of memory; my program reported one out-of-memory error,
gave up trying to open the rest of the files, and gracefully recovered
without crashing.

The program that had been supplied to him for Microsoft Windows, however,
gave an error for *each* file it failed to open.
Bonita Montero
2024-06-24 14:19:22 UTC
Permalink
Post by Keith Thompson
It's not safe to assume that a shrinking realloc call will never fail.
It's possible that it will never fail in any existing implementation,
but the standard makes no such guarantee.
With modern memory allocators a shrink can easily fail if the shrunken
block falls into a different size class and there's no memory for a
block in this class.
mimalloc, jemalloc and TCMalloc round up the size to the next power of
two up to 8kB. If yoz have a 8kB-class block and it shrinks below 4kB
and there's no page (6kB for all mentioned allocators) for this class
and no additional pool can be allocated the shrink fails.
Lawrence D'Oliveiro
2024-06-25 07:02:39 UTC
Permalink
Post by Keith Thompson
Post by Lawrence D'Oliveiro
The usual way I use realloc is to maintain separate counts of the
number of array elements I have allocated, and the number I am actually
using. A realloc call is only needed when the latter hits the former.
Every time I call realloc, I will extend by some minimum number of
array elements (e.g. 128), roughly comparable to the sort of array size
I typically end up with.
And then when the structure is complete, I do a final realloc call to
shrink it down so the size is actually that used. Is it safe to assume
such a call will never fail? Hmm ...
It's not safe to assume that a shrinking realloc call will never fail.
It's possible that it will never fail in any existing implementation,
but the standard makes no such guarantee.
...
Having said all that, if realloc fails (indicated by returning a null
pointer), you still have the original pointer to the object.
In other words, it’s safe to ignore any error from that last shrinking
realloc? That’s good enough for me. ;)
Keith Thompson
2024-06-25 10:05:16 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Keith Thompson
Post by Lawrence D'Oliveiro
The usual way I use realloc is to maintain separate counts of the
number of array elements I have allocated, and the number I am actually
using. A realloc call is only needed when the latter hits the former.
Every time I call realloc, I will extend by some minimum number of
array elements (e.g. 128), roughly comparable to the sort of array size
I typically end up with.
And then when the structure is complete, I do a final realloc call to
shrink it down so the size is actually that used. Is it safe to assume
such a call will never fail? Hmm ...
It's not safe to assume that a shrinking realloc call will never fail.
It's possible that it will never fail in any existing implementation,
but the standard makes no such guarantee.
...
Having said all that, if realloc fails (indicated by returning a null
pointer), you still have the original pointer to the object.
In other words, it’s safe to ignore any error from that last shrinking
realloc? That’s good enough for me. ;)
What? No, that's not what I said at all.

Suppose you do something like:

some_type *p = malloc(BIG_VALUE);
// ...
p = realloc(p, SMALL_VALUE);

If the realloc() succeeds and doesn't relocate and copy the object,
you're fine. If realloc() succeeds and *does* relocate the object, p
still points to memory that has now been deallocated, and you don't have
a pointer to the newly allocated memory. If realloc() fails, it returns
a null pointer, but the original memory is still valid -- but again, the
assignment clobbers your only pointer to it.

I presume you can write code that handles all three possibilities, but
you can't just ignore any errors.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
Richard Damon
2024-06-25 11:21:42 UTC
Permalink
Post by Keith Thompson
Post by Lawrence D'Oliveiro
Post by Keith Thompson
Post by Lawrence D'Oliveiro
The usual way I use realloc is to maintain separate counts of the
number of array elements I have allocated, and the number I am actually
using. A realloc call is only needed when the latter hits the former.
Every time I call realloc, I will extend by some minimum number of
array elements (e.g. 128), roughly comparable to the sort of array size
I typically end up with.
And then when the structure is complete, I do a final realloc call to
shrink it down so the size is actually that used. Is it safe to assume
such a call will never fail? Hmm ...
It's not safe to assume that a shrinking realloc call will never fail.
It's possible that it will never fail in any existing implementation,
but the standard makes no such guarantee.
...
Having said all that, if realloc fails (indicated by returning a null
pointer), you still have the original pointer to the object.
In other words, it’s safe to ignore any error from that last shrinking
realloc? That’s good enough for me. ;)
What? No, that's not what I said at all.
some_type *p = malloc(BIG_VALUE);
// ...
p = realloc(p, SMALL_VALUE);
If the realloc() succeeds and doesn't relocate and copy the object,
you're fine. If realloc() succeeds and *does* relocate the object, p
still points to memory that has now been deallocated, and you don't have
a pointer to the newly allocated memory. If realloc() fails, it returns
a null pointer, but the original memory is still valid -- but again, the
assignment clobbers your only pointer to it.
I presume you can write code that handles all three possibilities, but
you can't just ignore any errors.
The idiom I always learned for realloc was something like:


some_type *p = malloc(size);
if (!p) {
// allocation failed, do something about it. (might be just abort)
}

...

some_type *np = realloc(p, new_size);
if (np) {
p = np;
} else {
// p still points to old buffer, but you didn't get the new size
// so do what you can to handle the situation.
}

// p here points to the current buffer,
// might be the old size or the new.
Phil Carmody
2024-06-28 08:01:38 UTC
Permalink
Post by Keith Thompson
some_type *p = malloc(BIG_VALUE);
// ...
p = realloc(p, SMALL_VALUE);
... If realloc() succeeds and *does* relocate the object, p
still points to memory that has now been deallocated, and you don't have
a pointer to the newly allocated memory.
Surely some mistake?

However, such self-assignments are bad for the reasons you state later;
verify, then update.

Phil
--
We are no longer hunters and nomads. No longer awed and frightened, as we have
gained some understanding of the world in which we live. As such, we can cast
aside childish remnants from the dawn of our civilization.
-- NotSanguine on SoylentNews, after Eugen Weber in /The Western Tradition/
Keith Thompson
2024-06-28 11:04:48 UTC
Permalink
Post by Phil Carmody
Post by Keith Thompson
some_type *p = malloc(BIG_VALUE);
// ...
p = realloc(p, SMALL_VALUE);
... If realloc() succeeds and *does* relocate the object, p
still points to memory that has now been deallocated, and you don't have
a pointer to the newly allocated memory.
Surely some mistake?
Yes, and thanks for catching it.

If realloc() succeeds, p points to the newly allocated memory; you no
longer have a pointer to the old deallocated memory, but that's a good
thing.

I think I was describing the behavior of calling realloc(p, SMALL_VALUE)
without assigning the result to anything.
Post by Phil Carmody
However, such self-assignments are bad for the reasons you state later;
verify, then update.
Right.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
James Kuyper
2024-06-28 10:37:49 UTC
Permalink
On 6/25/24 06:05, Keith Thompson wrote:
...
Post by Keith Thompson
some_type *p = malloc(BIG_VALUE);
// ...
p = realloc(p, SMALL_VALUE);
If the realloc() succeeds and doesn't relocate and copy the object,
you're fine. If realloc() succeeds and *does* relocate the object, p
still points to memory that has now been deallocated, and you don't have
a pointer to the newly allocated memory. ...
? I believe that, in that case, p does point to the newly allocated memory.
Keith Thompson
2024-06-28 11:05:40 UTC
Permalink
Post by James Kuyper
...
Post by Keith Thompson
some_type *p = malloc(BIG_VALUE);
// ...
p = realloc(p, SMALL_VALUE);
If the realloc() succeeds and doesn't relocate and copy the object,
you're fine. If realloc() succeeds and *does* relocate the object, p
still points to memory that has now been deallocated, and you don't have
a pointer to the newly allocated memory. ...
? I believe that, in that case, p does point to the newly allocated memory.
Yes, my mistake.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
James Kuyper
2024-06-28 10:36:45 UTC
Permalink
...
Post by Lawrence D'Oliveiro
Post by Keith Thompson
Having said all that, if realloc fails (indicated by returning a null
pointer), you still have the original pointer to the object.
In other words, it’s safe to ignore any error from that last shrinking
realloc? That’s good enough for me. ;)
No, you misunderstand:

q = realloc(p, SMALL_VALUE);

Then if q is null, p still points at the originally allocated memory. If
q is not null, then it may point at newly allocated memory, and p has in
indeterminate value. You cannot go forward ignoring the possibility that
no new object was allocated, because if you do, you have no way of
knowing which of the two pointers you can safely dereference. You need,
at least,

if(q)
p = q;

then you can safely use p, regardless of whether realloc() allocated new
memory.
Bonita Montero
2024-06-24 14:16:07 UTC
Permalink
Post by Lawrence D'Oliveiro
Usually you don't resize the block with a few bytes ...
The usual way I use realloc is to maintain separate counts of the number
of array elements I have allocated, and the number I am actually using. A
realloc call is only needed when the latter hits the former. Every time I
call realloc, I will extend by some minimum number of array elements (e.g.
128), roughly comparable to the sort of array size I typically end up
with.
And then when the structure is complete, I do a final realloc call to
shrink it down so the size is actually that used. Is it safe to assume
such a call will never fail? Hmm ...
In C++ you dont't have to think about that. Do an emplace_back and the
capacity() doubles with libstdc++ and libc++ if the vevtor becomes full.
Kenny McCormack
2024-06-24 15:44:56 UTC
Permalink
In article <v5bv37$vejk$***@raubtier-asyl.eternal-september.org>,
...
Post by Bonita Montero
In C++ you dont't have to think about that.
You don't have to worry about it in Fortran, either.
--
"He is exactly as they taught in KGB school: an egoist, a liar, but talented - he
knows the mind of the wrestling-loving, under-educated, authoritarian-admiring
white male populous."
- Malcolm Nance, p59. -
Bonita Montero
2024-06-24 18:16:45 UTC
Permalink
Post by Kenny McCormack
...
Post by Bonita Montero
In C++ you dont't have to think about that.
You don't have to worry about it in Fortran, either.
But Fortran is for sure not that flexible that you could build your
own containers basing on the same logic, partitially with lowlevel
facilities which are derived from C. I implemnented a parallel hash
map with some reader writer's locks which feel rather like a native
component of the language; try this with Fortran.
David Brown
2024-06-17 12:15:11 UTC
Permalink
Post by Janis Papanagnou
In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.
Janis
Consider your target audience and their hardware, the target OS, and the
realistic size of your data. If the target is a PC, you can happily
malloc tens of MB at the start without a care, and for systems that do
not actually allocate system memory until you try to access the area,
there is no cost to this.

So in many situations where you are reading and parsing data from a
file, you can just do the initial malloc with more than enough space for
any realistic input file. You might still implement a realloc solution
for occasional extreme uses, and because it is nice to avoid artificial
limits for programs, but efficiency matters a lot less in those cases.

It may also be the case that even if realloc returns a different address
and logically copies a lot of data, that this is done by smarter virtual
memory mapping so that only the mapping changes, and the underlying
physical ram does not need to be copied. But I don't know if OS's and
realloc implementations are smart enough to do that.
Bonita Montero
2024-06-17 12:18:17 UTC
Permalink
Post by David Brown
Consider your target audience and their hardware, the target OS, and the
realistic size of your data.  If the target is a PC, you can happily
malloc tens of MB at the start without a care, and for systems that do
not actually allocate system memory until you try to access the area,
there is no cost to this.
If you have a memcpy() from a to b at the end there's twice the memory
allocated even if the destination pages are assigned while touching a
page.
Janis Papanagnou
2024-06-17 13:21:24 UTC
Permalink
Post by Janis Papanagnou
In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.
Let me add...

I'd assume that there's some basic allocation size defined; some
simple test sample with a handful of bytes didn't relocate the data.
Yet I don't know whether allocated memory is managed sequentially
or has linked blocks. A peek info the source code might help. What
I found is this comment for extending chunks:[*]
* Extending forward into following adjacent free chunk.
* Shifting backwards, joining preceding adjacent space
* Both shifting backwards and extending forward.
* Extending into newly sbrked space
Going to investigate that source code[*] later...

Janis

[*] https://elixir.bootlin.com/glibc/glibc-2.1.2/source/malloc/malloc.c
(line 3077 ff)
Scott Lurndal
2024-06-17 16:50:07 UTC
Permalink
Post by Janis Papanagnou
In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.
I've not found a use for realloc in the last forty five years, myself.

I suspect that the performance issues are not an issue for relatively
small datasets, and are often exhibited during the non-performance critical
'setup' phase of an algorithm.
Michael S
2024-06-17 17:20:57 UTC
Permalink
On Mon, 17 Jun 2024 16:50:07 GMT
Post by Scott Lurndal
Post by Janis Papanagnou
In a recent thread realloc() was a substantial part of the
discussion. "Occasionally" the increased data storage will be
relocated along with the previously stored data. On huge data sets
that might be a performance factor. Is there any experience or are
there any concrete factors about the conditions when this relocation
happens? - I could imagine that it's no issue as long as you're in
some kB buffer range, but if, say, we're using realloc() to
substantially increase buffers often it might be an issue to
consider. It would be good to get some feeling about that internal.
I've not found a use for realloc in the last forty five years, myself.
Did you find use for std::vector:resize()?
If yes, that could be major reason behind not finding use for realloc().
Another possible reason is coding for environments where dynamic
allocation either not used at all or used only during start up.

At least for me those are major reasons why I very rarely used realloc
since beginning of programming as a pro.
Post by Scott Lurndal
I suspect that the performance issues are not an issue for relatively
small datasets, and are often exhibited during the non-performance
critical 'setup' phase of an algorithm.
Scott Lurndal
2024-06-17 19:02:13 UTC
Permalink
Post by Michael S
On Mon, 17 Jun 2024 16:50:07 GMT
Post by Scott Lurndal
Post by Janis Papanagnou
In a recent thread realloc() was a substantial part of the
discussion. "Occasionally" the increased data storage will be
relocated along with the previously stored data. On huge data sets
that might be a performance factor. Is there any experience or are
there any concrete factors about the conditions when this relocation
happens? - I could imagine that it's no issue as long as you're in
some kB buffer range, but if, say, we're using realloc() to
substantially increase buffers often it might be an issue to
consider. It would be good to get some feeling about that internal.
I've not found a use for realloc in the last forty five years, myself.
Did you find use for std::vector:resize()?
I'm pretty sure (checks) that I posted this reply to comp.lang.c.

std::vector::resize() doesn't work well from C (well, I can mangle
the names and use an explicit this pointer, but why bother?).
Post by Michael S
If yes, that could be major reason behind not finding use for realloc().
Another possible reason is coding for environments where dynamic
allocation either not used at all or used only during start up.
Or because the algorithms used don't call for realloc. Or there
are better alternatives (like mmap).
Rosario19
2024-06-18 09:50:48 UTC
Permalink
On Mon, 17 Jun 2024 08:08:07 +0200, Janis Papanagnou
Post by Janis Papanagnou
In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.
Janis
the only problem i see it is the memory that is free is the first has
to be used, or be returned from malloc or realloc, because that memory
is already in a good position near the cpu
Bonita Montero
2024-06-25 08:48:33 UTC
Permalink
Test this code with your Linux installation. For my installation
glibc does all realloc()ations in-place. Really surprising for me.

#include <stdio.h>
#include <stdlib.h>

int main()
{
void *p = malloc( 0x100000000 );
printf( "%p\n", p );
p = realloc( p, 1 );
printf( "%p\n", p );
malloc( 0x100000000 - 0x10000 );
p = realloc( p, 0x100000000 );
printf( "%p\n", p );
}
Vir Campestris
2024-06-25 10:55:02 UTC
Permalink
Post by Bonita Montero
Test this code with your Linux installation. For my installation
glibc does all realloc()ations in-place. Really surprising for me.
#include <stdio.h>
#include <stdlib.h>
int main()
{
    void *p = malloc( 0x100000000 );
    printf( "%p\n", p );
    p = realloc( p, 1 );
    printf( "%p\n", p );
    malloc( 0x100000000 - 0x10000 );
    p = realloc( p, 0x100000000 );
    printf( "%p\n", p );
}
Try allocating a bunch of little items, and looking at where they are.
They'll likely be contiguous, or evenly spaced, depending on your
implementation and what "little" is.

Then resize them all. Some will move.

Andy.
--
Your C++ comment up-thread BTW is off-topic here. My favourite C++
container is vector, and that has a reserve call so you can keep growing
the container without lots of reallocations.
Bonita Montero
2024-06-25 11:28:51 UTC
Permalink
Post by Vir Campestris
Post by Bonita Montero
Test this code with your Linux installation. For my installation
glibc does all realloc()ations in-place. Really surprising for me.
#include <stdio.h>
#include <stdlib.h>
int main()
{
     void *p = malloc( 0x100000000 );
     printf( "%p\n", p );
     p = realloc( p, 1 );
     printf( "%p\n", p );
     malloc( 0x100000000 - 0x10000 );
     p = realloc( p, 0x100000000 );
     printf( "%p\n", p );
}
Try allocating a bunch of little items, and looking at where they are.
They'll likely be contiguous, or evenly spaced, depending on your
implementation and what "little" is.
Then resize them all. Some will move.
Andy.
The interesting part is that after doing the first realloc()
the memory being freee isn't reused for the next malloc().
Vir Campestris
2024-06-26 11:15:33 UTC
Permalink
Post by Bonita Montero
The interesting part is that after doing the first realloc()
the memory being freee isn't reused for the next malloc().
That's entirely implementation dependent.

Andy
Bonita Montero
2024-06-26 13:01:21 UTC
Permalink
Post by Vir Campestris
Post by Bonita Montero
The interesting part is that after doing the first realloc()
the memory being freee isn't reused for the next malloc().
That's entirely implementation dependent.
Of course, but that's what I wanted to test.
I traced the allocations with strace and the memory after the single
byte step actually is really freed and not marked as discardable.
DFS
2024-06-25 13:56:58 UTC
Permalink
Post by Bonita Montero
Test this code with your Linux installation. For my installation
glibc does all realloc()ations in-place. Really surprising for me.
#include <stdio.h>
#include <stdlib.h>
int main()
{
    void *p = malloc( 0x100000000 );
    printf( "%p\n", p );
    p = realloc( p, 1 );
    printf( "%p\n", p );
    malloc( 0x100000000 - 0x10000 );
    p = realloc( p, 0x100000000 );
    printf( "%p\n", p );
}
$ gcc -Wall montera_test.c -o mt
montera_test.c: In function ‘main’:
montera_test.c:10:9: warning: ignoring return value of ‘malloc’ declared
with attribute ‘warn_unused_result’ [-Wunused-result]
10 | malloc( 0x100000000 - 0x10000 );
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


$ ./mt
0x7fb976f12010
0x7fb976f12010
0x7fb876f11010
Bonita Montero
2024-06-25 14:16:22 UTC
Permalink
Post by DFS
Post by Bonita Montero
Test this code with your Linux installation. For my installation
glibc does all realloc()ations in-place. Really surprising for me.
#include <stdio.h>
#include <stdlib.h>
int main()
{
     void *p = malloc( 0x100000000 );
     printf( "%p\n", p );
     p = realloc( p, 1 );
     printf( "%p\n", p );
     malloc( 0x100000000 - 0x10000 );
     p = realloc( p, 0x100000000 );
     printf( "%p\n", p );
}
$ gcc -Wall montera_test.c -o mt
montera_test.c:10:9: warning: ignoring return value of ‘malloc’ declared
with attribute ‘warn_unused_result’ [-Wunused-result]
   10 |         malloc( 0x100000000 - 0x10000 );
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This actually isn't a problem because I just wanted to test if the tail
of the p-block is immediately rezsed so that the next grow can't recycle
the memory. But interestingly all steps are done in-place.
Loading...