Buffer contents well-defined after fgets() reaches EOF ?

Discussion:

Buffer contents well-defined after fgets() reaches EOF ?

Add Reply

Janis Papanagnou

2025-02-09 05:59:00 UTC

Reply

To get the last line of a text file I'm using

char buf[BUFSIZ];
while (fgets (buf, BUFSIZ, fd) != NULL)
; // read to last line

If the end of the file is reached my test shows that the previous
contents of 'buf' are still existing (not cleared or overwritten).

But the man page does not say anything whether this is guaranteed;
it says: "Reading stops after an EOF or a newline.", but it says
nothing about [not] writing to or [not] resetting the buffer.

Is that simple construct safe to get the last line of a text file?

Thanks.

Janis

Kaz Kylheku

2025-02-09 06:23:33 UTC

Reply

Post by Janis Papanagnou
To get the last line of a text file I'm using
char buf[BUFSIZ];
while (fgets (buf, BUFSIZ, fd) != NULL)
; // read to last line
If the end of the file is reached my test shows that the previous
contents of 'buf' are still existing (not cleared or overwritten).

Whenever fgets successfully reads one or more characters, and
adds them to the array (followed by a null terminator), it
returns the pointer it was given.

fgets only returns null when:

- it hits EOF when trying to obtain the first character.

- it hits an I/O error.

Post by Janis Papanagnou
But the man page does not say anything whether this is guaranteed;
it says: "Reading stops after an EOF or a newline.", but it says
nothing about [not] writing to or [not] resetting the buffer.

But of course, ISO C has the requirements nailed down. e.g. C99:

"The fgets function returns s if successful. If end-of-file is
encountered and no characters have been read into the array, the
contents of the array remain unchanged and a null pointer is returned.
If a read error occurs during the operation, the array contents are
indeterminate and a null pointer is returned."

Beware of man pages identifying themselves as "Linux Programmer's
Manual". Their quality is all over the place, and rarely hits
a high note.

Post by Janis Papanagnou
Is that simple construct safe to get the last line of a text file?

While fgets returns a pointer, you have a good line of input.
The terminating newline is included, unless it's the last line
and the file is missing it.

Some C newbies make this mistake:

while (!feof(stdin)) {
fgets(...)
/* process line */
}

their code ends up processing the last line twice. On the last byte
of input, which is usually the terminating newline of the last line,
fgets returns without having reached EOF. The loop spins around one more
time. This time fgets returns NULL, not having read a single byte. The
code doesn't check this and processes the buffer again.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca

Andrey Tarasevich

2025-02-09 06:23:49 UTC

Reply

Post by Janis Papanagnou
To get the last line of a text file I'm using
char buf[BUFSIZ];
while (fgets (buf, BUFSIZ, fd) != NULL)
; // read to last line
If the end of the file is reached my test shows that the previous
contents of 'buf' are still existing (not cleared or overwritten).
But the man page does not say anything whether this is guaranteed;
it says: "Reading stops after an EOF or a newline.", but it says
nothing about [not] writing to or [not] resetting the buffer.
Is that simple construct safe to get the last line of a text file?

What situation exactly are you talking about? When end-of-file is
encountered _immediately_, before reading the very first character? Of
when end-of-file is encountered after reading something (i.e. when the
last line in the file does not end with new-line character)?

The former situation is covered by the spec: "If end-of-file is
encountered and no characters have been read into the array, the
contents of the array remain unchanged and a null pointer is returned".

The second situation does not need additional clarifications. Per
general spec as many characters as available before the end-of-file will
be read and then terminated with '\0'. In such case there will be no
new-line character in the buffer.

So, in both cases we are perfectly safe when reading the last line of a
text file, if you don't forget to check the return value of `fgets`.

(This is all under assumption that size limit does not kick in. I
believe your question is not about that.)

Note also that `fgets` is not permitted to assume that the limit value
(the second parameter) correctly describes the accessible size of the
buffer. E.g. for this reason it is not permitted to zero-out the buffer
before reading. For example, this code is valid and has defined behavior

char buffer[10];
fgets(buffer, 1000, f);

provided the current line of the file fits into `char[10]`. I.e. even
though we "lied" to `fgets` about the limit, it is still required to
work correctly if the actual data fits into the actual buffer.

So, why do you care that "the previous contents of 'buf' are still
existing"?

--
Best regards,
Andrey

Andrey Tarasevich

2025-02-09 07:12:44 UTC

Reply

Post by Andrey Tarasevich
Note also that `fgets` is not permitted to assume that the limit value
(the second parameter) correctly describes the accessible size of the
buffer. E.g. for this reason it is not permitted to zero-out the buffer
before reading. For example, this code is valid and has defined behavior
char buffer[10];
fgets(buffer, 1000, f);
provided the current line of the file fits into `char[10]`. I.e. even
though we "lied" to `fgets` about the limit, it is still required to
work correctly if the actual data fits into the actual buffer.

... and this part of specification effectively guarantees, that any
[tail] portion of the buffer not overwritten by the characters obtained
from file, will remain unchanged. If `fgets` reads 5 characters from the
file, only first 6 characters of the buffer will be overwritten, while
the rest is guaranteed to remain untouched. If `fgets` reads nothing
(instant end-of-file), the entire buffer remains untouched.

--
Best regards,
Andrey

Lawrence D'Oliveiro

2025-02-09 23:52:17 UTC

Reply

If `fgets` reads nothing (instant end-of-file), the entire buffer
remains untouched.

You mean, only a single null byte gets written.

Andrey Tarasevich

2025-02-10 01:06:02 UTC

Reply

Post by Lawrence D'Oliveiro

If `fgets` reads nothing (instant end-of-file), the entire buffer
remains untouched.

You mean, only a single null byte gets written.

No. The buffer is not changed at all in such case.

--
Best regards,
Andrey

Andrey Tarasevich

2025-02-10 01:22:43 UTC

Reply

Post by Andrey Tarasevich

Post by Lawrence D'Oliveiro

If `fgets` reads nothing (instant end-of-file), the entire buffer
remains untouched.

You mean, only a single null byte gets written.

No. The buffer is not changed at all in such case.

... which actually raises an interesting quiz/puzzle/question:

Under what circumstances `fgets` is expected to return an empty
string? (I.e. set the [0] entry of the buffer to '\0' and return non-null)?

The only answer I can see right away is:

When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to read 0
characters.

This is under assumption that asking `fgets` to read 0 characters is
supposed to prevent it from detecting end-of-file condition or I/O error
condition. One can probably do some nitpicking at the current wording...
but I believe the above is the intent.

--
Best regards,
Andrey

Michael S

2025-02-10 10:49:11 UTC

Reply

On Sun, 9 Feb 2025 17:22:43 -0800

Post by Andrey Tarasevich

Post by Andrey Tarasevich

Post by Lawrence D'Oliveiro

If `fgets` reads nothing (instant end-of-file), the entire buffer
remains untouched.

You mean, only a single null byte gets written.

No. The buffer is not changed at all in such case.

Under what circumstances `fgets` is expected to return an empty
string? (I.e. set the [0] entry of the buffer to '\0' and return non-null)?
When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
read 0 characters.
This is under assumption that asking `fgets` to read 0 characters is
supposed to prevent it from detecting end-of-file condition or I/O
error condition. One can probably do some nitpicking at the current
wording... but I believe the above is the intent.

fgets() is one of many poorly defined standard library functions
inherited from early UNIX days. It is not horrendous, like gets(), so I
personally would not suggest deprecation. Instead, I would suggest
addition of another function with similar goals, but better-thought API
to the Standard library.

Tim Rentsch

2025-02-13 15:14:28 UTC

Reply

Post by Michael S
On Sun, 9 Feb 2025 17:22:43 -0800

Post by Andrey Tarasevich

Post by Lawrence D'Oliveiro

If `fgets` reads nothing (instant end-of-file), the entire buffer
remains untouched.

You mean, only a single null byte gets written.

No. The buffer is not changed at all in such case.

Under what circumstances `fgets` is expected to return an empty
string? (I.e. set the [0] entry of the buffer to '\0' and return
non-null)?
When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
read 0 characters.
This is under assumption that asking `fgets` to read 0 characters is
supposed to prevent it from detecting end-of-file condition or I/O
error condition. One can probably do some nitpicking at the current
wording... but I believe the above is the intent.

fgets() is one of many poorly defined standard library functions
inherited from early UNIX days. [...]

What about the fgets() function do you think is poorly defined?

Second question: by "poorly defined" do you mean "defined
wrongly" or "defined ambiguously" (or both)?

Michael S

2025-02-14 14:51:08 UTC

Reply

On Thu, 13 Feb 2025 07:14:28 -0800

Post by Tim Rentsch

Post by Michael S
On Sun, 9 Feb 2025 17:22:43 -0800

Post by Andrey Tarasevich

Post by Lawrence D'Oliveiro

If `fgets` reads nothing (instant end-of-file), the entire
buffer remains untouched.

You mean, only a single null byte gets written.

No. The buffer is not changed at all in such case.

Under what circumstances `fgets` is expected to return an empty
string? (I.e. set the [0] entry of the buffer to '\0' and return
non-null)?
When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
read 0 characters.
This is under assumption that asking `fgets` to read 0 characters
is supposed to prevent it from detecting end-of-file condition or
I/O error condition. One can probably do some nitpicking at the
current wording... but I believe the above is the intent.

fgets() is one of many poorly defined standard library functions
inherited from early UNIX days. [...]

What about the fgets() function do you think is poorly defined?
Second question: by "poorly defined" do you mean "defined
wrongly" or "defined ambiguously" (or both)?

For starter, it looks like designers of fgets() did not believe in
their own motto about files being just streams of bytes.
I don't know the history, so, may be, the function was defined this way
for portability with systems where text files have special record-based
structure?

Then, everything about it feels inelegant.
A return value carries just 1 bit of information, success or failure.
So why did they encode this information in baroque way instead of
something obvious, 0 and 1?
Appending zero at the end also feels like a hack, but it is necessary
because of the main problem. And the main problem is: how the user is
supposed to figure out how many bytes were read?
In well-designed API this question should be answered in O(1) time.
With fgets(), it can be answered in O(N) time when input is trusted to
contain no zeros. When input is arbitrary, finding out the answer is
even harder and requires quirks.

What is my suggestion for alternative?
Without too deep thinking I'd suggest (ignoring issues of restrict for
sake of brevity) function that gives the same answer like foo() below,
but hopefully does it faster:

char* foo(FILE* fp, char* dst, int count, char last_c)
{
while (count > 0) {
int ch = fgetc(fp);
if (ch == EOF) {
if (ferror(fp))
dst = NULL;
break;
}
*dst++ = ch;
if (ch == last_c)
break;
--count;
}
return dst;
}

The function foo() is more generic than fgets(). For use instead of
fgets() it should be accompanied by standard constant EOL_CHAR.

I am not completely satisfied with proposed solution. The API is
still less obvious than it could be. But it is much better than fgets().

Scott Lurndal

2025-02-14 15:10:50 UTC

Reply

Post by Michael S
On Thu, 13 Feb 2025 07:14:28 -0800

Post by Tim Rentsch

Post by Michael S
fgets() is one of many poorly defined standard library functions
inherited from early UNIX days. [...]

What about the fgets() function do you think is poorly defined?
Second question: by "poorly defined" do you mean "defined
wrongly" or "defined ambiguously" (or both)?

For starter, it looks like designers of fgets() did not believe in
their own motto about files being just streams of bytes.

How so? The 's' in fgets is for 'string'. File Get String (fgets).

If there is no string, it returns NULL, just like other string
functions. Seems quite logical to me.

Post by Michael S
I don't know the history, so, may be, the function was defined this way
for portability with systems where text files have special record-based
structure?

No, fgets was defined long before C portability to anything other than
unix was considered. I'd guess it was originally a convenience function
used in several utilities before being moved to libc.

Post by Michael S
Then, everything about it feels inelegant.
A return value carries just 1 bit of information, success or failure.
So why did they encode this information in baroque way instead of
something obvious, 0 and 1?

Because it is a string function.

Post by Michael S
Appending zero at the end also feels like a hack, but it is necessary

Because it is a string function, and strings are terminated with a
NUL byte.

Post by Michael S
because of the main problem. And the main problem is: how the user is
supposed to figure out how many bytes were read?

Most simply (albeit less performant) by using strlen on the result.

Post by Michael S
In well-designed API this question should be answered in O(1) time.
With fgets(), it can be answered in O(N) time when input is trusted to
contain no zeros.

Which is a prerequisite for using file-get-string (fgets) in the
first place - strings cannot have embedded NUL-bytes.

If you're reading non-string data use read/pread/mmap.

Michael S

2025-02-14 15:23:58 UTC

Reply

On Fri, 14 Feb 2025 15:10:50 GMT

Post by Scott Lurndal
If you're reading non-string data use read/pread/mmap.

I don't know about you, but in decades of practice I didn't yet
encounter a situation when I can trust a file input with 100%
certainty.

Scott Lurndal

2025-02-14 16:46:08 UTC

Reply

Post by Michael S
On Fri, 14 Feb 2025 15:10:50 GMT

Post by Scott Lurndal
If you're reading non-string data use read/pread/mmap.

I don't know about you, but in decades of practice I didn't yet
encounter a situation when I can trust a file input with 100%
certainty.

It may be a language difference, but I don't understand what
you're saying here.

If I read a data file with a particular format (e.g. ELF), I
can trust that when I read the header, I'll read the header.

I would never use fgets for that. fread, perhaps, but I'd
more more likely use mmap when processing a structure binary
like and ELF or COFF file.

The header may be garbage, but any file is allowed to have
corrupt content. In the case of ELF, the magic numbers
provide some assurance that the rest of the header is
trustworthy.

Kaz Kylheku

2025-02-14 17:28:30 UTC

Reply

Post by Michael S
On Fri, 14 Feb 2025 15:10:50 GMT

Post by Scott Lurndal
If you're reading non-string data use read/pread/mmap.

I don't know about you, but in decades of practice I didn't yet
encounter a situation when I can trust a file input with 100%
certainty.

With a little care, you can cheerfully process through a binary
executables with fgets, if you open the stream in binary mode (or on
Unixes where you don't have to).

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca

Kaz Kylheku

2025-02-14 17:22:59 UTC

Reply

Post by Michael S
For starter, it looks like designers of fgets() did not believe in
their own motto about files being just streams of bytes.

They obviously did, which is exactly why they painstakingly preserved
the annoying line terminators in the returned data.

Post by Michael S
I don't know the history, so, may be, the function was defined this way
for portability with systems where text files have special record-based
structure?

You are sliding into muddled thinking here.

Post by Michael S
Then, everything about it feels inelegant.
A return value carries just 1 bit of information, success or failure.

Why would you assert a claim for which the standard library alone
is replete with counterexamples: getchar, malloc, getenv, pow, sin.

Did you mean /the/ return value (of fgets)?

Post by Michael S
So why did they encode this information in baroque way instead of
something obvious, 0 and 1?

Because you can express this concept:

char work_area[SIZE];
char *line;

while ((line = fgets(work_area, sizeof work_area, stream)))
{
/* process line */
}

The work_area just provides storage for the operation: line is the
returned line.

The loop would work even if fgets sometimes returned pointers that
are not the to first byte of work_area. It just so happens that
they always are.

It is meaningful to capture the returned value and work with
it as if it were distinct from the buffer.

Post by Michael S
Appending zero at the end also feels like a hack, but it is necessary
because of the main problem.

Appending zero is necessary so that the result meets the definition
of a C character string, without which it cannot be passed into
string-manipulating functions like strlen.

Home-grown functions that resemble fgets, but forget to add a null
byte sometimes, are the subjects of security CVEs.

Post by Michael S
And the main problem is: how the user is
supposed to figure out how many bytes were read?

Yes, how are they, if you take away the null byte?

Post by Michael S
In well-designed API this question should be answered in O(1) time.

In the context of C strings, that buys you almost nothing.
Even if you know the length, it's going to get measured numerous
more times.

It would be good if fgets nuked the terminating newline.

Many uses of fgets, after every operation, look for the newline
and nuke it, before doing anything else.

There is a nice idiom for that, by the way, which avoids an
temporary variable and if test:

line[strcspn(line, "\n")] = 0;

strcspn(line, "\n") calculates the length of the prefix of line
which consists of non-newlines. That value is precisely the
array index of the first newline, if there is one, or else
of the terminating null, if there isn't a newline. Either
way, you can clobber that with a newline.

Once you see the above, you will never do this again:

newline = strchr(line, '\n');
if (newline)
*newline = 0;

Post by Michael S
With fgets(), it can be answered in O(N) time when input is trusted to
contain no zeros.

We have decided in the C world that text does not contain zeros.

This has become so pervasive that the remaining naysayers can safely
regarded as part of a lunatic fringe.

Software that tries to support the presence of raw nulls in text is
actively harmful for security.

For instance, a piece of text with embedded nulls might have valid
overall syntax which makes it immune to an injection attack.

But when it is sent to another piece of software which interprets
the null as a terminator, the syntax is chopped in half, allowing
it to be completed by a malicious actor.

Post by Michael S
When input is arbitrary, finding out the answer is
even harder and requires quirks.

When input is arbitrary, don't use fgets? It's for text.

Post by Michael S
The function foo() is more generic than fgets(). For use instead of
fgets() it should be accompanied by standard constant EOL_CHAR.
I am not completely satisfied with proposed solution. The API is
still less obvious than it could be. But it is much better than fgets().

If last_c is '\n', you're still writing the pesky newline that
the caller will often want to remove.

Adding a terminating null and returning a pointer to that null
would be better.

You could then call the operation again with the returned dst
pointer, and it would continue extending the string,
without obliterating the last character.

I'm sure I've seen a foo-like function in software before:
reading delimited by an arbitrary byte, with length signaling.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca

Keith Thompson

2025-02-14 19:03:20 UTC

Reply

Kaz Kylheku <643-408-***@kylheku.com> writes:
[...]

Post by Kaz Kylheku
It would be good if fgets nuked the terminating newline.
Many uses of fgets, after every operation, look for the newline
and nuke it, before doing anything else.
There is a nice idiom for that, by the way, which avoids an
line[strcspn(line, "\n")] = 0;

[...]

Then how do you detect a partial line? That can occur either if
the last line doesn't have a terminating newline (on systems that
permit it) or a line that's too long to fit in the array.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

Kaz Kylheku

2025-02-14 19:34:59 UTC

Reply

Post by Keith Thompson
[...]

Post by Kaz Kylheku
It would be good if fgets nuked the terminating newline.
Many uses of fgets, after every operation, look for the newline
and nuke it, before doing anything else.
There is a nice idiom for that, by the way, which avoids an
line[strcspn(line, "\n")] = 0;

[...]
Then how do you detect a partial line? That can occur either if
the last line doesn't have a terminating newline (on systems that
permit it) or a line that's too long to fit in the array.

I've seen many programs like this don't care. They have some
'char buf[4096]' and that's that.

In a program not required or designed to handle arbitrarily
long lines, you can do something very simple (prior to the
above line[strcspn(line, "\n")] = 0 expression).

- zero-initialize the buffer.

- after every call to fgets, inspect the value of the second-to-last
array element. If the value is neither zero, nor '\n', then somehow
diagnose that a too-long line has been presented to the program,
contrary to its documented limitations.

This will yield a false positive on an unterminated last line. That
issue can be added as a documented limitation, or else the buffer can be
sized one greater than what the documented line length limit requires,
so that the program allows inner lines to be one character longer than
the documented limit, but is strict with regard to an unterminated last
line.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca

Michael S

2025-02-15 18:06:02 UTC

Reply

On Fri, 14 Feb 2025 19:34:59 -0000 (UTC)

Post by Kaz Kylheku

Post by Keith Thompson
[...]

Post by Kaz Kylheku
It would be good if fgets nuked the terminating newline.
Many uses of fgets, after every operation, look for the newline
and nuke it, before doing anything else.
There is a nice idiom for that, by the way, which avoids an
line[strcspn(line, "\n")] = 0;

[...]
Then how do you detect a partial line? That can occur either if
the last line doesn't have a terminating newline (on systems that
permit it) or a line that's too long to fit in the array.

I've seen many programs like this don't care. They have some
'char buf[4096]' and that's that.

IMHO, even a program that is not designed to handle long lines should
give an informative error diagnostic when it encounters one.
Your trick described below is good for that, but one has to be rather
good programmer in order to invent such trick. I believe that if one
has to be good in order to use the API it's a clear indication that the
API is *not* good.

Similarly, IMHO, the programs not designed to handle presence of null
characters in the text should give an informative error diagnostic when
they are encountered. And in that fgets() is especially unhelpful.

Post by Kaz Kylheku
In a program not required or designed to handle arbitrarily
long lines, you can do something very simple (prior to the
above line[strcspn(line, "\n")] = 0 expression).
- zero-initialize the buffer.
- after every call to fgets, inspect the value of the second-to-last
array element. If the value is neither zero, nor '\n', then somehow
diagnose that a too-long line has been presented to the program,
contrary to its documented limitations.
This will yield a false positive on an unterminated last line. That
issue can be added as a documented limitation, or else the buffer can
be sized one greater than what the documented line length limit
requires, so that the program allows inner lines to be one character
longer than the documented limit, but is strict with regard to an
unterminated last line.

Scott Lurndal

2025-02-14 20:01:16 UTC

Reply

Post by Keith Thompson
[...]

Post by Kaz Kylheku
It would be good if fgets nuked the terminating newline.
Many uses of fgets, after every operation, look for the newline
and nuke it, before doing anything else.
There is a nice idiom for that, by the way, which avoids an
line[strcspn(line, "\n")] = 0;

[...]
Then how do you detect a partial line? That can occur either if
the last line doesn't have a terminating newline (on systems that
permit it) or a line that's too long to fit in the array.

I terminate the line at the first newline, if present.

while ((line = fgets(buffer, sizeof(buffer), f)) != NULL) {
char *cp;

if ((cp = strchr(line, '\n')) != NULL) {
*cp = '\0';
}

history_set_pos(0);
if (history_search(line, 1) == -1) {
add_history(line);
}

commands.parse_and_execute(line);
}

Not the most efficient, perhaps, but better than using
getchar to read the line.

Janis Papanagnou

2025-02-14 19:51:38 UTC

Reply

[ ... ]
It would be good if fgets nuked the terminating newline.
Many uses of fgets, after every operation, look for the newline
and nuke it, before doing anything else.
There is a nice idiom for that, by the way, which avoids an
line[strcspn(line, "\n")] = 0;

This is nice.

In the test code which was the base of this thread I'm relying
on the existing '\n' and use buf[strlen(buf)-1] = '\0'; to
remove the last character.

[...]
We have decided in the C world that text does not contain zeros.
This has become so pervasive that the remaining naysayers can safely
regarded as part of a lunatic fringe.
Software that tries to support the presence of raw nulls in text is
actively harmful for security.

Actually, in the same code, I'm also using the strtok() function
to iterate over the buffer to get pointers to the separate tokens;
if I'm not mistaken, that function places '\0' characters in the
buffer to separate the string tokens. This is very efficient and
(since the original buffer data isn't necessary any more) there's
no problems (here) with its data interspersed with '\0'; strings
(the tokens) get accessed through the returned pointers, and the
buffer is just the physical (now sort of "binary") storage.

Janis

[...]

Michael S

2025-02-15 17:02:55 UTC

Reply

On Fri, 14 Feb 2025 20:51:38 +0100

Post by Janis Papanagnou

[ ... ]
It would be good if fgets nuked the terminating newline.
Many uses of fgets, after every operation, look for the newline
and nuke it, before doing anything else.
There is a nice idiom for that, by the way, which avoids an
line[strcspn(line, "\n")] = 0;

This is nice.
In the test code which was the base of this thread I'm relying
on the existing '\n' and use buf[strlen(buf)-1] = '\0'; to
remove the last character.

[...]
We have decided in the C world that text does not contain zeros.
This has become so pervasive that the remaining naysayers can safely
regarded as part of a lunatic fringe.
Software that tries to support the presence of raw nulls in text is
actively harmful for security.

Actually, in the same code, I'm also using the strtok() function
to iterate over the buffer to get pointers to the separate tokens;
if I'm not mistaken, that function places '\0' characters in the
buffer to separate the string tokens. This is very efficient and
(since the original buffer data isn't necessary any more) there's
no problems (here) with its data interspersed with '\0'; strings
(the tokens) get accessed through the returned pointers, and the
buffer is just the physical (now sort of "binary") storage.
Janis

[...]

Michael S

2025-02-15 17:29:11 UTC

Reply

On Fri, 14 Feb 2025 20:51:38 +0100

Post by Janis Papanagnou
Actually, in the same code, I'm also using the strtok() function

strtok() is one of the relatively small set of more problemetic
functions in C library that are not thread-safe.
If you only care about POSIX target, the I'd reccomend to avoid strtok
and to use strtok_r().

Janis Papanagnou

2025-02-16 03:29:20 UTC

Reply

Post by Michael S
On Fri, 14 Feb 2025 20:51:38 +0100

Post by Janis Papanagnou
Actually, in the same code, I'm also using the strtok() function

strtok() is one of the relatively small set of more problemetic
functions in C library that are not thread-safe.

I know that it's not thread-safe. (You can't miss that information
if you look up the man page to inspect the function interface.)

Post by Michael S
If you only care about POSIX target, the I'd reccomend to avoid strtok
and to use strtok_r().

But since I don't use threads - neither here nor did I ever needed
them generally in my "C" contexts - that's unnecessary. Isn't it?

Moreover, I prefer functions with a simpler interface to functions
with a more clumsy one (I mean the 'char **saveptr' part); so why
use the complex one in the first place if it just complicates its
use and reduces the code clarity unnecessarily.

Re "more problematic functions in C library"...
I had to chuckle on that; if you're coming from other languages
most "C" functions - especially the low-level "C" functions that
operate on memory with pointers - don't look "unproblematic". :-)

Janis

James Kuyper

2025-02-16 06:04:11 UTC

Reply

Post by Janis Papanagnou

Post by Michael S
On Fri, 14 Feb 2025 20:51:38 +0100

Post by Janis Papanagnou
Actually, in the same code, I'm also using the strtok() function

strtok() is one of the relatively small set of more problemetic
functions in C library that are not thread-safe.

I know that it's not thread-safe. (You can't miss that information
if you look up the man page to inspect the function interface.)

Post by Michael S
If you only care about POSIX target, the I'd reccomend to avoid strtok
and to use strtok_r().

If you cannot assume POSIX, but can assume C2011 or later, you might be
able to use strtok_s() instead. You need to add

#ifdef __STDC_LIB_EXT1__
#define __STDC_WANT_LIB_EXT1__ 1
// strtok_s() will be declared in <string.h>
#endif
#include <string.h>

Post by Janis Papanagnou
But since I don't use threads - neither here nor did I ever needed
them generally in my "C" contexts - that's unnecessary. Isn't it?

No. What makes strtok() problematic can come up without any use of
threads. Consider for the moment a bug I had to investigate. A function
that was looping through strtok() calls to parse a string called a
utility function during each pass through the loop. The utility function
also called strtok() in a loop to parse an entirely different string for
a different purpose. Exercise for the student: figure out what the
consequences were.

Kaz Kylheku

2025-02-16 07:37:09 UTC

Reply

Post by James Kuyper
No. What makes strtok() problematic can come up without any use of
threads. Consider for the moment a bug I had to investigate. A function
that was looping through strtok() calls to parse a string called a
utility function during each pass through the loop. The utility function
also called strtok() in a loop to parse an entirely different string for
a different purpose. Exercise for the student: figure out what the
consequences were.

Moreover, if strtok is thread-safe thanks to using thread-specific
storage for the context, that will not make it recursion-safe. It will
make the bug behave predictably for the same inputs, no matter how other
threads use strtok.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca

Janis Papanagnou

2025-02-16 17:59:31 UTC

Reply

Post by James Kuyper

Post by Janis Papanagnou
But since I don't use threads - neither here nor did I ever needed
them generally in my "C" contexts - that's unnecessary. Isn't it?

No. What makes strtok() problematic can come up without any use of
threads. Consider for the moment a bug I had to investigate. A function
that was looping through strtok() calls to parse a string called a
utility function during each pass through the loop. The utility function
also called strtok() in a loop to parse an entirely different string for
a different purpose. [...]

You can construct any situations that don't apply to my application.

All relevant things I can infer from strtok() is that it has to use
static state information (which naturally doesn't support re-entrant
code). I see that this obviously also consequently implies that it's
not-thread safe and that you also obviously cannot nest calls as you
depicted it above (and I think this is even documented for those to
whom it may not be obvious).

So again; if it's unnecessary here why should I prefer using a more
clumsy code than necessary that makes the code less clear?

If I'd write (for example) a library function to parse tokens then
I'd certainly not use this function because I don't want conflicts
and dependencies on the surrounding context of other code that uses
this library function.

But, again, in my application context its makes no sense.

Janis

Michael S

2025-02-16 08:48:44 UTC

Reply

On Sun, 16 Feb 2025 04:29:20 +0100

Post by Janis Papanagnou

Post by Michael S
On Fri, 14 Feb 2025 20:51:38 +0100

Post by Janis Papanagnou
Actually, in the same code, I'm also using the strtok() function

strtok() is one of the relatively small set of more problemetic
functions in C library that are not thread-safe.

I know that it's not thread-safe. (You can't miss that information
if you look up the man page to inspect the function interface.)

Post by Michael S
If you only care about POSIX target, the I'd reccomend to avoid
strtok and to use strtok_r().

But since I don't use threads - neither here nor did I ever needed
them generally in my "C" contexts - that's unnecessary. Isn't it?
Moreover, I prefer functions with a simpler interface to functions
with a more clumsy one (I mean the 'char **saveptr' part); so why
use the complex one in the first place if it just complicates its
use and reduces the code clarity unnecessarily.

I don't see how explicit context variable can be considered less clear
than context hidden within library in non-obvious way (see post of Kaz
that points out that there are at least two options of how exactly it
could be handled, with different semantics).

Post by Janis Papanagnou
Re "more problematic functions in C library"...
I had to chuckle on that; if you're coming from other languages
most "C" functions - especially the low-level "C" functions that
operate on memory with pointers - don't look "unproblematic". :-)
Janis

I tend to have no problems with low-level C RTL functions, in
particular those with names start with 'mem'. More problems with some
of those that try to be "higher level", for example, strcat(). Even more
with those that their designers probably considered 'object-oriented',
like strtok().

Janis Papanagnou

2025-02-16 18:14:31 UTC

Reply

Post by Michael S
On Sun, 16 Feb 2025 04:29:20 +0100

Post by Janis Papanagnou
Moreover, I prefer functions with a simpler interface to functions
with a more clumsy one (I mean the 'char **saveptr' part); so why
use the complex one in the first place if it just complicates its
use and reduces the code clarity unnecessarily.

I don't see how explicit context variable can be considered less clear
than context hidden within library in non-obvious way [...]

Explicitly maintaining unnecessary parameters and providing additional
code for the logic to handle that unnecessarily is not obviously less
clear to you? - Then I cannot help you, sorry.

Post by Michael S

Post by Janis Papanagnou
Re "more problematic functions in C library"...
I had to chuckle on that; if you're coming from other languages
most "C" functions - especially the low-level "C" functions that
operate on memory with pointers - don't look "unproblematic". :-)

I tend to have no problems with low-level C RTL functions, in
particular those with names start with 'mem'.

*shrug*

I recall (in early C++ days when there wasn't yet a string type) to
have based a set of string functions on the mem...() type functions
(as opposed to the str...() type functions); it wasn't more difficult.
Rather the effects had been (a) that we could operate binary strings,
(b) that it was (slightly) faster code, and (c) that some code could
get even simpler.

Post by Michael S
More problems with some
of those that try to be "higher level", for example, strcat(). Even more
with those that their designers probably considered 'object-oriented',
like strtok().

I don't consider strtok() being 'object-oriented', rather the opposite
because of the globally static attribute it has. OO objects typically
carry their own state (unless you deliberately implement a Singleton
pattern).

Janis

Michael S

2025-02-17 09:54:24 UTC

Reply

On Sun, 16 Feb 2025 19:14:31 +0100

Post by Janis Papanagnou

Post by Michael S
On Sun, 16 Feb 2025 04:29:20 +0100

*shrug*
I recall (in early C++ days when there wasn't yet a string type) to
have based a set of string functions on the mem...() type functions
(as opposed to the str...() type functions); it wasn't more difficult.
Rather the effects had been (a) that we could operate binary strings,
(b) that it was (slightly) faster code, and (c) that some code could
get even simpler.
Janis

Your first hand experience appears to match mine.
Then, why *shrug*? Shouldn't you say *nod* or *noddle* ?

Kaz Kylheku

2025-02-16 07:32:23 UTC

Reply

Post by Michael S
On Fri, 14 Feb 2025 20:51:38 +0100

Post by Janis Papanagnou
Actually, in the same code, I'm also using the strtok() function

strtok() is one of the relatively small set of more problemetic
functions in C library that are not thread-safe.

The design of the strtok() API is not inherently unsafe against threads;
but it requires thread-local storage to be safe.

Since ISO C has threads now, it now takes the opportunity to
explicitly removes any requirements for thread safety in strtok.

However, it is possible for an implementation to step forward and
make it thread safe. For instance, in a POSIX system, a thread-specific
key can be allocated for strtok on library initialization,
or the first use of strtok (via pthread_once).

static pthread_key_t strtok_key;

// ...

if (pthread_key_create(&strtok_key, NULL))
...

Then strtok does

char *strtok (char * restrict str, const char * restrit delim)
{
if (str == NULL)
str = pthread_getspecific(strtok_key);

...

// all return paths do this, if str has changed:
pthread_setspecific(strtok_key, str);
return ...;
}

Only problem is that this will not perform anywhere near as well as
strtok_r, which specifies an inexpensive location for the context
pointer.

Post by Michael S
If you only care about POSIX target, the I'd reccomend to avoid strtok
and to use strtok_r().

I would recommend learning about strspn and strcspn, and writing
your own tokenizing loop:

/* strtok-like loop: input variabls are str and delim */

for (;;) {
/* skip delim chars to find start of tok */
char *tok = str + strspn(str, delim);

/* tokens must be nonempty;
if (*tok == 0)
break;

/* OK; tok points to non-delim char.
Find end of token: skip span of non-delim chars. */
char *end = tok + strcspn(str, delim);

/* Record whether the end of the token is the end
of the string. */
char more = *end;

/* null-terminate token */
*end = 0;

{ /* process tok here */ }

if (!more)
break;

/* If there is more material after the tok, point
str there and continue */
str = end + 1;
}

The strok function is ill-suited to many situations. For instance,
there are situations in which you do want empty tokens, like CSV, such
that ",abc,def," shows four tokens, two of them empty.

With the strspn and strcspn building blocks, you can easily whip up a
custom tokenizing loop that has the right semantics for the situation.

We can also write our loop such that it restores the original
character that was overwritten in order to null-terminate the token,
simply by adding *end = more. Thus when the loop ends, the string
is restored to its original state.

I can understand code like that above without having to look up
anything, but if I see strtok or strtok_r code after many years of not
working with strtok, I will need a refresher on how exactly they define
a token.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca

Michael S

2025-02-16 09:05:46 UTC

Reply

On Sun, 16 Feb 2025 07:32:23 -0000 (UTC)

Post by Kaz Kylheku

Post by Michael S
On Fri, 14 Feb 2025 20:51:38 +0100

Post by Janis Papanagnou
Actually, in the same code, I'm also using the strtok() function

strtok() is one of the relatively small set of more problemetic
functions in C library that are not thread-safe.

The design of the strtok() API is not inherently unsafe against
threads; but it requires thread-local storage to be safe.
Since ISO C has threads now, it now takes the opportunity to
explicitly removes any requirements for thread safety in strtok.
However, it is possible for an implementation to step forward and
make it thread safe. For instance, in a POSIX system, a
thread-specific key can be allocated for strtok on library
initialization, or the first use of strtok (via pthread_once).
static pthread_key_t strtok_key;
// ...
if (pthread_key_create(&strtok_key, NULL))
...
Then strtok does
char *strtok (char * restrict str, const char * restrit delim)
{
if (str == NULL)
str = pthread_getspecific(strtok_key);
...
pthread_setspecific(strtok_key, str);
return ...;
}
Only problem is that this will not perform anywhere near as well as
strtok_r, which specifies an inexpensive location for the context
pointer.

Post by Michael S
If you only care about POSIX target, the I'd reccomend to avoid
strtok and to use strtok_r().

I would recommend learning about strspn and strcspn, and writing
/* strtok-like loop: input variabls are str and delim */
for (;;) {
/* skip delim chars to find start of tok */
char *tok = str + strspn(str, delim);
/* tokens must be nonempty;
if (*tok == 0)
break;
/* OK; tok points to non-delim char.
Find end of token: skip span of non-delim chars. */
char *end = tok + strcspn(str, delim);
/* Record whether the end of the token is the end
of the string. */
char more = *end;
/* null-terminate token */
*end = 0;
{ /* process tok here */ }
if (!more)
break;
/* If there is more material after the tok, point
str there and continue */
str = end + 1;
}
The strok function is ill-suited to many situations. For instance,
there are situations in which you do want empty tokens, like CSV, such
that ",abc,def," shows four tokens, two of them empty.
With the strspn and strcspn building blocks, you can easily whip up a
custom tokenizing loop that has the right semantics for the situation.
We can also write our loop such that it restores the original
character that was overwritten in order to null-terminate the token,
simply by adding *end = more. Thus when the loop ends, the string
is restored to its original state.
I can understand code like that above without having to look up
anything, but if I see strtok or strtok_r code after many years of not
working with strtok, I will need a refresher on how exactly they
define a token.

For parsing of something important and relatively well-defined, like
CSV, I'd very seriously consider option of not using standard str*
utilities at all, with exception of those, where coding your own
requires special expertise, i.e. primarily strtod(). BTW, even strtod()
can't be blindly relied on for .csv, because it accepts hex floats,
while standard CSV parser has to reject them.
Most likely, avoiding fgets() is also a good idea in this case.

Janis Papanagnou

2025-02-16 18:25:46 UTC

Reply

Post by Michael S
On Sun, 16 Feb 2025 07:32:23 -0000 (UTC)

Post by Kaz Kylheku
[...]
The strok function is ill-suited to many situations. For instance,
there are situations in which you do want empty tokens, like CSV, such
that ",abc,def," shows four tokens, two of them empty.
With the strspn and strcspn building blocks, you can easily whip up a
custom tokenizing loop that has the right semantics for the situation.
We can also write our loop such that it restores the original
character that was overwritten in order to null-terminate the token,
simply by adding *end = more. Thus when the loop ends, the string
is restored to its original state.
I can understand code like that above without having to look up
anything, but if I see strtok or strtok_r code after many years of not
working with strtok, I will need a refresher on how exactly they
define a token.

For parsing of something important and relatively well-defined, like
CSV, I'd very seriously consider option of not using standard str*
utilities at all, with exception of those, where coding your own
requires special expertise, i.e. primarily strtod(). BTW, even strtod()
can't be blindly relied on for .csv, because it accepts hex floats,
while standard CSV parser has to reject them.
Most likely, avoiding fgets() is also a good idea in this case.

I certainly wouldn't call a CSV format as being "well-defined". But
CSV is certainly nasty enough to use some existing CSV library and
not re-invent the wheel in the first place.

Janis

Janis Papanagnou

2025-02-16 18:21:02 UTC

Reply

Post by Kaz Kylheku
I would recommend learning about strspn and strcspn, and writing

Incidentally, in a recent toy project, I used it for parsing simple
syntax.

For the code of this thread the strtok() was simpler to use, though.

Post by Kaz Kylheku
The strok function is ill-suited to many situations. For instance,
there are situations in which you do want empty tokens, like CSV, such
that ",abc,def," shows four tokens, two of them empty.

Sure. (Always use an appropriate solution for any given task.)

Janis

Post by Kaz Kylheku
[...]

Scott Lurndal

2025-02-16 20:26:53 UTC

Reply

Post by Janis Papanagnou

Post by Kaz Kylheku
I would recommend learning about strspn and strcspn, and writing

Incidentally, in a recent toy project, I used it for parsing simple
syntax.
For the code of this thread the strtok() was simpler to use, though.

lex/flex isn't that difficult to learn or to use; and quite a bit
more flexible than hand-rolled tokenizers using str* functions.

Michael S

2025-02-15 17:41:11 UTC

Reply

On Fri, 14 Feb 2025 17:22:59 -0000 (UTC)

Post by Kaz Kylheku
Did you mean /the/ return value (of fgets)?

Yes, I did.

Michael S

2025-02-15 18:29:15 UTC

Reply

On Fri, 14 Feb 2025 17:22:59 -0000 (UTC)

Post by Kaz Kylheku

Post by Michael S
For starter, it looks like designers of fgets() did not believe in
their own motto about files being just streams of bytes.

They obviously did, which is exactly why they painstakingly preserved
the annoying line terminators in the returned data.

Post by Michael S
I don't know the history, so, may be, the function was defined this
way for portability with systems where text files have special
record-based structure?

You are sliding into muddled thinking here.

Post by Michael S
Then, everything about it feels inelegant.
A return value carries just 1 bit of information, success or
failure.

Why would you assert a claim for which the standard library alone
is replete with counterexamples: getchar, malloc, getenv, pow, sin.
Did you mean /the/ return value (of fgets)?

Post by Michael S
So why did they encode this information in baroque way instead of
something obvious, 0 and 1?

char work_area[SIZE];
char *line;
while ((line = fgets(work_area, sizeof work_area, stream)))
{
/* process line */
}
The work_area just provides storage for the operation: line is the
returned line.
The loop would work even if fgets sometimes returned pointers that
are not the to first byte of work_area. It just so happens that
they always are.
It is meaningful to capture the returned value and work with
it as if it were distinct from the buffer.

Post by Michael S
Appending zero at the end also feels like a hack, but it is
necessary because of the main problem.

Appending zero is necessary so that the result meets the definition
of a C character string, without which it cannot be passed into
string-manipulating functions like strlen.
Home-grown functions that resemble fgets, but forget to add a null
byte sometimes, are the subjects of security CVEs.

Post by Michael S
And the main problem is: how the user is
supposed to figure out how many bytes were read?

Yes, how are they, if you take away the null byte?

Post by Michael S
In well-designed API this question should be answered in O(1) time.

In the context of C strings, that buys you almost nothing.
Even if you know the length, it's going to get measured numerous
more times.
It would be good if fgets nuked the terminating newline.
Many uses of fgets, after every operation, look for the newline
and nuke it, before doing anything else.
There is a nice idiom for that, by the way, which avoids an
line[strcspn(line, "\n")] = 0;
strcspn(line, "\n") calculates the length of the prefix of line
which consists of non-newlines. That value is precisely the
array index of the first newline, if there is one, or else
of the terminating null, if there isn't a newline. Either
way, you can clobber that with a newline.
newline = strchr(line, '\n');
if (newline)
*newline = 0;

Post by Michael S
With fgets(), it can be answered in O(N) time when input is trusted
to contain no zeros.

We have decided in the C world that text does not contain zeros.

Yes, for internal data.
External inputs has to be sanitized.

Post by Kaz Kylheku
This has become so pervasive that the remaining naysayers can safely
regarded as part of a lunatic fringe.
Software that tries to support the presence of raw nulls in text is
actively harmful for security.
For instance, a piece of text with embedded nulls might have valid
overall syntax which makes it immune to an injection attack.
But when it is sent to another piece of software which interprets
the null as a terminator, the syntax is chopped in half, allowing
it to be completed by a malicious actor.

I don't quite understand. In particular, I don't understand if you
argue in favor of fgets() or against it.

Post by Kaz Kylheku

Post by Michael S
When input is arbitrary, finding out the answer is
even harder and requires quirks.

When input is arbitrary, don't use fgets? It's for text.

Post by Michael S
The function foo() is more generic than fgets(). For use instead of
fgets() it should be accompanied by standard constant EOL_CHAR.
I am not completely satisfied with proposed solution. The API is
still less obvious than it could be. But it is much better than fgets().

If last_c is '\n', you're still writing the pesky newline that
the caller will often want to remove.
Adding a terminating null and returning a pointer to that null
would be better.

If the caller wants it, it can easily do it by itself.
OTOH, If we follow your proposal, we lose information about
presence/absence of EOL at the end of the file. I think, for generic
function it's better to not lose any information, even even an
information that is not useful for 99.99% of the callers.

Post by Kaz Kylheku
You could then call the operation again with the returned dst
pointer, and it would continue extending the string,
without obliterating the last character.
reading delimited by an arbitrary byte, with length signaling.

I certainly do not pretend that I invented anything new here.
Nor did I pretend that it's the best possible.
More so, I'd like it even more mundane. I just can't figure out, how to
do it without addition of one more [pointer] parameter.

One obvious possibility is to return # of characters read instead of
pointer. Then 0 can mean EOF and negative values can mean I/O errors.
But that is also not sufficiently boring.

Janis Papanagnou

2025-02-16 03:33:17 UTC

Reply

Post by Michael S
One obvious possibility is to return # of characters read instead of
pointer. Then 0 can mean EOF and negative values can mean I/O errors.
But that is also not sufficiently boring.

But isn't that the (already existing) interface that 'fread()' had
been designed for?

Janis

Janis Papanagnou

2025-02-14 19:23:50 UTC

Reply

Post by Michael S

[ fgets() poorly defined? ]

[...]
Then, everything about it feels inelegant.
A return value carries just 1 bit of information, success or failure.
So why did they encode this information in baroque way instead of
something obvious, 0 and 1?

I consider it to be differently; it basically returns a pointer
to work with on the data, and the special NULL pointer value is
just the often seen hack where a special pointer value provides
an error indication.

Typical application (for me) is

if ((line = fgets (buf, BUFSIZ, fd)) == NULL)
// handle error...
else
// process data

Moreover, returning the pointer to the data makes it possible to
(e.g.) nest string processing functions (including 'fgets') or
to chain processing or immediate access/dereference the string
contents.

IMO the 'fgets' function matches the typical interface for such
string functions (in C) allowing such programming language idioms
like the two or three mentioned.

I think it is generally arguable whether code patterns like
if ((line = fgets (buf, BUFSIZ, fd)) == NULL)
can be considered clean code with clean syntax and a clean design.
But not in a "C" language newsgroup where such things are typical
(with this function design) as language specific code pattern.

Janis

Post by Michael S
[...]

James Kuyper

2025-02-14 19:38:40 UTC

Reply

On 2/14/25 14:23, Janis Papanagnou wrote:
...

Post by Janis Papanagnou
I consider it to be differently; it basically returns a pointer
to work with on the data, and the special NULL pointer value is
just the often seen hack where a special pointer value provides
an error indication.
Typical application (for me) is
if ((line = fgets (buf, BUFSIZ, fd)) == NULL)
// handle error...
else
// process data
Moreover, returning the pointer to the data makes it possible to
(e.g.) nest string processing functions (including 'fgets') or
to chain processing or immediate access/dereference the string
contents.
IMO the 'fgets' function matches the typical interface for such
string functions (in C) allowing such programming language idioms
like the two or three mentioned.
I think it is generally arguable whether code patterns like
if ((line = fgets (buf, BUFSIZ, fd)) == NULL)
can be considered clean code with clean syntax and a clean design.
But not in a "C" language newsgroup where such things are typical
(with this function design) as language specific code pattern.

As with several of the string processing functions, I think fgets()
would be better if it returned a pointer to the end of the data that was
read in, rather than to the beginning. The chaining you talk about does
not, in general, work properly if the return value from fgets() is NULL,
or the entire buffer was filled without writing a null character.

Janis Papanagnou

2025-02-14 20:02:19 UTC

Reply

Post by James Kuyper
As with several of the string processing functions, I think fgets()
would be better if it returned a pointer to the end of the data that was
read in, rather than to the beginning.

Yes, that's another option. The language designer have to decide which
behavior is more useful. There's pros and cons, IMO. On the minus side
would be that the origin of the string gets lost that way. (Of course
you can adjust your code then to keep a copy. But any way implemented,
you need to adjust your code according to how the function is defined.)

Post by James Kuyper
The chaining you talk about does
not, in general, work properly if the return value from fgets() is NULL,

Yes.

Post by James Kuyper
or the entire buffer was filled without writing a null character.

I read the man page that as if that at least would be guaranteed:

"A terminating null byte ('\0') is stored
after the last character in the buffer."

(also in cases where no EOL or EOF is read).

Janis

Michael S

2025-02-15 17:53:10 UTC

Reply

On Fri, 14 Feb 2025 21:02:19 +0100

Post by Janis Papanagnou

Post by James Kuyper
As with several of the string processing functions, I think fgets()
would be better if it returned a pointer to the end of the data
that was read in, rather than to the beginning.

Yes, that's another option. The language designer have to decide which
behavior is more useful. There's pros and cons, IMO.

IMHO, there are no cons.
Returning pointer to the end of data is very obviously superior.

Post by Janis Papanagnou
On the minus side
would be that the origin of the string gets lost that way.

Huh?
How could you lose something you just passed to the function?
In most typical code, it's not even a complex expression or pointer,
but name of array.

Post by Janis Papanagnou
(Of course
you can adjust your code then to keep a copy. But any way implemented,
you need to adjust your code according to how the function is
defined.)

Janis Papanagnou

2025-02-16 03:48:20 UTC

Reply

Post by Michael S
On Fri, 14 Feb 2025 21:02:19 +0100

Post by Janis Papanagnou

Post by James Kuyper
As with several of the string processing functions, I think fgets()
would be better if it returned a pointer to the end of the data
that was read in, rather than to the beginning.

Yes, that's another option. The language designer have to decide which
behavior is more useful. There's pros and cons, IMO.

IMHO, there are no cons.

If you think so.

Post by Michael S
Returning pointer to the end of data is very obviously superior.

I seem to recall to have had uses for both variants in the past. (Not
that it would have made a big difference.)

Post by Michael S

Post by Janis Papanagnou
On the minus side
would be that the origin of the string gets lost that way.

Huh?
How could you lose something you just passed to the function?

For example if you use idioms like s = str...(t++, ...) .

Post by Michael S
In most typical code, it's not even a complex expression or pointer,
but name of array.

I really don't want to argue with you about what is "The Best" design.

Personally I took advantage of how it's actually in "C" defined, and I
also occasionally missed the other design variant in some other cases.
(No design variant would have prevented me from "working around" the
effects of the missing one.)

And the "C" lib designers are not stupid; I'd think they had considered
what interface they implement. (Just speculating here of course.)

Janis

Post by Michael S

Post by Janis Papanagnou
(Of course
you can adjust your code then to keep a copy. But any way implemented,
you need to adjust your code according to how the function is
defined.)

Tim Rentsch

2025-02-15 16:37:20 UTC

Reply

Post by Michael S
On Thu, 13 Feb 2025 07:14:28 -0800

Post by Tim Rentsch

Post by Michael S
On Sun, 9 Feb 2025 17:22:43 -0800

Post by Andrey Tarasevich

Post by Lawrence D'Oliveiro

If `fgets` reads nothing (instant end-of-file), the entire
buffer remains untouched.

You mean, only a single null byte gets written.

No. The buffer is not changed at all in such case.

Under what circumstances `fgets` is expected to return an empty
string? (I.e. set the [0] entry of the buffer to '\0' and return
non-null)?
When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
read 0 characters.
This is under assumption that asking `fgets` to read 0 characters
is supposed to prevent it from detecting end-of-file condition or
I/O error condition. One can probably do some nitpicking at the
current wording... but I believe the above is the intent.

fgets() is one of many poorly defined standard library functions
inherited from early UNIX days. [...]

What about the fgets() function do you think is poorly defined?
Second question: by "poorly defined" do you mean "defined
wrongly" or "defined ambiguously" (or both)?

For starter, it looks like designers of fgets() did not believe in
their own motto about files being just streams of bytes.
I don't know the history, so, may be, the function was defined this way
for portability with systems where text files have special record-based
structure?
Then, everything about it feels inelegant.
A return value carries just 1 bit of information, success or failure.
So why did they encode this information in baroque way instead of
something obvious, 0 and 1?
Appending zero at the end also feels like a hack, but it is necessary
because of the main problem. And the main problem is: how the user is
supposed to figure out how many bytes were read?
In well-designed API this question should be answered in O(1) time.
With fgets(), it can be answered in O(N) time when input is trusted to
contain no zeros. When input is arbitrary, finding out the answer is
even harder and requires quirks.

If I understand you correctly your complaint is that the existing
semantics are not as useful as you would like them to be, even
though the current definition does make the behavior well defined.
Is that right?

Clearly using fgets() is problematic when the input stream might
contain null characters. To me it seems obvious that the original
implementors expected that fgets() would not be used in such cases,
perhaps with the less severe restriction that the presence of
embedded nulls could be detected and simply rejected as bad input,
much the same as overly long lines or a final line without a
terminating newline character.

Michael S

2025-02-15 18:08:56 UTC

Reply

On Sat, 15 Feb 2025 08:37:20 -0800

Post by Tim Rentsch

Post by Michael S
On Thu, 13 Feb 2025 07:14:28 -0800

Post by Tim Rentsch

Post by Michael S
On Sun, 9 Feb 2025 17:22:43 -0800

Post by Andrey Tarasevich

Post by Lawrence D'Oliveiro

If `fgets` reads nothing (instant end-of-file), the entire
buffer remains untouched.

You mean, only a single null byte gets written.

No. The buffer is not changed at all in such case.

Under what circumstances `fgets` is expected to return an
empty string? (I.e. set the [0] entry of the buffer to '\0' and
return non-null)?
When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
read 0 characters.
This is under assumption that asking `fgets` to read 0 characters
is supposed to prevent it from detecting end-of-file condition or
I/O error condition. One can probably do some nitpicking at the
current wording... but I believe the above is the intent.

fgets() is one of many poorly defined standard library functions
inherited from early UNIX days. [...]

What about the fgets() function do you think is poorly defined?
Second question: by "poorly defined" do you mean "defined
wrongly" or "defined ambiguously" (or both)?

For starter, it looks like designers of fgets() did not believe in
their own motto about files being just streams of bytes.
I don't know the history, so, may be, the function was defined this
way for portability with systems where text files have special
record-based structure?
Then, everything about it feels inelegant.
A return value carries just 1 bit of information, success or
failure. So why did they encode this information in baroque way
instead of something obvious, 0 and 1?
Appending zero at the end also feels like a hack, but it is
how the user is supposed to figure out how many bytes were read?
In well-designed API this question should be answered in O(1) time.
With fgets(), it can be answered in O(N) time when input is trusted
to contain no zeros. When input is arbitrary, finding out the
answer is even harder and requires quirks.

If I understand you correctly your complaint is that the existing
semantics are not as useful as you would like them to be, even
though the current definition does make the behavior well defined.
Is that right?

Yes.

Post by Tim Rentsch
Clearly using fgets() is problematic when the input stream might
contain null characters. To me it seems obvious that the original
implementors expected that fgets() would not be used in such cases,
perhaps with the less severe restriction that the presence of
embedded nulls could be detected and simply rejected as bad input,
much the same as overly long lines or a final line without a
terminating newline character.

My impression is that they didn't spend much time thinking.

Tim Rentsch

2025-02-19 04:17:00 UTC

Reply

Post by Michael S
On Sat, 15 Feb 2025 08:37:20 -0800

[...]

Post by Michael S

Post by Tim Rentsch
Clearly using fgets() is problematic when the input stream might
contain null characters. To me it seems obvious that the original
implementors expected that fgets() would not be used in such cases,
perhaps with the less severe restriction that the presence of
embedded nulls could be detected and simply rejected as bad input,
much the same as overly long lines or a final line without a
terminating newline character.

My impression is that they didn't spend much time thinking.

I have no idea how much time was spent designing the fgets()
interface, nor do I think it's important to know. I understand the
limitations of fgets() and don't mind using it in circumstances
where it provides a net positive value.

Lawrence D'Oliveiro

2025-02-21 05:58:19 UTC

Reply

Post by Michael S
Appending zero at the end also feels like a hack, but it is necessary
because of the main problem. And the main problem is: how the user is
supposed to figure out how many bytes were read?

You already have a function that answers that question.

<https://manpages.debian.org/fread(3)>

Tim Rentsch

2025-02-15 16:12:33 UTC

Reply

Post by Andrey Tarasevich

Post by Lawrence D'Oliveiro

If `fgets` reads nothing (instant end-of-file), the entire
buffer remains untouched.

You mean, only a single null byte gets written.

No. The buffer is not changed at all in such case.

Under what circumstances `fgets` is expected to return an empty
string? (I.e. set the [0] entry of the buffer to '\0' and return
non-null)?
When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
read 0 characters.
This is under assumption that asking `fgets` to read 0 characters
is supposed to prevent it from detecting end-of-file condition or
I/O error condition. One can probably do some nitpicking at the
current wording... but I believe the above is the intent.

Clearly there are more than a few C implementors who agree with that.

Janis Papanagnou

2025-02-10 01:44:27 UTC

Reply

Post by Andrey Tarasevich

Post by Lawrence D'Oliveiro

If `fgets` reads nothing (instant end-of-file), the entire buffer
remains untouched.

You mean, only a single null byte gets written.

This was actually what I feared some implementation might do
(unless it's specified by the "C" standard, which luckily is,
as has been shown and got quoted in this thread).

Post by Andrey Tarasevich
No. The buffer is not changed at all in such case.

Which had been the good news.

Janis

Lawrence D'Oliveiro

2025-02-10 03:28:01 UTC

Reply

Post by Andrey Tarasevich

Post by Lawrence D'Oliveiro

If `fgets` reads nothing (instant end-of-file), the entire buffer
remains untouched.

You mean, only a single null byte gets written.

No. The buffer is not changed at all in such case.

From the man page <https://manpages.debian.org/fgets(3)>:

fgets() reads in at most one less than size characters from stream and
stores them into the buffer pointed to by s. Reading stops after an
EOF or a newline. If a newline is read, it is stored into the buffer.
A terminating null byte ('\0') is stored after the last character in
the buffer.

Note there is no qualification like “a terminating null byte is stored
after the last character if EOF was not reached”. It’s clear the
terminating null byte is *always* stored.

Andrey Tarasevich

2025-02-10 04:11:22 UTC

Reply

Post by Lawrence D'Oliveiro
fgets() reads in at most one less than size characters from stream and
stores them into the buffer pointed to by s. Reading stops after an
EOF or a newline. If a newline is read, it is stored into the buffer.
A terminating null byte ('\0') is stored after the last character in
the buffer.
Note there is no qualification like “a terminating null byte is stored
after the last character if EOF was not reached”. It’s clear the
terminating null byte is *always* stored.

Well, the language standard says differently.

You are referring to a specific manpage that follows POSIX. When taken
literally it seems to contradict the standard specification for `fgets`,
but I highly doubt this was the intent. Apparently someone tried to
re-word the spec for better readability, but managed to botch it.

This manpage, for one example, is in full agreement with the standard

https://www.man7.org/linux/man-pages/man3/fgets.3p.html

A practical experiment demonstrates that [supposedly] POSIX-obeying
implementations do not write '\0' into the buffer in "immediate
end-of-file" situations:
https://coliru.stacked-crooked.com/a/3e672e6718dd388b

--
Best regards,
Andrey

Lawrence D'Oliveiro

2025-02-10 07:21:11 UTC

Reply

Post by Andrey Tarasevich
This manpage, for one example, is in full agreement with the standard
https://www.man7.org/linux/man-pages/man3/fgets.3p.html

Notice these two sentences would seem to contradict one another:

A null byte shall be written immediately after the last byte read
into the array. If the end-of-file condition is encountered before
any bytes are read, the contents of the array pointed to by s
shall not be changed.

Post by Andrey Tarasevich
A practical experiment demonstrates that [supposedly] POSIX-obeying
implementations do not write '\0' into the buffer in "immediate
https://coliru.stacked-crooked.com/a/3e672e6718dd388b

My test program does the same. I would say that settles it.

Scott Lurndal

2025-02-10 16:39:44 UTC

Reply

Post by Andrey Tarasevich
This manpage, for one example, is in full agreement with the standard
https://www.man7.org/linux/man-pages/man3/fgets.3p.html

That manual page is not definitive, or a standard.

The ISO C standard is definitive, and parrotted here:

https://pubs.opengroup.org/onlinepubs/9799919799/functions/fgets.html

James Kuyper

2025-02-10 18:58:05 UTC

Reply

Post by Lawrence D'Oliveiro

Post by Andrey Tarasevich
This manpage, for one example, is in full agreement with the standard
https://www.man7.org/linux/man-pages/man3/fgets.3p.html

A null byte shall be written immediately after the last byte read
into the array. If the end-of-file condition is encountered before
any bytes are read, the contents of the array pointed to by s
shall not be changed.

Note: this wording is almost identical to relevant wording in the
current C standard:

"A null character is written immediately after the last character read
into the array." (7.23.7p2).
"If end-of-file is encountered and no characters have been read into the
array, the contents of the array remain unchanged and a null pointer is
returned." (7.23.7p3)

I don't see the contradiction. If "no characters are read into the
array", there is no such thing as "the last byte read into the array",
so a null byte has no location where it should be written. Therefore,
there's no reason for changing the contents of the array.

Lawrence D'Oliveiro

2025-02-11 01:03:21 UTC

Reply

Post by James Kuyper
I don't see the contradiction. If "no characters are read into the
array", there is no such thing as "the last byte read into the array",
so a null byte has no location where it should be written. Therefore,
there's no reason for changing the contents of the array.

What if the array is only big enough for one byte? In this case, no
characters can be read into it. Is a trailing null inserted in this case?

James Kuyper

2025-02-11 03:33:03 UTC

Reply

Post by Lawrence D'Oliveiro

Post by James Kuyper
I don't see the contradiction. If "no characters are read into the
array", there is no such thing as "the last byte read into the array",
so a null byte has no location where it should be written. Therefore,
there's no reason for changing the contents of the array.

What if the array is only big enough for one byte? In this case, no
characters can be read into it. Is a trailing null inserted in this case?

"The fgets function reads at most one less than the number of characters
specified by n from the stream pointed to by stream into the array
pointed to by s." (7.32.7.2p2)

If the buffer length is 1, "at most one less than the number ...
specified" is 0. Therefore, fgets() cannot read any characters into the
buffer, no matter what the contents of the input stream are. Again,
since there is no "last byte read into the array", there is no location
where a null byte should be written.

Kaz Kylheku

2025-02-11 03:42:03 UTC

Reply

Post by James Kuyper

Post by Lawrence D'Oliveiro
What if the array is only big enough for one byte? In this case, no
characters can be read into it. Is a trailing null inserted in this case?

"The fgets function reads at most one less than the number of characters
specified by n from the stream pointed to by stream into the array
pointed to by s." (7.32.7.2p2)
If the buffer length is 1, "at most one less than the number ...
specified" is 0. Therefore, fgets() cannot read any characters into the
buffer, no matter what the contents of the input stream are. Again,
since there is no "last byte read into the array", there is no location
where a null byte should be written.

If the array consists of two bytes, then it's possible to use the fgets
function carry out the job of process input in a line-wise fashion,
using fragments of lines that are one character wide. For instance
"foo\n" may be be read in three parts "f\0", "o\0","\n\0".

If the buffer is one byte wide, then it's not possible for the loop
around fgets to meaningfully process the file.

Therefore, it might as well just return null on the first call
and every subsequent one.

It simply doesn't make sense to use an array of one byte.

A one bytea area is too small, since it can only hold a string of zero
length, and a non-zero-length file cannot be expressed as a catenation
of strings of zero length.

The fgets function /could/ null terminate always, even when returning
null, and even in the one-byte-buffer case. But what would be the point;
there is no need for code to rely on the buffer when null has been
returned.

When we use fgets, we can (and probably should) pretend that the buffer
is just a work area or context buffer for the function, and the return
value is the real data (which happens to point to the context buffer).
When we get null, the operation yielded no data. We "got something"
only when fgets returns a pointer to it.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca

Lawrence D'Oliveiro

2025-02-11 04:54:23 UTC

Reply

Post by James Kuyper

Post by Lawrence D'Oliveiro

Post by James Kuyper
I don't see the contradiction. If "no characters are read into the
array", there is no such thing as "the last byte read into the array",
so a null byte has no location where it should be written. Therefore,
there's no reason for changing the contents of the array.

What if the array is only big enough for one byte? In this case, no
characters can be read into it. Is a trailing null inserted in this case?

"The fgets function reads at most one less than the number of characters
specified by n from the stream pointed to by stream into the array
pointed to by s." (7.32.7.2p2)
If the buffer length is 1, "at most one less than the number ...
specified" is 0. Therefore, fgets() cannot read any characters into the
buffer, no matter what the contents of the input stream are. Again,
since there is no "last byte read into the array", there is no location
where a null byte should be written.

Have you tried it? I have.

James Kuyper

2025-02-11 18:07:53 UTC

Reply

...

Post by Lawrence D'Oliveiro

Post by James Kuyper
"The fgets function reads at most one less than the number of characters
specified by n from the stream pointed to by stream into the array
pointed to by s." (7.32.7.2p2)
If the buffer length is 1, "at most one less than the number ...
specified" is 0. Therefore, fgets() cannot read any characters into the
buffer, no matter what the contents of the input stream are. Again,
since there is no "last byte read into the array", there is no location
where a null byte should be written.

Have you tried it? I have.

I just tried it, using gcc and found that fgets() does set the first
byte of the buffer to a null character. Therefore, it doesn't conform to
the requirements of the standard. That's not particularly surprising -
calling fgets with useless arguments isn't something that I'd expect to
be a high priority on their pre-delivery tests.

Lawrence D'Oliveiro

2025-02-11 21:47:21 UTC

Reply

Post by James Kuyper
I just tried it, using gcc and found that fgets() does set the first
byte of the buffer to a null character. Therefore, it doesn't conform to
the requirements of the standard.

GCC is, however, the closest thing we have to a de-facto standard for C.

Is there another C compiler/runtime that behaves different?

James Kuyper

2025-02-11 22:44:30 UTC

Reply

Post by Lawrence D'Oliveiro

Post by James Kuyper
I just tried it, using gcc and found that fgets() does set the first
byte of the buffer to a null character. Therefore, it doesn't conform to
the requirements of the standard.

GCC is, however, the closest thing we have to a de-facto standard for C.

I've no interest in de-facto standards. I'm only interested in de-jure
standards such as ISO/IEC 9899:2023. Feel free to have different
preferences.

Lawrence D'Oliveiro

2025-02-12 06:16:22 UTC

Reply

Post by James Kuyper

Post by Lawrence D'Oliveiro

Post by James Kuyper
I just tried it, using gcc and found that fgets() does set the first
byte of the buffer to a null character. Therefore, it doesn't conform
to the requirements of the standard.

GCC is, however, the closest thing we have to a de-facto standard for C.

I've no interest in de-facto standards. I'm only interested in de-jure
standards such as ISO/IEC 9899:2023.

Where is there an implementation that conforms to that?

Keith Thompson

2025-02-11 21:59:33 UTC

Reply

Post by James Kuyper
...

Post by Lawrence D'Oliveiro

Post by James Kuyper
"The fgets function reads at most one less than the number of characters
specified by n from the stream pointed to by stream into the array
pointed to by s." (7.32.7.2p2)
If the buffer length is 1, "at most one less than the number ...
specified" is 0. Therefore, fgets() cannot read any characters into the
buffer, no matter what the contents of the input stream are. Again,
since there is no "last byte read into the array", there is no location
where a null byte should be written.

Have you tried it? I have.

I just tried it, using gcc and found that fgets() does set the first
byte of the buffer to a null character. Therefore, it doesn't conform to
the requirements of the standard. That's not particularly surprising -
calling fgets with useless arguments isn't something that I'd expect to
be a high priority on their pre-delivery tests.

As you know, gcc doesn't implement fgets(). Were you using GNU libc?

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

James Kuyper

2025-02-11 22:58:43 UTC

Reply

...

I just tried it, using gcc and found that fgets() does set the first
byte of the buffer to a null character. Therefore, it doesn't conform to
the requirements of the standard. That's not particularly surprising -
calling fgets with useless arguments isn't something that I'd expect to
be a high priority on their pre-delivery tests.

As you know, gcc doesn't implement fgets(). Were you using GNU lib

.
Yes. To be specific, Ubuntu GLIBC 2.35-0ubuntu3.9.

Here's my test code:

#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
char fill = 1;
char buffer = fill;
char *retval = NULL;
FILE *infile;
if(argc < 2)
infile = stdin;
else{
infile = fopen(argv[1], "r");
if(!infile)
{
perror(argv[1]);
return EXIT_FAILURE;
}
}

while((retval = fgets(&buffer, 1, infile)) == &buffer)
{
printf("%ld:'%u'\n", ftell(infile), (unsigned)buffer);
buffer = fill++;
}
if(ferror(infile))
perror("fgets");

printf("%p!=%p ferror:%d feof:%d '%c'\n",
(void*)&buffer, (void*)retval,
ferror(infile), feof(infile), buffer);
}

Note that if fgets() works as it should, that's an infinite loop, since
no data is read in, and therefore there's no movement through the input
file. I wrote code that executes after the infinite loop just to cover
the possibility that it doesn't work that way.

Keith Thompson

2025-02-11 23:40:15 UTC

Reply

Post by James Kuyper
...

I just tried it, using gcc and found that fgets() does set the first
byte of the buffer to a null character. Therefore, it doesn't conform to
the requirements of the standard. That's not particularly surprising -
calling fgets with useless arguments isn't something that I'd expect to
be a high priority on their pre-delivery tests.

As you know, gcc doesn't implement fgets(). Were you using GNU lib

.
Yes. To be specific, Ubuntu GLIBC 2.35-0ubuntu3.9.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
char fill = 1;
char buffer = fill;
char *retval = NULL;
FILE *infile;
if(argc < 2)
infile = stdin;
else{
infile = fopen(argv[1], "r");
if(!infile)
{
perror(argv[1]);
return EXIT_FAILURE;
}
}
while((retval = fgets(&buffer, 1, infile)) == &buffer)
{
printf("%ld:'%u'\n", ftell(infile), (unsigned)buffer);
buffer = fill++;
}
if(ferror(infile))
perror("fgets");
printf("%p!=%p ferror:%d feof:%d '%c'\n",
(void*)&buffer, (void*)retval,
ferror(infile), feof(infile), buffer);
}
Note that if fgets() works as it should, that's an infinite loop, since
no data is read in, and therefore there's no movement through the input
file. I wrote code that executes after the infinite loop just to cover
the possibility that it doesn't work that way.

I get an infinite loop with both glibc and musl on Ubuntu, and under
Termux on Android (Bionic library implementation):

$ ./jk < /dev/null | head -n 3
0:'0'
0:'0'
0:'0'
$ echo hello | ./jk | head -n 3
-1:'0'
-1:'0'
-1:'0'
$

With newlib on Cygwin, there is no infinite loop:

$ ./jk.exe < /dev/null
0x7ffffcc17!=0x0 ferror:0 feof:0 ''
$ echo hello | ./jk.exe
0x7ffffcc17!=0x0 ferror:0 feof:0 ''
$

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

Tim Rentsch

2025-02-18 05:11:29 UTC

Reply

Post by Keith Thompson

...

Post by Keith Thompson

I just tried it, using gcc and found that fgets() does set the
first byte of the buffer to a null character. Therefore, it
doesn't conform to the requirements of the standard. That's not
particularly surprising - calling fgets with useless arguments
isn't something that I'd expect to be a high priority on their
pre-delivery tests.

As you know, gcc doesn't implement fgets(). Were you using GNU lib

.
Yes. To be specific, Ubuntu GLIBC 2.35-0ubuntu3.9.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
char fill = 1;
char buffer = fill;
char *retval = NULL;
FILE *infile;
if(argc < 2)
infile = stdin;
else{
infile = fopen(argv[1], "r");
if(!infile)
{
perror(argv[1]);
return EXIT_FAILURE;
}
}
while((retval = fgets(&buffer, 1, infile)) == &buffer)
{
printf("%ld:'%u'\n", ftell(infile), (unsigned)buffer);
buffer = fill++;
}
if(ferror(infile))
perror("fgets");
printf("%p!=%p ferror:%d feof:%d '%c'\n",
(void*)&buffer, (void*)retval,
ferror(infile), feof(infile), buffer);
}
Note that if fgets() works as it should, that's an infinite loop,
since no data is read in, and therefore there's no movement through
the input file. I wrote code that executes after the infinite loop
just to cover the possibility that it doesn't work that way.

I get an infinite loop with both glibc and musl on Ubuntu, and under
$ ./jk < /dev/null | head -n 3
0:'0'
0:'0'
0:'0'
$ echo hello | ./jk | head -n 3
-1:'0'
-1:'0'
-1:'0'
$
$ ./jk.exe < /dev/null
0x7ffffcc17!=0x0 ferror:0 feof:0 ''
$ echo hello | ./jk.exe
0x7ffffcc17!=0x0 ferror:0 feof:0 ''
$

I have an amusing footnote to these trials.

I wrote a short program to test fgets() under varying length
arguments. Compiling with gcc on Ubuntu, I was surprised to
discover the behavior of fgets() with a length argument of 1
depended on the the optimization setting of the compiler -
using -O0 gave a different result than -O1. Compiling with clang
gave the same result under both optimization settings.

Andrey Tarasevich

2025-02-11 06:32:12 UTC

Reply

Post by Lawrence D'Oliveiro

Post by James Kuyper
I don't see the contradiction. If "no characters are read into the
array", there is no such thing as "the last byte read into the array",
so a null byte has no location where it should be written. Therefore,
there's no reason for changing the contents of the array.

What if the array is only big enough for one byte? In this case, no
characters can be read into it. Is a trailing null inserted in this case?

If by that you mean, "what if the value of 1 is passed as second
argument", then, as I stated in one of my previous messages:

No attempt to read anything from the stream is made, which means that
end-of-file or I/O error conditions do not arise (unless, perhaps, the
stream was already in error condition) and the [0] byte of the buffer is
simply set to '\0'.

--
Best regards,
Andrey

Andrey Tarasevich

2025-02-11 06:38:53 UTC

Reply

Post by Andrey Tarasevich

Post by Lawrence D'Oliveiro

Post by James Kuyper
I don't see the contradiction. If "no characters are read into the
array", there is no such thing as "the last byte read into the array",
so a null byte has no location where it should be written. Therefore,
there's no reason for changing the contents of the array.

What if the array is only big enough for one byte? In this case, no
characters can be read into it. Is a trailing null inserted in this case?

If by that you mean, "what if the value of 1 is passed as second
No attempt to read anything from the stream is made, which means that
end-of-file or I/O error conditions do not arise (unless, perhaps, the
stream was already in error condition) and the [0] byte of the buffer is
simply set to '\0'.

... and non-null pointer (pointer to the buffer) is returned.

For what it is worth, an experiment shows that if the stream is already
in end-of-file state at the moment of the call, `fgets` still behaves as
if the call was successful - the buffer is modified as described above,
non-null pointer is returned:

https://coliru.stacked-crooked.com/a/a68382afcf4ff155

--
Best regards,
Andrey

Kaz Kylheku

2025-02-11 12:04:22 UTC

Reply

Post by Andrey Tarasevich
If by that you mean, "what if the value of 1 is passed as second
No attempt to read anything from the stream is made, which means that
end-of-file or I/O error conditions do not arise (unless, perhaps, the
stream was already in error condition) and the [0] byte of the buffer is
simply set to '\0'.

ISO C: "If a read error occurs during the operation, the members of the
array have unspecified values and a null pointer is returned."

(I think that stretches to the situation when the error has happened
already, but clearerr(stream) has not been called to remove the
condition.)

Whenever fgets returns null due to not being able to read any characters
into the array, it should not change the value of the elements of the
array, even if the reason is that the array hos no room.

We can think about the possibility of fgets returning a pointer
to a null string when an array of size 1 is uzed, without advancing
the stream.

I find it not so easy to argue that it would not be /conforming/. The
behavior can be regarded as a straightforward special case of the
ordinary behavior, when fgets adds one or more characters to the array,
runs out of room, and then null terminates and exits.

I find it easy to argue that it's anything but a bad idea for fgets to
ever return an empty string.

The way fgets is defined, it provides single clear termination signal
for loops; the null pointer.

If an implementation of fgets may return an empty string (only
conceivably allowed in the size 1 array case), then that constitutes an
additional new termination signal. A program not looking for this
additional termination signal shall loop indefinitely over a finite
stream.

While in that situation, the implementation might be conforming, and be
processing a strictly conforming program, even if so, the infinite
looping is a needlessly poor situation which can be avoided by not
taking that interpretation: i.e. if no characters are added to the array
for any reason, then have fgets always return NULL, rather than an empty
string.

It would be a good idea to add the requirement "fgets shall not
return a pointer to an empty string" to its description to codify that.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca

Tim Rentsch

2025-02-13 15:29:44 UTC

Reply

Post by Andrey Tarasevich

I don't see the contradiction. If "no characters are read into the
array", there is no such thing as "the last byte read into the array",
so a null byte has no location where it should be written. Therefore,
there's no reason for changing the contents of the array.

What if the array is only big enough for one byte? In this case, no
characters can be read into it. Is a trailing null inserted in this case?

If by that you mean, "what if the value of 1 is passed as second
No attempt to read anything from the stream is made, which means that
end-of-file or I/O error conditions do not arise (unless, perhaps, the
stream was already in error condition) and the [0] byte of the buffer
is simply set to '\0'.

You're in good company. Doing a web search turned up this description --

fgets() - Read a String from a Stream
Last Updated: 2024-09-20

#include <stdio.h>
char *fgets(char *string, int n, FILE *stream);

General Description

Reads bytes from a stream pointed to by stream into an array
pointed to by string, starting at the position indicated by the
file position indicator. Reading continues until the number of
characters read is equal to n-1, or until a new-line character
(\n), or until the end of the stream, whichever comes first. The
fgets() function stores the result in string and adds a null
character (\0) to the end of the string. The string includes the
new-line character, if read.

The fgets() function is not supported for files opened with
type=record.

The fgets() function has the same restriction as any read
operation for a read immediately following a write or a write
immediately following a read. Between a write and a subsequent
read, an intervening flush or reposition must occur. Between a
read and a subsequent write, an intervening flush or reposition
must also occur, unless an EOF has been reached.

Returned Value

If successful, fgets() returns a pointer to the string buffer.

If unsuccessful, fgets() returns NULL.

If n is less than or equal to 0, it indicates a domain error;
errno is set to EDOM to indicate the cause of the failure.

When n equals 1, it indicates a valid result. It means that the
string buffer has only room for the null terminator; nothing is
physically read from the file. (Such an operation is still
considered a read operation, so it cannot immediately follow a
write operation unless an intervening flush or reposition
operation occurs first.)

If n is greater than 1, fgets() will only fail if an I/O error
occurs or if EOF is reached and no data is read from the file.

Note: You should use ferror() and feof() to determine whether an
error or an EOF condition occurred. An EOF is only reached when
an attempt is made to read "past" the last byte of data. Reading
up to and including the last byte of data does not turn on the EOF
indicator.

If EOF is reached after data has already been read into the string
buffer, fgets() returns a pointer to the string buffer to indicate
success. A subsequent call would return NULL since fgets() would
reach EOF without reading any data.

That description was found on this page:

https://www.ibm.com/docs/en/zvm/7.4?
topic=descriptions-fgets-read-string-from-stream

Janis Papanagnou

2025-02-09 07:13:10 UTC

Reply

First; thanks Kaz and Andrey for the replies. - As so often answering
more than I asked or needed. :-)

The provided C standard quote answers my question. - Thanks!

Post by Andrey Tarasevich

Post by Janis Papanagnou
To get the last line of a text file I'm using
char buf[BUFSIZ];
while (fgets (buf, BUFSIZ, fd) != NULL)
; // read to last line
If the end of the file is reached my test shows that the previous
contents of 'buf' are still existing (not cleared or overwritten).
But the man page does not say anything whether this is guaranteed;
it says: "Reading stops after an EOF or a newline.", but it says
nothing about [not] writing to or [not] resetting the buffer.
Is that simple construct safe to get the last line of a text file?

What situation exactly are you talking about? When end-of-file is
encountered _immediately_, before reading the very first character? Of
when end-of-file is encountered after reading something (i.e. when the
last line in the file does not end with new-line character)?

I have a _coherent_ file, with a few NL terminated lines of text.

Usually I use fgets() in contexts where I process every line, like

while (fgets (buf, BUFSIZ, fd) != NULL) {
operate_on (buf);
}
// here the status of buf[] is usually not important any more

My actual context was different, like

while (fgets (buf, BUFSIZ, fd) != NULL) {
// buf[] contents are ignored here
}
operate_on (buf[]); // which I assumed contains last line

Post by Andrey Tarasevich
The former situation is covered by the spec: "If end-of-file is
encountered and no characters have been read into the array, the
contents of the array remain unchanged and a null pointer is returned".
The second situation does not need additional clarifications. Per
general spec as many characters as available before the end-of-file will
be read and then terminated with '\0'. In such case there will be no
new-line character in the buffer.
So, in both cases we are perfectly safe when reading the last line of a
text file, if you don't forget to check the return value of `fgets`.

I suppose you mean what I already had in my code above: ... != NULL

Post by Andrey Tarasevich
(This is all under assumption that size limit does not kick in. I
believe your question is not about that.)

Yes, it was just the one posted question. (No incoherent text files,
no error conditions, no signals, no buffer size mistakes, etc.)

Post by Andrey Tarasevich
Note also that `fgets` is not permitted to assume that the limit value
(the second parameter) correctly describes the accessible size of the
buffer. E.g. for this reason it is not permitted to zero-out the buffer
before reading. For example, this code is valid and has defined behavior
char buffer[10];
fgets(buffer, 1000, f);
provided the current line of the file fits into `char[10]`. I.e. even
though we "lied" to `fgets` about the limit, it is still required to
work correctly if the actual data fits into the actual buffer.
So, why do you care that "the previous contents of 'buf' are still
existing"?

I hope it got clear by the two code snippets I posted above...

Usually I read and process the data that I got in buf from fgets()
while there *is* data (fgets() != NULL), and I thus don't care any
more about buffer contents validity after the loop (fgets() == NULL).

But now I wanted to ignore all data that I got for fgets() != NULL
in the loop. And I hoped that *after* the loop the last read data is
still valid.

Janis

Michael S

2025-02-09 10:50:46 UTC

Reply

On Sun, 9 Feb 2025 08:13:10 +0100

Post by Janis Papanagnou
First; thanks Kaz and Andrey for the replies. - As so often answering
more than I asked or needed. :-)
The provided C standard quote answers my question. - Thanks!

Post by Andrey Tarasevich

Post by Janis Papanagnou
To get the last line of a text file I'm using
char buf[BUFSIZ];
while (fgets (buf, BUFSIZ, fd) != NULL)
; // read to last line
If the end of the file is reached my test shows that the previous
contents of 'buf' are still existing (not cleared or overwritten).
But the man page does not say anything whether this is guaranteed;
it says: "Reading stops after an EOF or a newline.", but it says
nothing about [not] writing to or [not] resetting the buffer.
Is that simple construct safe to get the last line of a text file?

What situation exactly are you talking about? When end-of-file is
encountered _immediately_, before reading the very first character?
Of when end-of-file is encountered after reading something (i.e.
when the last line in the file does not end with new-line
character)?

I have a _coherent_ file, with a few NL terminated lines of text.

I wonder what you mean by "coherent".

Post by Janis Papanagnou
Usually I use fgets() in contexts where I process every line, like
while (fgets (buf, BUFSIZ, fd) != NULL) {
operate_on (buf);
}
// here the status of buf[] is usually not important any more
My actual context was different, like
while (fgets (buf, BUFSIZ, fd) != NULL) {
// buf[] contents are ignored here
}
operate_on (buf[]); // which I assumed contains last line

It depends on definition of "last line".
What do you consider "last line" of the file in which last character is
not LF? The one before the last LF or one after? Your code would get
the latter.

Janis Papanagnou

2025-02-09 17:29:01 UTC

Reply

Post by Michael S
On Sun, 9 Feb 2025 08:13:10 +0100

[...]

Post by Michael S

Post by Janis Papanagnou
I have a _coherent_ file, with a few NL terminated lines of text.

I wonder what you mean by "coherent".

A badly chosen word; I noticed it too late only after posting.

I meant consistent with respect to the line terminators (i.e. none
missing, each line [including the last one] has one, no mixture of
LF, CR, LR-LF, etc. (and also no fixed-sized unterminated lines as
we may still know from mainframes, just to be complete).

Post by Michael S

Post by Janis Papanagnou
Usually I use fgets() in contexts where I process every line, like
while (fgets (buf, BUFSIZ, fd) != NULL) {
operate_on (buf);
}
// here the status of buf[] is usually not important any more
My actual context was different, like
while (fgets (buf, BUFSIZ, fd) != NULL) {
// buf[] contents are ignored here
}
operate_on (buf[]); // which I assumed contains last line

It depends on definition of "last line".
What do you consider "last line" of the file in which last character is
not LF?

I consider missing newlines at the end of any text line as a bug.
(And I'm not inclined to use a weaker word than "bug".) YMMV.

Post by Michael S
The one before the last LF or one after? Your code would get
the latter.

It's a non-issue (for me), as should have got obvious.

Janis

Mark Bourne

2025-02-10 21:57:29 UTC

Reply

Post by Janis Papanagnou

Post by Michael S
What do you consider "last line" of the file in which last character is
not LF?

I consider missing newlines at the end of any text line as a bug.
(And I'm not inclined to use a weaker word than "bug".) YMMV.

I think I once saw somewhere that utilities originating on Unix
typically consider \n to be a line terminator, so include it at the end
of every line including the last, whereas those originating on
DOS/Windows typically consider \n to be a line separator, so don't
include it at the end of the last line. So Unix-originated utilities
might not behave as expected if the file doesn't end with \n, whereas
Windows-originated utilities might treat the file as having an extra
blank line at the end of the file if it does end with \n. Utilities
ported from one system to the other sometimes continue following the
convention of their origin, rather than the system they're running on.

I'm not sure where I originally saw that, but for what it's worth the
following Stack Overflow makes a similar claim:
<https://stackoverflow.com/a/729795>. Most of the answer discusses
POSIX, with a "line" defined as ending with a terminating newline, hence
every line including the last ends with a newline, while a final
footnote notes that doesn't necessarily apply to non-POSIX systems,
particularly Windows.

--
Mark.

Andrey Tarasevich

2025-02-09 15:27:26 UTC

Reply

Post by Janis Papanagnou
But now I wanted to ignore all data that I got for fgets() != NULL
in the loop. And I hoped that *after* the loop the last read data is
still valid.

As Michael already noted it depends on what you consider as the last
piece of valid data in your file. Say, what do you want to see as "the
last line" in a file that ends with

abracadabra\n<EOF here>

?

Is "abracadabra" the last line? Or is the last line supposed to be empty
in this case?

--
Best regards,
Andrey

Janis Papanagnou

2025-02-09 17:34:55 UTC

Reply

Post by Andrey Tarasevich

Post by Janis Papanagnou
But now I wanted to ignore all data that I got for fgets() != NULL
in the loop. And I hoped that *after* the loop the last read data is
still valid.

As Michael already noted it depends on what you consider as the last
piece of valid data in your file.

I have a strong opinion of a text file concerning line terminators;
I answered that in my reply Michael.

Post by Andrey Tarasevich
Say, what do you want to see as "the
last line" in a file that ends with
abracadabra\n<EOF here>
?
Is "abracadabra" the last line? Or is the last line supposed to be empty
in this case?

If "\n" is a string literal (2 characters, '\' and 'n') then it's an
incomplete line (as to my standards), if it's meant as a <LF> control
character then it's complete. (Similar with <CR> on old Apple/Macs and
<CR><LF> on DOS-alike systems.)

Janis

Keith Thompson

2025-02-10 00:57:02 UTC

Reply

[...]

Post by Janis Papanagnou

Post by Andrey Tarasevich
Say, what do you want to see as "the
last line" in a file that ends with
abracadabra\n<EOF here>
?
Is "abracadabra" the last line? Or is the last line supposed to be empty
in this case?

If "\n" is a string literal (2 characters, '\' and 'n') then it's an
incomplete line (as to my standards), if it's meant as a <LF> control
character then it's complete. (Similar with <CR> on old Apple/Macs and
<CR><LF> on DOS-alike systems.)

It seems obvious to me that Andrey intended the \n to be a new-line
character (which is almost always LF in modern C implementations).

Here's (some of) what the C standard says about text streams:

A text stream is an ordered sequence of characters composed into
lines, each line consisting of zero or more characters plus a
terminating new-line character. Whether the last line requires
a terminating new-line character is implementation-defined.

For an implementation that *doesn't* require a new-line on the
last line, a stream without a trailing new-line is valid. For an
implementation that *does* require it, such a stream is invalid,
and a program that attempts to process it can have undefined behavior.

Most modern implementations don't require that trailing new-line.
For example, `echo -n hello > hello.txt` creates a valid text file.
Of course a C program that deals with text files can impose any
additional restrictions its author likes.

The above describes how a text stream looks to a C program. The
external representation can be quite different, with transformations
to map between them. The most common such transformation is
mapping the DOS/Windows CR-LF line terminator to LF on input, and
vice versa on output. Or the external representation might store
each line as a fixed-length character sequence padded with spaces.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

Janis Papanagnou

2025-02-10 01:35:01 UTC

Reply

Post by Keith Thompson
[...]
A text stream is an ordered sequence of characters composed into
lines, each line consisting of zero or more characters plus a
terminating new-line character. Whether the last line requires
a terminating new-line character is implementation-defined.
For an implementation that *doesn't* require a new-line on the
last line, a stream without a trailing new-line is valid. For an
implementation that *does* require it, such a stream is invalid,
and a program that attempts to process it can have undefined behavior.

This is what "C" accepts (or tolerates), yes.

Given that some folks with the aid of some fancy editors makes it
possible to suppress (or not create) the final line ending - bytes
are still expensive it seems - I suppose it's a sensible requirement
for "C" compilers to be tolerant here.

Post by Keith Thompson
Most modern implementations don't require that trailing new-line.
For example, `echo -n hello > hello.txt` creates a valid text file.
Of course a C program that deals with text files can impose any
additional restrictions its author likes.

And cat alpha.c beta.c > gamma.c will create inconsistent texts if
there's no line terminator on the last lines of some files.

Post by Keith Thompson
The above describes how a text stream looks to a C program. The
external representation can be quite different, with transformations
to map between them.

(Concerning this thread; I'm anyway operating on custom data files
in plain text format, so I'm less concerned about how "C" compilers
expect their "C" source.)

Post by Keith Thompson
The most common such transformation is
mapping the DOS/Windows CR-LF line terminator to LF on input, and
vice versa on output. Or the external representation might store
each line as a fixed-length character sequence padded with spaces.

I appreciate that the editor I use keeps data consistent but allows
an explicit change between Unix and DOS text modes (where necessary
of if desired).

The most extreme context I had worked in was a company that allowed
(for every employee) a free choice of used computer technology; that
led to program text files that literally had all the inconsistencies.
Since many files were edited by different folks there where all sorts
of line terminators mixed even in the same one file, and there either
were complete last lines or not. The (some?) IDEs used were tolerant
WRT line terminators and their mixing. Other tools reacted sensibly.
The first thing I've done was to write a "C" tool to detect and fix
these sorts of inconsistencies.

Janis

Keith Thompson

2025-02-10 04:37:34 UTC

Reply

[...]

Post by Janis Papanagnou

Post by Keith Thompson
The above describes how a text stream looks to a C program. The
external representation can be quite different, with transformations
to map between them.

(Concerning this thread; I'm anyway operating on custom data files
in plain text format, so I'm less concerned about how "C" compilers
expect their "C" source.)

The requirements for text streams are distinct from the requirements
for C source files. For example, you might have a cross-compiler
where C source files follow the rules of the OS where the compiler
runs, and text files processed via stdio follow the rules of
the target system. And a C compiler might not use stdio to read
source files. It might not even be implemented in C.

In particular, the standard has this specific requirement for source
files (this is from the "Translation phases" section):

A source file that is not empty shall end in a new-line
character, which shall not be immediately preceded by a backslash
character before any such splicing takes place.

(This is in translation phase 2; any new-line characters might be
the result of a transformation during phase 1.)

So a non-empty file not ending in a new-line character might be a
valid text file for use with stdio, but is not a valid C source file.
On the other hand, the mapping described in translation phase 1
might add a new-line character to such a file, so a conforming
compiler could accept such a source file without complaint.

Of the compilers I've tried, gcc and tcc quietly accept a source
file with no trailing newline, and clang rejects it with the right
options (-std=c?? -pedantic-errors).

[...]

Post by Janis Papanagnou
The most extreme context I had worked in was a company that allowed
(for every employee) a free choice of used computer technology; that
led to program text files that literally had all the inconsistencies.
Since many files were edited by different folks there where all sorts
of line terminators mixed even in the same one file, and there either
were complete last lines or not. The (some?) IDEs used were tolerant
WRT line terminators and their mixing. Other tools reacted sensibly.
The first thing I've done was to write a "C" tool to detect and fix
these sorts of inconsistencies.

Been there, done that. There seems to be a tendency in the Windows
world to create text files with no terminator on the last line.
In some cases I've been able to translate the source files to a
consistent format. In others, doing so would have created huge
diffs in the source control system, so I left well enough alone.

My preferred editor, vim, handles files with either LF or CRLF line
endings gracefully, but if there's a mix it shows "^M" at the end of
each line that has a Windows-style CRLF ending. I found a possible
solution, but I haven't bothered using it since I'm not currently
dealing with such files.

<https://vi.stackexchange.com/q/39297/2380>

This is already off-topic, so I won't even mention tabs vs. spaces.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

Janis Papanagnou

2025-02-10 06:08:22 UTC

Reply

Post by Keith Thompson

Post by Janis Papanagnou
The most extreme context I had worked in was a company that allowed
(for every employee) a free choice of used computer technology; that
led to program text files that literally had all the inconsistencies.
Since many files were edited by different folks there where all sorts
of line terminators mixed even in the same one file, and there either
were complete last lines or not. The (some?) IDEs used were tolerant
WRT line terminators and their mixing. Other tools reacted sensibly.
The first thing I've done was to write a "C" tool to detect and fix
these sorts of inconsistencies.

Been there, done that. There seems to be a tendency in the Windows
world to create text files with no terminator on the last line.

Yep.

Post by Keith Thompson
In some cases I've been able to translate the source files to a
consistent format. In others, doing so would have created huge
diffs in the source control system, so I left well enough alone.

Yes, that is what you buy with a fix. But it pays, IME. What I had
done was to provide scripts to automate the transformation, I did
a short-term code freeze on a whole project, transformed the data,
and since that point we had a consistent base. The good thing was
that the single CR terminators (old Apple/Macs, pre OS-X) were only
in older code. And the handling of LF vs. CR-LF was okay once that
the script streamlined the data, either converting all to LF, or,
to keep any conistent variant (whether it was LF or CR-LF).

Post by Keith Thompson
My preferred editor, vim, handles files with either LF or CRLF line
endings gracefully, but if there's a mix it shows "^M" at the end of
each line that has a Windows-style CRLF ending.

Yes. With the change I described we got rid of this issue.

Post by Keith Thompson
I found a possible
solution, but I haven't bothered using it since I'm not currently
dealing with such files.

Same here. For me it was just a historic little episode.

Post by Keith Thompson
<https://vi.stackexchange.com/q/39297/2380>
This is already off-topic, so I won't even mention tabs vs. spaces.

But as Vim users we don't have any issues here; as long as the
indentation is _visibly_ consistent we can fix any tab/space-mix
on the fly and easily with Vim.

Janis

Keith Thompson

2025-02-10 06:41:36 UTC

Reply

[...]

Post by Janis Papanagnou

Post by Keith Thompson
This is already off-topic, so I won't even mention tabs vs. spaces.

But as Vim users we don't have any issues here; as long as the
indentation is _visibly_ consistent we can fix any tab/space-mix
on the fly and easily with Vim.

Yes, *if* the indentation is visibly consistent.

At a previous job, I reviewed an update whose apparent meaning
differed depending on whether the editor was configured with 4- or
8-column tabstops. I don't remember the exact details, but the code
looked like either:

if (condition)
statement1;
statement2;

or:

if (condition)
statement1;
statement2;

depending on the reader's settings. Of course they're semantically
equivalent, but the first is the way the developer saw it, and the
second is misleading and is the way it looked to me.

This kind of thing is why I use only spaces for indentation and curly
braces even when there's only one statement in the block (unless I'm
working under a coding standard that says otherwise).

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

Janis Papanagnou

2025-02-10 06:54:07 UTC

Reply

Post by Keith Thompson
[...]

Post by Janis Papanagnou

Post by Keith Thompson
This is already off-topic, so I won't even mention tabs vs. spaces.

But as Vim users we don't have any issues here; as long as the
indentation is _visibly_ consistent we can fix any tab/space-mix
on the fly and easily with Vim.

Yes, *if* the indentation is visibly consistent.
At a previous job, I reviewed an update whose apparent meaning
differed depending on whether the editor was configured with 4- or
8-column tabstops. I don't remember the exact details, but the code
if (condition)
statement1;
statement2;
if (condition)
statement1;
statement2;
depending on the reader's settings. Of course they're semantically
equivalent, but the first is the way the developer saw it, and the
second is misleading and is the way it looked to me.

Yeah, misleading code is a pain, especially if you have got the job
to fix some error in these incoherent formatted modules. (I suppose
that case is yet more than only misleading if you are programming in
Python where indentation even carries semantics.)

Post by Keith Thompson
This kind of thing is why I use only spaces for indentation

I think it's personal preference. Mine is to use only tabs (with a
data type specific width) for the indentation.

Janis

Post by Keith Thompson
and curly
braces even when there's only one statement in the block (unless I'm
working under a coding standard that says otherwise).

Mark Bourne

2025-02-10 23:22:54 UTC

Reply

Post by Janis Papanagnou

Post by Keith Thompson
At a previous job, I reviewed an update whose apparent meaning
differed depending on whether the editor was configured with 4- or
8-column tabstops. I don't remember the exact details, but the code
if (condition)
statement1;
statement2;
if (condition)
statement1;
statement2;
depending on the reader's settings. Of course they're semantically
equivalent, but the first is the way the developer saw it, and the
second is misleading and is the way it looked to me.

Yeah, misleading code is a pain, especially if you have got the job
to fix some error in these incoherent formatted modules. (I suppose
that case is yet more than only misleading if you are programming in
Python where indentation even carries semantics.)

That was an issue in Python 2 where, I think, a single tab was treated
as equivalent to 8 spaces for the purposes of block scoping. Depending
on editor settings, that may or may not match how it visually appears.
For that reason, Python 3 makes it an error to mix tabs and spaces in
ways that would be misleading, i.e. if the meaning would depend on the
size of a tab. Even before Python 3, the issue was generally avoided by
coding conventions, e.g. using only spaces and not tabs. Not intending
to go into any more detail than that, this being comp.lang.c not
comp.lang.python ;)

--
Mark.

Kaz Kylheku

2025-02-11 00:59:56 UTC

Reply

Post by Janis Papanagnou

Post by Keith Thompson
[...]

Post by Janis Papanagnou

Post by Keith Thompson
This is already off-topic, so I won't even mention tabs vs. spaces.

But as Vim users we don't have any issues here; as long as the
indentation is _visibly_ consistent we can fix any tab/space-mix
on the fly and easily with Vim.

Yes, *if* the indentation is visibly consistent.
At a previous job, I reviewed an update whose apparent meaning
differed depending on whether the editor was configured with 4- or
8-column tabstops. I don't remember the exact details, but the code
if (condition)
statement1;
statement2;
if (condition)
statement1;
statement2;
depending on the reader's settings. Of course they're semantically
equivalent, but the first is the way the developer saw it, and the
second is misleading and is the way it looked to me.

Yeah, misleading code is a pain, especially if you have got the job
to fix some error in these incoherent formatted modules.

Turning on gcc -Wmisleading-indentation could go a long way toward
hunting down the trouble spots.

Not sure how that well that deals with inconsistent mixtures of tabs and
spaces. The man page documentation (I realize there is also the real
GCC manual) says that the amount of indentation is determined by
the -ftabstop=N option, where N defaults to 8.

So I'm guessing that you may have to compile the code at least
two ways---with -ftabstop=4 -ftabstop=8 and possibly other
choices---to get the all bad spots to look like your second example and
be diagnosed.

They should support a nondeterministic behavior:

-fabstop=2,4,8

fork reality into all those tabstops and diagnose in all of them
in the same pass.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca

Lawrence D'Oliveiro

2025-02-17 02:50:45 UTC

Reply

Post by Keith Thompson
if (condition)
statement1;
statement2;

This is why I got into the habit of writing it like

if (condition)
{
statement1;
} /*if*/
statement2;

By the way, the braces are mandatory in Perl. Wonder why?

Mark Bourne

2025-02-10 22:57:25 UTC

Reply

Post by Janis Papanagnou
I have a _coherent_ file, with a few NL terminated lines of text.
Usually I use fgets() in contexts where I process every line, like
while (fgets (buf, BUFSIZ, fd) != NULL) {
operate_on (buf);
}
// here the status of buf[] is usually not important any more
My actual context was different, like
while (fgets (buf, BUFSIZ, fd) != NULL) {
// buf[] contents are ignored here
}
operate_on (buf[]); // which I assumed contains last line

...

Post by Janis Papanagnou
Usually I read and process the data that I got in buf from fgets()
while there *is* data (fgets() != NULL), and I thus don't care any
more about buffer contents validity after the loop (fgets() == NULL).
But now I wanted to ignore all data that I got for fgets() != NULL
in the loop. And I hoped that *after* the loop the last read data is
still valid.

What does fgets do if the file is completely empty? I may be wrong
(more familiar with Python than C these days), but it doesn't look like
that should be any different from any other end-of-file condition, so
presumably the first call to fgets would return NULL, without ever
modifying the buffer. Unless the buffer is initialised (e.g. to an
empty string) before the while loop, that would result in an
uninitialised buffer being passed to operate_on.

--
Mark.

Mark Bourne

2025-02-10 23:24:51 UTC

Reply

Post by Janis Papanagnou
I have a _coherent_ file, with a few NL terminated lines of text.
Usually I use fgets() in contexts where I process every line, like
     while (fgets (buf, BUFSIZ, fd) != NULL) {
         operate_on (buf);
     }
     // here the status of buf[] is usually not important any more
My actual context was different, like
     while (fgets (buf, BUFSIZ, fd) != NULL) {
         // buf[] contents are ignored here
     }
     operate_on (buf[]); // which I assumed contains last line

...

Post by Janis Papanagnou
Usually I read and process the data that I got in buf from fgets()
while there *is* data (fgets() != NULL), and I thus don't care any
more about buffer contents validity after the loop (fgets() == NULL).
But now I wanted to ignore all data that I got for fgets() != NULL
in the loop. And I hoped that *after* the loop the last read data is
still valid.

What does fgets do if the file is completely empty? I may be wrong
(more familiar with Python than C these days), but it doesn't look like
that should be any different from any other end-of-file condition, so
presumably the first call to fgets would return NULL, without ever
modifying the buffer. Unless the buffer is initialised (e.g. to an
empty string) before the while loop, that would result in an
uninitialised buffer being passed to operate_on.

Ah, I see Ben covered that earlier today elsewhere in the thread.

--
Mark.

Ben Bacarisse

2025-02-10 01:32:16 UTC

Reply

Post by Janis Papanagnou
To get the last line of a text file I'm using
char buf[BUFSIZ];
while (fgets (buf, BUFSIZ, fd) != NULL)
; // read to last line
If the end of the file is reached my test shows that the previous
contents of 'buf' are still existing (not cleared or overwritten).

Something that has not yet come up (as far as I can see) is that you
might need to handle an empty file. In such a case, nothing gets
written and fgets returns NULL right away. Processing buf in this
situation is then undefined.

One way to handle this is to put into buf something that can't get read
by fgets. Two newlines is a good candidate:

char buf[BUFSIZE] = "\n\n";

You can then test for that if need be, though of course it all depends
on what your application is doing.

--
Ben.

Janis Papanagnou

2025-02-10 01:40:28 UTC

Reply

Post by Ben Bacarisse

Post by Janis Papanagnou
To get the last line of a text file I'm using
char buf[BUFSIZ];
while (fgets (buf, BUFSIZ, fd) != NULL)
; // read to last line
If the end of the file is reached my test shows that the previous
contents of 'buf' are still existing (not cleared or overwritten).

Something that has not yet come up (as far as I can see) is that you
might need to handle an empty file. In such a case, nothing gets
written and fgets returns NULL right away. Processing buf in this
situation is then undefined.

I haven't considered that at all because my context is very specific;
it's just three text lines (a comment line, an empty separator line,
and a line with the payload data that I'm interested in). If there's
a file it will have exactly these three lines (and all correctly and
consistently terminated).

Post by Ben Bacarisse
One way to handle this is to put into buf something that can't get read
char buf[BUFSIZE] = "\n\n";
You can then test for that if need be, though of course it all depends
on what your application is doing.

Thanks for pointing it out and for the suggestion.

Janis

87 Replies
2 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Janis Papanagnou 2025-02-09 05:59:00 UTC

Kaz Kylheku 2025-02-09 06:23:33 UTC

Andrey Tarasevich 2025-02-09 06:23:49 UTC

Andrey Tarasevich 2025-02-09 07:12:44 UTC

Lawrence D'Oliveiro 2025-02-09 23:52:17 UTC

Andrey Tarasevich 2025-02-10 01:06:02 UTC

Andrey Tarasevich 2025-02-10 01:22:43 UTC

Michael S 2025-02-10 10:49:11 UTC

Tim Rentsch 2025-02-13 15:14:28 UTC

Michael S 2025-02-14 14:51:08 UTC

Scott Lurndal 2025-02-14 15:10:50 UTC

Michael S 2025-02-14 15:23:58 UTC

Scott Lurndal 2025-02-14 16:46:08 UTC

Kaz Kylheku 2025-02-14 17:28:30 UTC

Kaz Kylheku 2025-02-14 17:22:59 UTC

Keith Thompson 2025-02-14 19:03:20 UTC

Kaz Kylheku 2025-02-14 19:34:59 UTC

Michael S 2025-02-15 18:06:02 UTC

Scott Lurndal 2025-02-14 20:01:16 UTC

Janis Papanagnou 2025-02-14 19:51:38 UTC

Michael S 2025-02-15 17:02:55 UTC

Michael S 2025-02-15 17:29:11 UTC

Janis Papanagnou 2025-02-16 03:29:20 UTC

James Kuyper 2025-02-16 06:04:11 UTC

Kaz Kylheku 2025-02-16 07:37:09 UTC

Janis Papanagnou 2025-02-16 17:59:31 UTC

Michael S 2025-02-16 08:48:44 UTC

Janis Papanagnou 2025-02-16 18:14:31 UTC

Michael S 2025-02-17 09:54:24 UTC

Kaz Kylheku 2025-02-16 07:32:23 UTC

Michael S 2025-02-16 09:05:46 UTC

Janis Papanagnou 2025-02-16 18:25:46 UTC

Janis Papanagnou 2025-02-16 18:21:02 UTC

Scott Lurndal 2025-02-16 20:26:53 UTC

Michael S 2025-02-15 17:41:11 UTC

Michael S 2025-02-15 18:29:15 UTC

Janis Papanagnou 2025-02-16 03:33:17 UTC

Janis Papanagnou 2025-02-14 19:23:50 UTC

James Kuyper 2025-02-14 19:38:40 UTC

Janis Papanagnou 2025-02-14 20:02:19 UTC

Michael S 2025-02-15 17:53:10 UTC

Janis Papanagnou 2025-02-16 03:48:20 UTC

Tim Rentsch 2025-02-15 16:37:20 UTC

Michael S 2025-02-15 18:08:56 UTC

Tim Rentsch 2025-02-19 04:17:00 UTC

Lawrence D'Oliveiro 2025-02-21 05:58:19 UTC

Tim Rentsch 2025-02-15 16:12:33 UTC

Janis Papanagnou 2025-02-10 01:44:27 UTC

Lawrence D'Oliveiro 2025-02-10 03:28:01 UTC

Andrey Tarasevich 2025-02-10 04:11:22 UTC

Lawrence D'Oliveiro 2025-02-10 07:21:11 UTC

Scott Lurndal 2025-02-10 16:39:44 UTC

James Kuyper 2025-02-10 18:58:05 UTC

Lawrence D'Oliveiro 2025-02-11 01:03:21 UTC

James Kuyper 2025-02-11 03:33:03 UTC

Kaz Kylheku 2025-02-11 03:42:03 UTC

Lawrence D'Oliveiro 2025-02-11 04:54:23 UTC

James Kuyper 2025-02-11 18:07:53 UTC

Lawrence D'Oliveiro 2025-02-11 21:47:21 UTC

James Kuyper 2025-02-11 22:44:30 UTC

Lawrence D'Oliveiro 2025-02-12 06:16:22 UTC

Keith Thompson 2025-02-11 21:59:33 UTC

James Kuyper 2025-02-11 22:58:43 UTC

Keith Thompson 2025-02-11 23:40:15 UTC

Tim Rentsch 2025-02-18 05:11:29 UTC

Andrey Tarasevich 2025-02-11 06:32:12 UTC

Andrey Tarasevich 2025-02-11 06:38:53 UTC

Kaz Kylheku 2025-02-11 12:04:22 UTC

Tim Rentsch 2025-02-13 15:29:44 UTC

Janis Papanagnou 2025-02-09 07:13:10 UTC

Michael S 2025-02-09 10:50:46 UTC

Janis Papanagnou 2025-02-09 17:29:01 UTC

Mark Bourne 2025-02-10 21:57:29 UTC

Andrey Tarasevich 2025-02-09 15:27:26 UTC

Janis Papanagnou 2025-02-09 17:34:55 UTC

Keith Thompson 2025-02-10 00:57:02 UTC

Janis Papanagnou 2025-02-10 01:35:01 UTC

Keith Thompson 2025-02-10 04:37:34 UTC

Janis Papanagnou 2025-02-10 06:08:22 UTC

Keith Thompson 2025-02-10 06:41:36 UTC

Janis Papanagnou 2025-02-10 06:54:07 UTC

Mark Bourne 2025-02-10 23:22:54 UTC

Kaz Kylheku 2025-02-11 00:59:56 UTC

Lawrence D'Oliveiro 2025-02-17 02:50:45 UTC

Mark Bourne 2025-02-10 22:57:25 UTC

Mark Bourne 2025-02-10 23:24:51 UTC

Ben Bacarisse 2025-02-10 01:32:16 UTC

Janis Papanagnou 2025-02-10 01:40:28 UTC

about - legalese

Loading...