Discussion:
Implicit String-Literal Concatenation
(too old to reply)
Lawrence D'Oliveiro
2024-02-24 23:05:47 UTC
Permalink
I like using this for long strings:

fputs
(
"When an uncleft or a bulkbit wins one or more bernstonebits above\n"
"its own, it takes on a backward lading. When it loses one or\n"
"more, it takes on a forward lading. Such a mote is called a\n"
"*farer*, for that the drag between unlike ladings flits it. When\n"
"bernstonebits flit by themselves, it may be as a bolt of\n"
"lightning, a spark off some faststanding chunk, or the everyday\n"
"flow of bernstoneness through wires.\n",
stdout
);

Of languages that derive ideas from C, only C++ and Python seem to have
kept this. Java, JavaScript and PHP have not, for some reason.
Janis Papanagnou
2024-02-25 16:38:38 UTC
Permalink
Post by Lawrence D'Oliveiro
fputs
(
"When an uncleft or a bulkbit wins one or more bernstonebits above\n"
"its own, it takes on a backward lading. When it loses one or\n"
"more, it takes on a forward lading. Such a mote is called a\n"
"*farer*, for that the drag between unlike ladings flits it. When\n"
"bernstonebits flit by themselves, it may be as a bolt of\n"
"lightning, a spark off some faststanding chunk, or the everyday\n"
"flow of bernstoneness through wires.\n",
stdout
);
I also liked to be able to use this feature _in some cases_ in C++.

Not in the given case, though, where I like to more clearly see the
newlines, so I'd prefer cout << "..." << endl
Post by Lawrence D'Oliveiro
Of languages that derive ideas from C, only C++ and Python seem to have
kept this. Java, JavaScript and PHP have not, for some reason.
In Java you have at least the string concatenation operator + which
is, IMO, pretty good for that line structuring across source lines.

In Awk (another "C like"), string concatenations have no visible
operators so we can for example write print "Hell" "o " "world"
But since lines have a much more restricted definition you cannot
without line continuation escape spread these strings across many
lines. (It's not too bad to add a terminating '\' where desired.)

As far as you're asking "for some reason", I could just speculate
(and abstain).

Janis
Lawrence D'Oliveiro
2024-02-25 20:43:31 UTC
Permalink
In Java you have at least the string concatenation operator + which is,
IMO, pretty good for that line structuring across source lines.
Implicit concatenation works well in Python because you also have the
“%” operator overloaded to perform printf-style formatting with a
string. If you had to use “+” then, because that binds less tightly
than “%”, you would have to have parentheses as well, which are
unnecessary with implicit concatenation. E.g.

# depreciation entries
sql.cursor.execute \
(
"insert into payments set when_made = %(when_made)s,"
" description = %(description)s, other_party_name = \"\","
" amount = %(amount)d, kind = \"D\", tax_year = %(tax_year)d"
%
{
"when_made" : end_for_tax_year(tax_year) - 1,
"description" :
sql_string
(
"%s: %s $%s at %d%% from %s"
%
(
entry["description"],
entry["method"],
format_amount(entry["initial_value"]),
entry["rate"],
format_date(entry["when_purchased"]),
)
),
"amount" : - entry["amount"],
"tax_year" : tax_year,
}
)

Or, for added fun, how about parameterizing a format:

num_format = "%%.%dg" % nr_digits
...
for axis in range(3) :
out.write \
(
" (%s, %s),\n"
%
(num_format, num_format)
%
(
min(v.co[axis] for v in the_mesh.vertices),
max(v.co[axis] for v in the_mesh.vertices)
)
)
#end for
bart
2024-02-25 21:20:13 UTC
Permalink
Post by Lawrence D'Oliveiro
In Java you have at least the string concatenation operator + which is,
IMO, pretty good for that line structuring across source lines.
Implicit concatenation works well in Python because you also have the
“%” operator overloaded to perform printf-style formatting with a
string. If you had to use “+” then, because that binds less tightly
than “%”,
You mean it binds less tightly than implicit concatenation? So that:

"abc" % "def" "ghi" means "abc" % ("def" "ghi")
"abc" % "def" + "ghi" means ("abc" % "def") "ghi"
Post by Lawrence D'Oliveiro
you would have to have parentheses as well, which are
unnecessary with implicit concatenation. E.g.
# depreciation entries
sql.cursor.execute \
(
"insert into payments set when_made = %(when_made)s,"
" description = %(description)s, other_party_name = \"\","
" amount = %(amount)d, kind = \"D\", tax_year = %(tax_year)d"
%
{
"when_made" : end_for_tax_year(tax_year) - 1,
sql_string
(
"%s: %s $%s at %d%% from %s"
%
(
entry["description"],
entry["method"],
format_amount(entry["initial_value"]),
entry["rate"],
format_date(entry["when_purchased"]),
)
),
"amount" : - entry["amount"],
"tax_year" : tax_year,
}
)
num_format = "%%.%dg" % nr_digits
...
out.write \
(
" (%s, %s),\n"
%
(num_format, num_format)
%
(
min(v.co[axis] for v in the_mesh.vertices),
max(v.co[axis] for v in the_mesh.vertices)
)
)
#end for
Although I can't see it made much difference here. Is this an example of
how bad it can be without implicit concatenation, or is it this
complicated despite that?

Since I can't see any "+" operators between strings, yet what follows
"%" is usually something starting with "(" or "{", not a string constant.
Blue-Maned_Hawk
2024-02-25 16:45:20 UTC
Permalink
I've used this to make strings with embedded newlines look in the source
file closer to how they'd look on output.
--
Blue-Maned_Hawk│shortens to
Hawk│/
blu.mɛin.dÊ°ak/
│he/him/his/himself/Mr.
blue-maned_hawk.srht.site
2017 called, but i couldn't understand what they were saying over all the
screams.
Lawrence D'Oliveiro
2024-02-25 20:25:09 UTC
Permalink
Post by Lawrence D'Oliveiro
Of languages that derive ideas from C, only C++ and Python seem to have
kept this. Java, JavaScript and PHP have not, for some reason.
I’d forgotten to check Perl; it doesn’t have implicit concatenation
either.
Łukasz 'Maly' Ostrowski
2024-02-26 20:12:39 UTC
Permalink
Post by Lawrence D'Oliveiro
Of languages that derive ideas from C, only C++ and Python seem to have
kept this. Java, JavaScript and PHP have not, for some reason.
Java (Text Blocks):
String s = """
multi
line
string""";

JavaScript (Template Literal):
let s = `multi
line
string`;

Still more convenient than C.

PHP? Don't care about PHP, it's shit, not even checking, most likely
some kind of a Perl-ish <<<EOF expression.
--
Kindest regards,
Łukasz 'Mały' Ostrowski.
Lawrence D'Oliveiro
2024-02-26 20:31:18 UTC
Permalink
Post by Łukasz 'Maly' Ostrowski
String s = """
multi line string""";
Python has those, too. I use them sometimes. Generally I’m not fond of
them, because I think they’re wrongly defined.
Post by Łukasz 'Maly' Ostrowski
let s = `multi line string`;
I think Python has something like that now, too. F-strings?
Post by Łukasz 'Maly' Ostrowski
Still more convenient than C.
I still like having the choice of implicit concatenation, because then I
fully control what appears in the string.

Tip: I have Emacs macros defined to strip and add the quoting/escaping,
because I find the strings are easier to edit without that.
Post by Łukasz 'Maly' Ostrowski
PHP? Don't care about PHP, it's shit, not even checking, most likely
some kind of a Perl-ish <<<EOF expression.
PHP is shit, not because of what it copied from Perl, but from what it
didn’t copy. Nowadays it is trying to copy from Python, and it is making
the same mistake.

The <<EOD construct that Perl has comes from POSIX shells, and it is very
useful in both places. Bash also adds a <<<-construct.

Question: How would you do two separate <<-strings in the same shell
command?
Janis Papanagnou
2024-02-27 12:18:20 UTC
Permalink
Post by Lawrence D'Oliveiro
The <<EOD construct that Perl has comes from POSIX shells, and it is very
useful in both places. Bash also adds a <<<-construct.
Yes, bash adopted the '<<<' "here-strings".
Post by Lawrence D'Oliveiro
Question: How would you do two separate <<-strings in the same shell
command?
Can you give an example what you intend here? (With what semantics?)

Since '<<' is redirecting the here-document text to stdin of the
command you can have only one channel.

Janis
Lawrence D'Oliveiro
2024-02-27 23:10:17 UTC
Permalink
Post by Janis Papanagnou
Post by Lawrence D'Oliveiro
Question: How would you do two separate <<-strings in the same shell
command?
Can you give an example what you intend here? (With what semantics?)
Since '<<' is redirecting the here-document text to stdin of the command
you can have only one channel.
Perl lets you do something like

func(<<EOD1, <<EOD2);
... contents of first string ...
EOD1
... contents of second string ...
EOD2

But this doesn’t work in Bash. However, in a Posix shell, remember you can
specify the number of the file descriptor you want to redirect, e.g.

diff -u /dev/fd/8 /dev/fd/9 8<<'EOD1' 9<<'EOD2'
... contents of first string ...
EOD1
... contents of second string ...
EOD2

Note I add the single quotes to prevent expansion of “$”-sequences within
the strings. (I think this might be needed in Perl, too.)
Janis Papanagnou
2024-02-27 23:50:46 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Janis Papanagnou
Post by Lawrence D'Oliveiro
Question: How would you do two separate <<-strings in the same shell
command?
Can you give an example what you intend here? (With what semantics?)
Since '<<' is redirecting the here-document text to stdin of the command
you can have only one channel.
Perl lets you do something like
func(<<EOD1, <<EOD2);
... contents of first string ...
EOD1
... contents of second string ...
EOD2
But this doesn’t work in Bash. However, in a Posix shell, remember you can
specify the number of the file descriptor you want to redirect, e.g.
diff -u /dev/fd/8 /dev/fd/9 8<<'EOD1' 9<<'EOD2'
... contents of first string ...
EOD1
... contents of second string ...
EOD2
Note I add the single quotes to prevent expansion of “$”-sequences within
the strings. (I think this might be needed in Perl, too.)
I see. - Yes, you can do that in POSIX shells as well. - Note that I set
F'up-to CUS. And post the response there as a f'up to this post.

Janis
Kaz Kylheku
2024-02-26 20:42:42 UTC
Permalink
Post by Lawrence D'Oliveiro
fputs
(
"When an uncleft or a bulkbit wins one or more bernstonebits above\n"
"its own, it takes on a backward lading. When it loses one or\n"
"more, it takes on a forward lading. Such a mote is called a\n"
"*farer*, for that the drag between unlike ladings flits it. When\n"
"bernstonebits flit by themselves, it may be as a bolt of\n"
"lightning, a spark off some faststanding chunk, or the everyday\n"
"flow of bernstoneness through wires.\n",
stdout
);
Of languages that derive ideas from C, only C++ and Python seem to have
kept this. Java, JavaScript and PHP have not, for some reason.
Implicit string catenation means you need punctuation to separate
elements that are not catenated.

It's a nonstarter in Lisp where you want

("ab" "cd" "ef")

to have three elements, not one. So if it worked that way, we would
need

("ab", "cd", "ef")

which is too horrible a price to pay for string literal catenation.

ANSI Lisp just allows line breaks in strings. However, all the white
space is combined into it.

Allow line breaks in string literals means that if you forget to
close a quote, it might not be diagnosed until the end of file!
The strictness of having to close a string in the same line is
worthwhile for diagnosis.

In TXR Lisp, I solved multiple problems with a backslash continuation.

"abc \
def"

encodes the string "abcdef". All unescaped whitespace around the
backslash is deleted. If you want "abc def", you can plant an escaped
space in there:

"abc \ \
def"

or

"abc \
\ def"

Unfortunately, it does mean we have the run of backslashes down the
right side:

"abc \
def \
ghi \
... "

I can live with that.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Mike Sanders
2024-02-26 22:03:11 UTC
Permalink
Post by Lawrence D'Oliveiro
fputs
(
"When an uncleft or a bulkbit wins one or more bernstonebits above\n"
"its own, it takes on a backward lading. When it loses one or\n"
"more, it takes on a forward lading. Such a mote is called a\n"
"*farer*, for that the drag between unlike ladings flits it. When\n"
"bernstonebits flit by themselves, it may be as a bolt of\n"
"lightning, a spark off some faststanding chunk, or the everyday\n"
"flow of bernstoneness through wires.\n",
stdout
);
Of languages that derive ideas from C, only C++ and Python seem to have
kept this. Java, JavaScript and PHP have not, for some reason.
Easy solution Lawrence. Why not use something like bin2c:

<https://www.segger.com/free-utilities/bin2c/>

void Usage() {

#include "my_text"

printf("%s\n", my_var);

}
--
:wq
Mike Sanders
Lawrence D'Oliveiro
2024-02-26 23:17:36 UTC
Permalink
My tool for easy editing of such embedded text is the Emacs macros in
multiquote.el, here <https://gitlab.com/ldo/emacs-prefs>.
Mike Sanders
2024-02-27 17:27:51 UTC
Permalink
Post by Lawrence D'Oliveiro
My tool for easy editing of such embedded text is the Emacs macros in
multiquote.el, here <https://gitlab.com/ldo/emacs-prefs>.
Neato-burritto, built his own tool chain, 'atta-boy. Interesting page.
--
:wq
Mike Sanders
David Brown
2024-02-27 08:36:38 UTC
Permalink
Post by Mike Sanders
Post by Lawrence D'Oliveiro
fputs
(
"When an uncleft or a bulkbit wins one or more bernstonebits above\n"
"its own, it takes on a backward lading. When it loses one or\n"
"more, it takes on a forward lading. Such a mote is called a\n"
"*farer*, for that the drag between unlike ladings flits it. When\n"
"bernstonebits flit by themselves, it may be as a bolt of\n"
"lightning, a spark off some faststanding chunk, or the everyday\n"
"flow of bernstoneness through wires.\n",
stdout
);
Of languages that derive ideas from C, only C++ and Python seem to have
kept this. Java, JavaScript and PHP have not, for some reason.
<https://www.segger.com/free-utilities/bin2c/>
Because it generates files that have Segger copyright notices stamped on
them? At least, that's how it appears from that web page.

There are lots of open source alternatives that do similar things, with
different variations in the way they generate the output. Or you can
write your own in about 10 lines of Python, which of course makes it a
lot easier to customise to fit your own styles and requirements.

And with C23, we will get #embed, though it is not yet supported by
major tools.

<https://en.cppreference.com/w/c/preprocessor/embed>
Mike Sanders
2024-02-27 17:31:36 UTC
Permalink
Post by David Brown
Because it generates files that have Segger copyright notices stamped on
them? At least, that's how it appears from that web page.
Then we build our own...
Post by David Brown
There are lots of open source alternatives that do similar things, with
different variations in the way they generate the output. Or you can
write your own in about 10 lines of Python, which of course makes it a
lot easier to customise to fit your own styles and requirements.
Yeah even simpler ways too, sed/awk/etc
Post by David Brown
And with C23, we will get #embed, though it is not yet supported by
major tools.
<https://en.cppreference.com/w/c/preprocessor/embed>
Did not know that was coming down the pike, thanks for sharing the info
David.
--
:wq
Mike Sanders
bart
2024-02-27 18:56:26 UTC
Permalink
Post by David Brown
Post by Mike Sanders
     fputs
       (
         "When an uncleft or a bulkbit wins one or more bernstonebits
above\n"
         "its own, it takes on a backward lading. When it loses one
or\n"
         "more, it takes on a forward lading. Such a mote is called a\n"
         "*farer*, for that the drag between unlike ladings flits it.
When\n"
         "bernstonebits flit by themselves, it may be as a bolt of\n"
         "lightning, a spark off some faststanding chunk, or the
everyday\n"
         "flow of bernstoneness through wires.\n",
         stdout
       );
Of languages that derive ideas from C, only C++ and Python seem to have
kept this. Java, JavaScript and PHP have not, for some reason.
<https://www.segger.com/free-utilities/bin2c/>
Because it generates files that have Segger copyright notices stamped on
them?  At least, that's how it appears from that web page.
There are lots of open source alternatives that do similar things, with
different variations in the way they generate the output.  Or you can
write your own in about 10 lines of Python, which of course makes it a
lot easier to customise to fit your own styles and requirements.
And with C23, we will get #embed, though it is not yet supported by
major tools.
<https://en.cppreference.com/w/c/preprocessor/embed>
Actually I've had such feature, for text files, for some years in my
older compiler:

#include <stdio.h>

int main(void) {
puts(strinclude(__FILE__));
}

This prints out the contents of this sourcefile. Binary files don't work
because of embedded zeros, but could have been made to.

Some stuff is just very easy to do; other stuff like designator chains
less easy and also less useful.
David Brown
2024-02-27 22:21:28 UTC
Permalink
Post by bart
Post by David Brown
Post by Mike Sanders
     fputs
       (
         "When an uncleft or a bulkbit wins one or more
bernstonebits above\n"
         "its own, it takes on a backward lading. When it loses one
or\n"
         "more, it takes on a forward lading. Such a mote is called a\n"
         "*farer*, for that the drag between unlike ladings flits
it. When\n"
         "bernstonebits flit by themselves, it may be as a bolt of\n"
         "lightning, a spark off some faststanding chunk, or the
everyday\n"
         "flow of bernstoneness through wires.\n",
         stdout
       );
Of languages that derive ideas from C, only C++ and Python seem to have
kept this. Java, JavaScript and PHP have not, for some reason.
<https://www.segger.com/free-utilities/bin2c/>
Because it generates files that have Segger copyright notices stamped
on them?  At least, that's how it appears from that web page.
There are lots of open source alternatives that do similar things,
with different variations in the way they generate the output.  Or you
can write your own in about 10 lines of Python, which of course makes
it a lot easier to customise to fit your own styles and requirements.
And with C23, we will get #embed, though it is not yet supported by
major tools.
<https://en.cppreference.com/w/c/preprocessor/embed>
Actually I've had such feature, for text files, for some years in my
    #include <stdio.h>
    int main(void) {
        puts(strinclude(__FILE__));
    }
This prints out the contents of this sourcefile. Binary files don't work
because of embedded zeros, but could have been made to.
Some stuff is just very easy to do; other stuff like designator chains
less easy and also less useful.
The #embed pre-processor directive turns the file into a list of integer
constants, one per byte (unless an implementation offers other options).
That makes it a little less convenient for strings than your solution,
but more convenient for data files. There's no harm in supporting both!
Lawrence D'Oliveiro
2024-02-27 22:52:33 UTC
Permalink
Post by David Brown
The #embed pre-processor directive turns the file into a list of integer
constants, one per byte (unless an implementation offers other options).
What a waste of time.
Kaz Kylheku
2024-02-28 01:09:46 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by David Brown
The #embed pre-processor directive turns the file into a list of integer
constants, one per byte (unless an implementation offers other options).
What a waste of time.
Plus easily doable in 1970's Lisp.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
David Brown
2024-02-28 11:50:10 UTC
Permalink
Post by Kaz Kylheku
Post by Lawrence D'Oliveiro
Post by David Brown
The #embed pre-processor directive turns the file into a list of integer
constants, one per byte (unless an implementation offers other options).
What a waste of time.
Plus easily doable in 1970's Lisp.
That would be useful, if we were living in the 1970's or if anyone had
wanted to learn Lisp this side of the millennium bug.

As I mentioned before, it's not particularly difficult to do this kind
of manipulation, and people write utilities for them in a variety of
languages, or download a variety of free tools for the job.

But it will often be more convenient to have it built into the language
and compiler. And for those interested in speed, the test
implementations have handled this far more efficiently than other
techniques. Logically, #embed turns the file into a list of numbers. In
practice, if you use it for the common case of initialising a const
array of unsigned char, the compiler simply copies and pastes the file
into the output as a binary blob.

It would, IMHO, have been useful also to have had an "embed operator" in
the manner of the "pragma operator", so that it could be used in a macro
definition.
Lawrence D'Oliveiro
2024-02-28 20:56:28 UTC
Permalink
... people write utilities for them in a variety of languages ...
But it will often be more convenient to have it built into the language
and compiler.
What can be built into the language can only ever be a small subset of
the many and varied ways that people have incorporated data blobs into
their programs. Often these will need to have custom structures with
computed header fields, that kind of thing. So you will need custom
build tools to construct these structures, and then you might as well
include those blobs directly into the final build, rather than go
through some extra step of pretending to turn them back into some
source form.

For example, here’s an old Android project of mine (OK, so the app is
Java code, but the same principle applies)
<https://bitbucket.org/ldo17/unicode_browser_android/src/master/>
where I wrote a custom Python script to read a Nameslist.txt file
downloaded from unicode.org to generate a table which could be loaded
into memory quickly for easy searching.
bart
2024-02-28 21:34:14 UTC
Permalink
Post by Lawrence D'Oliveiro
... people write utilities for them in a variety of languages ...
But it will often be more convenient to have it built into the language
and compiler.
What can be built into the language can only ever be a small subset of
the many and varied ways that people have incorporated data blobs into
their programs. Often these will need to have custom structures with
computed header fields, that kind of thing. So you will need custom
build tools to construct these structures, and then you might as well
include those blobs directly into the final build, rather than go
through some extra step of pretending to turn them back into some
source form.
For example, here’s an old Android project of mine (OK, so the app is
Java code, but the same principle applies)
<https://bitbucket.org/ldo17/unicode_browser_android/src/master/>
where I wrote a custom Python script to read a Nameslist.txt file
downloaded from unicode.org to generate a table which could be loaded
into memory quickly for easy searching.
I can see now where you get your coding style from. You seem to like
stretching things out vertically as much as possible:

public void Add
(
int CategoryCode,
ItemType Item
)
/* Use this instead of add to populate CodeToIndex table. */
{
CodeToIndex.put(CategoryCode, getCount());
add(Item);
} /*Add*/

In C:

void Add(int CategoryCode, ItemType Item) {
CodeToIndex_put(CategoryCode, getCount());
add(Item);
}

4 non-comment lines versus 9. I know Java needs tons of boilerplate, but
but it is not all the language's fault.
Lawrence D'Oliveiro
2024-02-28 23:52:55 UTC
Permalink
Post by bart
void Add(int CategoryCode, ItemType Item) {
CodeToIndex_put(CategoryCode, getCount());
add(Item);
}
4 non-comment lines versus 9. I know Java needs tons of boilerplate, but
but it is not all the language's fault.
Or how about

void Add(int CategoryCode, ItemType Item) {CodeToIndex_put(CategoryCode, getCount());add(Item);}

Wow! I never realized you could do that in C!! I thought it was an
error to put stuff after column 72 or something. Thanks for the tip!!!
bart
2024-02-29 00:15:17 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by bart
void Add(int CategoryCode, ItemType Item) {
CodeToIndex_put(CategoryCode, getCount());
add(Item);
}
4 non-comment lines versus 9. I know Java needs tons of boilerplate, but
but it is not all the language's fault.
Or how about
void Add(int CategoryCode, ItemType Item) {CodeToIndex_put(CategoryCode, getCount());add(Item);}
Wow! I never realized you could do that in C!! I thought it was an
error to put stuff after column 72 or something. Thanks for the tip!!!
Well, you could write an entire program on one line.

Or you can write an entire program in one thin column:

v\
o\
i\
d\
....

Or you can use common sense and avoiding writing code which is either
too compact or so spread out vertically that you have to hunt for the
actual code. Like trying to find the bits of meat in a thin soup.

That's what I took away from your Java code, which looks remarkably like
the spaced-out examples you posted recently.
Lawrence D'Oliveiro
2024-02-29 02:53:33 UTC
Permalink
Post by bart
Or you can use common sense and avoiding writing code which is either
too compact or so spread out vertically that you have to hunt for the
actual code. Like trying to find the bits of meat in a thin soup.
Terribly sorry about that. I wonder if you could look at this part of the
same code file:

final android.util.SparseArray<Integer> CodeToIndex =
new android.util.SparseArray<Integer>();

and show me how to thicken that part of my humble, tasteless gruel? Maybe
using that same “_” trick you used to do OO in C in your previous example?
bart
2024-02-29 09:20:30 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by bart
Or you can use common sense and avoiding writing code which is either
too compact or so spread out vertically that you have to hunt for the
actual code. Like trying to find the bits of meat in a thin soup.
Terribly sorry about that. I wonder if you could look at this part of the
final android.util.SparseArray<Integer> CodeToIndex =
new android.util.SparseArray<Integer>();
and show me how to thicken that part of my humble, tasteless gruel? Maybe
using that same “_” trick you used to do OO in C in your previous example?
You've shown an example of a piece of meat. In main.java, 70% of the
lines are either blanks or contain only an opening or closing bracket.
Scott Lurndal
2024-02-29 15:48:51 UTC
Permalink
Post by bart
Post by Lawrence D'Oliveiro
Post by bart
void Add(int CategoryCode, ItemType Item) {
CodeToIndex_put(CategoryCode, getCount());
add(Item);
}
4 non-comment lines versus 9. I know Java needs tons of boilerplate, but
but it is not all the language's fault.
Or how about
void Add(int CategoryCode, ItemType Item) {CodeToIndex_put(CategoryCode, getCount());add(Item);}
Wow! I never realized you could do that in C!! I thought it was an
error to put stuff after column 72 or something. Thanks for the tip!!!
Well, you could write an entire program on one line.
int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}

(A winner from the obfuscated C contest).
Janis Papanagnou
2024-02-29 16:03:41 UTC
Permalink
Post by Scott Lurndal
int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}
What does it do?

What preconditions must be fulfilled or what additions
does it need to compile?
Post by Scott Lurndal
(A winner from the obfuscated C contest).
(Are non-compiling C sources allowed in the contest?)

Janis
Scott Lurndal
2024-02-29 16:17:44 UTC
Permalink
Post by Janis Papanagnou
Post by Scott Lurndal
int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}
What does it do?
What preconditions must be fulfilled or what additions
does it need to compile?
Post by Scott Lurndal
(A winner from the obfuscated C contest).
(Are non-compiling C sources allowed in the contest?)
https://www.ioccc.org/years.html

The above is from 'burton'.
Janis Papanagnou
2024-02-29 17:12:20 UTC
Permalink
Post by Scott Lurndal
Post by Janis Papanagnou
Post by Scott Lurndal
int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}
What does it do?
What preconditions must be fulfilled or what additions
does it need to compile?
With the link below I see it "needs" a 600+ lines long Makefile.

And I see there's some variable definitions with magic numbers
passed.
Post by Scott Lurndal
Post by Janis Papanagnou
Post by Scott Lurndal
(A winner from the obfuscated C contest).
(Are non-compiling C sources allowed in the contest?)
https://www.ioccc.org/years.html
The above is from 'burton'.
Thanks.

Janis
Scott Lurndal
2024-02-29 17:30:00 UTC
Permalink
Post by Janis Papanagnou
Post by Janis Papanagnou
Post by Scott Lurndal
int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}
What does it do?
What preconditions must be fulfilled or what additions
does it need to compile?
With the link below I see it "needs" a 600+ lines long Makefile.
The readme simply says compile it and run it
as ./prog <value between 1 and 512>.
Keith Thompson
2024-02-29 21:20:19 UTC
Permalink
Post by Scott Lurndal
Post by Janis Papanagnou
Post by Janis Papanagnou
Post by Scott Lurndal
int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}
What does it do?
What preconditions must be fulfilled or what additions
does it need to compile?
With the link below I see it "needs" a 600+ lines long Makefile.
The readme simply says compile it and run it
as ./prog <value between 1 and 512>.
No, you have to compile it with specific command-line arguments to
define B and I. The Makefile does that (don't ask me why it's so long),
but you can do it manually.

From hint.txt:
"""
On a little-endian machine:

clang -include stdio.h -include stdlib.h -Wall -Weverything -pedantic -DB=6945503773712347754LL -DI=5859838231191962459LL -DT=0 -DS=7 -o prog prog.c

On a big-endian machine:

clang -include stdio.h -include stdlib.h -Wall -Weverything -pedantic -DB=7091606191627001958LL -DI=6006468689561538903LL -DT=1 -DS=0 -o prog.be prog.c
"""
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
bart
2024-02-29 21:44:20 UTC
Permalink
Post by Keith Thompson
Post by Scott Lurndal
Post by Janis Papanagnou
Post by Janis Papanagnou
Post by Scott Lurndal
int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}
What does it do?
What preconditions must be fulfilled or what additions
does it need to compile?
With the link below I see it "needs" a 600+ lines long Makefile.
The readme simply says compile it and run it
as ./prog <value between 1 and 512>.
No, you have to compile it with specific command-line arguments to
define B and I. The Makefile does that (don't ask me why it's so long),
but you can do it manually.
"""
clang -include stdio.h -include stdlib.h -Wall -Weverything -pedantic -DB=6945503773712347754LL -DI=5859838231191962459LL -DT=0 -DS=7 -o prog prog.c
clang -include stdio.h -include stdlib.h -Wall -Weverything -pedantic -DB=7091606191627001958LL -DI=6006468689561538903LL -DT=1 -DS=0 -o prog.be prog.c
"""
In't it cheating when half the program is part of the build
instructions? Here is a complete standalone program:

----------------
#include <stdio.h>
#include <stdlib.h>
#define B 6945503773712347754LL
#define I 5859838231191962459LL
#define T 0
#define S 7
int main(int b,char**i){long long
n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}
----------------

It's 261 bytes. The 'one-liner' that was posted was 134 bytes.

(It appears to print an input of 0 to 255 as binary.)
Keith Thompson
2024-02-29 22:06:50 UTC
Permalink
bart <***@freeuk.com> writes:
[...]
Post by bart
In't it cheating when half the program is part of the build
instructions?
Apparently not. If it were, the judges of the IOCCC would not have
accepted it.

The most recent rules are here:

https://www.ioccc.org/2020/rules.txt

I have not studied them to verify that the entry in question obeys them,
but the judges know what they're doing.

One of the winners of the 1988 contest was:
```
#include "/dev/tty"
```

This won a "Best Abuse of the Rules" award and resulted in a change in
the 1989 rules to forbid doing the same thing again.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Janis Papanagnou
2024-03-01 17:09:05 UTC
Permalink
Post by Keith Thompson
[...]
Post by bart
In't it cheating when half the program is part of the build
instructions?
I recall from decades ago (when I looked into this contest) that they
even had a contribution that fed the whole C program into the compiler
through compiler options. (I think it even got a prize.)

"Is it cheating?" - I'd say no, since it was accepted.

Is it really about an "obfuscated C code"? - I'd say no. (But it was
anyway a curiosity.)
Post by Keith Thompson
Apparently not. If it were, the judges of the IOCCC would not have
accepted it.
[...]
```
#include "/dev/tty"
This is great! :-)

Janis
Keith Thompson
2024-03-01 18:49:46 UTC
Permalink
Post by Janis Papanagnou
[...]
Post by bart
In't it cheating when half the program is part of the build
instructions?
I recall from decades ago (when I looked into this contest) that they
even had a contribution that fed the whole C program into the compiler
through compiler options. (I think it even got a prize.)
1990/stig.c:
```
c
```

stig.ksh provides a ksh alias that compiles it with a long "-D..."
option. (The resulting program prints a message indicating whether the
compiler supports nested /**/ comments.)

Like most Abuse of the Rules winners, it resulted in a rule change for
the following years.

https://www.ioccc.org/years.html#1990
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Janis Papanagnou
2024-03-01 21:06:03 UTC
Permalink
Post by Keith Thompson
Like most Abuse of the Rules winners, it resulted in a rule change for
the following years.
Makes sense.

Janis
Keith Thompson
2024-02-29 17:20:18 UTC
Permalink
Post by Scott Lurndal
Post by Janis Papanagnou
Post by Scott Lurndal
int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}
What does it do?
What preconditions must be fulfilled or what additions
does it need to compile?
Post by Scott Lurndal
(A winner from the obfuscated C contest).
(Are non-compiling C sources allowed in the contest?)
https://www.ioccc.org/years.html
The above is from 'burton'.
"burton" has submitted a number of entries over the years. This one is
https://www.ioccc.org/2020/burton/index.html

The program by itself does not compile. The provided Makefile passes
additional arguments, with B and I being defined differently in
big-endian vs. little-endian machines.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
David Brown
2024-02-29 07:58:40 UTC
Permalink
Post by Lawrence D'Oliveiro
... people write utilities for them in a variety of languages ...
But it will often be more convenient to have it built into the language
and compiler.
What can be built into the language can only ever be a small subset of
the many and varied ways that people have incorporated data blobs into
their programs.
Of course. But that doesn't mean that a language should not include a
feature that makes it easy for a lot of people to get some data blobs
into their code. Maybe /you/ won't find it useful, but other people will.
Lawrence D'Oliveiro
2024-02-29 21:05:02 UTC
Permalink
Post by David Brown
Post by Lawrence D'Oliveiro
... people write utilities for them in a variety of languages ...
But it will often be more convenient to have it built into the
language and compiler.
What can be built into the language can only ever be a small subset of
the many and varied ways that people have incorporated data blobs into
their programs.
Of course. But that doesn't mean that a language should not include a
feature that makes it easy for a lot of people to get some data blobs
into their code.
Maybe the C compiler should concentrate on compiling C code, and leave it
to the rest of the build toolchain to deal with other data.
David Brown
2024-03-01 08:16:00 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by David Brown
Post by Lawrence D'Oliveiro
... people write utilities for them in a variety of languages ...
But it will often be more convenient to have it built into the
language and compiler.
What can be built into the language can only ever be a small subset of
the many and varied ways that people have incorporated data blobs into
their programs.
Of course. But that doesn't mean that a language should not include a
feature that makes it easy for a lot of people to get some data blobs
into their code.
Maybe the C compiler should concentrate on compiling C code, and leave it
to the rest of the build toolchain to deal with other data.
It is possible to be actively involved in the development of the
standards - preparing and discussing proposals, joining committees, or
at least joining mailing lists for the discussions. If you are not
doing the work and showing the interest /before/ decisions are made, you
don't get a say afterwards. It is more productive to discuss what you
can do with the features C has, than to wish it never had them.

Oh, and the reason C23 has #embed, is because people want it. It is
something C developers have asked for for many years. /You/ might not
have use of it, but that's true of lots of features of C for all
programmers - no one needs everything in the language and standard library.
Kaz Kylheku
2024-03-01 16:55:51 UTC
Permalink
Post by David Brown
It is possible to be actively involved in the development of the
standards - preparing and discussing proposals, joining committees, or
at least joining mailing lists for the discussions. If you are not
doing the work and showing the interest /before/ decisions are made, you
don't get a say afterwards. It is more productive to discuss what you
can do with the features C has, than to wish it never had them.
Also, if you don't join the gang that breaks windows and spray
paints walls, you don't get to say aftward which windows are broken
and what is scribbled on what wall.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
David Brown
2024-03-01 17:28:05 UTC
Permalink
Post by Kaz Kylheku
Post by David Brown
It is possible to be actively involved in the development of the
standards - preparing and discussing proposals, joining committees, or
at least joining mailing lists for the discussions. If you are not
doing the work and showing the interest /before/ decisions are made, you
don't get a say afterwards. It is more productive to discuss what you
can do with the features C has, than to wish it never had them.
Also, if you don't join the gang that breaks windows and spray
paints walls, you don't get to say aftward which windows are broken
and what is scribbled on what wall.
A slightly closer version of that feeble analogy would be that you don't
get to say they should have used a different colour, or broken doors
instead of windows.

It's okay for Lawrence (or anyone else) to say that don't approve of
#embed, or don't think they will use it themselves. But like most
(probably all) features in newer C standards, it was added because
enough people wanted it for the committee and connected developers to do
the work designing and documenting the features, and testing prototypes
in practice.

There are procedures in place for people to have an influence on the
future of C. If you want to have your say, you can have it. But
waiting until a new standard version is solidified and then complaining
that you don't like the direction it is taking, is too late. Whining
about things here afterwards doesn't do anyone any good.

That's different from saying you don't like the feature, or you don't
like the way C is heading, or you won't use it yourself. And it's
different from talking about it, trying to learn how a new feature works
and how to make the best of it. Such discussions are great, and I'd
love to see more of them here in c.l.c.
Lawrence D'Oliveiro
2024-02-27 20:25:27 UTC
Permalink
Post by David Brown
And with C23, we will get #embed, though it is not yet supported by
major tools.
More and more hacks on the preprocessor. Why not just get rid of it and
replace it with something like m4?

Because then you will discover that string-based macros are inherently an
unmanageable problem.
Keith Thompson
2024-02-27 20:35:35 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by David Brown
And with C23, we will get #embed, though it is not yet supported by
major tools.
More and more hacks on the preprocessor. Why not just get rid of it and
replace it with something like m4?
Because it would invalidate 99% or more of existing C code.

(m4? Seriously?)
Post by Lawrence D'Oliveiro
Because then you will discover that string-based macros are inherently an
unmanageable problem.
The C preprocessor operates on preprocessor tokens, not just strings.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Lawrence D'Oliveiro
2024-02-27 23:03:08 UTC
Permalink
Post by Keith Thompson
(m4? Seriously?)
Do you know of any more powerful string-based macro processor?
Post by Keith Thompson
The C preprocessor operates on preprocessor tokens, not just strings.
Think of “hygienic” macros in the Lisps, and why that is just impossible
in any string-based preprocessor.
bart
2024-02-27 22:12:23 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by David Brown
And with C23, we will get #embed, though it is not yet supported by
major tools.
More and more hacks on the preprocessor. Why not just get rid of it and
replace it with something like m4?
Because then you will discover that string-based macros are inherently an
unmanageable problem.
I hadn't notice that #embed was a preprocessor directive. But that is
not the problem here, it is this:

"The expansion of a #embed directive is a token sequence formed from the
list of integer constant expressions described below."

If a string like "ABC" really is converted to the five tokens 'A' comma
'B' comma 'C', then it's going to make long strings and binary files
inefficient.

Embedding a 100KB file will result in a 100KB bigger executable, but
along the way it may have to generate 200,000 tokens within the
compiler, half of them commas. Which in turn will need to be turned into
100,000 integer expressions.

I would hope that implementations find some way of streamlining that
process, perhaps by turning that 100KB of data directly into a 100KB string.
David Brown
2024-02-28 11:54:06 UTC
Permalink
Post by bart
Post by Lawrence D'Oliveiro
Post by David Brown
And with C23, we will get #embed, though it is not yet supported by
major tools.
More and more hacks on the preprocessor. Why not just get rid of it and
replace it with something like m4?
Because then you will discover that string-based macros are inherently an
unmanageable problem.
I hadn't notice that #embed was a preprocessor directive. But that is
"The expansion of a #embed directive is a token sequence formed from the
list of integer constant expressions described below."
If a string like "ABC" really is converted to the five tokens 'A' comma
'B' comma 'C', then it's going to make long strings and binary files
inefficient.
Embedding a 100KB file will result in a 100KB bigger executable, but
along the way it may have to generate 200,000 tokens within the
compiler, half of them commas. Which in turn will need to be turned into
100,000 integer expressions.
I would hope that implementations find some way of streamlining that
process, perhaps by turning that 100KB of data directly into a 100KB string.
They won't use strings, they will use data blobs - binary data. Then
there is no issue with null bytes. And yes, implementations will skip
the token generation (unless you are doing something weird, such as
using #embed to read the parameters to a function call).

Tests with prototype implementations gave extremely fast results.
bart
2024-02-28 13:13:13 UTC
Permalink
Post by bart
Post by Lawrence D'Oliveiro
Post by David Brown
And with C23, we will get #embed, though it is not yet supported by
major tools.
More and more hacks on the preprocessor. Why not just get rid of it and
replace it with something like m4?
Because then you will discover that string-based macros are
inherently an
unmanageable problem.
I hadn't notice that #embed was a preprocessor directive. But that is
"The expansion of a #embed directive is a token sequence formed from
the list of integer constant expressions described below."
If a string like "ABC" really is converted to the five tokens 'A'
comma 'B' comma 'C', then it's going to make long strings and binary
files inefficient.
Embedding a 100KB file will result in a 100KB bigger executable, but
along the way it may have to generate 200,000 tokens within the
compiler, half of them commas. Which in turn will need to be turned
into 100,000 integer expressions.
I would hope that implementations find some way of streamlining that
process, perhaps by turning that 100KB of data directly into a 100KB string.
They won't use strings, they will use data blobs - binary data.  Then
there is no issue with null bytes.
AFAIK strings in C can have embedded zeros when not assumed to be
zero-terminated. So here:

char s[]={1,2,3,0,4,5,6};

s will have a length of 7.
  And yes, implementations will skip
the token generation (unless you are doing something weird, such as
using #embed to read the parameters to a function call).
What happens if you do -E to preprocess only?
David Brown
2024-02-28 14:08:52 UTC
Permalink
Post by bart
Post by bart
Post by Lawrence D'Oliveiro
Post by David Brown
And with C23, we will get #embed, though it is not yet supported by
major tools.
More and more hacks on the preprocessor. Why not just get rid of it and
replace it with something like m4?
Because then you will discover that string-based macros are
inherently an
unmanageable problem.
I hadn't notice that #embed was a preprocessor directive. But that is
"The expansion of a #embed directive is a token sequence formed from
the list of integer constant expressions described below."
If a string like "ABC" really is converted to the five tokens 'A'
comma 'B' comma 'C', then it's going to make long strings and binary
files inefficient.
Embedding a 100KB file will result in a 100KB bigger executable, but
along the way it may have to generate 200,000 tokens within the
compiler, half of them commas. Which in turn will need to be turned
into 100,000 integer expressions.
I would hope that implementations find some way of streamlining that
process, perhaps by turning that 100KB of data directly into a 100KB string.
They won't use strings, they will use data blobs - binary data.  Then
there is no issue with null bytes.
AFAIK strings in C can have embedded zeros when not assumed to be
    char s[]={1,2,3,0,4,5,6};
s will have a length of 7.
That's not a string, it's an array of char. A "string" in C is "a
contiguous sequence of characters terminated by and including the first
null character". The difference is crucial in respect to the handling
of null bytes. And it is the main reason for #embed generating a
comma-separated sequence of integer constants rather than a string. (It
also avoids messy hex character sequences if you show the output of
#embed somewhere.)
Post by bart
  And yes, implementations will skip the token generation (unless you
are doing something weird, such as using #embed to read the parameters
to a function call).
What happens if you do -E to preprocess only?
That's something weird :-)

I guess you get the integer list.
Keith Thompson
2024-02-28 21:36:40 UTC
Permalink
bart <***@freeuk.com> writes:
[...]
Post by bart
AFAIK strings in C can have embedded zeros when not assumed to be
char s[]={1,2,3,0,4,5,6};
s will have a length of 7.
Strings *by definition* cannot have embedded zeros. A null character
terminates a string.

A string literal can have embedded \0 characters, but if you're
suggesting that #embed should expand to a string literal, I can see
several disadvantages and no significant advantages. For one thing, the
data may or may not end with a null character; string literals always
do.

[...]
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Malcolm McLean
2024-02-29 11:56:47 UTC
Permalink
Post by Keith Thompson
[...]
Post by bart
AFAIK strings in C can have embedded zeros when not assumed to be
char s[]={1,2,3,0,4,5,6};
s will have a length of 7.
Strings *by definition* cannot have embedded zeros. A null character
terminates a string.
C strings. Not strings in other programming languages. And only if you
define "C strings" in a rather restrictive but, to be fair, totally
legitimate way. So I wouldn't have put in the asterisks.
--
Check out Basic Algorithms and my other books:
https://www.lulu.com/spotlight/bgy1mm
David Brown
2024-02-29 15:19:45 UTC
Permalink
Post by Malcolm McLean
[...]
Post by bart
AFAIK strings in C can have embedded zeros when not assumed to be
     char s[]={1,2,3,0,4,5,6};
s will have a length of 7.
Strings *by definition* cannot have embedded zeros.  A null character
terminates a string.
C strings. Not strings in other programming languages.
Let me point you to the name of this Usenet group.

And strings in any programming language have either :

1. A string of characters and a terminating null, thus no embedded null
characters.
2. A starting length (such as Pascal strings).
3. A fixed size.
4. A more advanced structure.

An array of bytes is not a "string".
Post by Malcolm McLean
And only if you
define "C strings" in a rather restrictive but, to be fair, totally
legitimate way. So I wouldn't have put in the asterisks.
The definition of "C string" is given in section 7.1.1p1 of the C
standards. There is only one definition of "C string".
Lawrence D'Oliveiro
2024-02-29 21:36:05 UTC
Permalink
Post by David Brown
An array of bytes is not a "string".
It is in PHP, I think also in Perl, and also in (obsolete) Python 2.

And what about C string functions that take explicit lengths?
Keith Thompson
2024-02-29 21:53:04 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by David Brown
An array of bytes is not a "string".
It is in PHP, I think also in Perl, and also in (obsolete) Python 2.
I'm fairly sure that strings are of a distinct non-array type in all
three of those languages. If you're curious about the details, consult
the relevant documentation or post in an appropriate newsgroup.
Post by Lawrence D'Oliveiro
And what about C string functions that take explicit lengths?
What about them?

The C standard defines the term "string" (7.1.1 paragraph 1). Do you
dispute that definition?
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Richard Harnden
2024-03-01 12:59:43 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by David Brown
An array of bytes is not a "string".
It is in PHP, I think also in Perl, and also in (obsolete) Python 2.
And what about C string functions that take explicit lengths?
You mean: There's a danger that a function that returns a 'string', but
truncates it to n chars, might not be returning a string at all ?
Lawrence D'Oliveiro
2024-03-01 20:59:11 UTC
Permalink
Post by Richard Harnden
You mean: There's a danger that a function that returns a 'string', but
truncates it to n chars, might not be returning a string at all ?
If it’s not NUL-terminated, then it’s not a “string”, right?
Keith Thompson
2024-02-29 16:08:23 UTC
Permalink
Post by Malcolm McLean
Post by Keith Thompson
[...]
Post by bart
AFAIK strings in C can have embedded zeros when not assumed to be
char s[]={1,2,3,0,4,5,6};
s will have a length of 7.
Strings *by definition* cannot have embedded zeros. A null
character
terminates a string.
C strings. Not strings in other programming languages. And only if you
define "C strings" in a rather restrictive but, to be fair, totally
legitimate way. So I wouldn't have put in the asterisks.
Yes, C strings. Or has we call them here in comp.lang.c, strings.

Sheesh.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
bart
2024-02-29 14:31:11 UTC
Permalink
Post by Keith Thompson
[...]
Post by bart
AFAIK strings in C can have embedded zeros when not assumed to be
char s[]={1,2,3,0,4,5,6};
s will have a length of 7.
Strings *by definition* cannot have embedded zeros. A null character
terminates a string.
A string literal can have embedded \0 characters, but if you're
suggesting that #embed should expand to a string literal, I can see
several disadvantages and no significant advantages. For one thing, the
data may or may not end with a null character; string literals always
do.
Not here:

char s[] = "ABC";
char t[3] = "DEF";

The "DEF" string doesn't end with a zero.

Is 'string' given a special meaning in the standard?

/That/ would seem to me to be too restrictive. Does this:

char *s;

define a pointer to a such string, or can it be any kind of data? For
example, `char*` is used by the GetOpenFileName WinAPI function for a
/series/ of zero-terminated strings which itself is terminated with two
zero bytes.

So it is some property that is attributed to the data that will be stored.

I normally use `cstring` or `stringz` outside the language when refering
to a zero-terminated sequences of characters, which implies that
embedded zeros aren't allowed.
Richard Harnden
2024-02-29 15:22:18 UTC
Permalink
[...]
Post by bart
AFAIK strings in C can have embedded zeros when not assumed to be
     char s[]={1,2,3,0,4,5,6};
s will have a length of 7.
Strings *by definition* cannot have embedded zeros.  A null character
terminates a string.
A string literal can have embedded \0 characters, but if you're
suggesting that #embed should expand to a string literal, I can see
several disadvantages and no significant advantages.  For one thing, the
data may or may not end with a null character; string literals always
do.
    char s[]  = "ABC";
    char t[3] = "DEF";
The "DEF" string doesn't end with a zero.
And is, therefore, not a string.
Is 'string' given a special meaning in the standard?
Yes. Things that work with the strX functions. Which means they are
'\0' terminated.
   char *s;
define a pointer to a such string, or can it be any kind of data? For
That points to a char. That could be followed by more chars and it one
of those is a '\0', then it's a string. You know this.
example, `char*` is used by the GetOpenFileName WinAPI function for a
/series/ of zero-terminated strings which itself is terminated with two
zero bytes.
That is a windowsism, then.

Why didn't they use the NULL terminated char **argv kind of thing?
So it is some property that is attributed to the data that will be stored.
I normally use `cstring` or `stringz` outside the language when refering
to a zero-terminated sequences of characters, which implies that
embedded zeros aren't allowed.
Chris M. Thomasson
2024-02-29 21:10:54 UTC
Permalink
Post by Richard Harnden
[...]
Post by bart
AFAIK strings in C can have embedded zeros when not assumed to be
     char s[]={1,2,3,0,4,5,6};
s will have a length of 7.
Strings *by definition* cannot have embedded zeros.  A null character
terminates a string.
A string literal can have embedded \0 characters, but if you're
suggesting that #embed should expand to a string literal, I can see
several disadvantages and no significant advantages.  For one thing, the
data may or may not end with a null character; string literals always
do.
     char s[]  = "ABC";
     char t[3] = "DEF";
The "DEF" string doesn't end with a zero.
And is, therefore, not a string.
[...]

Right. However, is this a string or two embedded strings:

char x[] = "ABC\0DEF"
_____________________
#include <stdio.h>

int main()
{
char x[] = "ABC\0DEF";

printf("%s\n", x);
printf("%s", x + 4);

return 0;
}
_____________________

Any undefined behavior here?
Keith Thompson
2024-02-29 21:45:26 UTC
Permalink
[...]
Post by Chris M. Thomasson
Post by Richard Harnden
     char s[]  = "ABC";
     char t[3] = "DEF";
The "DEF" string doesn't end with a zero.
And is, therefore, not a string.
[...]
char x[] = "ABC\0DEF"
Is *what* a string or two embedded strings?

x as a whole does not contain a string, but there are 8 strings within it.
Post by Chris M. Thomasson
_____________________
#include <stdio.h>
int main()
{
char x[] = "ABC\0DEF";
printf("%s\n", x);
printf("%s", x + 4);
return 0;
}
_____________________
Any undefined behavior here?
No, and none in this:

for (size_t i = 0; i < sizeof x; i ++) {
printf("%s\n", x+i);
}
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Chris M. Thomasson
2024-02-29 22:03:23 UTC
Permalink
Post by Keith Thompson
[...]
Post by Chris M. Thomasson
Post by Richard Harnden
     char s[]  = "ABC";
     char t[3] = "DEF";
The "DEF" string doesn't end with a zero.
And is, therefore, not a string.
[...]
char x[] = "ABC\0DEF"
^^^^^^^^^^^^^^^^^^^^^
Post by Keith Thompson
Is *what* a string or two embedded strings?
For some reason, I see two "embedded" "C strings" wrt null termination here.
Post by Keith Thompson
x as a whole does not contain a string, but there are 8 strings within it.
Well, I was referring to that null terminator. So I see two "C strings"
separated by an offset? Wrt using std functions that terminate on a '\0'?

I think I am misunderstanding you. Sorry... ;^o
Post by Keith Thompson
Post by Chris M. Thomasson
_____________________
#include <stdio.h>
int main()
{
char x[] = "ABC\0DEF";
printf("%s\n", x);
printf("%s", x + 4);
return 0;
}
_____________________
Any undefined behavior here?
for (size_t i = 0; i < sizeof x; i ++) {
printf("%s\n", x+i);
}
Keith Thompson
2024-02-29 22:14:52 UTC
Permalink
Post by Chris M. Thomasson
Post by Keith Thompson
[...]
Post by Chris M. Thomasson
Post by Richard Harnden
     char s[]  = "ABC";
     char t[3] = "DEF";
The "DEF" string doesn't end with a zero.
And is, therefore, not a string.
[...]
char x[] = "ABC\0DEF"
^^^^^^^^^^^^^^^^^^^^^
Post by Keith Thompson
Is *what* a string or two embedded strings?
For some reason, I see two "embedded" "C strings" wrt null termination here.
I was asking you to clarify what you're referring to. What you posted
is a declaration, not a string. You could have been referring to the
object x, to the string literal, to subsets of either, or to something else.
Post by Chris M. Thomasson
Post by Keith Thompson
x as a whole does not contain a string, but there are 8 strings within it.
Well, I was referring to that null terminator. So I see two "C
strings" separated by an offset? Wrt using std functions that
terminate on a '\0'?
I think I am misunderstanding you. Sorry... ;^o
I thought I explained it, but here's a more straightforward explanation.
And the term "string" is defined by 7.1.1p1, not by the behavior of
standard functions: "A *string* is a contiguous sequence of characters
terminated by and including the first null character."

At run time, the array `x` contains:
- A string of length 3 starting at index 0.
- A string of length 2 starting at index 1.
- A string of length 1 starting at index 2.
- A string of length 0 starting at index 3.
- A string of length 3 starting at index 4.
- A string of length 2 starting at index 5.
- A string of length 1 starting at index 6.
- A string of length 0 starting at index 7.

#include <stdio.h>
int main(void) {
char x[] = "ABC\0DEF";
for (size_t i = 0; i < sizeof x; i ++) {
printf("x[%zu] is a pointer to the string \"%s\"\n", i, x+i);
}
}

Output:
x[0] is a pointer to the string "ABC"
x[1] is a pointer to the string "BC"
x[2] is a pointer to the string "C"
x[3] is a pointer to the string ""
x[4] is a pointer to the string "DEF"
x[5] is a pointer to the string "EF"
x[6] is a pointer to the string "F"
x[7] is a pointer to the string ""
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Chris M. Thomasson
2024-03-02 21:48:52 UTC
Permalink
Post by Keith Thompson
Post by Chris M. Thomasson
Post by Keith Thompson
[...]
Post by Chris M. Thomasson
Post by Richard Harnden
     char s[]  = "ABC";
     char t[3] = "DEF";
The "DEF" string doesn't end with a zero.
And is, therefore, not a string.
[...]
char x[] = "ABC\0DEF"
^^^^^^^^^^^^^^^^^^^^^
Post by Keith Thompson
Is *what* a string or two embedded strings?
For some reason, I see two "embedded" "C strings" wrt null termination here.
I was asking you to clarify what you're referring to. What you posted
is a declaration, not a string. You could have been referring to the
object x, to the string literal, to subsets of either, or to something else.
Post by Chris M. Thomasson
Post by Keith Thompson
x as a whole does not contain a string, but there are 8 strings within it.
Well, I was referring to that null terminator. So I see two "C
strings" separated by an offset? Wrt using std functions that
terminate on a '\0'?
I think I am misunderstanding you. Sorry... ;^o
I thought I explained it, but here's a more straightforward explanation.
And the term "string" is defined by 7.1.1p1, not by the behavior of
standard functions: "A *string* is a contiguous sequence of characters
terminated by and including the first null character."
[...]

Ahhh yes. Sorry for glossing over that point.
Lawrence D'Oliveiro
2024-03-05 04:48:38 UTC
Permalink
Post by Keith Thompson
"A *string* is a contiguous sequence of characters
terminated by and including the first null character."
So how come strlen(3) does not include the null?

David Brown
2024-02-29 15:30:05 UTC
Permalink
[...]
Post by bart
AFAIK strings in C can have embedded zeros when not assumed to be
     char s[]={1,2,3,0,4,5,6};
s will have a length of 7.
Strings *by definition* cannot have embedded zeros.  A null character
terminates a string.
A string literal can have embedded \0 characters, but if you're
suggesting that #embed should expand to a string literal, I can see
several disadvantages and no significant advantages.  For one thing, the
data may or may not end with a null character; string literals always
do.
    char s[]  = "ABC";
"ABC" is a "string literal". Once things like concatenation of adjacent
strings, macro expansion, etc., is complete, a null character is
appended to it. Then it is used as an initialiser for the array "s".
After initialisation, "s" is an array of 4 chars and contains a string.

(Note - a "string literal" might not be a "string", because string
literals can contain embedded nulls. This is a footnote in 6.4.5
describing string literals.)
    char t[3] = "DEF";
The "DEF" string doesn't end with a zero.
"DEF" is not a "string" - it is a "string literal". It does get a null
character appended during translation phase 7. But only the first three
characters - therefore not including the null character - get copied to
"t" during the initialisation of "t". "t" is an array of 3 chars, and
it does not contain a string.
Is 'string' given a special meaning in the standard?
Yes. See 7.1.1p1.
   char *s;
define a pointer to a such string, or can it be any kind of data? For
example, `char*` is used by the GetOpenFileName WinAPI function for a
/series/ of zero-terminated strings which itself is terminated with two
zero bytes.
"char *s;" declares a pointer to a char, or a pointer to the start of an
array of char. It is /not/ a string, or a pointer to a string. C
strings are values, and exist at run-time - they are not types. So "s"
can point to a string, or a char (which will be a string if and only if
it is a null character), or an array of chars (which may or may not
contain a string), or it could point to anything else.
So it is some property that is attributed to the data that will be stored.
Yes.
I normally use `cstring` or `stringz` outside the language when refering
to a zero-terminated sequences of characters, which implies that
embedded zeros aren't allowed.
That makes sense. Different languages have different ways of holding
sequences of characters (generic "string" data), so you need to qualify
the term if it is not clear from the context.

But when we are discussing C, and there is no other qualification,
"string" means "C string" - the definition of "string" given in the C
standards.
Keith Thompson
2024-02-29 16:25:19 UTC
Permalink
David Brown <***@hesbynett.no> writes:
[...]
Post by David Brown
"char *s;" declares a pointer to a char, or a pointer to the start of
an array of char.
"char *s;" defines a pointer to char. At runtime, it may or may not
point to the initial element of an array object. (And for most
purposes, a char object is treated as 1-element array of char.)
Post by David Brown
It is /not/ a string, or a pointer to a string.
It may be a pointer to a string at run time, depending on what it
currently points to.
Post by David Brown
C strings are values, and exist at run-time - they are not types. So
"s" can point to a string, or a char (which will be a string if and
only if it is a null character), or an array of chars (which may or
may not contain a string), or it could point to anything else.
"s" can point to *the initial element of* an array of chars. if that
array contains a string, then (the value of) s is by definition a
pointer to a string.

[...]
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Keith Thompson
2024-02-29 16:18:52 UTC
Permalink
Post by bart
Post by Keith Thompson
[...]
Post by bart
AFAIK strings in C can have embedded zeros when not assumed to be
char s[]={1,2,3,0,4,5,6};
s will have a length of 7.
Strings *by definition* cannot have embedded zeros. A null
character
terminates a string.
A string literal can have embedded \0 characters, but if you're
suggesting that #embed should expand to a string literal, I can see
several disadvantages and no significant advantages. For one thing, the
data may or may not end with a null character; string literals always
do.
char s[] = "ABC";
char t[3] = "DEF";
The "DEF" string doesn't end with a zero.
That's not strictly true, nor is it relevant.

For #embed, you generally don't know the length of the resulting
sequence of values, so you can't usually write something like:

const unsigned char data[N] = {
#embed "foo.dat"
};

And the way it's described in the standard is that "DEF" refers to an
array object that does include a trailing null byte, but only the first
3 characters are used to initialize t. Of course a compiler can
optimize that one byte away.
Post by bart
Is 'string' given a special meaning in the standard?
Of course it is. Surely you know that. 7.1.1 paragraph 1.

"abc\0def" is a valid string literal, but its value is not a string.
(No, the standard doesn't say that the value of a string literal is a
string.)

[...]
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Janis Papanagnou
2024-02-29 17:17:19 UTC
Permalink
Post by Keith Thompson
"abc\0def" is a valid string literal, but its value is not a string.
(No, the standard doesn't say that the value of a string literal is a
string.)
This sounds somewhat strange in my ears. Usually a literal for a type
will constitute an instance of the type. - I suppose the irregularity
stems from the fact that there's no explicit string object type in C.

Janis
Keith Thompson
2024-02-29 17:22:28 UTC
Permalink
Post by Janis Papanagnou
Post by Keith Thompson
"abc\0def" is a valid string literal, but its value is not a string.
(No, the standard doesn't say that the value of a string literal is a
string.)
This sounds somewhat strange in my ears. Usually a literal for a type
will constitute an instance of the type. - I suppose the irregularity
stems from the fact that there's no explicit string object type in C.
Exactly, "string" is not a type.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Kaz Kylheku
2024-02-29 19:26:42 UTC
Permalink
Post by Keith Thompson
Post by Janis Papanagnou
Post by Keith Thompson
"abc\0def" is a valid string literal, but its value is not a string.
(No, the standard doesn't say that the value of a string literal is a
string.)
This sounds somewhat strange in my ears. Usually a literal for a type
will constitute an instance of the type. - I suppose the irregularity
stems from the fact that there's no explicit string object type in C.
Exactly, "string" is not a type.
It is a type in the broader sense, in that is a logical proposition
about the attributes of an object that is true or false.

It's just not a type in the C static type system.

What that means is that there does not exist a constraint rule in
standard C requiring some expression or object to conform to the string
type. The concept "string" is not represented in the constraint system.
But it is a type concept.

(There are rules that require a string, but they are not constraint
rules. E.g. if strlen is given an argument which isn't a string, the
behavior is undefined.)

Consider:

char a[3] = "abc";
size_t l = strlen(a);

In the unlikely event that this example would capture the attention of a
computer scientist who researches type systems, he or she would identify
that as having a type error. (One that the C type system is too weak to
model.)

"Upper case letter" is also a type; that's why the header is called
<ctype.h>.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
James Kuyper
2024-02-29 19:45:44 UTC
Permalink
...>> Exactly, "string" is not a type.
Post by Kaz Kylheku
It is a type in the broader sense, in that is a logical proposition
about the attributes of an object that is true or false.
If I defined something to be a sequence of floating point numbers
terminated by a NaN, would that thing also qualify as a type, according
to the definition you're using?

Could you give a source for the definition of "type" that you're using?
Can you use the word "type" in a statement whose truth relies upon the
difference between that definition and the way that "type" is defined by
the C standard? Preferably it would be a useful statement that applies to C.
Keith Thompson
2024-02-28 21:41:44 UTC
Permalink
bart <***@freeuk.com> writes:
[...]
Post by bart
AFAIK strings in C can have embedded zeros when not assumed to be
char s[]={1,2,3,0,4,5,6};
s will have a length of 7.
s will have a *size* of 7. Its length, as a string, is 3. The
distinction between "length" and "size" is particularly important in
this case.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Keith Thompson
2024-02-28 21:57:09 UTC
Permalink
David Brown <***@hesbynett.no> writes:
[...]
Post by David Brown
They won't use strings, they will use data blobs - binary data. Then
there is no issue with null bytes. And yes, implementations will skip
the token generation (unless you are doing something weird, such as
using #embed to read the parameters to a function call).
Tests with prototype implementations gave extremely fast results.
I'm not sure how that would work. #embed is a preprocessor directive,
and at least in the abstract model it has to expand to valid C code.

I would have expected that it would simply generate the list of
comma-separated integer constants described in the standard; later
phases would simply parse that list and generate code as if that
sequence had been written in the original source file. Do you know of
an implementation that does something else?

For example, say you have a file "foo.dat" containing 4 bytes with
values 0, 1, 2, and 3. This would be perfectly valid:

struct foo {
unsigned char a;
unsigned short b;
unsigned int c;
double d;
};

struct foo obj = {
#embed "foo.dat"
};

#embed isn't defined to translate an input file to a sequence of bytes.
It's defined to translate an input file to a sequence of integer
constant expressions.

*Maybe* a compiler could optimize for the case where it knows that it's
being used to initialize an array of unsigned char, but (a) that would
require the preprocessor to have information that normally doesn't exist
until later phases, and (b) I'm not convinced it would be worth the
effort.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
bart
2024-02-28 23:01:22 UTC
Permalink
Post by Keith Thompson
[...]
Post by David Brown
They won't use strings, they will use data blobs - binary data. Then
there is no issue with null bytes. And yes, implementations will skip
the token generation (unless you are doing something weird, such as
using #embed to read the parameters to a function call).
Tests with prototype implementations gave extremely fast results.
I'm not sure how that would work. #embed is a preprocessor directive,
and at least in the abstract model it has to expand to valid C code.
I would have expected that it would simply generate the list of
comma-separated integer constants described in the standard; later
phases would simply parse that list and generate code as if that
sequence had been written in the original source file. Do you know of
an implementation that does something else?
For example, say you have a file "foo.dat" containing 4 bytes with
struct foo {
unsigned char a;
unsigned short b;
unsigned int c;
double d;
};
struct foo obj = {
#embed "foo.dat"
};
It would be unfortunate if your example was allowed. Clearly a binary
representation of an instance of your struct would probably require 16
bytes rather than 4, of which one may be padding.

Certainly if you were to write it out to disk as binary, it would need
more than 4.
Post by Keith Thompson
#embed isn't defined to translate an input file to a sequence of bytes.
It's defined to translate an input file to a sequence of integer
constant expressions.
Maybe it should be defined exactly like that, because that is what
people might expect. You example is better off using a normal text file
which contains an actual comma-delimited list (and which can mix ints
and floats), and a regular #include.
Keith Thompson
2024-02-28 23:31:17 UTC
Permalink
Post by bart
Post by Keith Thompson
[...]
Post by David Brown
They won't use strings, they will use data blobs - binary data. Then
there is no issue with null bytes. And yes, implementations will skip
the token generation (unless you are doing something weird, such as
using #embed to read the parameters to a function call).
Tests with prototype implementations gave extremely fast results.
I'm not sure how that would work. #embed is a preprocessor
directive,
and at least in the abstract model it has to expand to valid C code.
I would have expected that it would simply generate the list of
comma-separated integer constants described in the standard; later
phases would simply parse that list and generate code as if that
sequence had been written in the original source file. Do you know of
an implementation that does something else?
For example, say you have a file "foo.dat" containing 4 bytes with
struct foo {
unsigned char a;
unsigned short b;
unsigned int c;
double d;
};
struct foo obj = {
#embed "foo.dat"
};
It would be unfortunate if your example was allowed. Clearly a binary
representation of an instance of your struct would probably require 16
bytes rather than 4, of which one may be padding.
Depending on the sizes and alignments of the various types, sure.
So what?

N3096 is the latest public C23 draft.

https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf

#embed is defined in section 6.10.3.

The expansion of a #embed directive is a token sequence
formed from the list of integer constant expressions described
below. The group of tokens for each integer constant expression
in the list is separated in the token sequence from the group
of tokens for the previous integer constant expression in the
list by a comma. The sequence neither begins nor ends in a
comma. If the list of integer constant expressions is empty,
the token sequence is empty. The directive is replaced by its
expansion and, with the presence of certain embed parameters,
additional or replacement token sequences.

It's a preprocessor directive. The preprocessor operates on text and
proprocessing tokens, not on raw data. There is no way to directly
represent raw data in C source code. (I suppose string literals do so
to some extent, but they can't represent generalized raw data.)

The usage I described above is allowed. I see nothing unfortunate about
it. If you only want to use #embed with arrays of unsigned char, then
do that.

Its primary intended use is to read binary file contents at compile time
and allow a program to treat those contents as a raw representation,
particularly as the initialization for an array of unsigned char. There
was no reason to impose arbitrary restrictions to make it impossible to
use for any other purposes.

I suppose it would have been possible for #embed to expand to the raw
data itself, a binary copy of the input file. That would require C
source code, which currently is plain text, to be able to support
delimited chunks of binary data. It would require changes to portions
of the compiler after the preprocessor. Presumably you'd be able to
write the same representation directly in a C source file, which means
that C source files would no longer necessarily be representable as
text. I can see that causing all kinds of problems.

Fortunately, none of that was necessary, since the authors came up with
a way to define #embed in the preprocessor without making any other
changes to how C source code is represented. The fact that it can be
used in other odd ways doesn't bother me. The code I wrote above is
valid; I never said it was acceptable style.
Post by bart
Certainly if you were to write it out to disk as binary, it would need
more than 4.
Yes. So what?
Post by bart
Post by Keith Thompson
#embed isn't defined to translate an input file to a sequence of bytes.
It's defined to translate an input file to a sequence of integer
constant expressions.
Maybe it should be defined exactly like that, because that is what
people might expect. You example is better off using a normal text
file which contains an actual comma-delimited list (and which can mix
ints and floats), and a regular #include.
I certainly wouldn't advocate writing code like the above. My point is
that, given the definition of #embed in the C23 standard, it's valid and
has well defined semantics.

If you have suggestions for alternate ways to define #embed, they might
be interesting, but it's too late to change the existing specification.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
bart
2024-02-29 00:47:25 UTC
Permalink
Post by Keith Thompson
Post by bart
It would be unfortunate if your example was allowed. Clearly a binary
representation of an instance of your struct would probably require 16
bytes rather than 4, of which one may be padding.
Depending on the sizes and alignments of the various types, sure.
So what?
If you have suggestions for alternate ways to define #embed, they might
be interesting, but it's too late to change the existing specification.
My early comments on this were about compiler performance. I suggested
there might be a way to turn 100,000 byte values in a file, directly
into a 100KB string or data block, without needing to first convert
100,000 values into 100,000 integer expressions representated as tokens,
and to then parse those 100,000 expressions into AST nodes etc.

DB suggested something like that was actually done. But you can't do
that if those 100,000 numbers represent from 100KB to 800KB of memory
depending on the data type of the strucure they're initialising.

They might even be mixed type. Or it might be an example like this:

A binary file contains 8 bytes representing one IEEE754 float value. It
is desired to use that to initialise a double array of one element.

However #embed will that into 8 integer values of 0 to 255 each (I assume).

It's not clear either what happens when one of the integers has the
value 150, say, but it is used to initialise an element of type (signed)
char. It sounds like it would make it hard to inialise a char[] array,
when char is signed, from a file of UTF8 text.

Basically, #embed is dumb.

For flexibility, I wouldn't use #embed at all. Just have an actual
comma-separated set of values in a text file, and use #include instead.
Keith Thompson
2024-02-29 01:12:03 UTC
Permalink
Post by bart
Post by Keith Thompson
Post by bart
It would be unfortunate if your example was allowed. Clearly a binary
representation of an instance of your struct would probably require 16
bytes rather than 4, of which one may be padding.
Depending on the sizes and alignments of the various types, sure.
So what?
If you have suggestions for alternate ways to define #embed, they might
be interesting, but it's too late to change the existing specification.
My early comments on this were about compiler performance. I suggested
there might be a way to turn 100,000 byte values in a file, directly
into a 100KB string or data block, without needing to first convert
100,000 values into 100,000 integer expressions representated as
tokens, and to then parse those 100,000 expressions into AST nodes
etc.
I suggest that (a) parsing thoser 100,000 byte values isn't likely to be
a huge deal (if you have actual performance figures that contradict
that, feel free to present them), and (b) any solution that doesn't
involve expanding to C source code would require a lot more work to
implement for very little benefit.
Post by bart
DB suggested something like that was actually done. But you can't do
that if those 100,000 numbers represent from 100KB to 800KB of memory
depending on the data type of the strucure they're initialising.
Neither gcc nor clang implements #embed yet. DB mentioned prototype
implementations. I've asked him for more information.
Post by bart
A binary file contains 8 byes representing one IEEE754 float
value. It is desired to use that to initialise a double array of one
element.
However #embed will that into 8 integer values of 0 to 255 each (I assume).
Assuming CHAR_BIT==8, yes. You can use it to initialize a union, or use
memcpy() to copy from an array of unsigned char into a double object.
(Storing double values in binary files is uncommon, but it's certainly
possible.)
Post by bart
It's not clear either what happens when one of the integers has the
value 150, say, but it is used to initialise an element of type
(signed) char. It sounds like it would make it hard to inialise a
char[] array, when char is signed, from a file of UTF8 text.
Say you have a binary file containing a single byte with the value 150
(when interpreted as an 8-bit unsigned char). Then
#embed "file.dat"
will expand to something like
150
or
0x96

So if you write:

char array[] = {
#embed file.dat
};

then it's treated exactly the same as
char array[] = { 150 };

If plain char is signed, then the result of the conversion is
implementation-defined, but is very very likely to result in a value of
-106.

I expect that 99% of the uses of #embed will be to initialize arrays of
unsigned char (or uint8_t). For that purpose, it should work just fine.
Post by bart
Basically, #embed is dumb.
Do you object to the fact that the authors didn't add additional
arbitrary restrictions to forbid uses that you don't like?
Post by bart
For flexibility, I wouldn't use #embed at all. Just have an actual
comma-separated set of values in a text file, and use #include
instead.
And you can still do that.

If you have a png image file and you want to include its contents in
your C program, you can use a separate program to translate the file to
C source and #include the result, or you can use `#embed "foo.png"`.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
tTh
2024-02-29 15:29:24 UTC
Permalink
Post by bart
My early comments on this were about compiler performance. I suggested
there might be a way to turn 100,000 byte values in a file, directly
into a 100KB string or data block, without needing to first convert
100,000 values into 100,000 integer expressions representated as tokens,
and to then parse those 100,000 expressions into AST nodes etc.
But you HAVE to do that il #embed is in the preprocessor,
because his job is to give compilable text to the real
compiler. No other way is possible.
Post by bart
Basically, #embed is dumb.
No.
--
+---------------------------------------------------------------------+
| https://tube.interhacker.space/a/tth/video-channels |
+---------------------------------------------------------------------+
Scott Lurndal
2024-02-29 16:15:47 UTC
Permalink
Post by tTh
Post by bart
My early comments on this were about compiler performance. I suggested
there might be a way to turn 100,000 byte values in a file, directly
into a 100KB string or data block, without needing to first convert
100,000 values into 100,000 integer expressions representated as tokens,
and to then parse those 100,000 expressions into AST nodes etc.
But you HAVE to do that il #embed is in the preprocessor,
because his job is to give compilable text to the real
compiler. No other way is possible.
The standard does not require the preprocessor to be
separate from a 'real' compiler. It's acceptable for an implementation
to implement both in a single executable. Absent -E, the
preprocessor and compiler can cooperate to efficiently handle
#embed without generating parseable C code.
Scott Lurndal
2024-02-29 15:53:04 UTC
Permalink
Post by bart
Post by Keith Thompson
Post by bart
It would be unfortunate if your example was allowed. Clearly a binary
representation of an instance of your struct would probably require 16
bytes rather than 4, of which one may be padding.
Depending on the sizes and alignments of the various types, sure.
So what?
If you have suggestions for alternate ways to define #embed, they might
be interesting, but it's too late to change the existing specification.
My early comments on this were about compiler performance. I suggested
there might be a way to turn 100,000 byte values in a file, directly
into a 100KB string or data block, without needing to first convert
100,000 values into 100,000 integer expressions representated as tokens,
and to then parse those 100,000 expressions into AST nodes etc.
DB suggested something like that was actually done. But you can't do
that if those 100,000 numbers represent from 100KB to 800KB of memory
depending on the data type of the strucure they're initialising.
An implementation is free to simply pass a variant (or the directive
itself) of #embed from the pre-processor to the compiler if the programmer
isn't using -E, and the compiler could simply copy the embedded file
into the object file directly, without processing it as a series of
integer values. Much like the #file and #line directives passed by
the pre-processor to the compiler.
Keith Thompson
2024-02-29 17:06:23 UTC
Permalink
Post by Scott Lurndal
Post by bart
Post by Keith Thompson
Post by bart
It would be unfortunate if your example was allowed. Clearly a binary
representation of an instance of your struct would probably require 16
bytes rather than 4, of which one may be padding.
Depending on the sizes and alignments of the various types, sure.
So what?
If you have suggestions for alternate ways to define #embed, they might
be interesting, but it's too late to change the existing specification.
My early comments on this were about compiler performance. I suggested
there might be a way to turn 100,000 byte values in a file, directly
into a 100KB string or data block, without needing to first convert
100,000 values into 100,000 integer expressions representated as tokens,
and to then parse those 100,000 expressions into AST nodes etc.
DB suggested something like that was actually done. But you can't do
that if those 100,000 numbers represent from 100KB to 800KB of memory
depending on the data type of the strucure they're initialising.
An implementation is free to simply pass a variant (or the directive
itself) of #embed from the pre-processor to the compiler if the programmer
isn't using -E, and the compiler could simply copy the embedded file
into the object file directly, without processing it as a series of
integer values. Much like the #file and #line directives passed by
the pre-processor to the compiler.
Sure, an implementation has to operate *as if* it implemented the 8
translation phases separately. But given a structure initialized with
#embed, it would have to generate additional code to initialize the
structure members from the bytes of the binary blob.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Scott Lurndal
2024-02-29 17:28:33 UTC
Permalink
Post by Keith Thompson
Post by Scott Lurndal
Post by bart
Post by Keith Thompson
Post by bart
It would be unfortunate if your example was allowed. Clearly a binary
representation of an instance of your struct would probably require 16
bytes rather than 4, of which one may be padding.
Depending on the sizes and alignments of the various types, sure.
So what?
If you have suggestions for alternate ways to define #embed, they might
be interesting, but it's too late to change the existing specification.
My early comments on this were about compiler performance. I suggested
there might be a way to turn 100,000 byte values in a file, directly
into a 100KB string or data block, without needing to first convert
100,000 values into 100,000 integer expressions representated as tokens,
and to then parse those 100,000 expressions into AST nodes etc.
DB suggested something like that was actually done. But you can't do
that if those 100,000 numbers represent from 100KB to 800KB of memory
depending on the data type of the strucure they're initialising.
An implementation is free to simply pass a variant (or the directive
itself) of #embed from the pre-processor to the compiler if the programmer
isn't using -E, and the compiler could simply copy the embedded file
into the object file directly, without processing it as a series of
integer values. Much like the #file and #line directives passed by
the pre-processor to the compiler.
Sure, an implementation has to operate *as if* it implemented the 8
translation phases separately. But given a structure initialized with
#embed, it would have to generate additional code to initialize the
structure members from the bytes of the binary blob.
Would it? Or could it simply assume that the binary blob
is already in the same binary format that writing an instance
of the structure from a C application on the same host would have created?
David Brown
2024-02-29 17:58:49 UTC
Permalink
Post by Scott Lurndal
Post by Keith Thompson
Post by Scott Lurndal
Post by bart
Post by Keith Thompson
Post by bart
It would be unfortunate if your example was allowed. Clearly a binary
representation of an instance of your struct would probably require 16
bytes rather than 4, of which one may be padding.
Depending on the sizes and alignments of the various types, sure.
So what?
If you have suggestions for alternate ways to define #embed, they might
be interesting, but it's too late to change the existing specification.
My early comments on this were about compiler performance. I suggested
there might be a way to turn 100,000 byte values in a file, directly
into a 100KB string or data block, without needing to first convert
100,000 values into 100,000 integer expressions representated as tokens,
and to then parse those 100,000 expressions into AST nodes etc.
DB suggested something like that was actually done. But you can't do
that if those 100,000 numbers represent from 100KB to 800KB of memory
depending on the data type of the strucure they're initialising.
An implementation is free to simply pass a variant (or the directive
itself) of #embed from the pre-processor to the compiler if the programmer
isn't using -E, and the compiler could simply copy the embedded file
into the object file directly, without processing it as a series of
integer values. Much like the #file and #line directives passed by
the pre-processor to the compiler.
Sure, an implementation has to operate *as if* it implemented the 8
translation phases separately. But given a structure initialized with
#embed, it would have to generate additional code to initialize the
structure members from the bytes of the binary blob.
Would it? Or could it simply assume that the binary blob
is already in the same binary format that writing an instance
of the structure from a C application on the same host would have created?
That would depend on the sizes of the fields in the struct, and the size
of the integer constants in the #embed. The norm for #embed will be
unsigned char integer constants, so it will only be a direct fit for the
binary representation of the struct if all the struct fields are
compatible with that. But a compiler could have vendor parameters on
the #embed to change those sizes.
Scott Lurndal
2024-02-29 18:05:50 UTC
Permalink
Post by David Brown
Post by Scott Lurndal
Post by Keith Thompson
Post by Scott Lurndal
Post by bart
Post by Keith Thompson
Post by bart
It would be unfortunate if your example was allowed. Clearly a binary
representation of an instance of your struct would probably require 16
bytes rather than 4, of which one may be padding.
Depending on the sizes and alignments of the various types, sure.
So what?
If you have suggestions for alternate ways to define #embed, they might
be interesting, but it's too late to change the existing specification.
My early comments on this were about compiler performance. I suggested
there might be a way to turn 100,000 byte values in a file, directly
into a 100KB string or data block, without needing to first convert
100,000 values into 100,000 integer expressions representated as tokens,
and to then parse those 100,000 expressions into AST nodes etc.
DB suggested something like that was actually done. But you can't do
that if those 100,000 numbers represent from 100KB to 800KB of memory
depending on the data type of the strucure they're initialising.
An implementation is free to simply pass a variant (or the directive
itself) of #embed from the pre-processor to the compiler if the programmer
isn't using -E, and the compiler could simply copy the embedded file
into the object file directly, without processing it as a series of
integer values. Much like the #file and #line directives passed by
the pre-processor to the compiler.
Sure, an implementation has to operate *as if* it implemented the 8
translation phases separately. But given a structure initialized with
#embed, it would have to generate additional code to initialize the
structure members from the bytes of the binary blob.
Would it? Or could it simply assume that the binary blob
is already in the same binary format that writing an instance
of the structure from a C application on the same host would have created?
That would depend on the sizes of the fields in the struct, and the size
of the integer constants in the #embed.
I'm embedding a binary file. I want the representation in memory
to be _exactly_ the same as in the file, regardless of how it is
defined in the C code (array of char, array of int, array of long, struct whatever).
Scott Lurndal
2024-02-29 18:09:52 UTC
Permalink
Post by Scott Lurndal
Post by David Brown
Post by Scott Lurndal
Post by Keith Thompson
Post by Scott Lurndal
An implementation is free to simply pass a variant (or the directive
itself) of #embed from the pre-processor to the compiler if the programmer
isn't using -E, and the compiler could simply copy the embedded file
into the object file directly, without processing it as a series of
integer values. Much like the #file and #line directives passed by
the pre-processor to the compiler.
Sure, an implementation has to operate *as if* it implemented the 8
translation phases separately. But given a structure initialized with
#embed, it would have to generate additional code to initialize the
structure members from the bytes of the binary blob.
Would it? Or could it simply assume that the binary blob
is already in the same binary format that writing an instance
of the structure from a C application on the same host would have created?
That would depend on the sizes of the fields in the struct, and the size
of the integer constants in the #embed.
I'm embedding a binary file. I want the representation in memory
to be _exactly_ the same as in the file, regardless of how it is
defined in the C code (array of char, array of int, array of long, struct whatever).
I have an actual use case today where #embed of a (C++) std::map binary
object created by separate tool would be very useful. I'm
planning on using mmap to load it at runtime at the moment.
Lawrence D'Oliveiro
2024-02-29 21:27:52 UTC
Permalink
Post by Scott Lurndal
I have an actual use case today where #embed of a (C++) std::map binary
object created by separate tool would be very useful. I'm planning on
using mmap to load it at runtime at the moment.
Why not convert it to a .o file and statically link it into your program
as part of the build process?
bart
2024-03-01 11:52:16 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Scott Lurndal
I have an actual use case today where #embed of a (C++) std::map binary
object created by separate tool would be very useful. I'm planning on
using mmap to load it at runtime at the moment.
Why not convert it to a .o file and statically link it into your program
as part of the build process?
That's exactly what #embed will enable.
Lawrence D'Oliveiro
2024-03-05 04:47:18 UTC
Permalink
Post by bart
Post by Lawrence D'Oliveiro
Post by Scott Lurndal
I have an actual use case today where #embed of a (C++) std::map
binary object created by separate tool would be very useful. I'm
planning on using mmap to load it at runtime at the moment.
Why not convert it to a .o file and statically link it into your
program as part of the build process?
That's exactly what #embed will enable.
You can call it a toy version of objcopy
<https://manpages.debian.org/1/objcopy.1.html>.
David Brown
2024-02-29 19:51:11 UTC
Permalink
Post by Scott Lurndal
Post by David Brown
Post by Scott Lurndal
Post by Keith Thompson
Post by Scott Lurndal
Post by bart
Post by Keith Thompson
Post by bart
It would be unfortunate if your example was allowed. Clearly a binary
representation of an instance of your struct would probably require 16
bytes rather than 4, of which one may be padding.
Depending on the sizes and alignments of the various types, sure.
So what?
If you have suggestions for alternate ways to define #embed, they might
be interesting, but it's too late to change the existing specification.
My early comments on this were about compiler performance. I suggested
there might be a way to turn 100,000 byte values in a file, directly
into a 100KB string or data block, without needing to first convert
100,000 values into 100,000 integer expressions representated as tokens,
and to then parse those 100,000 expressions into AST nodes etc.
DB suggested something like that was actually done. But you can't do
that if those 100,000 numbers represent from 100KB to 800KB of memory
depending on the data type of the strucure they're initialising.
An implementation is free to simply pass a variant (or the directive
itself) of #embed from the pre-processor to the compiler if the programmer
isn't using -E, and the compiler could simply copy the embedded file
into the object file directly, without processing it as a series of
integer values. Much like the #file and #line directives passed by
the pre-processor to the compiler.
Sure, an implementation has to operate *as if* it implemented the 8
translation phases separately. But given a structure initialized with
#embed, it would have to generate additional code to initialize the
structure members from the bytes of the binary blob.
Would it? Or could it simply assume that the binary blob
is already in the same binary format that writing an instance
of the structure from a C application on the same host would have created?
That would depend on the sizes of the fields in the struct, and the size
of the integer constants in the #embed.
I'm embedding a binary file. I want the representation in memory
to be _exactly_ the same as in the file, regardless of how it is
defined in the C code (array of char, array of int, array of long, struct whatever).
Then you would want a union of the struct type and an appropriately
sized unsigned char array, and initialise the unsigned char area with
the bytes of the file using #embed.
David Brown
2024-02-29 09:10:10 UTC
Permalink
Post by Keith Thompson
[...]
Post by David Brown
They won't use strings, they will use data blobs - binary data. Then
there is no issue with null bytes. And yes, implementations will skip
the token generation (unless you are doing something weird, such as
using #embed to read the parameters to a function call).
Tests with prototype implementations gave extremely fast results.
I'm not sure how that would work. #embed is a preprocessor directive,
and at least in the abstract model it has to expand to valid C code.
I would have expected that it would simply generate the list of
comma-separated integer constants described in the standard; later
phases would simply parse that list and generate code as if that
sequence had been written in the original source file. Do you know of
an implementation that does something else?
The key thing, as I understand it, is that the compiler gets to know
that the integers in the list are all "nice". And since the
preprocessor and the compiler are part of the same implementation (even
if they are separate programs communicating with pipes or temporary
files), the preprocessor could pass on the binary blob in a pre-parsed form.

Think about what a preprocessor and compiler does with the initialisers
in an array, written in normal text (such as by using "xxd -i" or
another external script). For each integer, it has to divide up the
tokens, identify the comma, parse the integer, check that it is a valid
integer, figure out its type based on the size (and suffix, if any). It
needs to record the line number and column number for possible later
reference in error or warning messages. It has to check the value of
the integer against the type for the array elements, and possibly change
the value to suit, or issue warnings for out-of-range values. It has to
allocate all the space to store this information as it goes along,
without knowing the size of the array - so it will be lots of small
mallocs and/or wasted space. It's a /lot/. (Simpler compilers can get
away with a bit less effort, especially if they have more limited warnings.)

With #embed, the preprocessor can generate a compiler-specific "start of
embed" informational directive (much like "#line" directives and such
things generated by preprocessors today), then the data in a very
specific format, then an "end of embed" directive. It could, for
example, generate all the integers in the format "0xAB, " with 16
elements per line. The compiler wouldn't need to parse the data
normally - it knows exactly how many elements there are (from the "start
of embed" directive), it knows exactly where to find each entry (as each
is 6 characters long), it only needs to look at two of these characters,
there's never any errors, the source line number is fixed (at the #embed
line), and so on.

A more tightly coupled preprocessor and compiler can do even better -
for array initialisation, the binary blob could be used directly without
ever generating integer constants or parsing them.

The results of testing are that #embed is /massively/ faster and lower
memory compared to external generators, especially for larger files.
And it gives you the data on-hand for optimisation purposes, unlike
external direct linking of binary blobs. (So you can get the size of
the array, or use values from it as compile-time known values.)
Post by Keith Thompson
For example, say you have a file "foo.dat" containing 4 bytes with
struct foo {
unsigned char a;
unsigned short b;
unsigned int c;
double d;
};
struct foo obj = {
#embed "foo.dat"
};
#embed isn't defined to translate an input file to a sequence of bytes.
It's defined to translate an input file to a sequence of integer
constant expressions.
Yes. But the prime speed (and memory usage) gains come in, are for
large files, and that means array initialisers. That does not conflict
with using it for cases like yours.
Post by Keith Thompson
*Maybe* a compiler could optimize for the case where it knows that it's
being used to initialize an array of unsigned char, but (a) that would
require the preprocessor to have information that normally doesn't exist
until later phases, and (b) I'm not convinced it would be worth the
effort.
Look at
<https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1040r6.html#design-practice-speed>.

In those tests, for a 40 MB file gcc #embed is 200 times faster than
"xxd -i" generated files, and takes about 2.5% of the memory. It scales
to 1 GB files. And that's just a proof-of-concept implementation.
bart
2024-02-29 10:18:21 UTC
Permalink
Post by David Brown
Post by Keith Thompson
*Maybe* a compiler could optimize for the case where it knows that it's
being used to initialize an array of unsigned char, but (a) that would
require the preprocessor to have information that normally doesn't exist
until later phases, and (b) I'm not convinced it would be worth the
effort.
Look at
<https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1040r6.html#design-practice-speed>.
In those tests, for a 40 MB file gcc #embed is 200 times faster than
"xxd -i" generated files, and takes about 2.5% of the memory.  It scales
to 1 GB files.  And that's just a proof-of-concept implementation.
I've just down my own tests, with a 40MB data file containing random
A..Z letters (so can be processed as a text file).

This was converted also to a 120MB text file contain a list of numbers
("65,66,73,...", 3 characters for each data byte).

Using 'strinclude' in my old C compiler, it took about 1 second to build
this program:

#include <stdio.h>
#include <string.h>

char* s=strinclude("data");

int main(void) {
printf("%zu\n", strlen(s));
}

(Running it shows '40000000'.) The same test in my language (which has
no intermediate ASM stage) took 0.3 seconds.

Next I tried instead that 120MB text file containing the same data but
as text, initialising a char[] array using #include.

Tcc took 12 seconds. Bcc took 56 seconds (via ASM etc).

gcc got up to about 3GB memory usage then 'cc1' failed trying to
allocate 0.5GB, after about a minute.

Processing long list of numbers DOES use considerable resources. Bear in
mind that #embed also needs to take binary data and generate tokens,
possibly converting each binary number to text.
tTh
2024-02-29 15:34:24 UTC
Permalink
Post by bart
Using 'strinclude' in my old C compiler, it took about 1 second to build
  #include <stdio.h>
  #include <string.h>
  char* s=strinclude("data");
  int main(void) {
     printf("%zu\n", strlen(s));
 }
***@redlady:~/Desktop$ man strinclude
No manual entry for strinclude
***@redlady:~/Desktop$
--
+---------------------------------------------------------------------+
| https://tube.interhacker.space/a/tth/video-channels |
+---------------------------------------------------------------------+
bart
2024-03-01 11:58:49 UTC
Permalink
Post by tTh
Post by bart
Using 'strinclude' in my old C compiler, it took about 1 second to
   #include <stdio.h>
   #include <string.h>
   char* s=strinclude("data");
   int main(void) {
      printf("%zu\n", strlen(s));
  }
No manual entry for strinclude
'strinclude' is an extension I made for that compiler.

#embed is the new feature of C23. Although I'm not sure how it would be
used to initialise a char* pointer. Perhaps like this:

char dummy[] {
#embed "data"
,0};
char* s = dummy;

(I've added a 0-terminator here; I don't know if #embed will take care
of that.)

My 'strinclude' produces a zero-terminated string, but it is done within
the parser rather than lexer.
David Brown
2024-03-01 12:17:01 UTC
Permalink
Post by bart
Post by tTh
Post by bart
Using 'strinclude' in my old C compiler, it took about 1 second to
   #include <stdio.h>
   #include <string.h>
   char* s=strinclude("data");
   int main(void) {
      printf("%zu\n", strlen(s));
  }
No manual entry for strinclude
'strinclude' is an extension I made for that compiler.
#embed is the new feature of C23. Although I'm not sure how it would be
    char dummy[]  {
    #embed "data"
    ,0};
    char* s = dummy;
(I've added a 0-terminator here; I don't know if #embed will take care
of that.)
#embed very specifically does not add anything. So you would do :

const char s[] = {
#embed "data" suffix(,)
0
};

The "suffix" parameter adds a comma if "data" is not empty, and does
nothing if "data" is empty. Writing it as you did would work fine for
non-empty "data" but give the nonsensical results {,0} if "data" is
empty. (You might not care about such cases and prefer to write the
simpler version, but now you also know about "suffix".)

There is no need to have a separate character pointer variable - the
const char array can be used directly in most circumstances.
Post by bart
My 'strinclude' produces a zero-terminated string, but it is done within
the parser rather than lexer.
Keith Thompson
2024-02-29 17:03:45 UTC
Permalink
Post by David Brown
Post by Keith Thompson
[...]
Post by David Brown
They won't use strings, they will use data blobs - binary data. Then
there is no issue with null bytes. And yes, implementations will skip
the token generation (unless you are doing something weird, such as
using #embed to read the parameters to a function call).
Tests with prototype implementations gave extremely fast results.
I'm not sure how that would work. #embed is a preprocessor
directive,
and at least in the abstract model it has to expand to valid C code.
I would have expected that it would simply generate the list of
comma-separated integer constants described in the standard; later
phases would simply parse that list and generate code as if that
sequence had been written in the original source file. Do you know of
an implementation that does something else?
The key thing, as I understand it, is that the compiler gets to know
that the integers in the list are all "nice". And since the
preprocessor and the compiler are part of the same implementation
(even if they are separate programs communicating with pipes or
temporary files), the preprocessor could pass on the binary blob in a
pre-parsed form.
[...]

Sure, an implementation *could* optimize #embed so it expands to some
implementation-defined nonstandard form that later phases can treat as
raw data. But since it's defined as a preprocessor directive, it's
difficult to see how it could do so while covering all cases.

[...]
Post by David Brown
The results of testing are that #embed is /massively/ faster and lower
memory compared to external generators, especially for larger files.
And it gives you the data on-hand for optimisation purposes, unlike
external direct linking of binary blobs. (So you can get the size of
the array, or use values from it as compile-time known values.)
What testing? The very latest versions of gcc and clang (I checked both
their git repos yesterday) do not yet implement #embed.
Post by David Brown
Post by Keith Thompson
For example, say you have a file "foo.dat" containing 4 bytes with
struct foo {
unsigned char a;
unsigned short b;
unsigned int c;
double d;
};
struct foo obj = {
#embed "foo.dat"
};
#embed isn't defined to translate an input file to a sequence of bytes.
It's defined to translate an input file to a sequence of integer
constant expressions.
Yes. But the prime speed (and memory usage) gains come in, are for
large files, and that means array initialisers. That does not
conflict with using it for cases like yours.
So a compiler that does this would have to be able to handle

struct foo obj = {
#blob
<binary data>
#endblob>
};

and initialize a, b, c, and d to 0, 1, 2, and 3.0, respectively from
successive bytes of the binary data. Either that, or the preprocessor
would have to use information it doesn't have to determine how to expand
#embed.
Post by David Brown
Post by Keith Thompson
*Maybe* a compiler could optimize for the case where it knows that it's
being used to initialize an array of unsigned char, but (a) that would
require the preprocessor to have information that normally doesn't exist
until later phases, and (b) I'm not convinced it would be worth the
effort.
Look at
<https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1040r6.html#design-practice-speed>.
In those tests, for a 40 MB file gcc #embed is 200 times faster than
"xxd -i" generated files, and takes about 2.5% of the memory. It
scales to 1 GB files. And that's just a proof-of-concept
implementation.
That's for std::embed, a proposed C++ feature that's *not* defined as a
preprocessor directive. Sample usage from the paper:

constexpr std::span<const std::byte> fxaa_binary =
std::embed( "fxaa.spirv" );

So the compiler knows the type of the object being initialized.

(Note that the author of that C++ paper is also the editor for the C
standard.)

I'm still skeptical that C's #embed will actually be implemented other
than as expanding to a sequence of integer constants.

On the other hand, C23 allows for additional implementation-defined
parameters to #embed (as well as the standard embed parameters limit,
prefix, suffix, and is_empty). Such a parameter could specify how it's
expanded, perhaps to some implementation-defined blob format. *If*
compilers optimize #embed to something other than a sequence of integer
constant expressions, that's probably how it would be done. But since
neither gcc nor clang implements #embed at all, it may be too early to
speculate.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
David Brown
2024-02-29 18:08:22 UTC
Permalink
Post by Keith Thompson
Post by David Brown
Post by Keith Thompson
[...]
Post by David Brown
They won't use strings, they will use data blobs - binary data. Then
there is no issue with null bytes. And yes, implementations will skip
the token generation (unless you are doing something weird, such as
using #embed to read the parameters to a function call).
Tests with prototype implementations gave extremely fast results.
I'm not sure how that would work. #embed is a preprocessor
directive,
and at least in the abstract model it has to expand to valid C code.
I would have expected that it would simply generate the list of
comma-separated integer constants described in the standard; later
phases would simply parse that list and generate code as if that
sequence had been written in the original source file. Do you know of
an implementation that does something else?
The key thing, as I understand it, is that the compiler gets to know
that the integers in the list are all "nice". And since the
preprocessor and the compiler are part of the same implementation
(even if they are separate programs communicating with pipes or
temporary files), the preprocessor could pass on the binary blob in a
pre-parsed form.
[...]
Sure, an implementation *could* optimize #embed so it expands to some
implementation-defined nonstandard form that later phases can treat as
raw data. But since it's defined as a preprocessor directive, it's
difficult to see how it could do so while covering all cases.
It would require a strong link between the compiler and the preprocessor
- as you know, these don't have to be separate programs. In a more
weakly coupled system, there could still be a method for passing a
binary blob to the compiler in addition to the integer data, and let the
compiler use whichever form it preferred (based on what your code does
with the data).
Post by Keith Thompson
[...]
Post by David Brown
The results of testing are that #embed is /massively/ faster and lower
memory compared to external generators, especially for larger files.
And it gives you the data on-hand for optimisation purposes, unlike
external direct linking of binary blobs. (So you can get the size of
the array, or use values from it as compile-time known values.)
What testing? The very latest versions of gcc and clang (I checked both
their git repos yesterday) do not yet implement #embed.
I believe prototypes, tests, or proofs of concept have been made for
gcc, clang and perhaps other tools. I posted a link to some results -
more are floating around the internet if you want to look for them.
Post by Keith Thompson
Post by David Brown
Post by Keith Thompson
For example, say you have a file "foo.dat" containing 4 bytes with
struct foo {
unsigned char a;
unsigned short b;
unsigned int c;
double d;
};
struct foo obj = {
#embed "foo.dat"
};
#embed isn't defined to translate an input file to a sequence of bytes.
It's defined to translate an input file to a sequence of integer
constant expressions.
Yes. But the prime speed (and memory usage) gains come in, are for
large files, and that means array initialisers. That does not
conflict with using it for cases like yours.
So a compiler that does this would have to be able to handle
struct foo obj = {
#blob
<binary data>
#endblob>
};
and initialize a, b, c, and d to 0, 1, 2, and 3.0, respectively from
successive bytes of the binary data. Either that, or the preprocessor
would have to use information it doesn't have to determine how to expand
#embed.
I think I've covered how that could be handled. (And I don't know how
it /will/ be handled. But I am sure compiler implementers will figure a
way to make it work correctly for any use of the integer constant list,
while also making it as efficient as they reasonably can for the common
case of initialising an unsigned char array.)
Post by Keith Thompson
Post by David Brown
Post by Keith Thompson
*Maybe* a compiler could optimize for the case where it knows that it's
being used to initialize an array of unsigned char, but (a) that would
require the preprocessor to have information that normally doesn't exist
until later phases, and (b) I'm not convinced it would be worth the
effort.
Look at
<https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1040r6.html#design-practice-speed>.
In those tests, for a 40 MB file gcc #embed is 200 times faster than
"xxd -i" generated files, and takes about 2.5% of the memory. It
scales to 1 GB files. And that's just a proof-of-concept
implementation.
That's for std::embed, a proposed C++ feature that's *not* defined as a
constexpr std::span<const std::byte> fxaa_binary =
std::embed( "fxaa.spirv" );
So the compiler knows the type of the object being initialized.
(Note that the author of that C++ paper is also the editor for the C
standard.)
The work on #embed is being done simultaneously for C and C++.
std::embed() gives you slightly different way to write it, but the
implementation is the same. (Not unlike _Pragma and #pragma in C.)

Other pages I have seen with speed tests show the same pattern while
referring explicitly to #embed.
Post by Keith Thompson
I'm still skeptical that C's #embed will actually be implemented other
than as expanding to a sequence of integer constants.
We'll see when it all hits the mainline compilers!
Post by Keith Thompson
On the other hand, C23 allows for additional implementation-defined
parameters to #embed (as well as the standard embed parameters limit,
prefix, suffix, and is_empty). Such a parameter could specify how it's
expanded, perhaps to some implementation-defined blob format. *If*
compilers optimize #embed to something other than a sequence of integer
constant expressions, that's probably how it would be done. But since
neither gcc nor clang implements #embed at all, it may be too early to
speculate.
Loading...