Discussion:
multi bytes character - how to make it defined behavior?
(too old to reply)
Thiago Adams
2024-08-13 14:45:42 UTC
Permalink
static_assert('×' == 50071);

GCC - warning multi byte
CLANG - error character too large

I think instead of "multi bytes" we need "multi characters" - not bytes.

We decode utf8 then we have the character to decide if it is multi char
or not.

decoding '×' would consume bytes 195 and 151 the result is the decoded
Unicode value of 215.

It is not multi byte : 256*195 + 151 = 50071

O the other hand 'ab' is "multi character" resulting

256 * 'a' + 'b' = 256*97+98= 24930

One consequence is that

'ab' == '𤤰'

But I don't think this is a problem. At least everything is defined.
Bart
2024-08-13 23:52:13 UTC
Permalink
Post by Thiago Adams
static_assert('×' == 50071);
GCC -  warning multi byte
CLANG - error character too large
I think instead of "multi bytes" we need "multi characters" - not bytes.
We decode utf8 then we have the character to decide if it is multi char
or not.
decoding '×' would consume bytes 195 and 151 the result is the decoded
Unicode value of 215.
It is not multi byte : 256*195 + 151 = 50071
O the other hand 'ab' is "multi character" resulting
256 * 'a' + 'b' = 256*97+98= 24930
One consequence is that
'ab' == '𤤰'
But I don't think this is a problem. At least everything is defined.
What exactly do you mean by multi-byte characters? Is it a literal such
as 'ABCD'?

I've no idea what C makes of that, so you will first have to specify
what it might represent:

* Is it a single character represented by multiple bytes?

* If so, do those multiple bytes specify a Unicode number (2-3 bytes),
or a UTF8 sequence (up to 4 bytes, maybe more)?

* If those multiple sequence are allowed, could you have more than one
mixed ASCII/Unicode/UTF8 characters?

One problem with UTF8 in C character literals is that I believe those
are limited to an 'int' type, so 32 bits. You can't fit much in there.
And once you have such a value, how do you print it?

Some of this you can take care of in your 'cake' product, and
superimpose a particular spec on top of C (maybe they can be extended to
64 bits) but you probably can't do much about 'printf'.

(In my language, I overhauled this part of it earlier this year. There
it works like this:

* Character literals can be 64 bits

* They can represent up to 8 ASCII characters: 'ABCDEFGH'

* They can include escape codes for both Unicode and UTF8, and multiple
such characters can be specified:

'A\u20ACB' # All represent A€B; this is Unicode
'A\h EC 82 AC\B' # This is UTF8
'A\xEC\x82\xACB' # C-style escape

Internally they are stored as UTF8, so the 20AC is converted to UTF8

* The ordering of the characters matches that of the equivalent
"A\e20ACB" string when stored in memory; but this applies only to
little-endian

* Print routines have options to print the first character (which can be
a Unicode one), or the whole sequence)

Another aspect is when typing Unicode text directly via your text editor
instead of using escape codes; will the C source be UTF8, or some other
encoding? This will affect how the text is represented, and how much you
can fit into one 32/64-bit literal.
Keith Thompson
2024-08-14 00:33:46 UTC
Permalink
Bart <***@freeuk.com> writes:
[...]
Post by Bart
What exactly do you mean by multi-byte characters? Is it a literal
such as 'ABCD'?
I've no idea what C makes of that,
It's a character constant of type int with an implementation-defined
value. Read the section on "Character constants" in the C standard
(6.4.4.4 in C17).

(With gcc, its value is 0x41424344, but other compilers can and do
behave differently.)

We discussed this at some length several years ago.

[...]
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
Thiago Adams
2024-08-14 11:41:01 UTC
Permalink
Post by Keith Thompson
[...]
Post by Bart
What exactly do you mean by multi-byte characters? Is it a literal
such as 'ABCD'?
I've no idea what C makes of that,
It's a character constant of type int with an implementation-defined
value. Read the section on "Character constants" in the C standard
(6.4.4.4 in C17).
(With gcc, its value is 0x41424344, but other compilers can and do
behave differently.)
We discussed this at some length several years ago.
[...]
"An integer character constant has type int. The value of an integer
character constant containing
a single character that maps to a single value in the literal encoding
(6.2.9) is the numerical value
of the representation of the mapped character in the literal encoding
interpreted as an integer.
The value of an integer character constant containing more than one
character (e.g. ’ab’), or
containing a character or escape sequence that does not map to a single
value in the literal encoding,
is implementation-defined. If an integer character constant contains a
single character or escape
sequence, its value is the one that results when an object with type
char whose value is that of the
single character or escape sequence is converted to type int."


I am suggesting the define this:

"The value of an integer character constant containing more than one
character (e.g. ’ab’), or containing a character or escape sequence that
does not map to a single value in the literal encoding, is
implementation-defined."

How?

First, all source code should be utf8.

Then I am suggesting we first decode the bytes.

For instance, '×' is encoded with 195 and 151. We consume these 2 bytes
and the utf8 decoded value is 215.

Then this is the defined behavior

static_assert('×' == 215)

In case we have 'ab' for instance:
Fist we decode 'a' 97 then 'b' 98. We consume one byte each.
Then we have two characters. In this case we do

256 * 'a' + 'b' = 256*97+98= 24930

static_assert('ab' == 24930)

I believe this static_assert('ab' == 24930) matches the way it is used
today.

In case the value is bigger than MAX_INT I think it should be unsigned int.

Why?

Adding fixes on top of fixes make the language bigger and complex.
Like adding U'' L'' u8'' etc.

In my source code I use only utf8, everything just works without any
u8"" etc.
Bart
2024-08-14 13:05:22 UTC
Permalink
Post by Thiago Adams
Post by Keith Thompson
[...]
Post by Bart
What exactly do you mean by multi-byte characters? Is it a literal
such as 'ABCD'?
I've no idea what C makes of that,
It's a character constant of type int with an implementation-defined
value.  Read the section on "Character constants" in the C standard
(6.4.4.4 in C17).
(With gcc, its value is 0x41424344, but other compilers can and do
behave differently.)
We discussed this at some length several years ago.
[...]
"An integer character constant has type int. The value of an integer
character constant containing
a single character that maps to a single value in the literal encoding
(6.2.9) is the numerical value
of the representation of the mapped character in the literal encoding
interpreted as an integer.
The value of an integer character constant containing more than one
character (e.g. ’ab’), or
containing a character or escape sequence that does not map to a single
value in the literal encoding,
is implementation-defined. If an integer character constant contains a
single character or escape
sequence, its value is the one that results when an object with type
char whose value is that of the
single character or escape sequence is converted to type int."
"The value of an integer character constant containing more than one
character (e.g. ’ab’), or containing a character or escape sequence that
does not map to a single value in the literal encoding, is
implementation-defined."
How?
First, all source code should be utf8.
Then I am suggesting we first decode the bytes.
For instance, '×' is encoded with 195 and 151. We consume these 2 bytes
and the utf8 decoded value is 215.
By that you mean the Unicode index. But you say elsewhere that
everything in your source code is UTF8.

Where then does the 215 appear? Do your char* strings use 215 for ×, or
do they use 195 and 215?

I think this is why C requires those prefixes like u8'...'.
Post by Thiago Adams
Then this is the defined behavior
static_assert('×' == 215)
This is where you need to decide whether the integer value within '...',
AT RUNTIME, represents the Unicode index or the UTF8 sequence.

(In my language, though I do very little with Unicode ATM, I decided
that everything is UTF8 both at compile time and runtime. Unless I
explicitly expand a UTF8 u8[] string to a u32[] or i32[] array (either
will work), which contains 21-bit Unicode index values.)

I get the impression that C's wide characters are intended for those
Unicode indices, but that's not going to work well on Windows with its
16-bit wide character type.
Thiago Adams
2024-08-14 13:31:59 UTC
Permalink
Post by Bart
Post by Thiago Adams
Post by Keith Thompson
[...]
Post by Bart
What exactly do you mean by multi-byte characters? Is it a literal
such as 'ABCD'?
I've no idea what C makes of that,
It's a character constant of type int with an implementation-defined
value.  Read the section on "Character constants" in the C standard
(6.4.4.4 in C17).
(With gcc, its value is 0x41424344, but other compilers can and do
behave differently.)
We discussed this at some length several years ago.
[...]
"An integer character constant has type int. The value of an integer
character constant containing
a single character that maps to a single value in the literal encoding
(6.2.9) is the numerical value
of the representation of the mapped character in the literal encoding
interpreted as an integer.
The value of an integer character constant containing more than one
character (e.g. ’ab’), or
containing a character or escape sequence that does not map to a
single value in the literal encoding,
is implementation-defined. If an integer character constant contains a
single character or escape
sequence, its value is the one that results when an object with type
char whose value is that of the
single character or escape sequence is converted to type int."
"The value of an integer character constant containing more than one
character (e.g. ’ab’), or containing a character or escape sequence
that does not map to a single value in the literal encoding, is
implementation-defined."
How?
First, all source code should be utf8.
Then I am suggesting we first decode the bytes.
For instance, '×' is encoded with 195 and 151. We consume these 2
bytes and the utf8 decoded value is 215.
By that you mean the Unicode index. But you say elsewhere that
everything in your source code is UTF8.
215 is the unicode number of the character '×'.
Post by Bart
Where then does the 215 appear? Do your char* strings use 215 for ×, or
do they use 195 and 215?
215 is the result of decoding two utf8 encoded bytes. (195 and 151)
Post by Bart
I think this is why C requires those prefixes like u8'...'.
Post by Thiago Adams
Then this is the defined behavior
static_assert('×' == 215)
This is where you need to decide whether the integer value within '...',
AT RUNTIME, represents the Unicode index or the UTF8 sequence.
why runtime? It is compile time. This is why source code must be
universally encoded (utf8)
Post by Bart
(In my language, though I do very little with Unicode ATM, I decided
that everything is UTF8 both at compile time and runtime. Unless I
explicitly expand a UTF8 u8[] string to a u32[] or i32[] array (either
will work), which contains 21-bit Unicode index values.)
I get the impression that C's wide characters are intended for those
Unicode indices, but that's not going to work well on Windows with its
16-bit wide character type.
nowadays wide is just for windows API compatibility.
Bart
2024-08-14 15:34:07 UTC
Permalink
Post by Thiago Adams
Post by Bart
Post by Thiago Adams
Post by Keith Thompson
[...]
Post by Bart
What exactly do you mean by multi-byte characters? Is it a literal
such as 'ABCD'?
I've no idea what C makes of that,
It's a character constant of type int with an implementation-defined
value.  Read the section on "Character constants" in the C standard
(6.4.4.4 in C17).
(With gcc, its value is 0x41424344, but other compilers can and do
behave differently.)
We discussed this at some length several years ago.
[...]
"An integer character constant has type int. The value of an integer
character constant containing
a single character that maps to a single value in the literal
encoding (6.2.9) is the numerical value
of the representation of the mapped character in the literal encoding
interpreted as an integer.
The value of an integer character constant containing more than one
character (e.g. ’ab’), or
containing a character or escape sequence that does not map to a
single value in the literal encoding,
is implementation-defined. If an integer character constant contains
a single character or escape
sequence, its value is the one that results when an object with type
char whose value is that of the
single character or escape sequence is converted to type int."
"The value of an integer character constant containing more than one
character (e.g. ’ab’), or containing a character or escape sequence
that does not map to a single value in the literal encoding, is
implementation-defined."
How?
First, all source code should be utf8.
Then I am suggesting we first decode the bytes.
For instance, '×' is encoded with 195 and 151. We consume these 2
bytes and the utf8 decoded value is 215.
By that you mean the Unicode index. But you say elsewhere that
everything in your source code is UTF8.
215 is the unicode number of the character '×'.
Post by Bart
Where then does the 215 appear? Do your char* strings use 215 for ×,
or do they use 195 and 215?
215 is the result of decoding two utf8 encoded bytes. (195 and 151)
Post by Bart
I think this is why C requires those prefixes like u8'...'.
Post by Thiago Adams
Then this is the defined behavior
static_assert('×' == 215)
This is where you need to decide whether the integer value within
'...', AT RUNTIME, represents the Unicode index or the UTF8 sequence.
why runtime? It is compile time. This is why source code must be
universally encoded (utf8)
In that case I don't understand what you are testing for here. Is it an
error for '×' to be 215, or an error for it not to be?

And what is the test for, to ensure encoding is UTF8 in this ... source
file? ... compiler?

Where would the 'decoded 215' come into it?
Thiago Adams
2024-08-14 16:10:01 UTC
Permalink
Post by Bart
Post by Thiago Adams
Post by Bart
Post by Thiago Adams
Post by Keith Thompson
[...]
Post by Bart
What exactly do you mean by multi-byte characters? Is it a literal
such as 'ABCD'?
I've no idea what C makes of that,
It's a character constant of type int with an implementation-defined
value.  Read the section on "Character constants" in the C standard
(6.4.4.4 in C17).
(With gcc, its value is 0x41424344, but other compilers can and do
behave differently.)
We discussed this at some length several years ago.
[...]
"An integer character constant has type int. The value of an integer
character constant containing
a single character that maps to a single value in the literal
encoding (6.2.9) is the numerical value
of the representation of the mapped character in the literal
encoding interpreted as an integer.
The value of an integer character constant containing more than one
character (e.g. ’ab’), or
containing a character or escape sequence that does not map to a
single value in the literal encoding,
is implementation-defined. If an integer character constant contains
a single character or escape
sequence, its value is the one that results when an object with type
char whose value is that of the
single character or escape sequence is converted to type int."
"The value of an integer character constant containing more than one
character (e.g. ’ab’), or containing a character or escape sequence
that does not map to a single value in the literal encoding, is
implementation-defined."
How?
First, all source code should be utf8.
Then I am suggesting we first decode the bytes.
For instance, '×' is encoded with 195 and 151. We consume these 2
bytes and the utf8 decoded value is 215.
By that you mean the Unicode index. But you say elsewhere that
everything in your source code is UTF8.
215 is the unicode number of the character '×'.
Post by Bart
Where then does the 215 appear? Do your char* strings use 215 for ×,
or do they use 195 and 215?
215 is the result of decoding two utf8 encoded bytes. (195 and 151)
Post by Bart
I think this is why C requires those prefixes like u8'...'.
Post by Thiago Adams
Then this is the defined behavior
static_assert('×' == 215)
This is where you need to decide whether the integer value within
'...', AT RUNTIME, represents the Unicode index or the UTF8 sequence.
why runtime? It is compile time. This is why source code must be
universally encoded (utf8)
In that case I don't understand what you are testing for here. Is it an
error for '×' to be 215, or an error for it not to be?
GCC handles this as multibyte. Without decoding.

The result of GCC is 50071
static_assert('×' == 50071);

The explanation is that GCC is doing:

256*195 + 151 = 50071

(Remember the utf8 bytes were 195 151)

The way 'ab' is handled is the same of '×' on GCC. Clang have a error
for that. The standard just says the value is implementation defined.
Post by Bart
And what is the test for, to ensure encoding is UTF8 in this ... source
file? ... compiler?
MSVC has some checks, I don't know that is the logic.
Post by Bart
Where would the 'decoded 215' come into it?
215 is the value after decoding utf8 and producing the unicode value.

So my suggestion is decode first.

The bad part of my suggestion we may have two different ways of
producing the same value.

For instance the number generated by ab is the same of

'ab' == '𤤰'

The advantage is to converge to utf8 unicode and make it specified.
Thiago Adams
2024-08-14 16:27:26 UTC
Permalink
Post by Thiago Adams
Post by Bart
Post by Thiago Adams
Post by Bart
Post by Thiago Adams
Post by Keith Thompson
[...]
Post by Bart
What exactly do you mean by multi-byte characters? Is it a literal
such as 'ABCD'?
I've no idea what C makes of that,
It's a character constant of type int with an implementation-defined
value.  Read the section on "Character constants" in the C standard
(6.4.4.4 in C17).
(With gcc, its value is 0x41424344, but other compilers can and do
behave differently.)
We discussed this at some length several years ago.
[...]
"An integer character constant has type int. The value of an
integer character constant containing
a single character that maps to a single value in the literal
encoding (6.2.9) is the numerical value
of the representation of the mapped character in the literal
encoding interpreted as an integer.
The value of an integer character constant containing more than one
character (e.g. ’ab’), or
containing a character or escape sequence that does not map to a
single value in the literal encoding,
is implementation-defined. If an integer character constant
contains a single character or escape
sequence, its value is the one that results when an object with
type char whose value is that of the
single character or escape sequence is converted to type int."
"The value of an integer character constant containing more than
one character (e.g. ’ab’), or containing a character or escape
sequence that does not map to a single value in the literal
encoding, is implementation-defined."
How?
First, all source code should be utf8.
Then I am suggesting we first decode the bytes.
For instance, '×' is encoded with 195 and 151. We consume these 2
bytes and the utf8 decoded value is 215.
By that you mean the Unicode index. But you say elsewhere that
everything in your source code is UTF8.
215 is the unicode number of the character '×'.
Post by Bart
Where then does the 215 appear? Do your char* strings use 215 for ×,
or do they use 195 and 215?
215 is the result of decoding two utf8 encoded bytes. (195 and 151)
Post by Bart
I think this is why C requires those prefixes like u8'...'.
Post by Thiago Adams
Then this is the defined behavior
static_assert('×' == 215)
This is where you need to decide whether the integer value within
'...', AT RUNTIME, represents the Unicode index or the UTF8 sequence.
why runtime? It is compile time. This is why source code must be
universally encoded (utf8)
In that case I don't understand what you are testing for here. Is it
an error for '×' to be 215, or an error for it not to be?
GCC handles this as multibyte. Without decoding.
The result of GCC is 50071
static_assert('×' == 50071);
256*195 + 151 = 50071
(Remember the utf8 bytes were 195 151)
The way 'ab' is handled is the same of '×' on GCC. Clang have a error
for that. The standard just says the value is implementation defined.
Post by Bart
And what is the test for, to ensure encoding is UTF8 in this ...
source file? ... compiler?
MSVC has some checks, I don't know that is the logic.
Post by Bart
Where would the 'decoded 215' come into it?
215 is the value after decoding utf8 and producing the unicode value.
So my suggestion is decode first.
The bad part of my suggestion we may have two different ways of
producing the same value.
For instance the number generated by ab is the same of
'ab' == '𤤰'
The advantage is to converge to utf8 unicode and make it specified.
I use multibyte chars in my code.

For instance:
enum token {TK_EQUAL == '=='}

I prefer to write and read token.type == '==' rather than
token.type = TK_EQUAL.

An alternative for me also could be a macro.

if (token.type = MC('=', '=')) {...}

but then its worst than the type = TK_EQUAL
Bart
2024-08-14 17:07:26 UTC
Permalink
Post by Thiago Adams
Post by Bart
In that case I don't understand what you are testing for here. Is it
an error for '×' to be 215, or an error for it not to be?
GCC handles this as multibyte. Without decoding.
The result of GCC is 50071
static_assert('×' == 50071);
256*195 + 151 = 50071
So the 50071 is the 2-byte UTF8 sequence.
Post by Thiago Adams
(Remember the utf8 bytes were 195 151)
The way 'ab' is handled is the same of '×' on GCC.
I don't understand. 'a' and 'b' each occupy one byte. Together they need
two bytes.

Where's the problem? Are you perhaps confused as to what UTF8 is?

The 50071 above is much better expressed as hex: C397, which is two
bytes. Since both values are in 128..255, they are UTF8 codes, here
expressing a single Unicode character.

Given any two bytes in UTF8, it is easy to see whether they are two
ASCII character, or one (or part of) a Unicode characters, or one ASCII
character followed by the first byte of a UTF8 sequence, or if they are
malformed (eg. the middle of a UTF8 sequence).

There is no confusion.
Post by Thiago Adams
Post by Bart
And what is the test for, to ensure encoding is UTF8 in this ...
source file? ... compiler?
MSVC has some checks, I don't know that is the logic.
Post by Bart
Where would the 'decoded 215' come into it?
215 is the value after decoding utf8 and producing the unicode value.
Who or what does that, and for what purpose? From what I've seen, only
you have introduced it.
Post by Thiago Adams
So my suggestion is decode first.
Why? What are you comparing? Both sides of == must use UTF8 or Unicode,
but why introduce Unicode at all if apparently everything in source code
and at compile time, as you yourself have stated, is UTF8?
Post by Thiago Adams
The bad part of my suggestion we may have two different ways of
producing the same value.
For instance the number generated by ab is the same of
'ab' == '𤤰'
I don't think so. If I run this program:

#include <stdio.h>
#include <string.h>

int main() {
printf("%u\n", '×');
printf("%04X\n", '×');
printf("%u\n", 'ab');
printf("%04X\n", 'ab');
printf("%u\n", '𤤰');
printf("%04X\n", '𤤰');
}


I get this output (I've left out the decimal versions for clarity):

C397 ×

6162 ab

F0A4A4B0 𤤰

That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to
clash with some other Unicode character.
Thiago Adams
2024-08-14 17:40:04 UTC
Permalink
Post by Bart
Post by Thiago Adams
Post by Bart
In that case I don't understand what you are testing for here. Is it
an error for '×' to be 215, or an error for it not to be?
GCC handles this as multibyte. Without decoding.
The result of GCC is 50071
static_assert('×' == 50071);
256*195 + 151 = 50071
So the 50071 is the 2-byte UTF8 sequence.
50071 is the result of multiplying the first byte 195*256 and adding the
second byte 151. (This is NOT UTF8 related, this is the way C compilers
generates the value)

On the other hand, DECODING, bytes 195 and 151 using UTF8 gives us the
result of 215, that is the unicode value.
Post by Bart
Post by Thiago Adams
(Remember the utf8 bytes were 195 151)
The way 'ab' is handled is the same of '×' on GCC.
I don't understand. 'a' and 'b' each occupy one byte. Together they need
two bytes.
Where's the problem? Are you perhaps confused as to what UTF8 is?
I am not confused.

The problem is that the value of 'ab' is not defined in C. So I want to
use this but it is a warning.
Post by Bart
The 50071 above is much better expressed as hex: C397, which is two
bytes. Since both values are in 128..255, they are UTF8 codes, here
expressing a single Unicode character.
I am using '==' etc.. to represent token numbers.
Post by Bart
Given any two bytes in UTF8, it is easy to see whether they are two
ASCII character, or one (or part of) a Unicode characters, or one ASCII
character followed by the first byte of a UTF8 sequence, or if they are
malformed (eg. the middle of a UTF8 sequence).
There is no confusion.
Post by Thiago Adams
Post by Bart
And what is the test for, to ensure encoding is UTF8 in this ...
source file? ... compiler?
MSVC has some checks, I don't know that is the logic.
Post by Bart
Where would the 'decoded 215' come into it?
215 is the value after decoding utf8 and producing the unicode value.
Who or what does that, and for what purpose? From what I've seen, only
you have introduced it.
?
Any modern language will make '×' as 215 (the unicode value). But these
languages don't allow multi chars like 'ab'.
New languages are like U'×' in C.
Post by Bart
Post by Thiago Adams
So my suggestion is decode first.
Why? What are you comparing? Both sides of == must use UTF8 or Unicode,
but why introduce Unicode at all if apparently everything in source code
and at compile time, as you yourself have stated, is UTF8?
Post by Thiago Adams
The bad part of my suggestion we may have two different ways of
producing the same value.
For instance the number generated by ab is the same of
'ab' == '𤤰'
 #include <stdio.h>
 #include <string.h>
 int main() {
   printf("%u\n", '×');
   printf("%04X\n", '×');
   printf("%u\n", 'ab');
   printf("%04X\n", 'ab');
   printf("%u\n", '𤤰');
   printf("%04X\n", '𤤰');
 }
This is not running the algorithm I am suggesting!This 'ab' == '𤤰'
happens only in the say I am suggesting. No compiler is doing that today.
(I never imagined this would cause such confusion in understanding)
Post by Bart
C397                ×
6162                ab
F0A4A4B0            𤤰
That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to
clash with some other Unicode character.
My suggestion again. I am using string but imagine this working with
bytes from file.


#include <stdio.h>
#include <assert.h>

const unsigned char* utf8_decode(const unsigned char* s, int* c)
{
if (s[0] == '\0')
{
*c = 0;
return NULL; /*end*/
}

const unsigned char* next = NULL;
if (s[0] < 0x80)
{
*c = s[0];
assert(*c >= 0x0000 && *c <= 0x007F);
next = s + 1;
}
else if ((s[0] & 0xe0) == 0xc0)
{
*c = ((int)(s[0] & 0x1f) << 6) |
((int)(s[1] & 0x3f) << 0);
assert(*c >= 0x0080 && *c <= 0x07FF);
next = s + 2;
}
else if ((s[0] & 0xf0) == 0xe0)
{
*c = ((int)(s[0] & 0x0f) << 12) |
((int)(s[1] & 0x3f) << 6) |
((int)(s[2] & 0x3f) << 0);
assert(*c >= 0x0800 && *c <= 0xFFFF);
next = s + 3;
}
else if ((s[0] & 0xf8) == 0xf0 && (s[0] <= 0xf4))
{
*c = ((int)(s[0] & 0x07) << 18) |
((int)(s[1] & 0x3f) << 12) |
((int)(s[2] & 0x3f) << 6) |
((int)(s[3] & 0x3f) << 0);
assert(*c >= 0x10000 && *c <= 0x10FFFF);
next = s + 4;
}
else
{
*c = -1; // invalid
next = s + 1; // skip this byte
}

if (*c >= 0xd800 && *c <= 0xdfff)
{
*c = -1; // surrogate half
}

return next;
}

int get_value(const char* s0)
{
const char * s = s0;
int value = 0;
int uc;
s = utf8_decode(s, &uc);
while (s)
{
if (uc < 0x007F)
{
//multichar formula
value = value*256+uc;
}
else
{
//single char
value = uc;
break; //check if there is more then error..
}
s = utf8_decode(s, &uc);
}
return value;
}

int main(){
printf("%d\n", get_value(u8"×"));
printf("%d\n", get_value(u8"ab"));
}
Bart
2024-08-14 18:12:34 UTC
Permalink
Post by Thiago Adams
Post by Bart
That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to
clash with some other Unicode character.
My suggestion again. I am using string but imagine this working with
bytes from file.
#include <stdio.h>
#include <assert.h>
...
Post by Thiago Adams
int get_value(const char* s0)
{
   const char * s = s0;
   int value = 0;
   int  uc;
   s = utf8_decode(s, &uc);
   while (s)
   {
     if (uc < 0x007F)
     {
        //multichar formula
        value = value*256+uc;
     }
     else
     {
        //single char
        value = uc;
        break; //check if there is more then error..
     }
     s = utf8_decode(s, &uc);
   }
   return value;
}
int main(){
  printf("%d\n", get_value(u8"×"));
  printf("%d\n", get_value(u8"ab"));
}
I see your problem. You're mixing things up.

gcc will combine BYTE values together (by shifting by 8 bits or
multiplying by 256), including the individual bytes that represent UTF8.

You are combining ONLY ASCII bytes, and comparing the results with
21-bit Unicode values.

That is meaningless. I'm not surprised you get a clash between A*256+B,
and some arbitrary Unicode index.
Thiago Adams
2024-08-14 18:28:10 UTC
Permalink
Post by Bart
Post by Thiago Adams
Post by Bart
That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to
clash with some other Unicode character.
My suggestion again. I am using string but imagine this working with
bytes from file.
#include <stdio.h>
#include <assert.h>
...
Post by Thiago Adams
int get_value(const char* s0)
{
    const char * s = s0;
    int value = 0;
    int  uc;
    s = utf8_decode(s, &uc);
    while (s)
    {
      if (uc < 0x007F)
      {
         //multichar formula
         value = value*256+uc;
      }
      else
      {
         //single char
         value = uc;
         break; //check if there is more then error..
      }
      s = utf8_decode(s, &uc);
    }
    return value;
}
int main(){
   printf("%d\n", get_value(u8"×"));
   printf("%d\n", get_value(u8"ab"));
}
I see your problem. You're mixing things up.
The objective is :
- make single characters have the Unicode value without having to use U''
- allow more than one chars like 'ab' in some cases where each
character is less than 0x007F. This can break code for instance '¼¼'.
but I am suspecting people are not using in this way (I hope)
Post by Bart
gcc will combine BYTE values together (by shifting by 8 bits or
multiplying by 256), including the individual bytes that represent UTF8.
You are combining ONLY ASCII bytes, and comparing the results with
21-bit Unicode values.
That is meaningless. I'm not surprised you get a clash between A*256+B,
and some arbitrary Unicode index.
In any case..my suggestion looks dangerous. But meanwhile this is not
well specified in the standard.
Bart
2024-08-14 19:32:31 UTC
Permalink
Post by Bart
Post by Thiago Adams
Post by Bart
That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to
clash with some other Unicode character.
My suggestion again. I am using string but imagine this working with
bytes from file.
#include <stdio.h>
#include <assert.h>
...
Post by Thiago Adams
int get_value(const char* s0)
{
    const char * s = s0;
    int value = 0;
    int  uc;
    s = utf8_decode(s, &uc);
    while (s)
    {
      if (uc < 0x007F)
      {
         //multichar formula
         value = value*256+uc;
      }
      else
      {
         //single char
         value = uc;
         break; //check if there is more then error..
      }
      s = utf8_decode(s, &uc);
    }
    return value;
}
int main(){
   printf("%d\n", get_value(u8"×"));
   printf("%d\n", get_value(u8"ab"));
}
I see your problem. You're mixing things up.
 - make single characters have the Unicode value without  having to use
U''
 - allow more than one chars like 'ab' in some cases where each
character is less than 0x007F. This can break code for instance '¼¼'.
but I am suspecting people are not using in this way (I hope)
Obviously that can't work, for example because two printable ASCII
characters with codes 32 to 96, will have values from 1024 to 9216 when
combined in a character literal. Those are going to clash with Unicode
characters with those values.

It won't work either at compile-time or runtime.

You need to choose between Unicode representation and UTF8. Either that
or use some prefix to disambiguate in source code, but you still need
decide whether '€' in source code is represented as the Unicode bytes 20
AC (or maybe 00 20 AC) or the UTF8 sequence EC 82 AC, and further decide
which end of those sequences will be the least signfificant byte.
In any case..my suggestion looks dangerous. But meanwhile this is not
well specified in the standard.
It wasn't well-specified even when dealing with 100% ASCII. For example,
'AB' might have the hex value 0x4142 on one compiler, 0x4241 on another,
maybe just 0x41 or 0x42 on a third, or even 0x41410000.
Lawrence D'Oliveiro
2024-08-15 02:43:03 UTC
Permalink
The result of GCC is 50071 static_assert('×' == 50071);
256*195 + 151 = 50071
(Remember the utf8 bytes were 195 151)
That would be an endian-dependent interpretation.

Lawrence D'Oliveiro
2024-08-15 02:41:48 UTC
Permalink
Post by Thiago Adams
215 is the unicode number of the character '×'.
Be careful about the use of the term “character” in Unicode.

Unicode defines “code points”. A “grapheme” (which I think is their term
for “character”) can be made up of one or more “code points”, with no
upper limit on their number.
Lawrence D'Oliveiro
2024-08-15 01:39:27 UTC
Permalink
Post by Bart
I get the impression that C's wide characters are intended for those
Unicode indices, but that's not going to work well on Windows with its
16-bit wide character type.
Unfortunately, Windows (like Java) is shackled to the UTF-16 Albatross,
owing to embracing Unicode at exactly the wrong time.
Ben Bacarisse
2024-08-14 00:32:14 UTC
Permalink
Post by Thiago Adams
static_assert('×' == 50071);
static_assert(U'×' == 215);

works, but then I don't know what you were trying to do.
Post by Thiago Adams
GCC - warning multi byte
CLANG - error character too large
I think instead of "multi bytes" we need "multi characters" - not bytes.
We decode utf8 then we have the character to decide if it is multi char or
not.
These terms can be confusing and I don't know exactly how you are using
them. Basically I simply don't know what that second sentence is
saying.
Post by Thiago Adams
decoding '×' would consume bytes 195 and 151 the result is the decoded
Unicode value of 215.
Yes, Unicode 215 is UTF-8 encoded as two bytes with values 195 and 151.
Post by Thiago Adams
It is not multi byte : 256*195 + 151 = 50071
If that × is UTF-8 encoded then it might look, to the compiler, just
like an old-fashioned multi-character character constant just like 'ab'
does. Then again, it might not. gcc and clan take different views on
the matter.

You can get clang to that the same view a gcc by writing

static_assert('\xC3\x97' == 50071);

instead. Now both gcc and clang see it for what it is: an old-fashioned
multi-character character constant.
Post by Thiago Adams
O the other hand 'ab' is "multi character" resulting
The term for these things used to be "multi-byte character constant" and
they were highly non-portable. The trouble is that the term "multi-byte
character" now refers to highly portable encodings like UTF-8. Maybe
that's why gcc seems to have changed it's warning from what you gave to:

warning: multi-character character constant [-Wmultichar]
Post by Thiago Adams
256 * 'a' + 'b' = 256*97+98= 24930
One consequence is that
'ab' == '𤤰'
But I don't think this is a problem. At least everything is defined.
--
Ben.
Richard Damon
2024-08-14 03:44:24 UTC
Permalink
Post by Thiago Adams
static_assert('×' == 50071);
GCC -  warning multi byte
CLANG - error character too large
I think instead of "multi bytes" we need "multi characters" - not bytes.
We decode utf8 then we have the character to decide if it is multi char
or not.
decoding '×' would consume bytes 195 and 151 the result is the decoded
Unicode value of 215.
It is not multi byte : 256*195 + 151 = 50071
O the other hand 'ab' is "multi character" resulting
256 * 'a' + 'b' = 256*97+98= 24930
One consequence is that
'ab' == '𤤰'
But I don't think this is a problem. At least everything is defined.
When you use the single quotes by themselves ('), you are specifying
characters in the narrow character set, typically ASCII, but might be
some other 8-bit character encoding. It can not specify extended
character beyond those.

You can (if the implementation allows it) place multiple characters in
the constant to get an integer value with those characters packed.

When you use the double quotes by themselves ("), you are specifying a
string of these narrow characters, although this form might allow for
multi-byte encodings of some characters, like is done with UTF-8.

You can specifiy wide character constants by the syntax of L'x', u'x',
or U'x'.

L'x' will give you what ever the inplementation calls its "wide
character set". This MIGHT be UCS-2/UTF-16 or UCS-4/UTF-32 encoded, but
doesn't need to be.

The u'x' form will always be USC-2/UTF-16, and U'x' will always be
UCS-4/UTF-32

Like the plain 'x' form, the results from a single character, can not be
a multi-unit value, so u'x' can't generate a two surrogate pairs for a
single source character.

Change the ' to a " and you get wide strings, just like the characters,
but now u"xx" and L"xx" can generate charaters that use surrogate pairs
(or other multi-part encodings for L"xxx")
Loading...