Post by BartPost by Thiago AdamsPost by BartIn that case I don't understand what you are testing for here. Is it
an error for '×' to be 215, or an error for it not to be?
GCC handles this as multibyte. Without decoding.
The result of GCC is 50071
static_assert('×' == 50071);
256*195 + 151 = 50071
So the 50071 is the 2-byte UTF8 sequence.
50071 is the result of multiplying the first byte 195*256 and adding the
second byte 151. (This is NOT UTF8 related, this is the way C compilers
generates the value)
On the other hand, DECODING, bytes 195 and 151 using UTF8 gives us the
result of 215, that is the unicode value.
Post by BartPost by Thiago Adams(Remember the utf8 bytes were 195 151)
The way 'ab' is handled is the same of '×' on GCC.
I don't understand. 'a' and 'b' each occupy one byte. Together they need
two bytes.
Where's the problem? Are you perhaps confused as to what UTF8 is?
I am not confused.
The problem is that the value of 'ab' is not defined in C. So I want to
use this but it is a warning.
Post by BartThe 50071 above is much better expressed as hex: C397, which is two
bytes. Since both values are in 128..255, they are UTF8 codes, here
expressing a single Unicode character.
I am using '==' etc.. to represent token numbers.
Post by BartGiven any two bytes in UTF8, it is easy to see whether they are two
ASCII character, or one (or part of) a Unicode characters, or one ASCII
character followed by the first byte of a UTF8 sequence, or if they are
malformed (eg. the middle of a UTF8 sequence).
There is no confusion.
Post by Thiago AdamsPost by BartAnd what is the test for, to ensure encoding is UTF8 in this ...
source file? ... compiler?
MSVC has some checks, I don't know that is the logic.
Post by BartWhere would the 'decoded 215' come into it?
215 is the value after decoding utf8 and producing the unicode value.
Who or what does that, and for what purpose? From what I've seen, only
you have introduced it.
?
Any modern language will make '×' as 215 (the unicode value). But these
languages don't allow multi chars like 'ab'.
New languages are like U'×' in C.
Post by BartPost by Thiago AdamsSo my suggestion is decode first.
Why? What are you comparing? Both sides of == must use UTF8 or Unicode,
but why introduce Unicode at all if apparently everything in source code
and at compile time, as you yourself have stated, is UTF8?
Post by Thiago AdamsThe bad part of my suggestion we may have two different ways of
producing the same value.
For instance the number generated by ab is the same of
'ab' == '𤤰'
#include <stdio.h>
#include <string.h>
int main() {
printf("%u\n", '×');
printf("%04X\n", '×');
printf("%u\n", 'ab');
printf("%04X\n", 'ab');
printf("%u\n", '𤤰');
printf("%04X\n", '𤤰');
}
This is not running the algorithm I am suggesting!This 'ab' == '𤤰'
happens only in the say I am suggesting. No compiler is doing that today.
(I never imagined this would cause such confusion in understanding)
Post by BartC397 ×
6162 ab
F0A4A4B0 𤤰
That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to
clash with some other Unicode character.
My suggestion again. I am using string but imagine this working with
bytes from file.
#include <stdio.h>
#include <assert.h>
const unsigned char* utf8_decode(const unsigned char* s, int* c)
{
if (s[0] == '\0')
{
*c = 0;
return NULL; /*end*/
}
const unsigned char* next = NULL;
if (s[0] < 0x80)
{
*c = s[0];
assert(*c >= 0x0000 && *c <= 0x007F);
next = s + 1;
}
else if ((s[0] & 0xe0) == 0xc0)
{
*c = ((int)(s[0] & 0x1f) << 6) |
((int)(s[1] & 0x3f) << 0);
assert(*c >= 0x0080 && *c <= 0x07FF);
next = s + 2;
}
else if ((s[0] & 0xf0) == 0xe0)
{
*c = ((int)(s[0] & 0x0f) << 12) |
((int)(s[1] & 0x3f) << 6) |
((int)(s[2] & 0x3f) << 0);
assert(*c >= 0x0800 && *c <= 0xFFFF);
next = s + 3;
}
else if ((s[0] & 0xf8) == 0xf0 && (s[0] <= 0xf4))
{
*c = ((int)(s[0] & 0x07) << 18) |
((int)(s[1] & 0x3f) << 12) |
((int)(s[2] & 0x3f) << 6) |
((int)(s[3] & 0x3f) << 0);
assert(*c >= 0x10000 && *c <= 0x10FFFF);
next = s + 4;
}
else
{
*c = -1; // invalid
next = s + 1; // skip this byte
}
if (*c >= 0xd800 && *c <= 0xdfff)
{
*c = -1; // surrogate half
}
return next;
}
int get_value(const char* s0)
{
const char * s = s0;
int value = 0;
int uc;
s = utf8_decode(s, &uc);
while (s)
{
if (uc < 0x007F)
{
//multichar formula
value = value*256+uc;
}
else
{
//single char
value = uc;
break; //check if there is more then error..
}
s = utf8_decode(s, &uc);
}
return value;
}
int main(){
printf("%d\n", get_value(u8"×"));
printf("%d\n", get_value(u8"ab"));
}