Simple string conversion from UCS2 to ISO8859-1

Post by pozz
void ucs2_to_iso8859p1(char *ucs2, size_t size);
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing
size because ucs2 isn't null terminated.

Typically UCS2 strings ARE null terminated, it just a null is two bytes
long.

Post by pozz
I know I can use iconv() feature, but I'm on an embedded platform
without an OS and without iconv() function.
It is trivial to convert "0000"-"007F" chars: it's a simple cast from
unsigned int to char.

Note, I think you will find that it is that 0000-00FF that match. (as I
remember ISO8859-1 was the base for starting Unicode).

Post by pozz
It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's trivial
again. But I saw the code "2019" (apostrophe) that can be rendered as
0x27 in ISO8859-1.

To be correct, u2019 isn't 0x27, its just character that looks a lot
like it.

Post by pozz
Is there a simplified mapping table that can be written with if/switch?
if (code < 0x80) {
*dst++ = (char)code;
} else {
switch (code) {
    case 0x2019: *dst++ = 0x27; break; // Apostrophe
    case 0x...: *dst++ = ...; break;
    default: *ds++ = ' ';
}
}
I'm not searching a very detailed and correct mapping, but just a
"sufficient" implementation.

Then you have to decide which are sufficient mappings. No character
above FF *IS* the character below, but some have a close approximation,
so you will need to decide what to map.

pozz

2025-02-21 12:42:13 UTC

Post by pozz
void ucs2_to_iso8859p1(char *ucs2, size_t size);
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm
passing size because ucs2 isn't null terminated.

Typically UCS2 strings ARE null terminated, it just a null is two bytes
long.

Sure, but this isn't an issue here.

Note, I think you will find that it is that 0000-00FF that match. (as I
remember ISO8859-1 was the base for starting Unicode).

Post by pozz
It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's
trivial again. But I saw the code "2019" (apostrophe) that can be
rendered as 0x27 in ISO8859-1.

To be correct, u2019 isn't 0x27, its just character that looks a lot
like it.

Yes, but as a first approximation, 0x27 is much better than '?' for u2019.

Post by pozz
Is there a simplified mapping table that can be written with if/switch?
if (code < 0x80) {
   *dst++ = (char)code;
} else {
   switch (code) {
     case 0x2019: *dst++ = 0x27; break; // Apostrophe
     case 0x...: *dst++ = ...; break;
     default: *ds++ = ' ';
   }
}
I'm not searching a very detailed and correct mapping, but just a
"sufficient" implementation.

Then you have to decide which are sufficient mappings. No character
above FF *IS* the character below, but some have a close approximation,
so you will need to decide what to map.

Yes, I have to decide, but it is a very big problem (there are thousands
of Unicode symbols that can be approximated to another ISO8859-1 code).
I'm wondering if such an approximation is just implemented somewhere.

For example, what iconv() does in this case?

Janis Papanagnou

2025-02-21 13:06:03 UTC

Post by pozz
void ucs2_to_iso8859p1(char *ucs2, size_t size);
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm
passing size because ucs2 isn't null terminated.

[...]

Post by pozz
It is trivial to convert "0000"-"007F" chars: it's a simple cast from
unsigned int to char.

Note, I think you will find that it is that 0000-00FF that match. (as
I remember ISO8859-1 was the base for starting Unicode).

I second that.

Post by pozz
It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's
trivial again. But I saw the code "2019" (apostrophe) that can be
rendered as 0x27 in ISO8859-1.

To be correct, u2019 isn't 0x27, its just character that looks a lot
like it.

Yes, but as a first approximation, 0x27 is much better than '?' for u2019.

Note that there are _standard names_ assigned with the characters.
These are normative what the characters represent. - I strongly
suggest to not twist these standards by assigning different
characters; you will do no one a favor but inflict only confusion
and harm.

Post by pozz
Is there a simplified mapping table that can be written with if/switch?
if (code < 0x80) {
*dst++ = (char)code;
} else {
switch (code) {
case 0x2019: *dst++ = 0x27; break; // Apostrophe
case 0x...: *dst++ = ...; break;
default: *ds++ = ' ';
}
}
I'm not searching a very detailed and correct mapping, but just a
"sufficient" implementation.

Then you have to decide which are sufficient mappings. No character
above FF *IS* the character below, but some have a close
approximation, so you will need to decide what to map.

I've just made a run across the names of UCS-2 and ISO 8859-1, based
on their normative names and, as mentioned above already; they match
one-to-one in the ranges 0000-00FF and 00-FF respectively.

BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
contains a few other characters like the € (Euro Sign). If that is
possible for your context you have to map a handful of characters.

Janis

Post by pozz
For example, what iconv() does in this case?

Janis Papanagnou

2025-02-21 13:17:32 UTC

[...] But I saw the code "2019" (apostrophe) that can be
rendered as 0x27 in ISO8859-1.

To be correct, u2019 isn't 0x27, its just character that looks a lot
like it.

Yes, but as a first approximation, 0x27 is much better than '?' for u2019.

I want to amend the standard names to make it clear...

0027 APOSTROPHE
2019 RIGHT SINGLE QUOTATION MARK

27 APOSTROPHE

Hope that helps to understand the standard names.

(You should also be aware that a glyph for a character may be
depicted differently depending on the source of the respective
documents.)

Janis

Keith Thompson

2025-02-21 19:40:38 UTC

Janis Papanagnou <janis_papanagnou+***@hotmail.com> writes:
[...]

Post by Janis Papanagnou
BTW; you may want to consider using ISO 8859-15 (Latin 9) instead
of ISO 8859-1 (Latin 1); Latin 1 is widely outdated, and Latin 9
contains a few other characters like the € (Euro Sign). If that is
possible for your context you have to map a handful of characters.

Latin-1 maps exactly to Unicode for the first 256 values. Latin-9 does
not, which would make the translation more difficult.

<https://en.wikipedia.org/wiki/ISO/IEC_8859-15> includes a table showing
the 8 characters that differ betwween Latin-1 and Latin-9.

If at all possible, it would be better to convert to UTF-8. The
conversion is exact and reversible, and UTF-8 has largely superseded the
various Latin-* character encodings. I'm curious why the OP needs
ISO8859-1 and can't use UTF-8.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

Janis Papanagnou

2025-02-21 22:35:57 UTC

Post by Keith Thompson
[...]

Latin-1 maps exactly to Unicode for the first 256 values. Latin-9 does
not, which would make the translation more difficult.

Yes, that had already been pointed out upthread.

The (open) question is whether it makes sense to convert to "Latin 1"
only because it has a one-to-one mapping concerning the first UCS-2
characters, or if the underlying application of the OP wants support
of contemporary information by e.g. providing the € (Euro) sign with
"Latin 9".

Post by Keith Thompson
<https://en.wikipedia.org/wiki/ISO/IEC_8859-15> includes a table showing
the 8 characters that differ betwween Latin-1 and Latin-9.
If at all possible, it would be better to convert to UTF-8. The
conversion is exact and reversible, and UTF-8 has largely superseded the
various Latin-* character encodings.

Well, UTF-8 is an multi-octet _encoding_ for all Unicode characters,
while the ISO 8859-X family represents single octet representations.

Post by Keith Thompson
I'm curious why the OP needs ISO8859-1 and can't use UTF-8.

I think this, or why he can't use "Latin 9", are essential questions.

It seems to have got clear after a subsequent post of the OP; some
message/data source seems to provide characters from the upper planes
of Unicode and the OP has to (or wants to) somehow map them to some
constant octet character set. - Yet there's no information provided
what Unicode characters - characters that don't have a representation
in Latin 1 or Latin 9 - the OP will encounter or not from that source.

As it sounds it all seems to make little sense.

Janis

David Brown

2025-02-22 11:29:59 UTC

Post by Keith Thompson
[...]

Latin-1 maps exactly to Unicode for the first 256 values. Latin-9 does
not, which would make the translation more difficult.

Yes, that had already been pointed out upthread.
The (open) question is whether it makes sense to convert to "Latin 1"
only because it has a one-to-one mapping concerning the first UCS-2
characters, or if the underlying application of the OP wants support
of contemporary information by e.g. providing the € (Euro) sign with
"Latin 9".

Well, UTF-8 is an multi-octet _encoding_ for all Unicode characters,
while the ISO 8859-X family represents single octet representations.

Post by Keith Thompson
I'm curious why the OP needs ISO8859-1 and can't use UTF-8.

I think this, or why he can't use "Latin 9", are essential questions.
It seems to have got clear after a subsequent post of the OP; some
message/data source seems to provide characters from the upper planes
of Unicode and the OP has to (or wants to) somehow map them to some
constant octet character set. - Yet there's no information provided
what Unicode characters - characters that don't have a representation
in Latin 1 or Latin 9 - the OP will encounter or not from that source.
As it sounds it all seems to make little sense.
Janis

As the OP explained in a reply to one of my posts, he is getting data in
in UCS-2 format from SMS's from a modem. Somewhere along the line,
either the firmware in the modem or in the code sending the SMS's,
characters beyond the BMP are being used needlessly. So it looks like
his first idea of manually handling a few cases (like code 0x2019) seems
like the right approach.

Whether Latin-1 or Latin-9 is better will depend on his application.
The additional characters in Latin-9, with the exception of the Euro
symbol, are pretty obscure - it's unlikely that you'd need them and not
need a good deal more other characters (i.e., supporting much more of
Unicode).

As for why not use UTF-8, the answer is clearly simplicity. The OP is
working with a resource-constrained embedded system. I don't know what
he is doing with the characters after converting them from UCS-2, but it
is massively simpler to use an 8-bit character set if they are going to
be used for display on a small system. It also keeps memory management
simpler, and that is essential on such systems - one UCS-2 character
maps to one code unit with Latin-9 here. The space needed for UTF-8 is
much harder to predict, and the OP will want to avoid any kind of
malloc() or dynamic allocation where possible.

If the incoming SMS's are just being logged, or passed out in some other
way, then UTF-8 may be a convenient alternative.

Janis Papanagnou

2025-02-22 12:11:34 UTC

Post by David Brown
As the OP explained in a reply to one of my posts, he is getting data in
in UCS-2 format from SMS's from a modem. [...]

(Yes. I wrote: "have got clear after a subsequent post".)

Post by David Brown
Whether Latin-1 or Latin-9 is better will depend on his application.

(Was also my stance upthread; "If that is possible for your context")

Post by David Brown
The
additional characters in Latin-9, with the exception of the Euro symbol,
are pretty obscure

ISTR they are some language specific symbols, so probably less obscure
to someone from those countries.

Post by David Brown
- it's unlikely that you'd need them and not need a
good deal more other characters (i.e., supporting much more of Unicode).
As for why not use UTF-8, the answer is clearly simplicity.

This was not my point (someone else suggested that). To me that was
clear; UTF-8 is an _encoding_ (as I wrote), as opposed to a direct
representation of a fixed width character (either 8 bit width ISO
8859-X or 16 bit with UCS-2). Conversions to/from UTF-8 are not as
straightforward as fixed width character representations are.

Post by David Brown
The OP is
working with a resource-constrained embedded system. I don't know what
he is doing with the characters after converting them from UCS-2, but it
is massively simpler to use an 8-bit character set if they are going to
be used for display on a small system. It also keeps memory management
simpler, and that is essential on such systems - one UCS-2 character
maps to one code unit with Latin-9 here. The space needed for UTF-8 is
much harder to predict, and the OP will want to avoid any kind of
malloc() or dynamic allocation where possible.

You should address that to the other poster. :-)

Janis

Post by David Brown
If the incoming SMS's are just being logged, or passed out in some other
way, then UTF-8 may be a convenient alternative.

David Brown

2025-02-22 13:11:11 UTC

Post by David Brown
As the OP explained in a reply to one of my posts, he is getting data in
in UCS-2 format from SMS's from a modem. [...]

(Yes. I wrote: "have got clear after a subsequent post".)

Post by David Brown
Whether Latin-1 or Latin-9 is better will depend on his application.

(Was also my stance upthread; "If that is possible for your context")

Post by David Brown
The
additional characters in Latin-9, with the exception of the Euro symbol,
are pretty obscure

ISTR they are some language specific symbols, so probably less obscure
to someone from those countries.

The point (as I said below) is that adding these letters (š, ž, œ) makes
very little difference to anyone because they are not enough to let them
write their language properly. Sure, someone writing Czech might have
regular use of the letter ž - but with Latin-9 they can't write the
letters ť, ř, ď or several other Czech letters. So it provides little
benefit to most people who have those letters in their alphabet. If you
want to let people write their languages properly (something I strongly
support), you need much fuller Unicode support - unless you are working
specifically with Sami, Finnish or Estonian, the only benefit of moving
from Latin-1 to Latin-9 is for the Euro symbol.

This was not my point (someone else suggested that).

<snip>

Post by Janis Papanagnou
You should address that to the other poster. :-)

I was making a single reply that covered both parts - I know you didn't
write the bits you quoted from further up-thread.

Lawrence D'Oliveiro

2025-02-22 21:23:42 UTC

Post by Janis Papanagnou
UTF-8 is an _encoding_ (as I wrote), as opposed to a direct
representation of a fixed width character (either 8 bit width ISO 8859-X
or 16 bit with UCS-2). Conversions to/from UTF-8 are not as
straightforward as fixed width character representations are.

Unicode is not, and never has been, a fixed-width character set.

UCS-2 was a fixed-width set of code points. Even that idea has been
abandoned.

Richard Damon

2025-02-22 12:15:09 UTC

Post by David Brown

[...]

Latin-1 maps exactly to Unicode for the first 256 values. Latin-9 does
not, which would make the translation more difficult.

Yes, that had already been pointed out upthread.
The (open) question is whether it makes sense to convert to "Latin 1"
only because it has a one-to-one mapping concerning the first UCS-2
characters, or if the underlying application of the OP wants support
of contemporary information by e.g. providing the € (Euro) sign with
"Latin 9".

<https://en.wikipedia.org/wiki/ISO/IEC_8859-15> includes a table showing
the 8 characters that differ betwween Latin-1 and Latin-9.
If at all possible, it would be better to convert to UTF-8. The
conversion is exact and reversible, and UTF-8 has largely superseded the
various Latin-* character encodings.

Well, UTF-8 is an multi-octet _encoding_ for all Unicode characters,
while the ISO 8859-X family represents single octet representations.

I'm curious why the OP needs ISO8859-1 and can't use UTF-8.

I think this, or why he can't use "Latin 9", are essential questions.
It seems to have got clear after a subsequent post of the OP; some
message/data source seems to provide characters from the upper planes
of Unicode and the OP has to (or wants to) somehow map them to some
constant octet character set. - Yet there's no information provided
what Unicode characters - characters that don't have a representation
in Latin 1 or Latin 9 - the OP will encounter or not from that source.
As it sounds it all seems to make little sense.
Janis

Small nit, not outside the BMP, just outside the ASCII/LATIN-1 set. It
did like so many other programs a "pretty" transformation of a simple
single quotation mark, to a fancy version.

Post by David Brown
Whether Latin-1 or Latin-9 is better will depend on his application. The
additional characters in Latin-9, with the exception of the Euro symbol,
are pretty obscure - it's unlikely that you'd need them and not need a
good deal more other characters (i.e., supporting much more of Unicode).
As for why not use UTF-8, the answer is clearly simplicity. The OP is
working with a resource-constrained embedded system. I don't know what
he is doing with the characters after converting them from UCS-2, but it
is massively simpler to use an 8-bit character set if they are going to
be used for display on a small system. It also keeps memory management
simpler, and that is essential on such systems - one UCS-2 character
maps to one code unit with Latin-9 here. The space needed for UTF-8 is
much harder to predict, and the OP will want to avoid any kind of
malloc() or dynamic allocation where possible.
If the incoming SMS's are just being logged, or passed out in some other
way, then UTF-8 may be a convenient alternative.

I would ssy the big difference is that an 8-bit character set needs to
only store 256 glyphs for its font. Converting to UTF-8, would still
require storing some massive font, and the need to decide exactly how
massive it will be.

David Brown

2025-02-22 13:12:28 UTC

Post by David Brown
As for why not use UTF-8, the answer is clearly simplicity. The OP is
working with a resource-constrained embedded system. I don't know
what he is doing with the characters after converting them from UCS-2,
but it is massively simpler to use an 8-bit character set if they are
going to be used for display on a small system. It also keeps memory
management simpler, and that is essential on such systems - one UCS-2
character maps to one code unit with Latin-9 here. The space needed
for UTF-8 is much harder to predict, and the OP will want to avoid any
kind of malloc() or dynamic allocation where possible.
If the incoming SMS's are just being logged, or passed out in some
other way, then UTF-8 may be a convenient alternative.

Yes, exactly. A key point is what the OP is going to do with the text.

Janis Papanagnou

2025-02-22 15:43:55 UTC

[...] It did like so many other programs a "pretty" transformation of
a simple single quotation mark, to a fancy version.

Good to put the "pretty" in quotes; I've seen so many "fancy versions",
one worse than the other. They are culture specific and on a terminal
they often look bad even in their native target form. For example “--”
in a man page (say, 'man awk') has a left and right slant respectively
and they are linear, but my newsreader shows them both in the same
direction but the one thicker at the bottom the other at the top. It's
similar with single quotes; here we often see accents used at one side
and a regular single quote at the other side. In 'man man' for example
we find even a comment on that in the description of option '--ascii'.
There's *tons* of such quoting characters for the various languages,
in my mother tongue there's even _more than one_ type used in printed
media. Single or double and left or right and bottom or top or mixed
or double or single angle brackets in opening and closing form, plus
the *misused* accent characters (which look worst, IMO, especially if
combined inconsistently with other forms).

I'm glad that in programming there's a bias on symmetric use of the
neutral forms " and ' (for strings and characters and other quoting)
and that things like accents ` and ´ *seem* to gradually vanish for
quoting purposes; e.g. shell `...` long superseded by $(...). Only
document contents occasionally still adhere to trashy use.

One thing I'd really like to understand is why folks have been mixing
accents with quotes, as in ``standard'' (also taken from 'man awk').

They may look acceptable in one display or printing device but become
a typographical catastrophe when viewed on another device type.

</rant>

Janis
--
0022;QUOTATION MARK
00AB;LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
00BB;RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
2018;LEFT SINGLE QUOTATION MARK
2019;RIGHT SINGLE QUOTATION MARK
201A;SINGLE LOW-9 QUOTATION MARK
201B;SINGLE HIGH-REVERSED-9 QUOTATION MARK
201C;LEFT DOUBLE QUOTATION MARK
201D;RIGHT DOUBLE QUOTATION MARK
201E;DOUBLE LOW-9 QUOTATION MARK
201F;DOUBLE HIGH-REVERSED-9 QUOTATION MARK
2039;SINGLE LEFT-POINTING ANGLE QUOTATION MARK
203A;SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
275B;HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT
275C;HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT
275D;HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
275E;HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT
275F;HEAVY LOW SINGLE COMMA QUOTATION MARK ORNAMENT
2760;HEAVY LOW DOUBLE COMMA QUOTATION MARK ORNAMENT
276E;HEAVY LEFT-POINTING ANGLE QUOTATION MARK ORNAMENT
276F;HEAVY RIGHT-POINTING ANGLE QUOTATION MARK ORNAMENT
2E42;DOUBLE LOW-REVERSED-9 QUOTATION MARK
301D;REVERSED DOUBLE PRIME QUOTATION MARK
301E;DOUBLE PRIME QUOTATION MARK
301F;LOW DOUBLE PRIME QUOTATION MARK
FF02;FULLWIDTH QUOTATION MARK
1F676;SANS-SERIF HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
1F677;SANS-SERIF HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT
1F678;SANS-SERIF HEAVY LOW DOUBLE COMMA QUOTATION MARK ORNAMENT
E0022;TAG QUOTATION MARK

0027;APOSTROPHE
02BC;MODIFIER LETTER APOSTROPHE
02EE;MODIFIER LETTER DOUBLE APOSTROPHE
055A;ARMENIAN APOSTROPHE
FF07;FULLWIDTH APOSTROPHE
E0027;TAG APOSTROPHE

Richard Damon

2025-02-23 03:38:45 UTC

[...] It did like so many other programs a "pretty" transformation of
a simple single quotation mark, to a fancy version.

I have more often seen not the "accents" but the curly quotes (one and
closed) that look more like elevated commas flipped around.

When used to "escape" a character (or quote a string as an extended
escape), people come up with all sorts of ideas, and there sometimes the
strange characters were chosen to minimize the need for ways to escape
the escape character.

Lawrence D'Oliveiro

2025-02-22 21:24:28 UTC

Post by Richard Damon
I would ssy the big difference is that an 8-bit character set needs to
only store 256 glyphs for its font.

Note that glyphs are not characters.

Kaz Kylheku

2025-02-23 00:02:32 UTC

Post by Richard Damon
I would ssy the big difference is that an 8-bit character set needs to
only store 256 glyphs for its font.

Note that glyphs are not characters.

Unemployable shithead, note that Richard said "font". A font does assign
glyphs to abstract characters. The sentence is not the most precise we
can imagine, since character sets are not containers that store, but
it's not important here.

Gee, what are the odds you would fuck up an attempt to nit-pick someone
ten times your brain size?

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca

Waldek Hebisch

2025-02-22 23:44:49 UTC

Post by David Brown

[...]

Latin-1 maps exactly to Unicode for the first 256 values. Latin-9 does
not, which would make the translation more difficult.

Yes, that had already been pointed out upthread.
The (open) question is whether it makes sense to convert to "Latin 1"
only because it has a one-to-one mapping concerning the first UCS-2
characters, or if the underlying application of the OP wants support
of contemporary information by e.g. providing the € (Euro) sign with
"Latin 9".

Well, UTF-8 is an multi-octet _encoding_ for all Unicode characters,
while the ISO 8859-X family represents single octet representations.

I'm curious why the OP needs ISO8859-1 and can't use UTF-8.

I think this, or why he can't use "Latin 9", are essential questions.
It seems to have got clear after a subsequent post of the OP; some
message/data source seems to provide characters from the upper planes
of Unicode and the OP has to (or wants to) somehow map them to some
constant octet character set. - Yet there's no information provided
what Unicode characters - characters that don't have a representation
in Latin 1 or Latin 9 - the OP will encounter or not from that source.
As it sounds it all seems to make little sense.
Janis

Small nit, not outside the BMP, just outside the ASCII/LATIN-1 set. It
did like so many other programs a "pretty" transformation of a simple
single quotation mark, to a fancy version.

Most european characters are ASCII letter + accents, that can be
stored quite efficiently. Korean requires handful of basic characters,
rest can be syntetised from them.

Full Unicode certainly requires massive font, but selected subset
may be possible with modest resources (but probably more than 256
positions).

--
Waldek Hebisch

Richard Damon

2025-02-22 01:05:22 UTC

Post by pozz
void ucs2_to_iso8859p1(char *ucs2, size_t size);
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm
passing size because ucs2 isn't null terminated.

Typically UCS2 strings ARE null terminated, it just a null is two
bytes long.

Sure, but this isn't an issue here.

Note, I think you will find that it is that 0000-00FF that match. (as
I remember ISO8859-1 was the base for starting Unicode).

Post by pozz
It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's
trivial again. But I saw the code "2019" (apostrophe) that can be
rendered as 0x27 in ISO8859-1.

To be correct, u2019 isn't 0x27, its just character that looks a lot
like it.

Yes, but as a first approximation, 0x27 is much better than '?' for u2019.

And, as such is a subjective decision that you need to make.

Post by pozz
Is there a simplified mapping table that can be written with if/switch?
if (code < 0x80) {
   *dst++ = (char)code;
} else {
   switch (code) {
     case 0x2019: *dst++ = 0x27; break; // Apostrophe
     case 0x...: *dst++ = ...; break;
     default: *ds++ = ' ';
   }
}
I'm not searching a very detailed and correct mapping, but just a
"sufficient" implementation.

Then you have to decide which are sufficient mappings. No character
above FF *IS* the character below, but some have a close
approximation, so you will need to decide what to map.

Just look at its code, there will be open source versions of it.

The two real options is just reject anything above 0xFF, or have a big
table/switch to handle some determined list of things "close enough"

Lawrence D'Oliveiro

2025-02-22 03:00:31 UTC

Post by pozz
Yes, I have to decide, but it is a very big problem (there are thousands
of Unicode symbols that can be approximated to another ISO8859-1 code).
I'm wondering if such an approximation is just implemented somewhere.

If you look at NamesList.txt, you will see, next to each character,
references to others that might be similar or related in some way.

They say not to try to parse that file automatically, but I’ve had some
success doing exactly that ... so far ...

Janis Papanagnou

2025-02-22 04:29:14 UTC

If you look at NamesList.txt, you will see, next to each character,
references to others that might be similar or related in some way.
They say not to try to parse that file automatically, but I’ve had some
success doing exactly that ... so far ...

I wonder why they say so, given that there's a syntax description
available on their pages (see the respective HTML file[*]).

BTW; curious about that [informal] part of the syntax description

LF: <any sequence of a single ASCII 0A or 0D, or both>

It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
Is the latter of any practical relevance?

Janis

[*] https://www.unicode.org/Public/UCD/latest/ucd/NamesList.html

Lawrence D'Oliveiro

2025-02-22 06:13:27 UTC

Post by Lawrence D'Oliveiro
They say not to try to parse that file automatically, but I’ve had some
success doing exactly that ... so far ...

I wonder why they say so, given that there's a syntax description
available on their pages (see the respective HTML file[*]).

The file itself says different
<https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt>:

This file is semi-automatically derived from UnicodeData.txt and a
set of manually created annotations using a script to select or
suppress information from the data file. The rules used for this
process are aimed at readability for the human reader, at the
expense of some details; therefore, this file should not be parsed
for machine-readable information.

Janis Papanagnou

2025-02-22 08:11:02 UTC

Post by Lawrence D'Oliveiro
They say not to try to parse that file automatically, but I’ve had some
success doing exactly that ... so far ...

I wonder why they say so, given that there's a syntax description
available on their pages (see the respective HTML file[*]).

The file itself says different
This file is semi-automatically derived from UnicodeData.txt and a
set of manually created annotations using a script to select or
suppress information from the data file. The rules used for this
process are aimed at readability for the human reader, at the
expense of some details; therefore, this file should not be parsed
for machine-readable information.

I see, but I certainly wouldn't refrain from parsing it. (In the past
I had parsed much worse data; irregular HTML stuff and the like.)
OTOH, there's also the CSV data file available, yet even simpler to
parse with standard tools and no effort.

Janis

Lawrence D'Oliveiro

2025-02-22 21:22:20 UTC

Post by Lawrence D'Oliveiro
They say not to try to parse that file automatically, but I’ve had some
success doing exactly that ... so far ...

I wonder why they say so, given that there's a syntax description
available on their pages (see the respective HTML file[*]).

I see, but I certainly wouldn't refrain from parsing it.

Particularly since the information on related code points doesn’t seem to
be available anywhere else.

James Kuyper

2025-02-23 05:01:37 UTC

On 2/21/25 23:29, Janis Papanagnou wrote:
...

Post by Janis Papanagnou
BTW; curious about that [informal] part of the syntax description
LF: <any sequence of a single ASCII 0A or 0D, or both>
It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
Is the latter of any practical relevance?

According to <https://en.wikipedia.org/wiki/Newline#Representation>,
LF-CR is used by "Acorn BBC and RISC OS spooled text output". I presume
you would not consider that to be of any practical importance.

Lawrence D'Oliveiro

2025-02-23 05:53:59 UTC

Post by Janis Papanagnou
It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
Is the latter of any practical relevance?

Not to answer the question, but just to add to it; from the XML 1.1 spec
<https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-xml11>:

In addition, XML 1.0 attempts to adapt to the line-end conventions
of various modern operating systems, but discriminates against the
conventions used on IBM and IBM-compatible mainframes. As a
result, XML documents on mainframes are not plain text files
according to the local conventions. XML 1.0 documents generated on
mainframes must either violate the local line-end conventions, or
employ otherwise unnecessary translation phases before parsing and
after generation. Allowing straightforward interoperability is
particularly important when data stores are shared between
mainframe and non-mainframe systems (as opposed to being copied
from one to the other). Therefore XML 1.1 adds NEL (#x85) to the
list of line-end characters. For completeness, the Unicode line
separator character, #x2028, is also supported.

Kaz Kylheku

2025-02-23 07:03:04 UTC

Post by Janis Papanagnou
LF: <any sequence of a single ASCII 0A or 0D, or both>
It looks like they accept not only LF, CR, CR-LF, but also LF-CR.
Is the latter of any practical relevance?

Because if Unicode people spot the slightest opportunity to add
pointless complexity to anything, they tend to pounce on it.

Why just specify one line ending convention, when you can require the
processor of the file to watch out for four different tokens denoting
the line break?

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca

David Brown

2025-02-21 14:23:51 UTC

Post by pozz
void ucs2_to_iso8859p1(char *ucs2, size_t size);
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm passing
size because ucs2 isn't null terminated.
I know I can use iconv() feature, but I'm on an embedded platform
without an OS and without iconv() function.
It is trivial to convert "0000"-"007F" chars: it's a simple cast from
unsigned int to char.
It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's trivial
again. But I saw the code "2019" (apostrophe) that can be rendered as
0x27 in ISO8859-1.
Is there a simplified mapping table that can be written with if/switch?
if (code < 0x80) {
*dst++ = (char)code;
} else {
switch (code) {
    case 0x2019: *dst++ = 0x27; break; // Apostrophe
    case 0x...: *dst++ = ...; break;
    default: *ds++ = ' ';
}
}
I'm not searching a very detailed and correct mapping, but just a
"sufficient" implementation.

<https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane>

As has been mentioned by others, 0 - 0xff should be a direct translation
(with the possible exception of Latin-9 differences).

<https://en.wikipedia.org/wiki/ISO/IEC_8859-15>

When you look that BMP blocks above the first two blocks (0 - 0x7f, 0x80
- 0xff), you will quickly see that virtually none of them make any sense
to support in the way you are thinking. Just because a couple of the
characters in the Thaana block look a bit like quotation marks, does not
mean it makes any sense to try to transliterate them. Realistically,
you can at most make use of a few punctuation symbols (like 0x2019
above), and maybe approximate forms for some extended Latin alphabet
characters that you will never see in practice. Oh, and you might be
able to support those spam emails that use Greek and other letters that
look like Latin letters such as "ՏΡ𐊠Ꮇ" to fool filters. And that's
assuming you have output support for the full Latin-1 or Latin-9 range.

Unicode is rarely much use unless you want and can provide good support
for non-Latin alphabets. Otherwise your translations are going to be so
limited and simple that they are barely worth the effort and won't cover
anything useful.

So here I would say that whoever provides the text, provides it in
Latin-9 encoding. There's no point in allowing external translators to
use whatever characters they feel is best in their language, and then
your code makes some kind of odd approximation giving results that look
different. If someone really wants to use the letter "ā" that is found
in the Latin Extended A block, how do /you/ know whether the best
Latin-9 match is "a", "ã", "ä", or something different like "aa" or an
alternative spelling of the word? Maybe the rules are different for
Latvian and Anglicised Mandarin.

When we have worked with multiple languages on small embedded systems
(too small for big fonts and UTF-8), we have used one of three techniques :

1. Insist that the external translators provide strings in Latin-9 only
(or even just ASCII when the system was more restricted).

2. Use primarily ASCII, with a few user-defined characters per language
(that's useful for old-style character displays with space for perhaps 8
user-defined characters).

3. Use a PC program to figure out the characters actually used in the
strings, and put them into a single table indexing a generated list of
bitmap glyphs, also generated by the program (from freely available
fonts). The source is, naturally, UTF-8 - the strings stored in the
embedded system are not in any standard encoding representing
characters, but now hold glyph table indices.

Your idea here sounds to me like a lot of work for virtually no benefit.

pozz

2025-02-21 14:53:02 UTC

Post by David Brown

Post by pozz
void ucs2_to_iso8859p1(char *ucs2, size_t size);
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm
passing size because ucs2 isn't null terminated.
I know I can use iconv() feature, but I'm on an embedded platform
without an OS and without iconv() function.
It is trivial to convert "0000"-"007F" chars: it's a simple cast from
unsigned int to char.
It isn't so simple to convert higher codes. For example, the small e
with grave "00E8" can be converted to 0xE8 in ISO8859-1, so it's
trivial again. But I saw the code "2019" (apostrophe) that can be
rendered as 0x27 in ISO8859-1.
Is there a simplified mapping table that can be written with if/switch?
if (code < 0x80) {
   *dst++ = (char)code;
} else {
   switch (code) {
     case 0x2019: *dst++ = 0x27; break; // Apostrophe
     case 0x...: *dst++ = ...; break;
     default: *ds++ = ' ';
   }
}
I'm not searching a very detailed and correct mapping, but just a
"sufficient" implementation.

<https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane>
As has been mentioned by others, 0 - 0xff should be a direct translation
(with the possible exception of Latin-9 differences).
<https://en.wikipedia.org/wiki/ISO/IEC_8859-15>
When you look that BMP blocks above the first two blocks (0 - 0x7f, 0x80
- 0xff), you will quickly see that virtually none of them make any sense
to support in the way you are thinking. Just because a couple of the
characters in the Thaana block look a bit like quotation marks, does not
mean it makes any sense to try to transliterate them. Realistically,
you can at most make use of a few punctuation symbols (like 0x2019
above), and maybe approximate forms for some extended Latin alphabet
characters that you will never see in practice. Oh, and you might be
able to support those spam emails that use Greek and other letters that
look like Latin letters such as "ՏΡ𐊠Ꮇ" to fool filters. And that's
assuming you have output support for the full Latin-1 or Latin-9 range.
Unicode is rarely much use unless you want and can provide good support
for non-Latin alphabets. Otherwise your translations are going to be so
limited and simple that they are barely worth the effort and won't cover
anything useful.
So here I would say that whoever provides the text, provides it in
Latin-9 encoding. There's no point in allowing external translators to
use whatever characters they feel is best in their language, and then
your code makes some kind of odd approximation giving results that look
different. If someone really wants to use the letter "ā" that is found
in the Latin Extended A block, how do /you/ know whether the best
Latin-9 match is "a", "ã", "ä", or something different like "aa" or an
alternative spelling of the word? Maybe the rules are different for
Latvian and Anglicised Mandarin.
When we have worked with multiple languages on small embedded systems
1. Insist that the external translators provide strings in Latin-9 only
(or even just ASCII when the system was more restricted).
2. Use primarily ASCII, with a few user-defined characters per language
(that's useful for old-style character displays with space for perhaps 8
user-defined characters).
3. Use a PC program to figure out the characters actually used in the
strings, and put them into a single table indexing a generated list of
bitmap glyphs, also generated by the program (from freely available
fonts). The source is, naturally, UTF-8 - the strings stored in the
embedded system are not in any standard encoding representing
characters, but now hold glyph table indices.
Your idea here sounds to me like a lot of work for virtually no benefit.

Yes, you're right. My question comes from an SMS text received by a 4G
network modem. The reply to AT+CMGR command for a specific SMS reported
the text in UCS2. The SMS was one sent by the mobile operator with
balance of the prepaid SIM card.

The text included the apostrophe coded as U+2019 instead of U+0027. I
suspect the developer that wrote the text in the mobile operator systems
was using UTF-8 (or UTF-16) and inserted exactly U+2019 (maybe it did
wrong).

Anyway I think I can live without that.

Keith Thompson

2025-02-21 19:45:36 UTC

Post by pozz
void ucs2_to_iso8859p1(char *ucs2, size_t size);
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm
passing size because ucs2 isn't null terminated.

Is the UCS-2 really represented as a sequence of ASCII hex digits?

In actual UCS-2, each character is 2 bytes. The representation for
"Hello" would be 10 bytes, either "\0H\0e\0l\0l\0o" or
"H\0e\0l\0l\0o\0", depending on endianness. (UCS-2 is a subset of
UTF-16; the latter uses longer sequences to represent characters
outside the Basic Multilingual Plane.)

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

David Brown

2025-02-22 13:18:03 UTC

Post by Keith Thompson

Post by pozz
void ucs2_to_iso8859p1(char *ucs2, size_t size);
ucs2 is a string of type "00480065006C006C006F" for "Hello". I'm
passing size because ucs2 isn't null terminated.

Is the UCS-2 really represented as a sequence of ASCII hex digits?
In actual UCS-2, each character is 2 bytes. The representation for
"Hello" would be 10 bytes, either "\0H\0e\0l\0l\0o" or
"H\0e\0l\0l\0o\0", depending on endianness. (UCS-2 is a subset of
UTF-16; the latter uses longer sequences to represent characters
outside the Basic Multilingual Plane.)

My understanding here is that the OP is getting the UCS-2 encoded string
in from a modem, almost certainly on a serial line. The UCS-2 encoded
data is itself a binary sequence of 16-bit code units, and the modem
firmware is sending those as four hex digits. This is a very common way
to handle transmission of binary data in such systems - there is no need
for escapes or other complications to delimit the binary data. I would
expect that the entire incoming message will be comma-separated fields
with the time and date, sender's telephone number, and so on, as well as
the text itself as this long hex string.

Kaz Kylheku

2025-02-22 01:20:20 UTC