Discussion:
C' (C Prime) - C language with extensions
Add Reply
Thiago Adams
2017-07-21 13:33:40 UTC
Reply
Permalink
Raw Message
I have seen some topics about C x C++.

I would like to hear your thoughts about a new language derived
and compatible with from C11 called C' that I am creating.

The output is C89 code. It's open source.

Basic concepts:
I have a new type of function-definition.

Original C grammar:

function-definition:
declaration-specifiers declarator declaration-listopt compound-statement

C'

function-definition:
user-function-definition
default-function-definition

user-function-definition:
declaration-specifiers declarator declaration-listopt compound-statement

default-function-definition:
declaration-specifiers declarator declaration-listopt default ;

-------

Sample:

void F(X* pX) default;

This is a "default-function-definition"

When the compiler parses a default-function-definition it starts the
process of instantiation. (Similar of c++ explicit template instantiation)
The decision of what is instantiated is based on the function result, name and arguments.

Differently from C++, the information about how to generate code
is not inside the source code but is works as compiler plugins.

There are some build-in generators.

typedef struct
{
int i;
} X;

//normal function declaration
X* Create();

//"default-function-definition" can be placed at .c
X* X_Create() default;

X_Create will generate a function necessary to instanced X on heap.

The build-in functions are (so far): _Init, _Destroy, _Create, _Delete.
These functions are very similar of C++ constructor, destructor new and delete.

More functions could be added as build-in for instance _Compare.
_Compare is a function that compares each declarator of the struct.
At this point is already possible to notice some differences from C++.
It's impossible to create a function template Compare using C++ because we cannot iterate over data types.
The same for destructor, but C++ destructor is something build-in in C++.

The advantage of instantiation by plugins is that you can write your
code that have access to the AST. For instance, we can create a function to serialize objects and more.
(There are some disadvantages of course)

Another feature added is the 'auto' type qualifier.
This qualifier instructs the the code generator for Destroy to include the declarator.

For instance:

typedef struct
{
Y * auto pY;
} X

void X_Destroy(X* pX) default;

//the generated code will call

Y_Delete(pX->pY); //because pY is auto.

(This qualifier probably will be used in more cases in the future for static analysis)


The other added concept is for types.

C grammar

struct-or-union-specifier:
struct-or-union identifieropt { struct-declaration-list }
struct-or-union identifier

C' grammar

struct-or-union-specifier:
struct-or-union identifieropt { struct-declaration-list }
struct-or-union identifier struct-or-union-tagopt

struct-or-union-tag
identifier ( struct-or-union-tag-arguments )

struct-or-union-tag-arguments
struct-or-union-tag-argument
struct-or-union-tag-argument , struct-or-union-arguments

struct-or-union-tag-argument
type-name
constant
string-literal
(probably I will add more as necessary)


Samples:

struct Tokens List(Token);
typedef struct List(Token) Tokens;

The 'List' identifier is the key for the type instantiation.
At the begging type instantiations will focus on structs to help the creation
of containers. (I have considered the same of other types like int)

( I am also considering a way to add more struct declarators after type instantiation, this can be used to create derived structs)


typedef struct List(Token* auto) Tokens;
void Tokens_Add(Tokens* p, Token* pItem) default;

Now the instantiation of Tokens_Add can read from AST all the information of
struct-or-union-tag-arguments and generate the function.

I am planning to have at 1.0 some build-in containers (array, map and list)



Another build-in concept is Union types.

typedef struct Union(Box, Circle) Shape;

The 'Union' is the key for an instantiation of a type that have the same begging layout of Box and Circle.

Let's say Box and Circle have int type as its first declarator then shape will have it. (The first declarator must have a initialization value different from each other - Initialization for struct declarator is also an extension)

When function instantiation is applied to Union types the code generation will generate a type selection and call the appropriated function on Box or Circle.

void Shape_Draw(Shape*p) default;
//will call Box_Draw or Circle_Draw.


Another difference from C++ template instantiations is that the instantiations at C' don't trigger other instantiations.

In C++ , if instantiation of F1() depends on F2() that depends on F3() then F2 and F3 will be instantiated automatically when F1 is instantiated.

In C' if F1 depends on F2 the compiler will ask the user to instantiate F2 manually. Some build-in instantiations like Delete Create depends on Init e Destroy. But in this case, If Init or destroy are missing I can generate the code 'in-line' and also warning the user to add the instantiation because this can decrease code size.


--
The areas I tried to improve so far are:
- Constructor/destructor/Create/Delete generation
- Containers
- One way of polymorphism (more ways can be added)

Having these features working fine, I am planning to boostrap the
compiler (that is in C) and compare the number of lines saved)
This will be version 0.9.

Some videos:




The project is open for anyone to join.
GOTHIER Nathan
2017-07-21 14:49:24 UTC
Reply
Permalink
Raw Message
On Fri, 21 Jul 2017 06:33:40 -0700 (PDT)
Post by Thiago Adams
I would like to hear your thoughts about a new language derived
and compatible with from C11 called C' that I am creating.
Yet another counterfeited C... you should call it BL for the Bad Language.

In my view, your BL give no solution to C issues but only mimic C++.
Chad
2017-07-21 16:35:45 UTC
Reply
Permalink
Raw Message
And don't forget that we have CAlive.
Rick C. Hodgin
2017-07-21 16:45:42 UTC
Reply
Permalink
Raw Message
Post by Chad
And don't forget that we have CAlive.
Not yet. But, with your help ... it can be reality.

https://groups.google.com/forum/#!forum/caliveprogramminglanguage

http://www.libsf.org

"In God's sight, we've come together.
We've come together to help each other.
Let's grow this project up ... together!
In service and Love to The Lord, forever!"

We are all a team, Chad. We are just not yet all realizing this to
be true.

Thank you,
Rick C. Hodgin
Chris M. Thomasson
2017-07-21 23:22:21 UTC
Reply
Permalink
Raw Message
Post by Rick C. Hodgin
Post by Chad
And don't forget that we have CAlive.
Not yet. But, with your help ... it can be reality.
Wonder how you can work with somebody who might use a cruse word, like
"damn"? Afaict, its makes you angry.
Post by Rick C. Hodgin
https://groups.google.com/forum/#!forum/caliveprogramminglanguage
http://www.libsf.org
"In God's sight, we've come together.
We've come together to help each other.
Let's grow this project up ... together!
In service and Love to The Lord, forever!"
We are all a team, Chad. We are just not yet all realizing this to
be true.
Thank you,
Rick C. Hodgin
Thiago Adams
2017-07-21 17:00:18 UTC
Reply
Permalink
Raw Message
Post by GOTHIER Nathan
On Fri, 21 Jul 2017 06:33:40 -0700 (PDT)
Post by Thiago Adams
I would like to hear your thoughts about a new language derived
and compatible with from C11 called C' that I am creating.
Yet another counterfeited C... you should call it BL for the Bad Language.
In my view, your BL give no solution to C issues but only mimic C++.
It's is important to understand that C' can be seen as a tool.
If your input is C your output is C.
You can write you own 'default' implementation for a function.
The build-in instantiations are something that I consider helpful
for myself. Even if you dislike the way some default instantiation works you can ignore and create another one.

It's not necessary decades of implementation and test.
As soon the ouput helps you the result is positive.

Your code formatting, your comments ans style will be there.
If you leave C' you still have your C code.

You don't replace C for something else. You have a mechanism to write
your patterns.
b***@gmail.com
2017-07-22 03:06:23 UTC
Reply
Permalink
Raw Message
What problem are you trying to solve? What issues are you trying to resolve?

I mean I'd be interested, but only if it went in the opposite direction of C++ and Rust.

C could certainly use a few new features, like UTF-8 support, locialization, expanded generics, returning multiple variables at once, and a few more i haven't thought of.

But here's what C++ and Rust get wrong; Simplicity is C's best feature, and going whole hog adding stupid shit that people can make themselves.

The language is just a framework (like a legit scaffold, not library "framework") to allow others to build whatever they can imagine and more, but it's not some toolkit to be used directly like most languages are going for these days.
Chad
2017-07-22 09:43:26 UTC
Reply
Permalink
Raw Message
UTF-8 blows. And not that well either.
Bill Carson
2017-07-22 09:49:07 UTC
Reply
Permalink
Raw Message
Post by Chad
UTF-8 blows. And not that well either.
That's a bold statement. It's pretty much the standard these days. What
would you propose instead?
Malcolm McLean
2017-07-22 10:40:18 UTC
Reply
Permalink
Raw Message
Post by Bill Carson
Post by Chad
UTF-8 blows. And not that well either.
That's a bold statement. It's pretty much the standard these days. What
would you propose instead?
I don't think anyone would have invented utf-8, if early computers hadn't
used dumb terminals and character-mapped raster displays which made it
natural to use an 8 bit code (and of course since computers were invented
in the English-speaking world, originally only English was supported).
Ben Bacarisse
2017-07-22 12:54:47 UTC
Reply
Permalink
Raw Message
Post by Malcolm McLean
Post by Bill Carson
Post by Chad
UTF-8 blows. And not that well either.
That's a bold statement. It's pretty much the standard these days. What
would you propose instead?
I don't think anyone would have invented utf-8, if early computers hadn't
used dumb terminals and character-mapped raster displays which made it
natural to use an 8 bit code
It was natural to use the shortest practical code. Many early machines
had 6-bit character codes. I'd say it was the severe limits on storage
space that determined the character size though existing teletype codes
must have been a factor too.
Post by Malcolm McLean
(and of course since computers were invented
in the English-speaking world, originally only English was supported).
At least one (the 1940s EDSAC) was invented in a world where a classical
education was the norm because it had Theta, Phi Delta and Pi in it's
otherwise very limited character set!
--
Ben.
bartc
2017-07-22 13:04:48 UTC
Reply
Permalink
Raw Message
Post by Ben Bacarisse
Post by Malcolm McLean
Post by Bill Carson
Post by Chad
UTF-8 blows. And not that well either.
That's a bold statement. It's pretty much the standard these days. What
would you propose instead?
I don't think anyone would have invented utf-8, if early computers hadn't
used dumb terminals and character-mapped raster displays which made it
natural to use an 8 bit code
It was natural to use the shortest practical code. Many early machines
had 6-bit character codes. I'd say it was the severe limits on storage
space that determined the character size though existing teletype codes
must have been a factor too.
Post by Malcolm McLean
(and of course since computers were invented
in the English-speaking world, originally only English was supported).
At least one (the 1940s EDSAC) was invented in a world where a classical
education was the norm because it had Theta, Phi Delta and Pi in it's
otherwise very limited character set!
Probably mathematical rather than classical...

And actually, fully representing A-Z may not even have been a priority
if the purpose was to solve equations.
Bill Carson
2017-07-22 14:44:15 UTC
Reply
Permalink
Raw Message
Post by Malcolm McLean
I don't think anyone would have invented utf-8, if early computers
hadn't used dumb terminals and character-mapped raster displays which
made it natural to use an 8 bit code (and of course since computers were
invented in the English-speaking world, originally only English was
supported).
But early machines used 6 and 7 bit codes, even the 36-bit minicomputers
did. I don't see what's particularly wrong with utf8? It offers many of
the benefits as ascii does (easy to filter alphanumeric characters, easy
to map a char to the number it represents or do simple arithmetic, easy
to map from lowercase to uppercase, etc.). It does favor latin script
languages, but that is what more than 70% of the world population uses
(if I can quote Wikipedia on that).
Most of my woes are related to unicode rather than its encoding, like
having multiple representations for the same string.
GOTHIER Nathan
2017-07-22 15:26:14 UTC
Reply
Permalink
Raw Message
On Sat, 22 Jul 2017 14:44:15 -0000 (UTC)
Post by Bill Carson
But early machines used 6 and 7 bit codes, even the 36-bit minicomputers
did. I don't see what's particularly wrong with utf8? It offers many of
the benefits as ascii does (easy to filter alphanumeric characters, easy
to map a char to the number it represents or do simple arithmetic, easy
to map from lowercase to uppercase, etc.). It does favor latin script
languages, but that is what more than 70% of the world population uses
(if I can quote Wikipedia on that).
Most of my woes are related to unicode rather than its encoding, like
having multiple representations for the same string.
Indeed. I am more bitter against the current pixels encoding.

Actually there is no portable encoding to easily process graphics.

If someone has an interest in the field, I am open to discussion...
Jerry Stuckle
2017-07-22 21:26:47 UTC
Reply
Permalink
Raw Message
Post by Bill Carson
Post by Malcolm McLean
I don't think anyone would have invented utf-8, if early computers
hadn't used dumb terminals and character-mapped raster displays which
made it natural to use an 8 bit code (and of course since computers were
invented in the English-speaking world, originally only English was
supported).
But early machines used 6 and 7 bit codes, even the 36-bit minicomputers
did. I don't see what's particularly wrong with utf8? It offers many of
the benefits as ascii does (easy to filter alphanumeric characters, easy
to map a char to the number it represents or do simple arithmetic, easy
to map from lowercase to uppercase, etc.). It does favor latin script
languages, but that is what more than 70% of the world population uses
(if I can quote Wikipedia on that).
Most of my woes are related to unicode rather than its encoding, like
having multiple representations for the same string.
There was even a 5 bit code in use on very early systems.
--
==================
Remove the "x" from my email address
Jerry Stuckle
***@attglobal.net
==================
Chad
2017-07-23 02:15:35 UTC
Reply
Permalink
Raw Message
Post by Jerry Stuckle
Post by Bill Carson
Post by Malcolm McLean
I don't think anyone would have invented utf-8, if early computers
hadn't used dumb terminals and character-mapped raster displays which
made it natural to use an 8 bit code (and of course since computers were
invented in the English-speaking world, originally only English was
supported).
But early machines used 6 and 7 bit codes, even the 36-bit minicomputers
did. I don't see what's particularly wrong with utf8? It offers many of
the benefits as ascii does (easy to filter alphanumeric characters, easy
to map a char to the number it represents or do simple arithmetic, easy
to map from lowercase to uppercase, etc.). It does favor latin script
languages, but that is what more than 70% of the world population uses
(if I can quote Wikipedia on that).
Most of my woes are related to unicode rather than its encoding, like
having multiple representations for the same string.
There was even a 5 bit code in use on very early systems.
I thought you got banned for being you.
Jerry Stuckle
2017-07-23 02:25:02 UTC
Reply
Permalink
Raw Message
Post by Chad
Post by Jerry Stuckle
Post by Bill Carson
Post by Malcolm McLean
I don't think anyone would have invented utf-8, if early computers
hadn't used dumb terminals and character-mapped raster displays which
made it natural to use an 8 bit code (and of course since computers were
invented in the English-speaking world, originally only English was
supported).
But early machines used 6 and 7 bit codes, even the 36-bit minicomputers
did. I don't see what's particularly wrong with utf8? It offers many of
the benefits as ascii does (easy to filter alphanumeric characters, easy
to map a char to the number it represents or do simple arithmetic, easy
to map from lowercase to uppercase, etc.). It does favor latin script
languages, but that is what more than 70% of the world population uses
(if I can quote Wikipedia on that).
Most of my woes are related to unicode rather than its encoding, like
having multiple representations for the same string.
There was even a 5 bit code in use on very early systems.
I thought you got banned for being you.
Naw, that's just you being your normal asshole. Hasn't SalesFarce wised
up and fired your ass yet?
--
==================
Remove the "x" from my email address
Jerry Stuckle
***@attglobal.net
==================
Chad
2017-07-23 13:12:52 UTC
Reply
Permalink
Raw Message
Post by Jerry Stuckle
Post by Chad
Post by Jerry Stuckle
Post by Bill Carson
Post by Malcolm McLean
I don't think anyone would have invented utf-8, if early computers
hadn't used dumb terminals and character-mapped raster displays which
made it natural to use an 8 bit code (and of course since computers were
invented in the English-speaking world, originally only English was
supported).
But early machines used 6 and 7 bit codes, even the 36-bit minicomputers
did. I don't see what's particularly wrong with utf8? It offers many of
the benefits as ascii does (easy to filter alphanumeric characters, easy
to map a char to the number it represents or do simple arithmetic, easy
to map from lowercase to uppercase, etc.). It does favor latin script
languages, but that is what more than 70% of the world population uses
(if I can quote Wikipedia on that).
Most of my woes are related to unicode rather than its encoding, like
having multiple representations for the same string.
There was even a 5 bit code in use on very early systems.
I thought you got banned for being you.
Naw, that's just you being your normal asshole. Hasn't SalesFarce wised
up and fired your ass yet?
--
Nope. Unlike some people, I know enough about what is going on in order to avoid the firing squad.
Jerry Stuckle
2017-07-23 13:26:33 UTC
Reply
Permalink
Raw Message
Post by Chad
Post by Jerry Stuckle
Post by Chad
Post by Jerry Stuckle
Post by Bill Carson
Post by Malcolm McLean
I don't think anyone would have invented utf-8, if early computers
hadn't used dumb terminals and character-mapped raster displays which
made it natural to use an 8 bit code (and of course since computers were
invented in the English-speaking world, originally only English was
supported).
But early machines used 6 and 7 bit codes, even the 36-bit minicomputers
did. I don't see what's particularly wrong with utf8? It offers many of
the benefits as ascii does (easy to filter alphanumeric characters, easy
to map a char to the number it represents or do simple arithmetic, easy
to map from lowercase to uppercase, etc.). It does favor latin script
languages, but that is what more than 70% of the world population uses
(if I can quote Wikipedia on that).
Most of my woes are related to unicode rather than its encoding, like
having multiple representations for the same string.
There was even a 5 bit code in use on very early systems.
I thought you got banned for being you.
Naw, that's just you being your normal asshole. Hasn't SalesFarce wised
up and fired your ass yet?
--
Nope. Unlike some people, I know enough about what is going on in order to avoid the firing squad.
ROFLMAO! Yup, on your knees and his zipper is open.
--
==================
Remove the "x" from my email address
Jerry Stuckle
***@attglobal.net
==================
Chad
2017-07-24 00:15:23 UTC
Reply
Permalink
Raw Message
Post by Jerry Stuckle
Post by Chad
Post by Jerry Stuckle
Post by Chad
Post by Jerry Stuckle
Post by Bill Carson
Post by Malcolm McLean
I don't think anyone would have invented utf-8, if early computers
hadn't used dumb terminals and character-mapped raster displays which
made it natural to use an 8 bit code (and of course since computers were
invented in the English-speaking world, originally only English was
supported).
But early machines used 6 and 7 bit codes, even the 36-bit minicomputers
did. I don't see what's particularly wrong with utf8? It offers many of
the benefits as ascii does (easy to filter alphanumeric characters, easy
to map a char to the number it represents or do simple arithmetic, easy
to map from lowercase to uppercase, etc.). It does favor latin script
languages, but that is what more than 70% of the world population uses
(if I can quote Wikipedia on that).
Most of my woes are related to unicode rather than its encoding, like
having multiple representations for the same string.
There was even a 5 bit code in use on very early systems.
I thought you got banned for being you.
Naw, that's just you being your normal asshole. Hasn't SalesFarce wised
up and fired your ass yet?
--
Nope. Unlike some people, I know enough about what is going on in order to avoid the firing squad.
ROFLMAO! Yup, on your knees and his zipper is open.
You appear to know something about this. Speaking from personal experience?
Jerry Stuckle
2017-07-24 00:24:59 UTC
Reply
Permalink
Raw Message
Post by Chad
Post by Jerry Stuckle
Post by Chad
Post by Jerry Stuckle
Post by Chad
Post by Jerry Stuckle
Post by Bill Carson
Post by Malcolm McLean
I don't think anyone would have invented utf-8, if early computers
hadn't used dumb terminals and character-mapped raster displays which
made it natural to use an 8 bit code (and of course since computers were
invented in the English-speaking world, originally only English was
supported).
But early machines used 6 and 7 bit codes, even the 36-bit minicomputers
did. I don't see what's particularly wrong with utf8? It offers many of
the benefits as ascii does (easy to filter alphanumeric characters, easy
to map a char to the number it represents or do simple arithmetic, easy
to map from lowercase to uppercase, etc.). It does favor latin script
languages, but that is what more than 70% of the world population uses
(if I can quote Wikipedia on that).
Most of my woes are related to unicode rather than its encoding, like
having multiple representations for the same string.
There was even a 5 bit code in use on very early systems.
I thought you got banned for being you.
Naw, that's just you being your normal asshole. Hasn't SalesFarce wised
up and fired your ass yet?
--
Nope. Unlike some people, I know enough about what is going on in order to avoid the firing squad.
ROFLMAO! Yup, on your knees and his zipper is open.
You appear to know something about this. Speaking from personal experience?
Just what I've been told, Chad. Just what I've been told. I'll let you
figure out who told me.
--
==================
Remove the "x" from my email address
Jerry Stuckle
***@attglobal.net
==================
s***@casperkitty.com
2017-07-22 22:07:39 UTC
Reply
Permalink
Raw Message
Post by Malcolm McLean
I don't think anyone would have invented utf-8, if early computers hadn't
used dumb terminals and character-mapped raster displays which made it
natural to use an 8 bit code (and of course since computers were invented
in the English-speaking world, originally only English was supported).
The majority of text processed by computers is ASCII, even in countries that
use non-Latin alphabets, because the majority of text processed by computers
is *for consumption by other computers*. I don't really have any major squawk
with UTF-8 itself. Unicode, however, has many problems far more fundamental
than its encoding.

The designers of Unicode simultaneously try to insist that it is a character
encoding rather than a markup language, while trying to kludge it into all
sorts of special-case behaviors which should really be handled via limited
forms of markup.

A good form of textual representation should make it easy to take a piece of
text, determine how much will fit in a given amount of space, and produce a
list of graphemes/clusters to display, ordered by layout position, without
having to have specialized knowledge of all the characters involved. It
should also ensure that no slice taken from a string could appear to be a
semantically unrelated but seemingly-valid string. Unicode fails in such
regards, and seems to constantly be getting worse. Rather than saying that
certain combinations of Man, Woman, Boy, and Girl should be combined in a
2x2 block if connector characters appear between them, it would be easier
and more useful to define a general "combine up to four things in a block"
markup and then have it work on any combination of four characters. That
would entail defining "markup", though, rather than just defining an absurd
number of character-joining rules.
Öö Tiib
2017-07-22 23:06:42 UTC
Reply
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Malcolm McLean
I don't think anyone would have invented utf-8, if early computers hadn't
used dumb terminals and character-mapped raster displays which made it
natural to use an 8 bit code (and of course since computers were invented
in the English-speaking world, originally only English was supported).
The majority of text processed by computers is ASCII, even in countries that
use non-Latin alphabets, because the majority of text processed by computers
is *for consumption by other computers*. I don't really have any major squawk
with UTF-8 itself. Unicode, however, has many problems far more fundamental
than its encoding.
Citation please?
I have impression that vast majority of text processing that
is going on is people browsing and various services processing texts that
are up in internet. Do you suggest something else?
That site says that close to 90 of text what is up in internet right now
is UTF-8. https://w3techs.com/technologies/details/en-utf8/all/all
bartc
2017-07-23 00:45:10 UTC
Reply
Permalink
Raw Message
Post by Öö Tiib
Post by s***@casperkitty.com
Post by Malcolm McLean
I don't think anyone would have invented utf-8, if early computers hadn't
used dumb terminals and character-mapped raster displays which made it
natural to use an 8 bit code (and of course since computers were invented
in the English-speaking world, originally only English was supported).
The majority of text processed by computers is ASCII, even in countries that
use non-Latin alphabets, because the majority of text processed by computers
is *for consumption by other computers*. I don't really have any major squawk
with UTF-8 itself. Unicode, however, has many problems far more fundamental
than its encoding.
Citation please?
I have impression that vast majority of text processing that
is going on is people browsing and various services processing texts that
are up in internet. Do you suggest something else?
That site says that close to 90 of text what is up in internet right now
is UTF-8. https://w3techs.com/technologies/details/en-utf8/all/all
I wonder what that means?

Is might simply be that the char-encoding attribute of a web-page
happens to be set to UTF-8 whether or not it uses any code points
outside of ASCII.

Anyway I wouldn't find it surprising that most text processing - if it
can somehow be measured - is in fact of ASCII. Text processing can
include compiling C source code, which itself might include piping an
text/ASM intermediate representation to different parts of a compiler.

And then there are new formats that make heavy use of text such as XML
which are generally processed by program.

(I just looked at a random web page (from imdb.com), and it looked
mostly ASCII to me. Actually most of it was Javascript and HTML.

I did a histogram of the byte values, and counts from 126 to 255 were
all zeros.

I then did the same thing on a long article in Arabic about Unicode. The
page source still used ASCII for nearly 80% of the bytes (and presumably
for more than 80% of the characters since Unicode sequences are
multi-byte). So ASCII still seems to be used significantly.)
--
bartc
Öö Tiib
2017-07-23 07:05:12 UTC
Reply
Permalink
Raw Message
Post by bartc
Post by Öö Tiib
Post by s***@casperkitty.com
Post by Malcolm McLean
I don't think anyone would have invented utf-8, if early computers hadn't
used dumb terminals and character-mapped raster displays which made it
natural to use an 8 bit code (and of course since computers were invented
in the English-speaking world, originally only English was supported).
The majority of text processed by computers is ASCII, even in countries that
use non-Latin alphabets, because the majority of text processed by computers
is *for consumption by other computers*. I don't really have any major squawk
with UTF-8 itself. Unicode, however, has many problems far more fundamental
than its encoding.
Citation please?
I have impression that vast majority of text processing that
is going on is people browsing and various services processing texts that
are up in internet. Do you suggest something else?
That site says that close to 90 of text what is up in internet right now
is UTF-8. https://w3techs.com/technologies/details/en-utf8/all/all
I wonder what that means?
Is might simply be that the char-encoding attribute of a web-page
happens to be set to UTF-8 whether or not it uses any code points
outside of ASCII.
Yes, documents whose character encoding was set US-ASCII were only
0.2% in 2010 already.
https://w3techs.com/technologies/history_overview/character_encoding/ms/y
Ben Bacarisse
2017-07-23 11:59:46 UTC
Reply
Permalink
Raw Message
Öö Tiib <***@hot.ee> writes:
<snip>
Post by Öö Tiib
Yes, documents whose character encoding was set US-ASCII were only
0.2% in 2010 already.
https://w3techs.com/technologies/history_overview/character_encoding/ms/y
Slightly better to say documents whose transfer encoding is US-ASCII.
The documents themselves are Unicode documents.

An alternative would be to avoid the word document since you are really
talking about the byte streams rather than the end documents.

BTW, I'm surprised there are as many as 0.2% that have a declared
encoding of US-ASCII!
--
Ben.
GOTHIER Nathan
2017-07-23 11:07:01 UTC
Reply
Permalink
Raw Message
On Sun, 23 Jul 2017 01:45:10 +0100
Post by bartc
I then did the same thing on a long article in Arabic about Unicode. The
page source still used ASCII for nearly 80% of the bytes (and presumably
for more than 80% of the characters since Unicode sequences are
multi-byte). So ASCII still seems to be used significantly.)
Indeed. It is obvious since the web is written in HTML with english tags and
attributes. You should try to evaluate the ISO/IEC 646 code proportion between
the HTML tags in order to get a more relevant picture of the Unicode use in the
world, specially in Asia which represents a third of the world population and a
half of the world internet users.
Ben Bacarisse
2017-07-23 11:54:15 UTC
Reply
Permalink
Raw Message
<snip>
Post by bartc
Post by Öö Tiib
That site says that close to 90 of text what is up in internet right now
is UTF-8. https://w3techs.com/technologies/details/en-utf8/all/all
I wonder what that means?
That's an important question, but the page does answer it.
Post by bartc
Is might simply be that the char-encoding attribute of a web-page
happens to be set to UTF-8
It is, as you suppose, the declared transfer encoding that is being
counted in these statistics.
Post by bartc
whether or not it uses any code points
outside of ASCII.
And conversely, many HTML documents use Unicode code points outside of
0-127 without transmitting anything other than bytes in the range 0-127.
(All HTML documents are Unicode documents no matter how the data are
transferred.)

<snip>
Post by bartc
(I just looked at a random web page (from imdb.com), and it looked
mostly ASCII to me. Actually most of it was Javascript and HTML.
I did a histogram of the byte values, and counts from 126 to 255 were
all zeros.
I then did the same thing on a long article in Arabic about
Unicode. The page source still used ASCII for nearly 80% of the bytes
(and presumably for more than 80% of the characters since Unicode
sequences are multi-byte). So ASCII still seems to be used
significantly.)
I'd prefer a more careful use of words. The page uses Unicode
regardless of how the transmission is encoded. All HTML and XML
processing is Unicode text processing. The only formal role for ASCII
is as a very rarely used transmission encoding. You are talking about a
rather odd measure -- you are counting the bytes that have the high bit
set in the transmission of Unicode documents.

A percentage count of bytes transmitted that are outside of 0-127 is
going to be biased by pages that use a lot of Chinese characters (for
example) and it will be biased the other way by pages that use lots of
character entity encodings.

But most of all I wonder what's the point of all the counting. HTML
documents are Unicode documents and that's a good thing. How they are
transmitted is not very interesting. (Yes, I know you did not bring
that up.)
--
Ben.
bartc
2017-07-23 12:31:46 UTC
Reply
Permalink
Raw Message
Post by Ben Bacarisse
Post by bartc
I then did the same thing on a long article in Arabic about
Unicode. The page source still used ASCII for nearly 80% of the bytes
(and presumably for more than 80% of the characters since Unicode
sequences are multi-byte). So ASCII still seems to be used
significantly.)
I'd prefer a more careful use of words. The page uses Unicode
regardless of how the transmission is encoded. All HTML and XML
processing is Unicode text processing. The only formal role for ASCII
is as a very rarely used transmission encoding. You are talking about a
rather odd measure -- you are counting the bytes that have the high bit
set in the transmission of Unicode documents.
In my two examples, it was clear from visual examination that codes in
the low range were predominantly used as actual ASCII (in tag names and
their parameters for example, or in bits of script code).

They didn't just happen to look like ASCII because an arbitrary byte
value in that range rendered by an editor necessarily had to be
displayed as an ASCII character.

Nor was ASCII just being used as an encoding medium like Mime. Otherwise
the character sequences would be arbitrary.

But even if the purpose of the ASCII is as a framing format for actual
content that is predominantly non-ASCII, that is still what is largely
being transmitted and what is being processed.
Post by Ben Bacarisse
A percentage count of bytes transmitted that are outside of 0-127 is
going to be biased by pages that use a lot of Chinese characters (for
example) and it will be biased the other way by pages that use lots of
character entity encodings.
But most of all I wonder what's the point of all the counting. HTML
documents are Unicode documents and that's a good thing.
By that measure everything I've ever typed and read can be considered
Unicode.

It's just that English (and maybe a few other other languages such as
Italian), can usually be adequately represented using just the first 128
Unicode characters, and many of those are useless control codes.

(That's excluding typographical aspects that I agree with supercat don't
belong in Unicode.)
--
bartc
Ben Bacarisse
2017-07-23 14:26:01 UTC
Reply
Permalink
Raw Message
Post by bartc
Post by Ben Bacarisse
Post by bartc
I then did the same thing on a long article in Arabic about
Unicode. The page source still used ASCII for nearly 80% of the bytes
(and presumably for more than 80% of the characters since Unicode
sequences are multi-byte). So ASCII still seems to be used
significantly.)
I'd prefer a more careful use of words. The page uses Unicode
regardless of how the transmission is encoded. All HTML and XML
processing is Unicode text processing. The only formal role for ASCII
is as a very rarely used transmission encoding. You are talking about a
rather odd measure -- you are counting the bytes that have the high bit
set in the transmission of Unicode documents.
In my two examples, it was clear from visual examination that codes in
the low range were predominantly used as actual ASCII (in tag names
and their parameters for example, or in bits of script code).
Yes, though the correct wording is Unicode. You can't use ASCII in an
HTML document. You can transfer a document as ASCII (but that is very
rare indeed) but the document itself uses Unicode.

I know this seems like a non-distinction to you, but I am only making it
to reinforce the point that Unicode is widely used -- it is the standard
imposed by HTML even if only a few pages have a few codes outside of the
0-127 range.
Post by bartc
They didn't just happen to look like ASCII because an arbitrary byte
value in that range rendered by an editor necessarily had to be
displayed as an ASCII character.
Nor was ASCII just being used as an encoding medium like
Mime. Otherwise the character sequences would be arbitrary.
I don't understand ether of these two paragraphs. For one thing, I no
longer know what you are looking at. Did you capture the transfer and
look at that, of did you do something else? All of these things will
involve some sort of encoding.
Post by bartc
But even if the purpose of the ASCII is as a framing format for actual
content that is predominantly non-ASCII, that is still what is largely
being transmitted and what is being processed.
I am making a technical point about names, not about byte values. When
an SMS, using GSM 03.38, sends a space as 0x20 do you call that "using
ASCII"? Does an MS Windows file using Windows 1252 "use ASCII" for most
of it because a lot of the code points are the same?

If so, then ASCII will appear to be everywhere -- no question.
Post by bartc
Post by Ben Bacarisse
A percentage count of bytes transmitted that are outside of 0-127 is
going to be biased by pages that use a lot of Chinese characters (for
example) and it will be biased the other way by pages that use lots of
character entity encodings.
But most of all I wonder what's the point of all the counting. HTML
documents are Unicode documents and that's a good thing.
By that measure everything I've ever typed and read can be considered
Unicode.
That's good, isn't it?
Post by bartc
It's just that English (and maybe a few other other languages such as
Italian), can usually be adequately represented using just the first
128 Unicode characters, and many of those are useless control codes.
Provided you never want to write an exposé of the rôle played by an an
émigré in an attempted cout d'état! Many English speakers are rather
too blasé about the spelling of words like café and déjà vu for my
taste. It too much of cliché to mention it again?

<snip>
--
Ben.
bartc
2017-07-23 16:33:15 UTC
Reply
Permalink
Raw Message
Post by Ben Bacarisse
Post by bartc
Nor was ASCII just being used as an encoding medium like
Mime. Otherwise the character sequences would be arbitrary.
I don't understand ether of these two paragraphs. For one thing, I no
longer know what you are looking at. Did you capture the transfer and
look at that, of did you do something else? All of these things will
involve some sort of encoding.
I used 'Page Source' on Firefox then copy&paste to get it into a file.
Post by Ben Bacarisse
Provided you never want to write an exposé of the rôle played by an an
émigré in an attempted cout d'état! Many English speakers are rather
too blasé about the spelling of words like café and déjà vu for my
taste. It too much of cliché to mention it again?
None of the English typewriters I've seen deal with those either. It
hasn't stopped people using them for everything including writing novels
and plays.
--
bartc
David Kleinecke
2017-07-23 18:19:45 UTC
Reply
Permalink
Raw Message
Post by bartc
Post by Ben Bacarisse
Post by bartc
Nor was ASCII just being used as an encoding medium like
Mime. Otherwise the character sequences would be arbitrary.
I don't understand ether of these two paragraphs. For one thing, I no
longer know what you are looking at. Did you capture the transfer and
look at that, of did you do something else? All of these things will
involve some sort of encoding.
I used 'Page Source' on Firefox then copy&paste to get it into a file.
Post by Ben Bacarisse
Provided you never want to write an exposé of the rôle played by an an
émigré in an attempted cout d'état! Many English speakers are rather
too blasé about the spelling of words like café and déjà vu for my
taste. It too much of cliché to mention it again?
None of the English typewriters I've seen deal with those either. It
hasn't stopped people using them for everything including writing novels
and plays.
None of these bother me but lack of 'ñ' really hurts.
Ben Bacarisse
2017-07-23 19:00:03 UTC
Reply
Permalink
Raw Message
Post by bartc
Post by Ben Bacarisse
Post by bartc
Nor was ASCII just being used as an encoding medium like
Mime. Otherwise the character sequences would be arbitrary.
I don't understand ether of these two paragraphs. For one thing, I no
longer know what you are looking at. Did you capture the transfer and
look at that, of did you do something else? All of these things will
involve some sort of encoding.
I used 'Page Source' on Firefox then copy&paste to get it into a file.
OK. Still not sure what the point of the two paragraphs were and one's
been cut anyway.
Post by bartc
Post by Ben Bacarisse
Provided you never want to write an exposé of the rôle played by an an
émigré in an attempted cout d'état! Many English speakers are rather
too blasé about the spelling of words like café and déjà vu for my
taste. It too much of cliché to mention it again?
None of the English typewriters I've seen deal with those either. It
hasn't stopped people using them for everything including writing
novels and plays.
You can use l for 1 and O for 0 too (at least I had to the last time
wrote on a typewriter). That was 37 years ago. Don't you like being
all modern and international? In short, what's the problem?
--
Ben.
s***@casperkitty.com
2017-07-23 13:31:12 UTC
Reply
Permalink
Raw Message
Post by Ben Bacarisse
And conversely, many HTML documents use Unicode code points outside of
0-127 without transmitting anything other than bytes in the range 0-127.
(All HTML documents are Unicode documents no matter how the data are
transferred.)
Outside of URLs, UTF7 is very rare. In UTF8, byte values 0-127 are used
absolutely 100% exclusively for the purpose of sending code points 0-127.
Many other encodings represent some characters using a byte outside the
0-127 range followed by a byte within that range, but UTF-8 guarantees
that the byte representation of any sequence of code points will not appear
in any other [it does not, however, extend that guarantee to graphemes].
Post by Ben Bacarisse
Post by bartc
I then did the same thing on a long article in Arabic about
Unicode. The page source still used ASCII for nearly 80% of the bytes
(and presumably for more than 80% of the characters since Unicode
sequences are multi-byte). So ASCII still seems to be used
significantly.)
I'd prefer a more careful use of words. The page uses Unicode
regardless of how the transmission is encoded. All HTML and XML
processing is Unicode text processing. The only formal role for ASCII
is as a very rarely used transmission encoding. You are talking about a
rather odd measure -- you are counting the bytes that have the high bit
set in the transmission of Unicode documents.
That's actually a good measure. Within a UTF-8 document, the number of code
points in the range 0-127 will be equal to the number of bytes with the high
bit clear. The number of code points total will equal the number of bytes
that are not in the range 128-191.
Post by Ben Bacarisse
A percentage count of bytes transmitted that are outside of 0-127 is
going to be biased by pages that use a lot of Chinese characters (for
example) and it will be biased the other way by pages that use lots of
character entity encodings.
But most of all I wonder what's the point of all the counting. HTML
documents are Unicode documents and that's a good thing. How they are
transmitted is not very interesting. (Yes, I know you did not bring
that up.)
Some people regard the popularity of UTF8 vs UTF16 as the result of a bias
toward the Latin alphabet. Machine-readable text overwhelmingly uses a
very limited character set because doing so is easier and simpler than trying
to use a larger one, and even within many documents whose purpose is to
convey human-readable information, the amount of machine-readable text is
sufficiently large that use of UTF8 ends up being a net "win" versus UTF16.
Representing all text using a fixed two bytes per character might have made
sense if it would have been sufficient, but using two bytes per character
for machine-readable text while still needing four bytes per character for
other kinds makes far less sense.
GOTHIER Nathan
2017-07-22 23:30:54 UTC
Reply
Permalink
Raw Message
On Sat, 22 Jul 2017 15:07:39 -0700 (PDT)
Post by s***@casperkitty.com
The majority of text processed by computers is ASCII, even in countries that
use non-Latin alphabets, because the majority of text processed by computers
is *for consumption by other computers*. I don't really have any major squawk
with UTF-8 itself. Unicode, however, has many problems far more fundamental
than its encoding.
Wrong. Only the United States use ASCII. Most countries uses ISO/IEC 646 (aka
ECMA-6).
s***@casperkitty.com
2017-07-23 00:15:24 UTC
Reply
Permalink
Raw Message
Post by GOTHIER Nathan
Post by s***@casperkitty.com
The majority of text processed by computers is ASCII, even in countries that
use non-Latin alphabets, because the majority of text processed by computers
is *for consumption by other computers*. I don't really have any major squawk
with UTF-8 itself. Unicode, however, has many problems far more fundamental
than its encoding.
Wrong. Only the United States use ASCII. Most countries uses ISO/IEC 646 (aka
ECMA-6).
What I meant to say is that the majority of text characters processed by
computers are within the portion of the character set which is coded the
same as ASCII (which also happens to be the portion which UTF-8 encodes as
one byte per character). Although UTF-8 only makes 128 code points smaller
than they would be in UTF-16, while making more than 50,000 code points
larger, those 128 code points are so much more common than the others
that UTF-8 ends up being a win versus UTF-16 in almost every way.
Chad
2017-07-22 10:46:13 UTC
Reply
Permalink
Raw Message
I have no idea.
GOTHIER Nathan
2017-07-22 11:02:38 UTC
Reply
Permalink
Raw Message
On Fri, 21 Jul 2017 20:06:23 -0700 (PDT)
C could certainly use a few new features, like UTF-8 support, localization,
expanded generics, returning multiple variables at once, and a few more i
haven't thought of.
IMHO the first thing that should be fixed in C is the types definition before
possibly extending the standard library which I would remove from the standard
to clearly limiting the bounds of the language.
Thiago Adams
2017-07-22 14:52:35 UTC
Reply
Permalink
Raw Message
Post by b***@gmail.com
What problem are you trying to solve? What issues are you trying to resolve?
The short answer is - I am trying to automatize some patterns.

You can think something like "auto pilot" for C. Auto pilot for cars didn't required new roads. (Roads can be adapted in the future, but it works now)
Instead of create a new infrastructure, new libraries, compilers etc I trying to use the current infrastructure we have in C and create something to automatize code creation for some patterns.
Like auto-pilot the user is always in control and can do things manually or in
collaboration.

The first things I choose to automatize are

- destruction/creation (stack and heap)
- static initialization
- one type of polymorphism (more types vtable can be added)
- Containers (map, array and list)
Post by b***@gmail.com
I mean I'd be interested, but only if it went in the opposite direction of C++ and Rust.
The design decisions will focus in is stay very close to C.
Post by b***@gmail.com
C could certainly use a few new features, like UTF-8 support, locialization, expanded generics, returning multiple variables at once, and a few more i haven't thought of.
Another possible objective (not the main focus now) is to compile C11 into C89. For instance, it is very easy to transform u8"literal" into C89 version.
Post by b***@gmail.com
But here's what C++ and Rust get wrong; Simplicity is C's best feature, and going whole hog adding stupid shit that people can make themselves.
I trying to minimize new concepts. This is the objective.
Everything is build around data types and functions. (no new operators for instance)
Mr. Man-wai Chang
2017-07-23 15:52:53 UTC
Reply
Permalink
Raw Message
Post by Thiago Adams
I have seen some topics about C x C++.
I would like to hear your thoughts about a new language derived
and compatible with from C11 called C' that I am creating. .
....
The output is C89 code. It's open source.
Having these features working fine, I am planning to boostrap the
compiler (that is in C) and compare the number of lines saved)
This will be version 0.9.
....
The project is open for anyone to join.
Shouldn't you just build a development framework like WordPress that
ease programming? I think existing C and C++ syntax is more than enough
to write applications.
--
@~@ Remain silent! Drink, Blink, Stretch! Live long and prosper!!
/ v \ Simplicity is Beauty!
/( _ )\ May the Force and farces be with you!
^ ^ (x86_64 Ubuntu 9.10) Linux 2.6.39.3
不借貸! 不詐騙! 不援交! 不打交! 不打劫! 不自殺! 請考慮綜援 (CSSA):
http://www.swd.gov.hk/tc/index/site_pubsvc/page_socsecu/sub_addressesa
Thiago Adams
2017-07-24 12:25:10 UTC
Reply
Permalink
Raw Message
Post by Mr. Man-wai Chang
Post by Thiago Adams
I have seen some topics about C x C++.
I would like to hear your thoughts about a new language derived
and compatible with from C11 called C' that I am creating. .
....
The output is C89 code. It's open source.
Having these features working fine, I am planning to boostrap the
compiler (that is in C) and compare the number of lines saved)
This will be version 0.9.
....
The project is open for anyone to join.
Shouldn't you just build a development framework like WordPress that
ease programming? I think existing C and C++ syntax is more than enough
to write applications.
I am not planning to add new syntax in C' to make
it easier or high level adding new concepts on the language.

The syntax added so far was created to support the automation
of patterns. These patterns can be created in C by hand using
the same abstraction mechanism used by C' - functions and
data structures.
Tim Rentsch
2017-07-24 13:51:47 UTC
Reply
Permalink
Raw Message
Post by Thiago Adams
I have seen some topics about C x C++.
I would like to hear your thoughts about a new language derived
and compatible with from C11 called C' that I am creating.
I have a comment. This topic belongs in a different newsgroup.
Thiago Adams
2017-07-28 04:14:46 UTC
Reply
Permalink
Raw Message
Post by Tim Rentsch
Post by Thiago Adams
I have seen some topics about C x C++.
I would like to hear your thoughts about a new language derived
and compatible with from C11 called C' that I am creating.
I have a comment. This topic belongs in a different newsgroup.
I changed the syntax to allow C' be compatible with C parsing
using some macros. We can use current IDEs with auto complete etc.
I consider the tool very useful and it was created to solve
problems. (like container creation)
So, if container libs are relevant on this group I think this tool can be relevant as well. The code generated can be very efficient and it can be tailored for specific types.

(but I am open to accept your option if this is off-topic and avoid bothering others)

In this video I show auto-generation for initialization and destroy (stack, heap and static)
Also "auto" qualifier that is only used in destroy code generation.
and a container "array".



(sorry for English mistakes on this video)


Suggestions and critics are welcome.

Loading...