Discussion:
Compiler Elegance
(too old to reply)
David Kleinecke
2017-07-20 19:40:45 UTC
Permalink
Raw Message
This is about a point where it seems a C compiler must
abandon elegance and use a kludge. If all you know about
how compilers work is "use YACC or an equivalent" then
please explain for me how YACC does it.

The problem is differentiating between statement labels
and identifier-initial expressions. When the parser
reads the first token of a new statement and recognizes
an identifier how, if at all, does it know whether that
identifier is a variable name or label?

This can be solved by using semantic data if one
accepts a reasonable limitation: If the identifier is a
variable it must have already been declared (here is the
limitation - implicitly int functions must not be
accepted) otherwise it is a label (sometimes already
known as such, sometimes not). I don't want to use this
particular kludge because I want to put a barrier
between syntax and semantics (why is a different
question). In any case it forces me to test in a given
order which is something I want to avoid.

There are a number of other possible kludges but the
one I grit my teeth and favor is modifying the parse
of expressions. I don't specifically test for labels.
If a possible expression begins with an identifier I
look for a colon and if there is one I enter the
identifier as a label and call statement again.
Otherwise I jump to the right place in the interior
of the expression parse. I dislike that internal jump
but it only a small mis-elegance.

Anybody got any better ideas?
Rick C. Hodgin
2017-07-20 19:53:12 UTC
Permalink
Raw Message
Post by David Kleinecke
This is about a point where it seems a C compiler must
abandon elegance and use a kludge. If all you know about
how compilers work is "use YACC or an equivalent" then
please explain for me how YACC does it.
The problem is differentiating between statement labels
and identifier-initial expressions. When the parser
reads the first token of a new statement and recognizes
an identifier how, if at all, does it know whether that
identifier is a variable name or label? ...
Anybody got any better ideas?
In all honesty, my advice would be to forego the C compiler
restriction of requiring everything to be defined before
being referenced, and make your compiler an N-pass compiler
able to resolve forward-declared things without error.

I do not see any reason in this day and age to have a
compiler bound by limitations present in prior generations
of hardware.

You could parse out everything in the first pass, and those
things you know at that time go ahead and tag, but those
things you do not know let them be resolved in a follow-up
pass. If they fail after that second pass, then report them
as errors.

CAlive's design does this, making it an N-pass compiler. The
current model is using 7 passes, with only one pass being the
true compilation pass, which would logically be the second
compiler pass on the file as the others are for lexing, macro
expansion, token tagging, and then optimization, and generating
the output assembly.

Thank you,
Rick C. Hodgin
Joe Pfeiffer
2017-07-20 19:56:56 UTC
Permalink
Raw Message
Post by David Kleinecke
This is about a point where it seems a C compiler must
abandon elegance and use a kludge. If all you know about
how compilers work is "use YACC or an equivalent" then
please explain for me how YACC does it.
The problem is differentiating between statement labels
and identifier-initial expressions. When the parser
reads the first token of a new statement and recognizes
an identifier how, if at all, does it know whether that
identifier is a variable name or label?
This can be solved by using semantic data if one
accepts a reasonable limitation: If the identifier is a
variable it must have already been declared (here is the
limitation - implicitly int functions must not be
accepted) otherwise it is a label (sometimes already
known as such, sometimes not). I don't want to use this
particular kludge because I want to put a barrier
between syntax and semantics (why is a different
question). In any case it forces me to test in a given
order which is something I want to avoid.
There are a number of other possible kludges but the
one I grit my teeth and favor is modifying the parse
of expressions. I don't specifically test for labels.
If a possible expression begins with an identifier I
look for a colon and if there is one I enter the
identifier as a label and call statement again.
Otherwise I jump to the right place in the interior
of the expression parse. I dislike that internal jump
but it only a small mis-elegance.
Anybody got any better ideas?
A C statement label must be followed by a colon. That is sufficient for
a context free grammar to distinguish between an identifier used as a
label from one used as the beginning of an expression.
Thiago Adams
2017-07-21 12:07:04 UTC
Permalink
Raw Message
Post by Joe Pfeiffer
Post by David Kleinecke
This is about a point where it seems a C compiler must
abandon elegance and use a kludge. If all you know about
how compilers work is "use YACC or an equivalent" then
please explain for me how YACC does it.
The problem is differentiating between statement labels
and identifier-initial expressions. When the parser
reads the first token of a new statement and recognizes
an identifier how, if at all, does it know whether that
identifier is a variable name or label?
This can be solved by using semantic data if one
accepts a reasonable limitation: If the identifier is a
variable it must have already been declared (here is the
limitation - implicitly int functions must not be
accepted) otherwise it is a label (sometimes already
known as such, sometimes not). I don't want to use this
particular kludge because I want to put a barrier
between syntax and semantics (why is a different
question). In any case it forces me to test in a given
order which is something I want to avoid.
There are a number of other possible kludges but the
one I grit my teeth and favor is modifying the parse
of expressions. I don't specifically test for labels.
If a possible expression begins with an identifier I
look for a colon and if there is one I enter the
identifier as a label and call statement again.
Otherwise I jump to the right place in the interior
of the expression parse. I dislike that internal jump
but it only a small mis-elegance.
Anybody got any better ideas?
A C statement label must be followed by a colon. That is sufficient for
a context free grammar to distinguish between an identifier used as a
label from one used as the beginning of an expression.
I have two cases for look ahead:

* Labels
Look ahead to see ':'

* cast expressions

( typename
Look ahead to see if the next token is first of token of typename

Because my parser keep comments and preprossessor information
I have as many look ahead as necessary.

For instance:

label /*my comment*/ :

In this case the look ahead for the parser point of view is ':'. But I keep in memory the previous tokens. (spaces and comment in this case)

It's possible to change the grammar to avoid these look ahead. But I found
better to keep the grammar close to the standard.
Keith Thompson
2017-07-20 20:06:52 UTC
Permalink
Raw Message
Post by David Kleinecke
This is about a point where it seems a C compiler must
abandon elegance and use a kludge. If all you know about
how compilers work is "use YACC or an equivalent" then
please explain for me how YACC does it.
The problem is differentiating between statement labels
and identifier-initial expressions. When the parser
reads the first token of a new statement and recognizes
an identifier how, if at all, does it know whether that
identifier is a variable name or label?
When the parser sees an identifier, it doesn't know whether it's a
statement label, or an object name, or an enumeration constant, or a
function name, or .... Fortunately, it doesn't need to.

Given the grammar, it does know what tokens to look for next. The
identifier is a label *if and only if* the following token is a colon.
Not all grammatical rules are that simple, but any parser that can parse
C can deal with the complexity.

Typedef names are an issue, requiring some feedback from the symbol
table to the parser.
Post by David Kleinecke
This can be solved by using semantic data if one
accepts a reasonable limitation: If the identifier is a
variable it must have already been declared (here is the
limitation - implicitly int functions must not be
accepted) otherwise it is a label (sometimes already
known as such, sometimes not). I don't want to use this
particular kludge because I want to put a barrier
between syntax and semantics (why is a different
question). In any case it forces me to test in a given
order which is something I want to avoid.
No, label names are in their own independent namespace. A label can
use the same identifier as a variable name, a type name, a function
name, or anything else. It's recognized as a label name purely from
context (preceding a colon at the beginning of a labeled-statement,
or following the keyword "goto").

[...]
Post by David Kleinecke
Anybody got any better ideas?
Yes, just parse the input using the grammar -- plus a bit of extra work
to account for typedef names. Recognition of labels will fall out of
that automatically.
--
Keith Thompson (The_Other_Keith) kst-***@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
bartc
2017-07-20 20:35:41 UTC
Permalink
Raw Message
Post by David Kleinecke
This is about a point where it seems a C compiler must
abandon elegance and use a kludge. If all you know about
how compilers work is "use YACC or an equivalent" then
please explain for me how YACC does it.
The problem is differentiating between statement labels
and identifier-initial expressions. When the parser
reads the first token of a new statement and recognizes
an identifier how, if at all, does it know whether that
identifier is a variable name or label?
I checked my own parser (which is manually written recursive descent).

If parsing a new statement and it's an identifier, then there is one of
three possibilities:

(1) A colon follows, so it's a label (and it must be followed by a fresh
nested statement, possibly another label)

(2) If this is a typedef (requires some name resolving), then it's a
declaration

(3) Otherwise it's probably the start of an expression.
--
bartc
David Kleinecke
2017-07-20 21:49:58 UTC
Permalink
Raw Message
Post by bartc
Post by David Kleinecke
This is about a point where it seems a C compiler must
abandon elegance and use a kludge. If all you know about
how compilers work is "use YACC or an equivalent" then
please explain for me how YACC does it.
The problem is differentiating between statement labels
and identifier-initial expressions. When the parser
reads the first token of a new statement and recognizes
an identifier how, if at all, does it know whether that
identifier is a variable name or label?
I checked my own parser (which is manually written recursive descent).
If parsing a new statement and it's an identifier, then there is one of
(1) A colon follows, so it's a label (and it must be followed by a fresh
nested statement, possibly another label)
(2) If this is a typedef (requires some name resolving), then it's a
declaration
(3) Otherwise it's probably the start of an expression.
I think Bart has an inkling of what I am talking about. But ...

I asking how the parse proceeds. If I must use the following colon
to detect a label I have to get the non-label back into the parse
somehow to allow it to be recognized in an expression. I could
just stick it back into the token stream (like an unegtc) but
then I would have to spend a check on every next token call to
see whether I was a token ahead. I could just demand that the
token be a array tokens in member so I could just drop the
token pointer back one.

Conversely if I test for expressions first I have to make an error
recovery when the colon failed the expression scan.

This is, so far as I can tell, the only place in the C syntax
where it is potentially necessary to read two tokens ahead rather
than just the next token. (The pre-processor must be able to read
three characters - but that's a different story).

I handle typedefs by turning every typedef name into what is
effectively a keyword. This is also a kludge but not as blatant
a kludge as label handling.
bartc
2017-07-20 22:21:33 UTC
Permalink
Raw Message
Post by David Kleinecke
Post by bartc
Post by David Kleinecke
This is about a point where it seems a C compiler must
abandon elegance and use a kludge. If all you know about
how compilers work is "use YACC or an equivalent" then
please explain for me how YACC does it.
The problem is differentiating between statement labels
and identifier-initial expressions. When the parser
reads the first token of a new statement and recognizes
an identifier how, if at all, does it know whether that
identifier is a variable name or label?
I checked my own parser (which is manually written recursive descent).
If parsing a new statement and it's an identifier, then there is one of
(1) A colon follows, so it's a label (and it must be followed by a fresh
nested statement, possibly another label)
(2) If this is a typedef (requires some name resolving), then it's a
declaration
(3) Otherwise it's probably the start of an expression.
I think Bart has an inkling of what I am talking about. But ...
I asking how the parse proceeds. If I must use the following colon
to detect a label I have to get the non-label back into the parse
somehow to allow it to be recognized in an expression. I could
just stick it back into the token stream (like an unegtc) but
then I would have to spend a check on every next token call to
see whether I was a token ahead. I could just demand that the
token be a array tokens in member so I could just drop the
token pointer back one.
OK, I think I see your problem: you don't have the ability to peek at
the next token, without it affecting the current one.

There are few ways to solve this, the one I use is to work a token in
hand. That is, it's always reading the following token, but returns the
previous one.

(My technique is to have the parser call a function lex() that sets up
info about the current token in global 'lx' (a struct), and about the
next token in 'nextlx'.

On the next read to lex(), nextlx is copied to lx, and a lower level
function is called to get a fresh token from the source into nextlx.)
Post by David Kleinecke
This is, so far as I can tell, the only place in the C syntax
where it is potentially necessary to read two tokens ahead rather
than just the next token.
There are extra places if you want to optimise certain constructs, and
others where it makes the parsing simpler.
Post by David Kleinecke
(The pre-processor must be able to read
three characters - but that's a different story).
My parser is totally decoupled from the character processing in the
tokeniser.

(And lookahead in the preprocessor is another story. For example, in the
macro invocation:

M(x,y)

between the "M" and the "(", there can be any amount of white-space:
spaces, tabs and newlines, or any number of comments. And M and "("
could be split across include files, or across #if blocks. Maybe there
could even be another macro invocation in there provided it yields no
tokens (I've no idea what is allowed and what isn't).

However, I only look ahead two characters, so that it must be either
M(x,y) or M (x,y) with one space.)
Post by David Kleinecke
I handle typedefs by turning every typedef name into what is
effectively a keyword. This is also a kludge but not as blatant
a kludge as label handling.
I don't think this can work in general. A keyword would have global
scope across the file, and cannot be shadowed by any other identifier.

A typedef name obeys normal scope rules, and can be shadowed by other names:

typedef int T; // T is a typedef
{ T a; // typedef
int T; // Now T is a variable name
T .... // Start of expression not declaration
T: // But this is still a label

Labels have their own namespace so can coexist with typedefs and
variables. But if you make a typedef into a keyword, you won't be able
to use the same name for a label.

(Yes, the language is crazy. They must have been smoking something good
back in the late 60s!)

--
bartc
David Kleinecke
2017-07-21 02:45:16 UTC
Permalink
Raw Message
On Thursday, July 20, 2017 at 3:21:48 PM UTC-7, Bart wrote:

Snipping some obvious stuff
Post by bartc
OK, I think I see your problem: you don't have the ability to peek at
the next token, without it affecting the current one.
Correct. I get the next token by calling a function named
"next_token" which (1) I treat as though it were opaque and
(2) want to keep as simple as possible because it is called
so often. In fact next_token is the last step of the
pre-processor.
Post by bartc
There are few ways to solve this, the one I use is to work a token in
hand. That is, it's always reading the following token, but returns the
previous one.
That is essentially what I called read-ahead. Easy enough to
implement but a constant small cost and only rarely used.
Post by bartc
(My technique is to have the parser call a function lex() that sets up
info about the current token in global 'lx' (a struct), and about the
next token in 'nextlx'.
My tokens are simply ints which are indices into a token
data array.
Post by bartc
Post by David Kleinecke
This is, so far as I can tell, the only place in the C syntax
where it is potentially necessary to read two tokens ahead rather
than just the next token.
There are extra places if you want to optimise certain constructs, and
others where it makes the parsing simpler.
I have no interest in optimizing the parser. I'm curious where
you think look-ahead makes parsing easier. I see none.
Post by bartc
Post by David Kleinecke
I handle typedefs by turning every typedef name into what is
effectively a keyword. This is also a kludge but not as blatant
a kludge as label handling.
I don't think this can work in general. A keyword would have global
scope across the file, and cannot be shadowed by any other identifier.
typedef int T; // T is a typedef
{ T a; // typedef
int T; // Now T is a variable name
T .... // Start of expression not declaration
T: // But this is still a label
Labels have their own namespace so can coexist with typedefs and
variables. But if you make a typedef into a keyword, you won't be able
to use the same name for a label.
They are not precisely keywords. They are only like keywords when
they are not shadowed and they are only like the type specifier
keywords. I don't think labels have their own namespace and I don't
think they shadow.
Post by bartc
(Yes, the language is crazy. They must have been smoking something good
back in the late 60s!)
--
bartc
bartc
2017-07-21 10:37:12 UTC
Permalink
Raw Message
Post by David Kleinecke
Post by bartc
There are few ways to solve this, the one I use is to work a token in
hand. That is, it's always reading the following token, but returns the
previous one.
That is essentially what I called read-ahead. Easy enough to
implement but a constant small cost and only rarely used.
Post by bartc
(My technique is to have the parser call a function lex() that sets up
info about the current token in global 'lx' (a struct), and about the
next token in 'nextlx'.
My tokens are simply ints which are indices into a token
data array.
In that case it should be very easy. (I used that method once but the
cost was allocating the storage and populating the array. My compilers
now are written to be very fast.)
Post by David Kleinecke
Post by bartc
Post by David Kleinecke
This is, so far as I can tell, the only place in the C syntax
where it is potentially necessary to read two tokens ahead rather
than just the next token.
There are extra places if you want to optimise certain constructs, and
others where it makes the parsing simpler.
I have no interest in optimizing the parser. I'm curious where
you think look-ahead makes parsing easier. I see none.
A couple of places. I do string concatenation (ie. "ABC" "DEF" becomes
"ABCDEF") in the parser. Peeking to see if the next token is a string
makes that a little easier.

C allows a struct initialiser with a trailing comma: '{10,20,30,}'. It's
sometimes useful to peek at the "}" rather than checking it in the
course of reading the next token anyway, but the logic can become more
fiddly.

When reading an identifier and the next token is '(', which is likely to
mean a function call, this can be dealt with immediately (rather than
resolving to a function name which then decays into a function pointer
when then requires a complex type-specification to be constructed for).

So it's convenient rather than essential.
--
bartc
David Brown
2017-07-21 09:57:08 UTC
Permalink
Raw Message
Post by bartc
Post by David Kleinecke
I asking how the parse proceeds. If I must use the following colon
to detect a label I have to get the non-label back into the parse
somehow to allow it to be recognized in an expression. I could
just stick it back into the token stream (like an unegtc) but
then I would have to spend a check on every next token call to
see whether I was a token ahead. I could just demand that the
token be a array tokens in member so I could just drop the
token pointer back one.
OK, I think I see your problem: you don't have the ability to peek at
the next token, without it affecting the current one.
I haven't done much work in parsing, but my understanding is that C
requires looking ahead to be able to completely identify the current
token. And I believe there is theoretically no limit to how many tokens
ahead you might have to look (due to typedefs, IIRC).
Post by bartc
There are few ways to solve this, the one I use is to work a token in
hand. That is, it's always reading the following token, but returns the
previous one.
(My technique is to have the parser call a function lex() that sets up
info about the current token in global 'lx' (a struct), and about the
next token in 'nextlx'.
On the next read to lex(), nextlx is copied to lx, and a lower level
function is called to get a fresh token from the source into nextlx.)
Post by David Kleinecke
This is, so far as I can tell, the only place in the C syntax
where it is potentially necessary to read two tokens ahead rather
than just the next token.
There are extra places if you want to optimise certain constructs, and
others where it makes the parsing simpler.
Post by David Kleinecke
(The pre-processor must be able to read
three characters - but that's a different story).
My parser is totally decoupled from the character processing in the
tokeniser.
That is the way it is usually done, AFAIK.
Post by bartc
Post by David Kleinecke
I handle typedefs by turning every typedef name into what is
effectively a keyword. This is also a kludge but not as blatant
a kludge as label handling.
I don't think this can work in general. A keyword would have global
scope across the file, and cannot be shadowed by any other identifier.
Yes.
Post by bartc
typedef int T; // T is a typedef
{ T a; // typedef
int T; // Now T is a variable name
T .... // Start of expression not declaration
T: // But this is still a label
Labels have their own namespace so can coexist with typedefs and
variables. But if you make a typedef into a keyword, you won't be able
to use the same name for a label.
(Yes, the language is crazy. They must have been smoking something good
back in the late 60s!)
Most languages need at least /some/ look-ahead - Pascal was designed to
be simple to parse with at most one token look-ahead being required.
Alain Ketterlin
2017-07-20 22:54:44 UTC
Permalink
Raw Message
[...]
Post by David Kleinecke
Post by bartc
I checked my own parser (which is manually written recursive descent).
If parsing a new statement and it's an identifier, then there is one of
(1) A colon follows, so it's a label (and it must be followed by a fresh
nested statement, possibly another label)
(2) If this is a typedef (requires some name resolving), then it's a
declaration
(3) Otherwise it's probably the start of an expression.
I think Bart has an inkling of what I am talking about. But ...
I asking how the parse proceeds. If I must use the following colon
to detect a label I have to get the non-label back into the parse
somehow to allow it to be recognized in an expression. I could
just stick it back into the token stream (like an unegtc) but
then I would have to spend a check on every next token call to
see whether I was a token ahead.
You have one token in hand (the identifier). There is no reason to throw
it back at the lexer and read it back later. At this point, the state of
your parser must reflect the fact that you have just read an identifier.
The next token is called the "lookahead" token, and is the one you check
to decide whether you start a label, a declaration, or an expression.

From what you write, I think your question is more on programming and
data-structures than conceptual.

Bart's description above is, I think, as clear as possible: The three
possibilities he mentions typically correspond to three parsing
functions, taking as argument the state of your parser (which includes
the identifier you just read).

By the way, since you consider only one lookahead token, this is called
a LL(1) parser: the "1" means that you never (need to) look more than 1
token ahead to "predict" the rule you fire.

Wikipedia has a good example. See
https://en.wikipedia.org/wiki/Recursive_descent_parser
Post by David Kleinecke
This is, so far as I can tell, the only place in the C syntax
where it is potentially necessary to read two tokens ahead rather
than just the next token.
I have no precise counter-example to provide, but I doubt it: any
non-terminal that has at least two productions will require lookahead to
decide which production to fire (and the test is usually on sets on
token types, which you have to pre-compute for productions starting with
non-terminals). It also heavily depends on the grammar, *not* the
language. You can have differents grammars for the same language, i.e.,
different grammars for C.

And since you asked: Yacc/Bison takes a grammar, and builds a finite
automaton that "contains" all these tests. Feeding the automaton with
the token stream will then fire productions bottom-up. This is called
LALR(1) parsing (Bison can handle others classes of grammars as well).
And LL(1) parsers (that proceed top-down) can also be automatically
generated (ANTLR is a well-known LL(k) parser generator).

-- Alain.
David Kleinecke
2017-07-21 02:53:59 UTC
Permalink
Raw Message
Post by Alain Ketterlin
[...]
Post by David Kleinecke
Post by bartc
I checked my own parser (which is manually written recursive descent).
If parsing a new statement and it's an identifier, then there is one of
(1) A colon follows, so it's a label (and it must be followed by a fresh
nested statement, possibly another label)
(2) If this is a typedef (requires some name resolving), then it's a
declaration
(3) Otherwise it's probably the start of an expression.
I think Bart has an inkling of what I am talking about. But ...
I asking how the parse proceeds. If I must use the following colon
to detect a label I have to get the non-label back into the parse
somehow to allow it to be recognized in an expression. I could
just stick it back into the token stream (like an unegtc) but
then I would have to spend a check on every next token call to
see whether I was a token ahead.
You have one token in hand (the identifier). There is no reason to throw
it back at the lexer and read it back later. At this point, the state of
your parser must reflect the fact that you have just read an identifier.
The next token is called the "lookahead" token, and is the one you check
to decide whether you start a label, a declaration, or an expression.
From what you write, I think your question is more on programming and
data-structures than conceptual.
Bart's description above is, I think, as clear as possible: The three
possibilities he mentions typically correspond to three parsing
functions, taking as argument the state of your parser (which includes
the identifier you just read).
By the way, since you consider only one lookahead token, this is called
a LL(1) parser: the "1" means that you never (need to) look more than 1
token ahead to "predict" the rule you fire.
Wikipedia has a good example. See
https://en.wikipedia.org/wiki/Recursive_descent_parser
Post by David Kleinecke
This is, so far as I can tell, the only place in the C syntax
where it is potentially necessary to read two tokens ahead rather
than just the next token.
I have no precise counter-example to provide, but I doubt it: any
non-terminal that has at least two productions will require lookahead to
decide which production to fire (and the test is usually on sets on
token types, which you have to pre-compute for productions starting with
non-terminals). It also heavily depends on the grammar, *not* the
language. You can have differents grammars for the same language, i.e.,
different grammars for C.
And since you asked: Yacc/Bison takes a grammar, and builds a finite
automaton that "contains" all these tests. Feeding the automaton with
the token stream will then fire productions bottom-up. This is called
LALR(1) parsing (Bison can handle others classes of grammars as well).
And LL(1) parsers (that proceed top-down) can also be automatically
generated (ANTLR is a well-known LL(k) parser generator).
-- Alain.
I know about LL(1) and the rest. I just like to avoid using
any more jargon than is necessary. My compilers are unlike
usual compilers because of the state machine intermediate
step.
Alain Ketterlin
2017-07-21 09:55:40 UTC
Permalink
Raw Message
David Kleinecke <***@gmail.com> writes:

[...]
Post by David Kleinecke
I know about LL(1) and the rest.
Your naive question suggests that you don't know as much as you think.
Post by David Kleinecke
I just like to avoid using any more jargon than is necessary.
Concepts != jargon.
Post by David Kleinecke
My compilers are unlike usual compilers because of the state machine
intermediate step.
Good luck with that.

-- Alain.
Ben Bacarisse
2017-07-21 10:43:49 UTC
Permalink
Raw Message
David Kleinecke <***@gmail.com> writes:
<snip>
... My compilers are unlike
usual compilers because of the state machine intermediate
step.
That does not seem at all unusual. Many parsers use a state machine.
--
Ben.
Joe Pfeiffer
2017-07-21 17:56:28 UTC
Permalink
Raw Message
Post by David Kleinecke
I know about LL(1) and the rest. I just like to avoid using
any more jargon than is necessary. My compilers are unlike
usual compilers because of the state machine intermediate
step.
Normally the tokenizer is a state machine, and the parser is a state
machine plus a stack. Are you saying you've got another state machine
in between the two? Whatever for?
David Kleinecke
2017-07-21 20:44:33 UTC
Permalink
Raw Message
Post by Joe Pfeiffer
Post by David Kleinecke
I know about LL(1) and the rest. I just like to avoid using
any more jargon than is necessary. My compilers are unlike
usual compilers because of the state machine intermediate
step.
Normally the tokenizer is a state machine, and the parser is a state
machine plus a stack. Are you saying you've got another state machine
in between the two? Whatever for?
I have posted a couple of examples of what I mean here in
past. I could repeat them if they are needed. A parser is
not necessarily a state machine. And I think one needs two
stacks (at least) - one for operators and one for objects.
I use several more stacks - for instance one for break
targets and another for continue targets.

My preprocessor uses stacks but is not a state machine in
the sense I would use "state machine" (but usage of "state
machine" can differ). My preprocessor is a cascade of
seven functions corresponding to the description in the
Standard.
Keith Thompson
2017-07-21 20:56:45 UTC
Permalink
Raw Message
Post by David Kleinecke
Post by Joe Pfeiffer
Post by David Kleinecke
I know about LL(1) and the rest. I just like to avoid using
any more jargon than is necessary. My compilers are unlike
usual compilers because of the state machine intermediate
step.
Normally the tokenizer is a state machine, and the parser is a state
machine plus a stack. Are you saying you've got another state machine
in between the two? Whatever for?
I have posted a couple of examples of what I mean here in
past. I could repeat them if they are needed. A parser is
not necessarily a state machine. And I think one needs two
stacks (at least) - one for operators and one for objects.
I use several more stacks - for instance one for break
targets and another for continue targets.
Why would a parser need to distinguish between operators and objects,
and what exactly do you mean by "objects"? Objects in the C sense
of the word don't exist yet as far as the parser is concerned.
Post by David Kleinecke
My preprocessor uses stacks but is not a state machine in
the sense I would use "state machine" (but usage of "state
machine" can differ). My preprocessor is a cascade of
seven functions corresponding to the description in the
Standard.
You said you're having trouble distinguishing label names from
other identifiers. An ordinary parser, using methods that have
been in common use for decades (e.g., yacc) will have no trouble at
all with that; the distinction will fall out automatically given a
correct grammar definition. You'll trigger the production for a
"labeled-statement", and you'll have enough information at that
point to determine which of the three kinds of labeled-statement
you have, and the identifier for the label, if any.

If your parser is having trouble with this, I suspect you're using
ad-hoc methods that are never going to work as well as the usual
approach. (Unless there's some specific advantage to your approach?)
--
Keith Thompson (The_Other_Keith) kst-***@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
David Kleinecke
2017-07-22 02:31:09 UTC
Permalink
Raw Message
Post by Keith Thompson
Post by David Kleinecke
Post by Joe Pfeiffer
Post by David Kleinecke
I know about LL(1) and the rest. I just like to avoid using
any more jargon than is necessary. My compilers are unlike
usual compilers because of the state machine intermediate
step.
Normally the tokenizer is a state machine, and the parser is a state
machine plus a stack. Are you saying you've got another state machine
in between the two? Whatever for?
I have posted a couple of examples of what I mean here in
past. I could repeat them if they are needed. A parser is
not necessarily a state machine. And I think one needs two
stacks (at least) - one for operators and one for objects.
I use several more stacks - for instance one for break
targets and another for continue targets.
Why would a parser need to distinguish between operators and objects,
and what exactly do you mean by "objects"? Objects in the C sense
of the word don't exist yet as far as the parser is concerned.
The distinction is fundamental to my approach. The
objects are data the operators act upon. I can't call
them variables because they include consonants and
string literals.
Post by Keith Thompson
Post by David Kleinecke
My preprocessor uses stacks but is not a state machine in
the sense I would use "state machine" (but usage of "state
machine" can differ). My preprocessor is a cascade of
seven functions corresponding to the description in the
Standard.
You said you're having trouble distinguishing label names from
other identifiers. An ordinary parser, using methods that have
been in common use for decades (e.g., yacc) will have no trouble at
all with that; the distinction will fall out automatically given a
correct grammar definition. You'll trigger the production for a
"labeled-statement", and you'll have enough information at that
point to determine which of the three kinds of labeled-statement
you have, and the identifier for the label, if any.
If your parser is having trouble with this, I suspect you're using
ad-hoc methods that are never going to work as well as the usual
approach. (Unless there's some specific advantage to your approach?)
I'm not having trouble. I mentioned a half dozen possible
solutions all of which I dislike because I see them as
inelegant. I feel sure that whatever YACC does is one of
these but I don't know that much about the innards of YACC.

I only see one kind of labeled-statement in the C syntax.
But I use C89 and GOK what the later C might add.
Keith Thompson
2017-07-22 02:59:58 UTC
Permalink
Raw Message
[...]
Post by David Kleinecke
Post by Keith Thompson
Why would a parser need to distinguish between operators and objects,
and what exactly do you mean by "objects"? Objects in the C sense
of the word don't exist yet as far as the parser is concerned.
The distinction is fundamental to my approach. The
objects are data the operators act upon. I can't call
them variables because they include consonants and
string literals.
You mean operands, then. That would apply only while parsing
expressions. I wouldn't expect a C parser to have to treat expression
specially. It's all one big consistent grammar for the entire language
(with the previously mentioned special case for typedef names).

[...]
Post by David Kleinecke
Post by Keith Thompson
If your parser is having trouble with this, I suspect you're using
ad-hoc methods that are never going to work as well as the usual
approach. (Unless there's some specific advantage to your approach?)
I'm not having trouble. I mentioned a half dozen possible
solutions all of which I dislike because I see them as
inelegant. I feel sure that whatever YACC does is one of
these but I don't know that much about the innards of YACC.
I only see one kind of labeled-statement in the C syntax.
But I use C89 and GOK what the later C might add.
You know, it's really easy to check what the current standard says.
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
But the definition of a labeled-statement hasn't changed:

labeled-statement:
identifier : statement
case constant-expression : statement
default : statement
--
Keith Thompson (The_Other_Keith) kst-***@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
David Kleinecke
2017-07-22 03:25:38 UTC
Permalink
Raw Message
Post by Keith Thompson
[...]
Post by David Kleinecke
Post by Keith Thompson
Why would a parser need to distinguish between operators and objects,
and what exactly do you mean by "objects"? Objects in the C sense
of the word don't exist yet as far as the parser is concerned.
The distinction is fundamental to my approach. The
objects are data the operators act upon. I can't call
them variables because they include consonants and
string literals.
You mean operands, then. That would apply only while parsing
expressions. I wouldn't expect a C parser to have to treat expression
specially. It's all one big consistent grammar for the entire language
(with the previously mentioned special case for typedef names).
[...]
Post by David Kleinecke
Post by Keith Thompson
If your parser is having trouble with this, I suspect you're using
ad-hoc methods that are never going to work as well as the usual
approach. (Unless there's some specific advantage to your approach?)
I'm not having trouble. I mentioned a half dozen possible
solutions all of which I dislike because I see them as
inelegant. I feel sure that whatever YACC does is one of
these but I don't know that much about the innards of YACC.
I only see one kind of labeled-statement in the C syntax.
But I use C89 and GOK what the later C might add.
You know, it's really easy to check what the current standard says.
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
identifier : statement
case constant-expression : statement
default : statement
Oh. Yes that's what the standard says. I don't think of case
or default statements as labeled because they start with a
keyword and can be handled routinely. But I should have
looked and used a different phrase. Oh Well.
Noob
2017-07-21 22:21:50 UTC
Permalink
Raw Message
Post by David Kleinecke
I have posted a couple of examples of what I mean here in
past. I could repeat them if they are needed. A parser is
not necessarily a state machine. And I think one needs two
stacks (at least) - one for operators and one for objects.
I use several more stacks - for instance one for break
targets and another for continue targets.
My preprocessor uses stacks but is not a state machine in
the sense I would use "state machine" (but usage of "state
machine" can differ). My preprocessor is a cascade of
seven functions corresponding to the description in the
Standard.
"When I use a word,' Humpty Dumpty said in rather a scornful tone,
'it means just what I choose it to mean — neither more nor less."
Joe Pfeiffer
2017-07-22 02:08:26 UTC
Permalink
Raw Message
Post by David Kleinecke
Post by Joe Pfeiffer
Post by David Kleinecke
I know about LL(1) and the rest. I just like to avoid using
any more jargon than is necessary. My compilers are unlike
usual compilers because of the state machine intermediate
step.
Normally the tokenizer is a state machine, and the parser is a state
machine plus a stack. Are you saying you've got another state machine
in between the two? Whatever for?
I have posted a couple of examples of what I mean here in
past. I could repeat them if they are needed. A parser is
not necessarily a state machine. And I think one needs two
stacks (at least) - one for operators and one for objects.
I use several more stacks - for instance one for break
targets and another for continue targets.
You clearly have a different definition of "elegance" than I do.

When you say you think it "needs" two stacks (as opposed to you deciding
to use two for some reason), you need to go reread the Dragon book.
Joe Pfeiffer
2017-07-21 02:02:10 UTC
Permalink
Raw Message
Post by David Kleinecke
Post by bartc
Post by David Kleinecke
This is about a point where it seems a C compiler must
abandon elegance and use a kludge. If all you know about
how compilers work is "use YACC or an equivalent" then
please explain for me how YACC does it.
The problem is differentiating between statement labels
and identifier-initial expressions. When the parser
reads the first token of a new statement and recognizes
an identifier how, if at all, does it know whether that
identifier is a variable name or label?
I checked my own parser (which is manually written recursive descent).
If parsing a new statement and it's an identifier, then there is one of
(1) A colon follows, so it's a label (and it must be followed by a fresh
nested statement, possibly another label)
(2) If this is a typedef (requires some name resolving), then it's a
declaration
(3) Otherwise it's probably the start of an expression.
I think Bart has an inkling of what I am talking about. But ...
I asking how the parse proceeds. If I must use the following colon
to detect a label I have to get the non-label back into the parse
somehow to allow it to be recognized in an expression. I could
just stick it back into the token stream (like an unegtc) but
then I would have to spend a check on every next token call to
see whether I was a token ahead. I could just demand that the
token be a array tokens in member so I could just drop the
token pointer back one.
Conversely if I test for expressions first I have to make an error
recovery when the colon failed the expression scan.
This is, so far as I can tell, the only place in the C syntax
where it is potentially necessary to read two tokens ahead rather
than just the next token. (The pre-processor must be able to read
three characters - but that's a different story).
I handle typedefs by turning every typedef name into what is
effectively a keyword. This is also a kludge but not as blatant
a kludge as label handling.
There is a terrific book called "Crafting a Compiler", by Fischer and
LeBlanc (there's a later edition by Fischer, Cytron, and LeBlanc, but
I'm not familiar with it). It goes into considerable detail on how one
goes about writing a parser, either bottom-up or top-down, given a
grammar. You can save yourself a lot of needless head-scratching by
learning how this stuff is done before you start writing.
David Kleinecke
2017-07-21 02:53:26 UTC
Permalink
Raw Message
Post by Joe Pfeiffer
Post by David Kleinecke
Post by bartc
Post by David Kleinecke
This is about a point where it seems a C compiler must
abandon elegance and use a kludge. If all you know about
how compilers work is "use YACC or an equivalent" then
please explain for me how YACC does it.
The problem is differentiating between statement labels
and identifier-initial expressions. When the parser
reads the first token of a new statement and recognizes
an identifier how, if at all, does it know whether that
identifier is a variable name or label?
I checked my own parser (which is manually written recursive descent).
If parsing a new statement and it's an identifier, then there is one of
(1) A colon follows, so it's a label (and it must be followed by a fresh
nested statement, possibly another label)
(2) If this is a typedef (requires some name resolving), then it's a
declaration
(3) Otherwise it's probably the start of an expression.
I think Bart has an inkling of what I am talking about. But ...
I asking how the parse proceeds. If I must use the following colon
to detect a label I have to get the non-label back into the parse
somehow to allow it to be recognized in an expression. I could
just stick it back into the token stream (like an unegtc) but
then I would have to spend a check on every next token call to
see whether I was a token ahead. I could just demand that the
token be a array tokens in member so I could just drop the
token pointer back one.
Conversely if I test for expressions first I have to make an error
recovery when the colon failed the expression scan.
This is, so far as I can tell, the only place in the C syntax
where it is potentially necessary to read two tokens ahead rather
than just the next token. (The pre-processor must be able to read
three characters - but that's a different story).
I handle typedefs by turning every typedef name into what is
effectively a keyword. This is also a kludge but not as blatant
a kludge as label handling.
There is a terrific book called "Crafting a Compiler", by Fischer and
LeBlanc (there's a later edition by Fischer, Cytron, and LeBlanc, but
I'm not familiar with it). It goes into considerable detail on how one
goes about writing a parser, either bottom-up or top-down, given a
grammar. You can save yourself a lot of needless head-scratching by
learning how this stuff is done before you start writing.
I've been writing compilers for a long time. I learned
many things from the Dragon Book. But I haven't looked at
it for a long time. Is it obsolete today?
Malcolm McLean
2017-07-21 10:54:18 UTC
Permalink
Raw Message
Post by David Kleinecke
I've been writing compilers for a long time. I learned
many things from the Dragon Book. But I haven't looked at
it for a long time. Is it obsolete today?
Books like that never really become obsolete.
Joe Pfeiffer
2017-07-21 17:52:58 UTC
Permalink
Raw Message
Post by David Kleinecke
I've been writing compilers for a long time. I learned
many things from the Dragon Book. But I haven't looked at
it for a long time. Is it obsolete today?
No, but I found the one I mentioned is a lot easier reading.
Malcolm McLean
2017-07-21 10:52:39 UTC
Permalink
Raw Message
Post by David Kleinecke
This is, so far as I can tell, the only place in the C syntax
where it is potentially necessary to read two tokens ahead rather
than just the next token. (The pre-processor must be able to read
three characters - but that's a different story).
So change the lexer. alpha + {alphanumeric} + colon = label,
Post by David Kleinecke
I handle typedefs by turning every typedef name into what is
effectively a keyword. This is also a kludge but not as blatant
a kludge as label handling.
There's a subtle problem with C typedefs which I keep on understanding
then forgetting.
Malcolm McLean
2017-07-21 12:46:47 UTC
Permalink
Raw Message
Post by Malcolm McLean
Post by David Kleinecke
This is, so far as I can tell, the only place in the C syntax
where it is potentially necessary to read two tokens ahead rather
than just the next token. (The pre-processor must be able to read
three characters - but that's a different story).
So change the lexer. alpha + {alphanumeric} + colon = label,
Rely to own post: that won't work because of the ? : if ... else shorthand
syntax.
bartc
2017-07-21 13:53:19 UTC
Permalink
Raw Message
Post by Malcolm McLean
Post by Malcolm McLean
Post by David Kleinecke
This is, so far as I can tell, the only place in the C syntax
where it is potentially necessary to read two tokens ahead rather
than just the next token. (The pre-processor must be able to read
three characters - but that's a different story).
So change the lexer. alpha + {alphanumeric} + colon = label,
Rely to own post: that won't work because of the ? : if ... else shorthand
syntax.
I don't think it would work anyway, as C allows any amount of white
space and comments between a label and its colon. Plus directives. Plus
the label and/or colon could be isolated within conditional blocks:

fred
/* comment
.... */
// line \
comment
#include <stdio.h>

#if 1
:
#endif

And the label and/or colon could be in separate include files.

A token normally expects consecutive characters.

(Of course, you might need to do a lot of searching for source code
where there was a gap between a label and the following colon.)
--
bartc
j***@verizon.net
2017-07-21 18:38:01 UTC
Permalink
Raw Message
Post by bartc
Post by Malcolm McLean
Post by Malcolm McLean
Post by David Kleinecke
This is, so far as I can tell, the only place in the C syntax
where it is potentially necessary to read two tokens ahead rather
than just the next token. (The pre-processor must be able to read
three characters - but that's a different story).
So change the lexer. alpha + {alphanumeric} + colon = label,
Rely to own post: that won't work because of the ? : if ... else shorthand
syntax.
I don't think it would work anyway, as C allows any amount of white
space and comments between a label and its colon.
You're supposed to remove the white space before performing the syntactic analysis that would determine that it's a label. (5.1.1.2p7)
Post by bartc
... Plus directives. Plus
You're supposed remove all directives and implement conditional inclusion during translation phase 4 (5.1.1.2p4), long before you need to decide whether or not the identifier is a label, during translation phase 7. You're not required to do those translation phases in the specified order, so long as your implementation produces the same effect as if it had performed them in order. However, if you're finding these issues problematic, then maybe you should reconsider the design of your compiler. Actually performing those steps in precisely the specified order will completely resolve these issues (and many other similar ones).

...
Post by bartc
And the label and/or colon could be in separate include files.
The fact that they came from different include files might be something you keep track of for the purpose of producing more meaningful error messages, but otherwise all traces of that fact should have disappeared by the end of translation phase 4.
Post by bartc
A token normally expects consecutive characters.
All the characters that make up the identifier token that will be identified as a label during translation phase 7, should have been consecutive characters by the end of translation phase 4, when they were parsed as a single pre-processing token. They might not have been consecutive at the start of that phase, of course.
bartc
2017-07-21 18:56:36 UTC
Permalink
Raw Message
Post by j***@verizon.net
Post by bartc
Post by Malcolm McLean
Post by Malcolm McLean
Post by David Kleinecke
This is, so far as I can tell, the only place in the C syntax
where it is potentially necessary to read two tokens ahead rather
than just the next token. (The pre-processor must be able to read
three characters - but that's a different story).
So change the lexer. alpha + {alphanumeric} + colon = label,
Rely to own post: that won't work because of the ? : if ... else shorthand
syntax.
I don't think it would work anyway, as C allows any amount of white
space and comments between a label and its colon.
You're supposed to remove the white space before performing the syntactic analysis that would determine that it's a label. (5.1.1.2p7)
Post by bartc
... Plus directives. Plus
You're supposed remove all directives and implement conditional inclusion during translation phase 4 (5.1.1.2p4), long before you need to decide whether or not the identifier is a label, during translation phase 7. You're not required to do those translation phases in the specified order, so long as your implementation produces the same effect as if it had performed them in order. However, if you're finding these issues problematic, then maybe you should reconsider the design of your compiler. Actually performing those steps in precisely the specified order will completely resolve these issues (and many other similar ones).
...
Post by bartc
And the label and/or colon could be in separate include files.
The fact that they came from different include files might be something you keep track of for the purpose of producing more meaningful error messages, but otherwise all traces of that fact should have disappeared by the end of translation phase 4.
Post by bartc
A token normally expects consecutive characters.
All the characters that make up the identifier token that will be identified as a label during translation phase 7, should have been consecutive characters by the end of translation phase 4, when they were parsed as a single pre-processing token. They might not have been consecutive at the start of that phase, of course.
The bit that builds an identifier will expect the incoming sequence of
alphanumeric and _ characters to be consecutive. If necessary after
#include and #if processing, macro replacement, comment and white space
reduction, and token pasting.

If identifier: is to be considered one token as Malcolm suggested, then
the : should either follow immediately or be trivially separated by
white space, eg. one space, tab or newline.

Otherwise identifier : needs to be recognised from two lower level
tokens, requiring an extra tokenising step. But this would be a lot more
work than just doing this detection in the parser (and where it would
work properly even with ?: operators).

The OP doesn't like doing this in the parser, even though one-token
lookahead is trivially possible, because it's untidy. In his opinion.
--
bartc
David Kleinecke
2017-07-21 20:50:16 UTC
Permalink
Raw Message
Post by bartc
Post by j***@verizon.net
Post by bartc
Post by Malcolm McLean
Post by Malcolm McLean
Post by David Kleinecke
This is, so far as I can tell, the only place in the C syntax
where it is potentially necessary to read two tokens ahead rather
than just the next token. (The pre-processor must be able to read
three characters - but that's a different story).
So change the lexer. alpha + {alphanumeric} + colon = label,
Rely to own post: that won't work because of the ? : if ... else shorthand
syntax.
I don't think it would work anyway, as C allows any amount of white
space and comments between a label and its colon.
You're supposed to remove the white space before performing the syntactic analysis that would determine that it's a label. (5.1.1.2p7)
Post by bartc
... Plus directives. Plus
You're supposed remove all directives and implement conditional inclusion during translation phase 4 (5.1.1.2p4), long before you need to decide whether or not the identifier is a label, during translation phase 7. You're not required to do those translation phases in the specified order, so long as your implementation produces the same effect as if it had performed them in order. However, if you're finding these issues problematic, then maybe you should reconsider the design of your compiler. Actually performing those steps in precisely the specified order will completely resolve these issues (and many other similar ones).
...
Post by bartc
And the label and/or colon could be in separate include files.
The fact that they came from different include files might be something you keep track of for the purpose of producing more meaningful error messages, but otherwise all traces of that fact should have disappeared by the end of translation phase 4.
Post by bartc
A token normally expects consecutive characters.
All the characters that make up the identifier token that will be identified as a label during translation phase 7, should have been consecutive characters by the end of translation phase 4, when they were parsed as a single pre-processing token. They might not have been consecutive at the start of that phase, of course.
The bit that builds an identifier will expect the incoming sequence of
alphanumeric and _ characters to be consecutive. If necessary after
#include and #if processing, macro replacement, comment and white space
reduction, and token pasting.
If identifier: is to be considered one token as Malcolm suggested, then
the : should either follow immediately or be trivially separated by
white space, eg. one space, tab or newline.
Otherwise identifier : needs to be recognised from two lower level
tokens, requiring an extra tokenising step. But this would be a lot more
work than just doing this detection in the parser (and where it would
work properly even with ?: operators).
The OP doesn't like doing this in the parser, even though one-token
lookahead is trivially possible, because it's untidy. In his opinion.
There is also a cost in implementing look-ahead I would
like to avoid.

I handle the "tertiary" operation as though
QUESTION_MARK - expression - COLON
were a simple binary operation.
bartc
2017-07-21 22:07:46 UTC
Permalink
Raw Message
Post by David Kleinecke
Post by bartc
The OP doesn't like doing this in the parser, even though one-token
lookahead is trivially possible, because it's untidy. In his opinion.
There is also a cost in implementing look-ahead I would
like to avoid.
What sort of cost: runtime, lines of code, development...? Because I
thought you said you weren't interested in optimising the parser. All
the other options are more work than just doing that one look-ahead.
Post by David Kleinecke
I handle the "tertiary" operation as though
QUESTION_MARK - expression - COLON
were a simple binary operation.
The ?: handler in my parser is about 45 lines of code, but most of it is
dealing with the expression types and reducing ?: terms with constant
conditions. Stripped down it's barely a dozen non-blank lines [code
translated to C]:

static unitrec* readcondexpr (void) {
unitrec *x, *y, *pcond;

pcond = readorlexpr();

if (lx.symbol == questionsym) {
lex();
x = readexpression();
skipsymbol(colonsym);
y = readcondexpr();
pcond = createunit3(j_ifx, pcond, x, y);
}
return pcond;
}

This pretty much directly implements the description of conditional
expressions in the grammar:

conditional-expression:
logical-OR-expression
logical-OR-expression ? expression : conditional-expression

The point is that parsing C (just parsing) shouldn't present any
problems (unlike macro processing, or dealing with everything else that
comes after the parsing).
--
bartc
jacobnavia
2017-07-21 22:57:52 UTC
Permalink
Raw Message
dealing with everything else that comes after the parsing
Yes. My back end for the arm is starting to compile big programs now,
and it compiles itself.

I made a table of all stdlib functions, and if you use standard C, I
link the program and after successful compilation I jump into main()

The program executes on the spot.

In later versions I will start it under the debugger. I have to figure
out how to tell gdb to read the debug info not from a file but from a
buffer.

Or port my debugger to linux as I started several years ago with the
ptrace interface.

What debugger do you use for your language?
bartc
2017-07-22 00:37:11 UTC
Permalink
Raw Message
Post by jacobnavia
dealing with everything else that comes after the parsing
Yes. My back end for the arm is starting to compile big programs now,
and it compiles itself.
I made a table of all stdlib functions, and if you use standard C, I
link the program and after successful compilation I jump into main()
The program executes on the spot.
In later versions I will start it under the debugger. I have to figure
out how to tell gdb to read the debug info not from a file but from a
buffer.
Or port my debugger to linux as I started several years ago with the
ptrace interface.
What debugger do you use for your language?
Trial and error. Plus any good ideas I can think of to narrow down a
problem.

(Debugging language-related programs present special challenges anyway,
as any problem is not necessarily within the code I'm trying to execute.
But within the code that generated that code. Or in the interpreter than
executed the code that generated the code.

So a normal sort of debugger, even if I used one, wouldn't be much help.
Besides I wouldn't want to have to step through some of the other
people's C code that is giving problems. I rather not even have to look
through the source.)
--
bartc
David Kleinecke
2017-07-22 02:59:24 UTC
Permalink
Raw Message
Post by bartc
Post by David Kleinecke
Post by bartc
The OP doesn't like doing this in the parser, even though one-token
lookahead is trivially possible, because it's untidy. In his opinion.
There is also a cost in implementing look-ahead I would
like to avoid.
What sort of cost: runtime, lines of code, development...? Because I
thought you said you weren't interested in optimising the parser. All
the other options are more work than just doing that one look-ahead.
You can't eat one potato chip. You can't execute just
one look-ahead. You have to check every access to assure
it is not a look ahead. I don't consider avoiding that
optimizing - just good sense.
Post by bartc
Post by David Kleinecke
I handle the "tertiary" operation as though
QUESTION_MARK - expression - COLON
were a simple binary operation.
The ?: handler in my parser is about 45 lines of code, but most of it is
dealing with the expression types and reducing ?: terms with constant
conditions. Stripped down it's barely a dozen non-blank lines [code
static unitrec* readcondexpr (void) {
unitrec *x, *y, *pcond;
pcond = readorlexpr();
if (lx.symbol == questionsym) {
lex();
x = readexpression();
skipsymbol(colonsym);
y = readcondexpr();
pcond = createunit3(j_ifx, pcond, x, y);
}
return pcond;
}
This pretty much directly implements the description of conditional
logical-OR-expression
logical-OR-expression ? expression : conditional-expression
The point is that parsing C (just parsing) shouldn't present any
problems (unlike macro processing, or dealing with everything else that
comes after the parsing).
My version looks something like this (simplified to remove
confusing context):

if (is(QUESTION_MARK)) {
*prefix++ = QUESTION_MARK;
output (comp @ 0);
x = new_label();
output (jtest x);
*label++ = x;
goto _P;}
. . .
_P: if (expression()) {
x = --label*;
y = new_label();
output (jump y);
*label++ = y;
place (x);
goto _Q;}
else ERROR
_Q: if (is(COLON)) {
place (--label*);
goto ...}
else ERROR

To answer a different query: I think my code is "closer to the
iron".
else ERROR
bartc
2017-07-22 10:18:53 UTC
Permalink
Raw Message
Post by David Kleinecke
Post by bartc
Post by David Kleinecke
Post by bartc
The OP doesn't like doing this in the parser, even though one-token
lookahead is trivially possible, because it's untidy. In his opinion.
There is also a cost in implementing look-ahead I would
like to avoid.
What sort of cost: runtime, lines of code, development...? Because I
thought you said you weren't interested in optimising the parser. All
the other options are more work than just doing that one look-ahead.
You can't eat one potato chip. You can't execute just
one look-ahead. You have to check every access to assure
it is not a look ahead. I don't consider avoiding that
optimizing - just good sense.
Then your tokenising scheme isn't quite the pre-prepared array of tokens
that you suggested.

Otherwise it could simply work like this, if 'tokenlist' was an array of
token codes (with other parallel arrays for other info, or it could be
an array of structs):

if (tokenlist[currtokenno] == NAME_TOKEN) {
if (tokenlist[currtokenno + 1] == COLON_TOKEN) {
....
Post by David Kleinecke
Post by bartc
Post by David Kleinecke
I handle the "tertiary" operation as though
QUESTION_MARK - expression - COLON
were a simple binary operation.
My version looks something like this (simplified to remove
if (is(QUESTION_MARK)) {
*prefix++ = QUESTION_MARK;
x = new_label();
output (jtest x);
*label++ = x;
goto _P;}
. . .
_P: if (expression()) {
x = --label*;
y = new_label();
output (jump y);
*label++ = y;
place (x);
goto _Q;}
else ERROR
_Q: if (is(COLON)) {
place (--label*);
goto ...}
else ERROR
To answer a different query: I think my code is "closer to the
iron".
This looks like it's also generating code of some sort. So is perhaps
unconventional. (I generate an AST, on top of which I have to do name
resolution, type-checking (with promotions, conversions etc) and
constant expression reduction. Then eventually code generation.

[From another post]
Post by David Kleinecke
Oh. Yes that's what the standard says. I don't think of case
or default statements as labeled because they start with a
keyword and can be handled routinely. But I should have
looked and used a different phrase. Oh Well.
If you're not attempting to compile existing C code, you can also
introduce a Label keyword for labels:

Label L1:

Then you don't need lookahead in the parser. You just need to arrange
for this macro to be present:

#define Label

when compiling the code with conventional tools.
--
bartc
David Kleinecke
2017-07-22 16:30:05 UTC
Permalink
Raw Message
Post by bartc
Post by David Kleinecke
Post by bartc
Post by David Kleinecke
Post by bartc
The OP doesn't like doing this in the parser, even though one-token
lookahead is trivially possible, because it's untidy. In his opinion.
There is also a cost in implementing look-ahead I would
like to avoid.
What sort of cost: runtime, lines of code, development...? Because I
thought you said you weren't interested in optimising the parser. All
the other options are more work than just doing that one look-ahead.
You can't eat one potato chip. You can't execute just
one look-ahead. You have to check every access to assure
it is not a look ahead. I don't consider avoiding that
optimizing - just good sense.
Then your tokenising scheme isn't quite the pre-prepared array of tokens
that you suggested.
Otherwise it could simply work like this, if 'tokenlist' was an array of
token codes (with other parallel arrays for other info, or it could be
if (tokenlist[currtokenno] == NAME_TOKEN) {
if (tokenlist[currtokenno + 1] == COLON_TOKEN) {
....
There are, assuming I counted right, 78 pre-defined tokens
all the rest are defined on the fly as encountered by the
pre-procesor.
Post by bartc
Post by David Kleinecke
Post by bartc
Post by David Kleinecke
I handle the "tertiary" operation as though
QUESTION_MARK - expression - COLON
were a simple binary operation.
My version looks something like this (simplified to remove
if (is(QUESTION_MARK)) {
*prefix++ = QUESTION_MARK;
x = new_label();
output (jtest x);
*label++ = x;
goto _P;}
. . .
_P: if (expression()) {
x = --label*;
y = new_label();
output (jump y);
*label++ = y;
place (x);
goto _Q;}
else ERROR
_Q: if (is(COLON)) {
place (--label*);
goto ...}
else ERROR
To answer a different query: I think my code is "closer to the
iron".
This looks like it's also generating code of some sort. So is perhaps
unconventional. (I generate an AST, on top of which I have to do name
resolution, type-checking (with promotions, conversions etc) and
constant expression reduction. Then eventually code generation.
It's the assembly language of an imaginary machine,
Post by bartc
[From another post]
Post by David Kleinecke
Oh. Yes that's what the standard says. I don't think of case
or default statements as labeled because they start with a
keyword and can be handled routinely. But I should have
looked and used a different phrase. Oh Well.
If you're not attempting to compile existing C code, you can also
Then you don't need lookahead in the parser. You just need to arrange
#define Label
when compiling the code with conventional tools.>
Had I been there I would have required a "label" keyword:
label foo;
and a separate statement. But I want to follow the standard.

h***@gmail.com
2017-07-22 01:37:12 UTC
Permalink
Raw Message
On Friday, July 21, 2017 at 11:56:49 AM UTC-7, Bart wrote:

(snip on compiler technology and token lookahead.)
Post by bartc
The OP doesn't like doing this in the parser, even though one-token
lookahead is trivially possible, because it's untidy. In his opinion.
I tend not to use ungetc() in text processing programs,
though I don't mind if other I/O functions call it.

But yes, lookahead is common in compiler technology.

In the case of yacc, you don't have to think about it, just know
that yacc does what it is supposed to do.

I haven't thought about this recently, but I believe that
yacc will refactor the grammar such that it doesn't need so
much lookahead, even if you don't write it that way.

You don't (simplifying the notation somewhat, but I hope
you get the idea:

ifstatement := ifthen | ifthenelse;

where it would need a lot of lookahead, but instead:

ifstatement := ifthen [ else ];

Lookahead or not, there is internal state of the parsed
but not yet reduced input. Tokens don't take up all that
much space.

Writing parsers for languages without reserved words is
just a little more interesting. Even more for fixed form
Fortran, where spaces are not significant. There are fun
examples in Fortran, some with interesting bug history.

The assignment statement

DO 1 I=1.23

and the loop statement

DO1I = 1,23

Fortran parser technology is well understood, but somewhat
different from other languages.

PL/I, where spaces are significant, but keywords are not
reserved, has some interesting examples like:

if if=then then if=else; else else=then;

You might confuse the reader, but not the compiler.
Loading...