Post by bartPost by BGBPost by James Kuyper struct vector;
struct scenet;
struct vector {
double x;
double y;
double z;
};
struct scenet {
struct vector center;
double radius;
struct scenet (*child)[];
};
6.7.6.2p2: "The element type shall not be an incomplete or function type."
I have many draft versions of the C standard. n2912.pdf, dated
2022-06-08, says in 6.7.2.1.p3 about struct types that "... the type is
incomplete144) until immediately after the closing brace of the list
defining the content, and complete thereafter."
Therefore, struct scenet is not a complete type until the closing brace
of it's declaration.
struct scenet *child;
};
The struct is incomplete, but it still knows how to do pointer
arithmetic with that member. The calculation is not that different
from the array version (actually, the code from my compiler is
identical).
Difference is, in this case, "sizeof(struct scenet)" is not relevant
to "sizeof(struct scenet *)".
No, both of these need to know the size of the struct when accessing the
....
struct scenet *childp;
struct scenet (*childa)[];
};
The only thing you can't do with x->childa is perform pointer arithmetic
on the whole pointer-to-array, since the array size is zero. But doing
(x->childa)[i] should be fine.
As is clear since other compilers (excluding those that lavishly copy
gcc's behaviour) have no problem with it.
I think it is more a case of formal definitions here...
Post by bartPost by BGBFormally, with the parenthesis and array, the size of the struct is
considered relevant (even if not strictly so), but is also unknown at
that point.
This seems like obscure edge case territory.
It's a 'pointer to array'; it might be uncommon in C (because of its
fugly syntax), but it hs hardly obscure!
In my own use, excluding function pointers, I almost never have a need
to use parenthesis with declarations.
Post by bartPost by BGBAlas, if I could have my way, I might define a simplified subset which
drops some of these sorts of edge cases (the form with parenthesis
would simply become disallowed), but, likely, this wouldn't amount to
much.
T(*)[] is a perfectly valid type; there is no reason to exclude it from
struct members.
It is unambiguous in my original language, and can also be in C.
I have a slight difference of opinion in that, if I were designing C, it
would not be allowed.
The merit of C is, in a way, that almost has just what is needed, little
more, and little less.
Unlike, say, C++, which went down the rabbit hole of ever-increasing
complexity. None the less, it has still gained some complexities beyond
the bare minimum, and still has some weak points. Such as lacking a
standardized form of vector/SIMD extensions, or any way to have
customizable types (though, the latter point risks getting dangerously
close to C++ territory, so dunno).
Post by bartPost by BGBMostly backwards compatible with existing C code;
Allows for a smaller and simpler compilers;
Uses some C# like rules to eliminate the need for checking for
typedefs to parse stuff.
Though, one can't go entirely over to C# like behavior if one still
wants to support traditional separate compilation (so one would still
have a need for things like function prototypes, header files, and a
traditional preprocessor).
But, then one would basically just end up with C but with people being
confused about why things like "unsigned x;" no longer work (making it
kinda moot).
And, most people continue to swear by GCC and Clang, unconcerned with
their multi MLOC codebases, and the overly long time it takes to
recompile the compiler from source...
Yeah. I can choose to run my compiler from source each time it is
invoked; you barely notice the difference! (It adds 70-80ms.)
This cuts no ice here however.
Partial reason BGBCC still exists:
GCC and Clang are monstrosities (huge and slow to compile);
LCC offered very little over what I had already at the time;
TinyC didn't look like a particularly attractive starting point either.
However, as it is (having expanded significantly over the past some-odd
years), it can still be recompiled from source in a few seconds...
Whereas, rebuilding GCC is a good part of an hour, and LLVM+Clang
somehow manages to have build times measured in multiple hours (and, the
build times for Clang seem to get slower faster than computers are
getting faster).
Granted, one can speed it up some by trying to temporarily disable ones'
antivirus software, but that one is needed to start caring about things
like disabling AV software for faster build times, in the first place,
is still a problem...
Post by bartPost by BGBGranted, my existing compiler is a bit bigger; sadly, its code
footprint is more on par with Quake3, and its memory footprint
generally a bit steep (well, if one wants to run it on an FPGA board
with 128MB of total RAM; ideally one wants to keep the memory
footprint needed to compile a moderate size program in under around
50MB or so; which is an epic fail for my compiler as it is...).
And, as-is, compiling stuff takes a painfully long time on a 50MHz CPU
(even a moderately small program might take several minutes or more).
You can't cross-compile on a PC?
That it what I normally do, but it would be "nice" to have the option to
compile stuff natively from within the FPGA soft-processor or emulator.
But, to make this more practical would need a faster and lighter weight
compiler than what I have already.
Seemingly big issues:
Parsing an AST for a whole translation unit, eats a lot of RAM;
Decoding stuff into the internal 3AC IR, for a whole program at a time,
also eats a lot of RAM.
I had tried to look into designing a compiler with the preprocessor and
parser overlaid via a linked-list "line buffer" where, the preprocessor
would preprocess lines, put them in a linked list, and the parser would
consume them (freeing up each line once all tokens were consumed), and
then trying to drive the middle part of the compilation process one
top-level declaration at a time.
This turned into more of a mess than I would have hoped.
My existing compiler runs the preprocessor first, and generates a text
buffer containing the entire preprocessed output, but this can sometimes
reach sizes in MB territory (mostly with all of the stuff pulled in from
headers, which will often dwarf the actual code in each translation unit).
Then, the parser is left churning through large numbers of things like
structs, typedefs, and function prototypes, before getting to the actual
code. Parsing all these into an AST eats time and memory.
While the AST is arguably very bulky, one can at least entirely discard
it after each translation unit (this is one use case for a zone
allocator; where one can allocate AST related memory in an AST zone and
free all of it after each translation unit). The steep up-front cost of
the preprocessor output can also be reduced slightly by "chunking" the
buffering, say, into multiples of 32kB or similar (as opposed to trying
to "malloc()" the whole 1MB or so in a single large buffer).
Ideally, one then wants to leave the IL in a form where the compiler
doesn't need to load everything into 3AC form all at once, but my
existing IL design left little choice here. It was designed in a purely
linear structure with symbols managed by a sort of sliding array with an
LZ compression scheme, which means effectively the bytecode needs to be
decoded linearly and all at once.
Too many things that eat RAM.
Better would have been a structure where only a high-level view of the
metadata need to be decoded up-front (and then possibly in a way that
allows a cache-like approach), and similarly allowed for decoding the
Stack-IL into 3AC incrementally (say, when we are actually compiling the
function in question).
But, it is also a question of how to pull things off in a memory-compact
way without re-introducing a lot of the limitations that existed in
1980s era compilers (say, for example, the compiler having no way to
know whether or not a given function is reachable within the call graph).
Say, if you decode the entire program into 3AC form all at once, it is
possible to do things like walk the entire program as a graph and trace
out what functions are reachable (and determine things like local vs
external visibility, etc). This sort of a thing would be much less
viable if one could only look at a single function at a time.
But, then, if one needs to burn, say, 64 bytes per 3AC operation (and
one may have on average several hundred 3AC ops per function, and
several thousand functions in a program), RAM cost adds up quickly.
Where, in BGBCC, generally each function would have a dense array of 3AC
operator structs, and another array of "traces" which give the starting
and ending index of each basic block, and some flags and similar.
Things like 3AC nodes and string tables eating up lots of RAM.
But, the partial result of all of this is a compiler that has an
impractical memory footprint for an FPGA based soft processor (and is
also impractically slow).
Then again, my compiler is pretty slow even on my main PC. The amount of
time it takes being similar to that taken by GCC; which is kinda dead
slow if compared with MSVC. Seemingly, MSVC is somehow a very fast
compiler, with Clang sort of in-between (slower than MSVC, but still
faster than GCC).
Though, for actual compiled program performance, GCC tends to do pretty
well, and MSVC often worse. But, for some things, the reverse is true
(where the MSVC output is a lot faster than the GCC output).
...
But, as for ISA support on my processor (and supported by BGBCC), there
are currently several options:
BJX2 Baseline
Original form of my custom ISA;
Primarily, it is a 32-register design, with 16/32/64/96 bit ops;
XG2:
Newer variant of my ISA;
Drops 16-bit ops, moves over to 6-bit register fields;
Natively uses 64 GPRs;
Has 32/64/96 bit encodings.
RISC-V (RV64G)
Uses 5 bit register fields, with 32 GPRs;
And, another 32 FPU registers.
The CPU supports the 16-bit "C" extension, but BGBCC does not.
With my design, the "C" ops come with a performance penalty.
I have a jumbo-prefix extension that adds 64 and 96 bit encodings.
Largely to improve performance.
It works in essentially the same way as in my own ISA,
and does similar things.
Among a few other custom extensions.
XG3:
Bit-repacked an modified version of my ISA;
Can be "crazy glued" onto RV64G to make a sort of hybrid ISA.
It implicitly "re-merges" the X and F registers,
which were split in RV64G.
But, more just that it goes back to what XG2 did...
Currently, performance:
Plain RV64G is slower than both XG2 and XG3,
including when compiled with "gcc -O3"
Though, GCC is faster than BGBCC when targeting bare RV64G.
BGBCC targeting plain RV64G: Kinda sucks...
If I trick out the ISA, BGBCC is faster than GCC targeting RV64G.
Dunno what would happen if GCC could use my ISA extensions...
XG2 currently holds the speed prize...
XG3 isn't quite as fast at XG2 at present, but has promise.
In theory, XG2 and XG3 should be basically equivalent, as they are (more
or less) the same ISA just with the bits shuffled around (mostly this
was to allow XG3 to coexist in the same opcode space as RV64G, replacing
the "C" extension's encoding space). In the process, I did slightly
improve the "aesthetics" of the encoding scheme.
There are some minor differences between them, mostly related to how
BGBCC is using the ISA, and the ABI (with XG3, it is using the RISC-V
ABI rules).
One thing that does minorly hurt BGBCC here is that it primarily uses
callee-save registers for local variables, where:
My native ABI is balanced in favor of slightly more callee save
registers than scratch registers;
The RISC-V ABI has more scratch registers than callee save registers;
So, when using the RISC-V ABI, there is more register pressure (and more
register spills).
Ironically, at the same time, the RISC-V ABI has less function argument
registers vs XG2 (8 vs 16), and lacks argument spill space, which in
turn contribute towards making things "slightly less efficient".
Can't really "fix" the ABI for XG3 without causing binary compatibility
issues with calls to/from RV64G code, which would defeat the whole point
of why XG3 exists.
Having more scratch registers (and fewer callee save) is better for leaf
functions, but implicitly assumes that one is spending more of their
time in leaf functions (and comparably hurts performance if the program
is dominated more by going up and down the call stack in non-leaf
functions).
Though, arguably, one could make more use of scratch-registers within
non-leaf functions if one could be strategic about where the function
calls occur and when/where register spills are needed (but, my compiler
is not that clever; and mostly treats the scratch-registers as
off-limits for local variables within non-leaf functions).
Then again, debatable it could all be for nothing:
My fastest case is only around 40% faster than the "gcc -O3" output (for
programs like Doom and similar);
And, maybe, 40% isn't really enough to be worth the issues of a
non-standard ISA variant.
But, granted, it is closer to around 500% for OpenGL (trying to build my
OpenGL implementation with RV64 and GCC performs horribly). But, I kinda
needed to SIMD the crap out of this (plain RV64G lacks any form of SIMD
support).
...