Post by bartPost by BGBPost by Ben BacarissePost by BGBPost by Ben BacarisseWhile eliminating structs could also simplify things; structs also tend to
be a lot more useful.
Indeed. And I'd have to use them for this!
int a[4][4];
...
l=a[j][k];
int a[16];
...
l=a[j*4+k];
That's what you want to force me to write, but I can use and array of
arrays despite your arbitrary ban on them by simply putting the array in
a struct.
...
Post by BGBIN most contexts, I don't really see how a struct is preferable to a
multiply, but either way...
And I can't see how an array of arrays is harder for your compiler than
an array of structs. C's indexing requires the compiler to know that
size of the items pointed to.
I suspect that there is something amiss with your design if you are
considering this limiting in order to simplify the compiler. A simple
compiler should not care what kind of thing p points to in
p[i]
only what size of object p points to.
When I designed the compiler code, the initial approach for internal
type layout was to bit-pack it into 32 bits, say (a lot of this is
Basic1
(31:28): Layout of Type (0=Basic)
(27:16): Array Size
(15:12): Pointer Level Count
(11: 0): Base Type
Basic2
(31:28): Layout of Type (1=Basic2)
(27: 8): Array Size
( 7: 6): Pointer Level Count
( 5: 0): Base Type
Basic3
(31:28): Layout of Type (2=Basic3)
(27:24): Array Size
(23:20): Pointer Level Count
(19: 0): Base Type
Overflow
(31:28): Layout of Type (3=Overflow)
(27:24): MBZ
(23: 0): Index into Type-Overflow Table
And, a few other cases...
Basic1 was the default, able to express arrays from 0..4095 elements,
with 0..7 levels of pointer indirection, and 0..4095 for the base type.
Where, 0=T, 1=T*, 2=T**, ..., 7=T*******
8=T[], 9=T[][], A=T*[], B=T*[*], C=&T, ...
That's quite a level of detail. This looks more like an encoding you
might devise for a CPU instruction set. And yet you say elsewhere that
the whole compiler is 250K lines? So you're not trying to squeeze it
into a 16KB ROM or something.
These values are packed inside of register descriptors, of which there
are multiple in each 3AC op. And, there may be a whole lot of 3AC ops in
a program. Much bulkier would cause the compiler to use a lot more RAM.
Though, a likely RAM-saving option would be to redesign the compiler
such that only one function needs to have 3AC loaded at a time. But, for
some tasks (such as tracing out everything that is reachable), this
would mean repetitively re-decoding the stack IR for various functions.
Well, and/or building more metadata, such as (for every function)
building a table of any other functions called, any global variables
accessed, etc. But, this is just trading one type of memory bulk for
another.
Like, say, if there are 4000 functions in a program, any metadata that
needs to be stored per function adds up quickly.
Post by bartPost by BGBIt could also be used to encode another type, which was needed for
things like multidimensional arrays and some other complex types. But,
this seemed like an ugly hack... (And was at odds with how I imagined
types working, but seemed like a technical necessity).
I can see how multi-dim arrays would be troublesome with such a scheme.
Yeah.
It doesn't express them naturally, so I have to hack it.
It can, however, fairly easily express "FooStruct *x;".
But, if one wants, say:
FooStruct *x[8192];
Then it needs to put it into the type-overflow table.
Post by bartMy own approach is to use a bunch of parallel arrays (a table of structs
didn't appeal). They are populated with the standard types at the lower
end, then new types can be added.
A C type is represented by an index into those arrays. One of those
[0:maxtype]int tttarget # All names start with 'tt'
For a pointer, it is contains the index of the target type. For arrays,
it is the index of the element type. There is no restriction on nesting,
other than 'maxtype' (set to some large number; one day I'll make the
limit flexible).
Yeah, dunno...
I guess simpler could have been to do everything via the type-overflow
table and not have bothered with bit-twiddled type descriptors.
Originally though, the type overflow table was a response to noting that
there were a lot of situations that could not fit into a bit-packed format.
Post by bartPost by BGBOne downside as-is, is that if a given variable is assigned more than
4096 times in a given function, it can no longer be given a unique ID.
Though uncommon, this is not entirely implausible (with sufficiently
large functions), and there isn't currently any good way to deal with
this (apart from raising a compiler error).
#include <stdio.h>
int main(void) {
int a, b=2, c=3, d=4;
a=b+c*d;
printf("%d\n", a);
}
Except that the 'a=b+c*d;' line is repeated 1M or 2M times, usually
within an included file.
Some compilers (including for other languages) find this challenging.
As-is, if the line were repeated more than 4096 times, it would break my
compiler absent adding an overflow case for the local variable descriptor.
Though, one option would be to drop the type field and expand the ID and
sequence numbers.
Since, generally, for local variables the type is also encoded in a
descriptor struct for the variable (though, the inline type is useful
for things like "soft casting", etc).
Also relevant for things like temporaries, where the descriptor only
contains an abstract type (temporaries primarily rely on sequence
numbers for identity, with only a small number of "canonical" temporary
variables representing every temporary in the function).
For temporaries, would need some sort of sequence-number overflow case
that does not lose the ability to encode the type.
But, yeah, compiler breaking because a given variable is assigned too
many times in a function is kinda lame.
Post by bartPost by BGBBut, as noted, the 3AC IR only exists in memory.
In the IR, the operations are expressed as a sort of linear bytecode
operating on a virtual stack machine; with types expressed as ASCII strings.
Logically, the stack holds "ccxl_register" values, and the number of
3AC ops is typically less than the number of stack-machine operations
(those which exist simply to shuffle registers around essentially
disappear in the translation process).
LOAD x
LOAD y
ADD
STORE z
Might turn into a single 3AC operation.
I've used 3AC as an IL (several attempts), and also a stack VM. The 3AC
always looked so simple, but it also was a lot of hard work to turn it
into decent register-based code.
Now I used a stack-based IL which is far easier to work with.
The problem with generating code directly from a stack IL is that the
generated code tends to be rather inefficient.
But, going directly from an AST to 3AC or SSA is a pain...
And storing and reloading SSA as a "good" IR is a pain (and something
that just sorta serializes the internal structures, like what LLVM
bitcode does, is not ideal).
Say, in effect, any compiler which uses LLVM bitcode, would also need to
do things the same way that LLVM does them. And, a structural flattening
of my compiler's internal 3AC form, would require any other compiler to
do things the same way.
Luckily at least, a stack IR/IL can avoid this.
But, annoyingly, my existing IL has some design issues that would limit
its "general use"; mostly in that it is a monolithic linear bytecode
format that also expresses all of the metadata using bytecode ops.
The likely fix would be to go over to a structure more like I used for
the BS2VM, of basically representing everything inside a TLV format,
with each function and variable being individually wrapped (with offset
indices mapping table numbers to their corresponding TLV lumps).
Though, the actual contents of the bytecode wouldn't change that much,
and I had gone back and forth on whether it would be better to use
structural representation for metadata, or continue the hack of using
bytecode ops to encode the metadata.
But, attempts to rework BGBCC to use such a structure have thus far
tended to fizzle out.
I had a few times also considered trying to adopt .NET CIL, except that
it doesn't express C very well, and it represents metadata very
differently than my existing compiler (and its way of expressing
metadata isn't very flexible, *).
*: Basically, it uses a bunch of tables organized in a way almost like
in a relational database, but with a bit-packed representation (with the
bit-width of various members also depending on the number of rows or
range of primary keys within the other tables, etc). The table layouts
are effectively fixed, so one would be SOL if they want to express
anything that doesn't fit into the existing tables.
Post by bartI wouldn't start off with turning stack code into 3AC! (SSA is supposed
to be /the/ way to generate code in compilers; they can keep it.)
My 3AC format isn't pure 3AC, but rather an intermediate between 3AC and
SSA (hence the whole sequence number issue...).
But, I ended up going this direction (at first in my BGBScript VM), as
it allowed for significant performance improvements over operating in
terms of a stack machine.
Can note that I was able (at the time) to make BGBScript and BGBScript2,
when mostly using a static-typed language variant and JIT compiled,
reasonably performance competitive with native C. BGBScript2 (beyond the
syntax changes) was also in some ways somewhat simplified vs what its
predecessor had become (its scope model and type-inference was a
significant source of pain; and effectively turning it into C# made this
part simpler).
A naive purely dynamic interpreter is a lot simpler, but then one is
hard-pressed to get much better than around 50x to 100x slower versus
native C (and, if running any non-trivial code in the VM, something with
a 100x slowdown is undesirable; and one needs either static typing or
type inference to have any hope of decent performance).
Post by bartThose 4 stack instructions might turn into 3/4 machine instructions on
x64 (perhaps fewere is register-resident). But if converting them to one
3AC instruction, you then have to expand them again.
FWIW:
No current plans to target BGBCC to x86-64, as MSVC and GCC serve this
use case well enough.
Post by bartPerhaps 3AC suits another architecture better (I think ARM uses
3-register ops; x64 is 2-register ops).
It makes sense on my own BJX2 ISA, which at this point is primarily
3-register ops (with 64 GPRs, etc).
I might have almost been tempted to consider switching more over to
RISC-V and GCC, except that (even with superscalar, etc), RISC-V
performance is a little weak.
Granted, "real world" it may not matter, because RISC-V is starting to
appear in ASICs, and a 1GHz CPU is going to beat a 50MHz CPU (in an
FPGA) fairly easily even if it takes more clock cycles to run similar code.
But, had noted that in some contrived benchmarks (software OpenGL and
neural-nets), my CPU core (at 50MHz) was able to be performance
competitive with an early 2000s laptop running at 1.4 GHz, but seemingly
mostly because x87 sucks in the NN case, ...
Well, and on some metrics the difference between them isn't as huge as
the difference in clock-speed would seem to imply. The laptop was
significantly faster at plain integer code though (for code that is not
memory bound).
The laptop seemed to be (relative to clock speed) bogged down in a few
areas:
The x87 FPU ops are slow;
Memory is also relatively slow.
Like, despite the 28x difference in clock speed, there was only around a
3.5x difference in memory bandwidth.
Technically though the laptop is slower than the newer RasPi models (or
running code in Termux on my cellphone...).
But, despite my efforts, getting Quake and similar running at decent
framerates on a 50MHz CPU remains elusive.
But, if I could bump it to 100MHz while keeping everything else the
same; Quake would generally be running at around 20fps.
Had ironically ended up mostly switching my efforts to GLQuake as,
ironically, my software OpenGL rasterizer had tended to be a little
faster than Quake's software renderer (at least, on my own ISA; on
RISC-V, the situation being very different).
Had also noted that vs my Ryzen, the my software GL rasterizer had
tended to be around 5x faster than x86-64 in a "clock cycles per pixel"
sense (though, the Ryzen still wins in terms of raw performance due to
having a 74x clock-speed advantage).
Much of this difference is because for my ISA, I had various "helper
instructions" for things like packing and unpacking RGB555 pixels into
packed-integer SIMD vectors, etc.
I suspect this sort of thing wrecks performance when building for the
other ISAs (turns into a bunch of shifting and masking).
Technically was using RGB555 and Z12.S4 buffers mostly because it needs
too much memory bandwidth to do RGBA32 and Z24.S8, ...
Though, my GL implementation currently still suffers from a relatively
slow transform/projection stage (an eventual TODO being likely to redo
this part).
As-is:
Pop a primitive off the stack;
Project vertices to screen space;
Check for culling;
If it can be culled, do so.
Check if primitive can be drawn as-is:
If yes, draw it.
Else, fragment primitive, and push fragments back onto stack.
Generally, the fragmentation is being done in world-space, so each
fragmented primitive requires any new vertices to be projected (it can
cache the previously projected vertices from the parent primitive).
Generally, the projection does the XYZ/W conversion upfront.
A likely "better" option:
Run polygon vertices through modelview and projection matrices;
Clip polygon against frustum planes (possibly in XYZW space);
Check and subdivide (similar to before, just in XYZW space);
Split into final triangles and quads and draw.
Theoretically, interpolating in XYZW space should work, with the XYZ/W
step being used for drawing the primitives (though, one sill needs to
divide by W to know whether the primitive needs to be subdivided).
Though, this is because the final rasterization stages (and generally
also my experimental hardware rasterizer) work in terms of affine
texturing; where larger on-screen primitives need to be subdivided to
avoid obvious texture warping.
Basically, if one has seen the 3D rendering on the original PlayStation,
they should probably know what sort of warping I am dealing with...