Discussion:
LLVM and Forth
(too old to reply)
Spiros Bousbouras
2017-04-06 12:33:44 UTC
Permalink
Raw Message
Since this has become about C now , I'm crossposting to comp.lang.c
and setting follow-ups to the same.

On Wed, 5 Apr 2017 20:14:27 -0700 (PDT)
It's a matter of trusting the compiler will code what you ask it to do,
and understanding the cases of undefined behavior when the compiler may
not generate expected code. It's not clear to me that mere mortals can
keep track of all of the cases of UB in the C standard, and the latitude
given to compilers to do different things in those cases.
I'm a mere mortal and I can certainly keep track of UB relevant to the code
I write. In most cases it's actually common sense. For example I don't need
to consult the C standard to know that trying to read beyond the bounds of
an array will probably not result in anything useful because , even *conceptually*
(i.e. without referring to the standard of any specific language) , it doesn't
make sense. Perhaps the confusion arises because some people when programming
in C are not thinking in terms of C but in terms of assembly. So they translate
mentally the C code in some generic assembly instructions and then try to decide
if the C code does something meaningful based on what those hypothetical assembly
instructions will do.

Example : the expression array[i] where i is beyond the end of the array.

Thinking in terms of C : Not defined which means you aren't going to get anything
useful so you shouldn't use it. Pondering or testing what's the most crazy thing
which might happen , may be a bit of fun every now and again but is not relevant to
practical programming.

Thinking in terms of assembly : "array[i] will be translated into an assembly
instruction which calculates an address of the form base + i * constant .
This will always point into some address in memory so I should get some value
or at least a segmentation fault ; so not a totally unpredictable result."

Note first that not thinking in terms of C is more complicated and requires more
knowledge than thinking in terms of C. Second , if one wants to think in terms of
assembly then , from a practical point of view , it makes sense to program in
assembly , not C.
Certainly coding guidelines, e.g. NASA's C coding guidelines (ref. 1) can
help avoid such difficulties. But, it's also clear that mere mortal
programmers may be easily confused. See the old thread on comp.std.c (ref. 2).
Even some of the experts had problems with the interpretation of the standard;
but, to be fair, that's an old thread.
[ References are
1. http://sdtimes.com/nasas-10-rules-developing-safety-critical-code
and
2. https://groups.google.com/d/msg/comp.std.c/9lYbeLVaZCA/__behqP09dIJ
]

Regarding the NASA guidelines , they are unsuitable for a lot of desktop software.
For example "Do not use dynamic memory allocation after initialization." .How
are you going to implement for example the Unix sort utility without using
realloc() (depending on the size of the input) ?

Your second reference has
char *index;
....
*index++ = toupper(*index);

.He doesn't explain why he decided to write code like this as opposed to
*index = toupper(*index);
index++ ;

.The latter seems to me more readable. So unless he had good reasons to prefer the
first version then worrying whether it's undefined behaviour is irrelevant for
programming ; one simply shouldn't write code like this. Do you have an example
of C code which satisfies the following 3 criteria ?

1. Is not contrived (according to your judgement).
2. You are not sure whether it has defined behaviour.
3. You do not see an obvious and easy way to change it so that it does what you
want *and* doesn't look contrived (according to your judgement) *and* you are
sure is not undefined behaviour.
I intend to continue using the GNU C++ and C compilers, but with -O0 and
-Wall flags turned on. Some of the code it generates at -O0 is fairly ugly,
but I expect it's reliable when the source is not UB.
I don't think the -O0 part makes sense. If the compiler has bugs and miscompiles
correct code then you want to find this out as early as possible rather than at a
later point where you many need a higher optimisation level because otherwise the
code does not run fast enough. If your code has undefined behaviour then if it does
what you want then it's only by coincidence and your luck may run out even if you
always use -O0 .So in my opinion , one should at least conduct some tests with
other optimisation levels (and preferably using multiple compilers) to see if the
code continues to run correctly.
--
Every application has an inherent amount of irreducible complexity. The only
question is: Who will have to deal with it\u2014the user, the application
developer, or the platform developer?
http://www.nomodes.com/Larry_Tesler_Consulting/Complexity_Law.html
s***@casperkitty.com
2017-04-06 14:37:53 UTC
Permalink
Raw Message
Post by Spiros Bousbouras
Note first that not thinking in terms of C is more complicated and requires more
knowledge than thinking in terms of C. Second , if one wants to think in terms of
assembly then , from a practical point of view , it makes sense to program in
assembly , not C.
Why?

If I want to grab eight pairs of uint16_t, flip all the bits, and write them
back, the same piece of code:

uint16_t *p = ...;
if ((uintptr_t)p & 2)
{
*(uint32_t*)(p+1) ^= 0xFFFFFFFF;
*(uint32_t*)(p+3) ^= 0xFFFFFFFF;
*(uint32_t*)(p+5) ^= 0xFFFFFFFF;
p[0] ^= 0xFFFF;
p[7] ^= 0xFFFF;
}
else
{
*(uint32_t*)p ^= 0xFFFFFFFF;
*(uint32_t*)(p+2) ^= 0xFFFFFFFF;
*(uint32_t*)(p+4) ^= 0xFFFFFFFF;
*(uint32_t*)(p+6) ^= 0xFFFFFFFF;
}

will work on almost all microcomputers for which there are any C compilers
that can be configured to *behave* as a "mechanical translators" [note that
such behavior does not preclude efficient code generation for most purposes].
It would likely be less efficient than ideal for some, but if that's the best
approach for the processor the code needs to run on today, it will likely be
fast enough on future processors by the time it's necessary to migrate the
code.

I could write separate versions of the code for every instruction set, and
even several different versions for some processors which have multiple
popular but incompatible assembly-language formats, but what advantage would
there be to doing that versus simply writing the above code and noting that
it will only work with compilers that use the casts as a signal that objects
of type uint32_t whose address has been taken should not be cached across
the above code, rather than as a sign that objects of type uint16_t may be
cached across it (at least in the even-address case)?
Ben Bacarisse
2017-04-06 15:02:51 UTC
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Spiros Bousbouras
Note first that not thinking in terms of C is more complicated and requires more
knowledge than thinking in terms of C. Second , if one wants to think in terms of
assembly then , from a practical point of view , it makes sense to program in
assembly , not C.
Why?
Because the point of writing C is to let the compiler do the work.
Post by s***@casperkitty.com
If I want to grab eight pairs of uint16_t, flip all the bits, and write them
uint16_t *p = ...;
if ((uintptr_t)p & 2)
{
*(uint32_t*)(p+1) ^= 0xFFFFFFFF;
*(uint32_t*)(p+3) ^= 0xFFFFFFFF;
*(uint32_t*)(p+5) ^= 0xFFFFFFFF;
p[0] ^= 0xFFFF;
p[7] ^= 0xFFFF;
}
else
{
*(uint32_t*)p ^= 0xFFFFFFFF;
*(uint32_t*)(p+2) ^= 0xFFFFFFFF;
*(uint32_t*)(p+4) ^= 0xFFFFFFFF;
*(uint32_t*)(p+6) ^= 0xFFFFFFFF;
}
will work on almost all microcomputers for which there are any C compilers
that can be configured to *behave* as a "mechanical translators" [note that
such behavior does not preclude efficient code generation for most purposes].
But why not write the obvious C (eight p[x] ^= 0xFFFF lines)? That
compiles to four instructions with gcc on my laptop and is likely to
compile to the best instructions on other machines too. There may be
some systems where your "C as assembler" approach can be compiled into
better code than the obvious C, but how often is that worth the
bugs/risk/complexity?

<snip>
--
Ben.
s***@casperkitty.com
2017-04-06 20:52:55 UTC
Permalink
Raw Message
Post by Ben Bacarisse
Post by s***@casperkitty.com
Post by Spiros Bousbouras
Note first that not thinking in terms of C is more complicated and requires more
knowledge than thinking in terms of C. Second , if one wants to think in terms of
assembly then , from a practical point of view , it makes sense to program in
assembly , not C.
Why?
Because the point of writing C is to let the compiler do the work.
The reason I write C is to let the compiler to *whatever portion of the work
I don't want to do myself*. Why should a desire to control some aspects of
code generation myself imply a desire to do *all* the work myself?
Post by Ben Bacarisse
Post by s***@casperkitty.com
If I want to grab eight pairs of uint16_t, flip all the bits, and write them
uint16_t *p = ...;
if ((uintptr_t)p & 2)
{
*(uint32_t*)(p+1) ^= 0xFFFFFFFF;
*(uint32_t*)(p+3) ^= 0xFFFFFFFF;
*(uint32_t*)(p+5) ^= 0xFFFFFFFF;
p[0] ^= 0xFFFF;
p[7] ^= 0xFFFF;
}
else
{
*(uint32_t*)p ^= 0xFFFFFFFF;
*(uint32_t*)(p+2) ^= 0xFFFFFFFF;
*(uint32_t*)(p+4) ^= 0xFFFFFFFF;
*(uint32_t*)(p+6) ^= 0xFFFFFFFF;
}
will work on almost all microcomputers for which there are any C compilers
that can be configured to *behave* as a "mechanical translators" [note that
such behavior does not preclude efficient code generation for most purposes].
But why not write the obvious C (eight p[x] ^= 0xFFFF lines)? That
compiles to four instructions with gcc on my laptop and is likely to
compile to the best instructions on other machines too. There may be
some systems where your "C as assembler" approach can be compiled into
better code than the obvious C, but how often is that worth the
bugs/risk/complexity?
If the performance of the code matters on the target system, and that system
is one where alternative techniques can improve performance, that would
suggest the trade-off should be worthwhile.

If the current platform will be usable for five years and code won't need
to be migrated to anything else until then, it's entirely possible that
code might get moved to a system where the code is sub-optimal *but* it is
likely that the only platforms where it is sub-optimal would be fast enough
that the performance of the code wouldn't matter.

Further, I would suggest that in many cases predictability and stability of
performance is often more important than optimality. Before hardware can
be built for an embedded system, it's necessary to know what will be needed
to make a program run acceptably fast. Since it's not possible to benchmark
real-world code in real-world usage before a system is built, it's often
necessary to design some benchmarks whose performance should correlate with
real-world performance. Having a compiler generate wonderfully fast code
during the stages of a product where e.g. a compiler infers that some variable
will always be zero because the code to control it hadn't been written yet
may be decidedly less helpful than having it generate code which does the
same checks as will be necessary later on, even if the optimized version of
the code behaved identically (except for execution time) to the unoptimized
one.
Scott Lurndal
2017-04-06 15:04:31 UTC
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Spiros Bousbouras
Note first that not thinking in terms of C is more complicated and requires more
knowledge than thinking in terms of C. Second , if one wants to think in terms of
assembly then , from a practical point of view , it makes sense to program in
assembly , not C.
Why?
If I want to grab eight pairs of uint16_t, flip all the bits, and write them
uint16_t *p = ...;
if ((uintptr_t)p & 2)
{
*(uint32_t*)(p+1) ^= 0xFFFFFFFF;
*(uint32_t*)(p+3) ^= 0xFFFFFFFF;
*(uint32_t*)(p+5) ^= 0xFFFFFFFF;
p[0] ^= 0xFFFF;
p[7] ^= 0xFFFF;
}
else
{
*(uint32_t*)p ^= 0xFFFFFFFF;
*(uint32_t*)(p+2) ^= 0xFFFFFFFF;
*(uint32_t*)(p+4) ^= 0xFFFFFFFF;
*(uint32_t*)(p+6) ^= 0xFFFFFFFF;
}
uint16_t *p = ...;
for(size_t i=0; i < 8*2; i++) *p++ ^= 0xffffu;

A good compiler will likely produce as good or better code for
this than the above, and it's far easier to read,
understand and maintain (and for 8 pairs, the run-time cost
of either source-code representation is in the noise).
Ike Naar
2017-04-06 15:04:38 UTC
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Spiros Bousbouras
Note first that not thinking in terms of C is more complicated and requires more
knowledge than thinking in terms of C. Second , if one wants to think in terms of
assembly then , from a practical point of view , it makes sense to program in
assembly , not C.
Why?
If I want to grab eight pairs of uint16_t, flip all the bits, and write them
uint16_t *p = ...;
if ((uintptr_t)p & 2)
Should the 2 be a 1 ?
Post by s***@casperkitty.com
{
*(uint32_t*)(p+1) ^= 0xFFFFFFFF;
*(uint32_t*)(p+3) ^= 0xFFFFFFFF;
*(uint32_t*)(p+5) ^= 0xFFFFFFFF;
p[0] ^= 0xFFFF;
p[7] ^= 0xFFFF;
}
else
{
*(uint32_t*)p ^= 0xFFFFFFFF;
*(uint32_t*)(p+2) ^= 0xFFFFFFFF;
*(uint32_t*)(p+4) ^= 0xFFFFFFFF;
*(uint32_t*)(p+6) ^= 0xFFFFFFFF;
}
will work on almost all microcomputers for which there are any C compilers
that can be configured to *behave* as a "mechanical translators" [note that
such behavior does not preclude efficient code generation for most purposes].
It would likely be less efficient than ideal for some, but if that's the best
approach for the processor the code needs to run on today, it will likely be
fast enough on future processors by the time it's necessary to migrate the
code.
I could write separate versions of the code for every instruction set, and
even several different versions for some processors which have multiple
popular but incompatible assembly-language formats, but what advantage would
there be to doing that versus simply writing the above code and noting that
it will only work with compilers that use the casts as a signal that objects
of type uint32_t whose address has been taken should not be cached across
the above code, rather than as a sign that objects of type uint16_t may be
cached across it (at least in the even-address case)?
Scott Lurndal
2017-04-06 15:43:09 UTC
Permalink
Raw Message
Post by Ike Naar
Post by s***@casperkitty.com
Post by Spiros Bousbouras
Note first that not thinking in terms of C is more complicated and requires more
knowledge than thinking in terms of C. Second , if one wants to think in terms of
assembly then , from a practical point of view , it makes sense to program in
assembly , not C.
Why?
If I want to grab eight pairs of uint16_t, flip all the bits, and write them
uint16_t *p = ...;
if ((uintptr_t)p & 2)
Should the 2 be a 1 ?
No, assuming that the uint16_t array is properly 2-byte aligned,
he's checking to see if it is four-byte aligned and if so, he
doesn't offset by 1 sizeof(short).
David Brown
2017-04-06 19:26:20 UTC
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Spiros Bousbouras
Note first that not thinking in terms of C is more complicated and requires more
knowledge than thinking in terms of C. Second , if one wants to think in terms of
assembly then , from a practical point of view , it makes sense to program in
assembly , not C.
Why?
If I want to grab eight pairs of uint16_t, flip all the bits, and write them
uint16_t *p = ...;
if ((uintptr_t)p & 2)
{
*(uint32_t*)(p+1) ^= 0xFFFFFFFF;
*(uint32_t*)(p+3) ^= 0xFFFFFFFF;
*(uint32_t*)(p+5) ^= 0xFFFFFFFF;
p[0] ^= 0xFFFF;
p[7] ^= 0xFFFF;
}
else
{
*(uint32_t*)p ^= 0xFFFFFFFF;
*(uint32_t*)(p+2) ^= 0xFFFFFFFF;
*(uint32_t*)(p+4) ^= 0xFFFFFFFF;
*(uint32_t*)(p+6) ^= 0xFFFFFFFF;
}
will work on almost all microcomputers for which there are any C compilers
that can be configured to *behave* as a "mechanical translators" [note that
such behavior does not preclude efficient code generation for most purposes].
It would likely be less efficient than ideal for some, but if that's the best
approach for the processor the code needs to run on today, it will likely be
fast enough on future processors by the time it's necessary to migrate the
code.
That is an absolutely stupid idea. The code is an unmaintainable mess,
as well as being incorrect C. The behaviour is undefined - and always
has been undefined - in standard C. Some implementations (possibly with
specific compiler flags) might accept it, but it is far from being
universal.

Compare it to the code:

void flip1(uint16_t *p) {
for (int i = 0; i < 8; i++) {
*p++ ^= 0xffff;
}
}

/That/ code does exactly what it says it does - it is simple, clear, and
easy to modify. It is easy for a compiler to optimise in various ways.

If you have particular reason to want to have somewhat faster code, you
can manually unroll the loop while maintaining most of the readability:

void flip2(uint16_t *p) {
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
}

And note that for /many/, perhaps even most, targets, the sensible C
versions give /better/ code - smaller and faster - than your
monstrosity. On some processors, the conditional will lead to a
pipeline flush - losing far more cycles than you save by using 32-bit
accesses. On some processors, the sensible versions can be turned into
SIMD instructions (with unaligned access to memory). On some
processors, the looped version will give such a tight loop that it the
instructions can be held in a pre-fetch buffer for faster execution. On
some processors, smaller registers mean that 32-bit operations are a lot
slower than 16-bit ones. On some processors, the clearer code can give
better scheduling of register loads and saves, leading to greater
throughput. On some processors, smaller code in a loop means better use
of caches and limited memory bandwidth for faster execution.

On a few processors, your version will be faster - assuming you have
checked that your implementation allows it or you have picked the right
flags or implementation-specific extension (such as the "may_alias"
attribute in newer gcc versions), or you have written the correct ugly
mess of unions. But even then, it will rarely be more than a small
percentage speedup - seldom worth the effort for such an obscure function.
Post by s***@casperkitty.com
I could write separate versions of the code for every instruction set, and
even several different versions for some processors which have multiple
popular but incompatible assembly-language formats, but what advantage would
there be to doing that versus simply writing the above code and noting that
it will only work with compilers that use the casts as a signal that objects
of type uint32_t whose address has been taken should not be cached across
the above code, rather than as a sign that objects of type uint16_t may be
cached across it (at least in the even-address case)?
So rather than writing one single, clear C version that works fine on
anything, or writing multiple assembly versions that are optimised for
each target and actually correct, you would prefer to write multiple
possibly incorrect C versions for different targets?

In my line of work, efficient code is important - that is why we put an
emphasis on clear, simple code that does exactly what it is supposed to
do, and we let the compiler generate tight object code from it.
/Occasionally/ the source code needs tuning or re-arranging to get
better object code for a particular target - such as choosing between
the two versions I gave above. But if you turned up to a code review
with your ugly hack, you'd be sent back to redo it from scratch.
s***@casperkitty.com
2017-04-06 20:17:04 UTC
Permalink
Raw Message
Post by David Brown
That is an absolutely stupid idea. The code is an unmaintainable mess,
as well as being incorrect C. The behaviour is undefined - and always
has been undefined - in standard C. Some implementations (possibly with
specific compiler flags) might accept it, but it is far from being
universal.
I'm not claiming it's universal among all computers--merely among
microcomputers and microcontrollers that have ever been remotely popular
and have any conforming compiler available. Are there any such devices
for which there does *not* exist a compiler that can be configured to
behave as though all aligned accesses operate on raw storage?
Post by David Brown
void flip1(uint16_t *p) {
for (int i = 0; i < 8; i++) {
*p++ ^= 0xffff;
}
}
/That/ code does exactly what it says it does - it is simple, clear, and
easy to modify. It is easy for a compiler to optimise in various ways.
On a system which requires aligned memory accesses, it's not so easy.
Also, I tried to make the example simple and probably over-simplified it.
The real pay-off for chunking optimizations comes when a large amount of
stuff can be done with known alignment. Hoist the alignment check out
of an outer loop [e.g. flipping a rectangular group of pixels on a bitmap
with a 4-byte-aligned stride] and the chunking version offers a 2:1 speed
boost that compilers on aligned-memory machines would be unlikely to find.
Post by David Brown
If you have particular reason to want to have somewhat faster code, you
void flip2(uint16_t *p) {
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
}
Without hoisting the alignment check, my version of the code would on an
ARM7-TDMI take 34-35 cycles in the unaligned case and 28-29 cycles in
the aligned case (depending upon how the compiler arranged the branches),
compared with your unrolled version at 48. If the alignment check were
hoisted, mine would take 24 or 30 cycles for inner-loop executions, versus
48.
Post by David Brown
And note that for /many/, perhaps even most, targets, the sensible C
versions give /better/ code - smaller and faster - than your
monstrosity. On some processors, the conditional will lead to a
pipeline flush - losing far more cycles than you save by using 32-bit
accesses. On some processors, the sensible versions can be turned into
SIMD instructions (with unaligned access to memory). On some
processors, the looped version will give such a tight loop that it the
instructions can be held in a pre-fetch buffer for faster execution. On
some processors, smaller registers mean that 32-bit operations are a lot
slower than 16-bit ones. On some processors, the clearer code can give
better scheduling of register loads and saves, leading to greater
throughput. On some processors, smaller code in a loop means better use
of caches and limited memory bandwidth for faster execution.
When embedded code is being written for a particular machine, what's
important is that it run well on that target machine. If that platform
becomes unavailable in 5 years later but there's still a need to use the
code, then it may be worthwhile to port it, but by then faster platforms
will likely be available. If code runs 30% slower on a new platform than
it would otherwise need to, but the new platform is twice as fast, the
sub-optimal performance on the new platform won't prevent it from serving
as an effective replacement for the old one.
Post by David Brown
On a few processors, your version will be faster - assuming you have
checked that your implementation allows it or you have picked the right
flags or implementation-specific extension (such as the "may_alias"
attribute in newer gcc versions), or you have written the correct ugly
mess of unions. But even then, it will rarely be more than a small
percentage speedup - seldom worth the effort for such an obscure function.
I wouldn't generally use such techniques in places where it wouldn't offer
much of a performance boost, but in some such places they can offer a 2x
or even 4x boost. If in places where performance matters, it's dominated
by places that can be sped up 2x, targeting those places can make what used
to be a sluggish program much more responsive.
Post by David Brown
Post by s***@casperkitty.com
I could write separate versions of the code for every instruction set, and
even several different versions for some processors which have multiple
popular but incompatible assembly-language formats, but what advantage would
there be to doing that versus simply writing the above code and noting that
it will only work with compilers that use the casts as a signal that objects
of type uint32_t whose address has been taken should not be cached across
the above code, rather than as a sign that objects of type uint16_t may be
cached across it (at least in the even-address case)?
So rather than writing one single, clear C version that works fine on
anything, or writing multiple assembly versions that are optimised for
each target and actually correct, you would prefer to write multiple
possibly incorrect C versions for different targets?
In my line of work, efficient code is important - that is why we put an
emphasis on clear, simple code that does exactly what it is supposed to
do, and we let the compiler generate tight object code from it.
/Occasionally/ the source code needs tuning or re-arranging to get
better object code for a particular target - such as choosing between
the two versions I gave above. But if you turned up to a code review
with your ugly hack, you'd be sent back to redo it from scratch.
And if someone complained that some graphics code was too slow using your
approach, while versions done my way ran 30% faster, what then?
David Brown
2017-04-06 21:08:33 UTC
Permalink
Raw Message
Post by s***@casperkitty.com
Post by David Brown
That is an absolutely stupid idea. The code is an unmaintainable mess,
as well as being incorrect C. The behaviour is undefined - and always
has been undefined - in standard C. Some implementations (possibly with
specific compiler flags) might accept it, but it is far from being
universal.
I'm not claiming it's universal among all computers--merely among
microcomputers and microcontrollers that have ever been remotely popular
and have any conforming compiler available. Are there any such devices
for which there does *not* exist a compiler that can be configured to
behave as though all aligned accesses operate on raw storage?
I have no idea. In all the programming I have done with lots of
compilers and lots of microcontrollers, I have never felt the need to
use such a feature. I don't remember ever seeing such an option on a
compiler other than gcc (although I remember one having a "treat all
memory as volatile" option), nor do I remember seeing any compiler
giving such a guarantee explicitly. But while I have read most of the
manuals of most of the compilers I have used, I can't claim to remember
all the details.
Post by s***@casperkitty.com
Post by David Brown
void flip1(uint16_t *p) {
for (int i = 0; i < 8; i++) {
*p++ ^= 0xffff;
}
}
/That/ code does exactly what it says it does - it is simple, clear, and
easy to modify. It is easy for a compiler to optimise in various ways.
On a system which requires aligned memory accesses, it's not so easy.
Compilers generally have a better chance of optimising simple code than
complex code. And the more significant optimisations often come when
code can be inlined - clear code is far easier to handle then.
Post by s***@casperkitty.com
Also, I tried to make the example simple and probably over-simplified it.
The real pay-off for chunking optimizations comes when a large amount of
stuff can be done with known alignment. Hoist the alignment check out
of an outer loop [e.g. flipping a rectangular group of pixels on a bitmap
with a 4-byte-aligned stride] and the chunking version offers a 2:1 speed
boost that compilers on aligned-memory machines would be unlikely to find.
You mean, change the problem when you are shown to be wrong?

I'll happily agree that it can be faster to do some operations 32-bit at
a time than 16-bit at a time, on 32-bit devices. And I will agree that
in a few cases, it may be worth the effort. But you will then be
writing target-optimised code for a specific purpose where your need for
the fastest possible implementation is more important than writing clear
and maintainable code. And in such cases, implementation-specific
extensions are fine.
Post by s***@casperkitty.com
Post by David Brown
If you have particular reason to want to have somewhat faster code, you
void flip2(uint16_t *p) {
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
}
Without hoisting the alignment check, my version of the code would on an
ARM7-TDMI take 34-35 cycles in the unaligned case and 28-29 cycles in
the aligned case (depending upon how the compiler arranged the branches),
compared with your unrolled version at 48. If the alignment check were
hoisted, mine would take 24 or 30 cycles for inner-loop executions, versus
48.
I tested your code briefly on a real Cortex-M4 (being a modern device,
unlike the ARM7-TDMI). Your version took 325 ns on my board for the
unaligned case, and slightly less for the aligned case. My unrolled
version took 350 ns. It is hard to imagine how anyone would think this
was worth doing, except in the most extreme of circumstances.
Post by s***@casperkitty.com
Post by David Brown
And note that for /many/, perhaps even most, targets, the sensible C
versions give /better/ code - smaller and faster - than your
monstrosity. On some processors, the conditional will lead to a
pipeline flush - losing far more cycles than you save by using 32-bit
accesses. On some processors, the sensible versions can be turned into
SIMD instructions (with unaligned access to memory). On some
processors, the looped version will give such a tight loop that it the
instructions can be held in a pre-fetch buffer for faster execution. On
some processors, smaller registers mean that 32-bit operations are a lot
slower than 16-bit ones. On some processors, the clearer code can give
better scheduling of register loads and saves, leading to greater
throughput. On some processors, smaller code in a loop means better use
of caches and limited memory bandwidth for faster execution.
When embedded code is being written for a particular machine, what's
important is that it run well on that target machine. If that platform
becomes unavailable in 5 years later but there's still a need to use the
code, then it may be worthwhile to port it, but by then faster platforms
will likely be available. If code runs 30% slower on a new platform than
it would otherwise need to, but the new platform is twice as fast, the
sub-optimal performance on the new platform won't prevent it from serving
as an effective replacement for the old one.
Post by David Brown
On a few processors, your version will be faster - assuming you have
checked that your implementation allows it or you have picked the right
flags or implementation-specific extension (such as the "may_alias"
attribute in newer gcc versions), or you have written the correct ugly
mess of unions. But even then, it will rarely be more than a small
percentage speedup - seldom worth the effort for such an obscure function.
I wouldn't generally use such techniques in places where it wouldn't offer
much of a performance boost, but in some such places they can offer a 2x
or even 4x boost. If in places where performance matters, it's dominated
by places that can be sped up 2x, targeting those places can make what used
to be a sluggish program much more responsive.
But your techniques /don't/ give a 2x performance boost. My real-world
tests on a 32-bit microcontroller gave about 15% performance boost. And
since you need to hinder the compiler's optimiser with -fno-strict-alias
flags, how do you know there is a speed improvement overall?

And even if it really was a 2x performance boost, that will make no
practical difference on a short and fast function.

So before you start doing this sort of thing, you need to have a
function that is used often, in time-critical code, /measured/ to be a
bottleneck for the system's performance. And then you write an
implementation and target specific version of the code in a way that you
can be sure is correct and not fragile in the face of compiler options
(which are easily changed). In that case, a technique like yours might
be workable.
Post by s***@casperkitty.com
Post by David Brown
Post by s***@casperkitty.com
I could write separate versions of the code for every instruction set, and
even several different versions for some processors which have multiple
popular but incompatible assembly-language formats, but what advantage would
there be to doing that versus simply writing the above code and noting that
it will only work with compilers that use the casts as a signal that objects
of type uint32_t whose address has been taken should not be cached across
the above code, rather than as a sign that objects of type uint16_t may be
cached across it (at least in the even-address case)?
So rather than writing one single, clear C version that works fine on
anything, or writing multiple assembly versions that are optimised for
each target and actually correct, you would prefer to write multiple
possibly incorrect C versions for different targets?
In my line of work, efficient code is important - that is why we put an
emphasis on clear, simple code that does exactly what it is supposed to
do, and we let the compiler generate tight object code from it.
/Occasionally/ the source code needs tuning or re-arranging to get
better object code for a particular target - such as choosing between
the two versions I gave above. But if you turned up to a code review
with your ugly hack, you'd be sent back to redo it from scratch.
And if someone complained that some graphics code was too slow using your
approach, while versions done my way ran 30% faster, what then?
I would ask them to show the measurements, which would almost certainly
prove that your ugly code gave negligible benefits. But if it turned up
to be absolutely necessary, then I would write code that was correct -
using implementation-specific extensions as needed rather than making
requirements about compiler flags. For example:

void flipZ(uint16_t *p) {
typedef uint32_t __attribute__((__may_alias__)) uint32_a;

if ((uintptr_t) p & 2) {
uint32_a *q = __builtin_assume_aligned(p + 1, 4);
*q++ ^= 0xffffffff;
*q++ ^= 0xffffffff;
*q++ ^= 0xffffffff;
p[0] ^= 0xFFFF;
p[7] ^= 0xFFFF;
} else {
uint32_a *q = __builtin_assume_aligned(p, 4);
*q++ ^= 0xffffffff;
*q++ ^= 0xffffffff;
*q++ ^= 0xffffffff;
*q++ ^= 0xffffffff;
}
}
s***@casperkitty.com
2017-04-06 22:24:35 UTC
Permalink
Raw Message
Post by David Brown
Compilers generally have a better chance of optimising simple code than
complex code. And the more significant optimisations often come when
code can be inlined - clear code is far easier to handle then.
Compilers will try to optimize for what they perceive as typical trade-offs
of time versus code space, or performance with many items versus few items.
Compilers may sometimes know better than programmers what things can be
done, but programmers often know better than compilers about what aspects
of behavior are more or less important in given situations.

On a processor where a load and a branch have comparable cost, saving three
loads, three stores, and tree xors in exchange for a shift, and branch taken,
or shift, branch not taken, and unconditional jump, is a slight but noticeable
win. Probably not enough to be worth pursuing in this particular case unless
the comparison can be hoisted, but add a loop to flip the pixels of a
rectangle rather than a stripe and then it would be worthwhile.
Post by David Brown
You mean, change the problem when you are shown to be wrong?
As I said I over-simplified to a point where chunking yields a measurable
but not huge benefit, but not so far as to presume known alignment. If I'd
simply said the function is limited to aligned rectangles then the unrolled
loop with with chunking would be smaller and faster than the one without,
and a loop that isn't unrolled would likely be twice as fast but the same size.
Post by David Brown
I'll happily agree that it can be faster to do some operations 32-bit at
a time than 16-bit at a time, on 32-bit devices. And I will agree that
in a few cases, it may be worth the effort. But you will then be
writing target-optimised code for a specific purpose where your need for
the fastest possible implementation is more important than writing clear
and maintainable code. And in such cases, implementation-specific
extensions are fine.
Having something that can easily be made to work on other platforms would
seem a plus in my book. Using implementation-specific extensions in a
#ifdef along with a note that would required disabling aliasing when using
any implementation that didn't support the extensions might be another good
alternative, though it's unclear what code should do when using a compiler
whose present version neither supports nor requires such directives because
it doesn't do aggressive inter-procedural aliasing analysis.
Post by David Brown
I tested your code briefly on a real Cortex-M4 (being a modern device,
unlike the ARM7-TDMI). Your version took 325 ns on my board for the
unaligned case, and slightly less for the aligned case. My unrolled
version took 350 ns. It is hard to imagine how anyone would think this
was worth doing, except in the most extreme of circumstances.
What if you throw in a for loop within each branch?

uint16_t *p = ...;
uint32_t stride = ...;
if ((uintptr_t)p & 2)
{
i=8;
do
{
*(uint32_t*)(p+1) ^= 0xFFFFFFFF;
*(uint32_t*)(p+3) ^= 0xFFFFFFFF;
*(uint32_t*)(p+5) ^= 0xFFFFFFFF;
p[0] ^= 0xFFFF;
p[7] ^= 0xFFFF;
p+=stride;
} while(--i);
}
else
{
i=8;
do
{
*(uint32_t*)p ^= 0xFFFFFFFF;
*(uint32_t*)(p+2) ^= 0xFFFFFFFF;
*(uint32_t*)(p+4) ^= 0xFFFFFFFF;
*(uint32_t*)(p+6) ^= 0xFFFFFFFF;
p+=stride;
} while(--i);
}

How does that affect the timings?
Post by David Brown
And even if it really was a 2x performance boost, that will make no
practical difference on a short and fast function.
That depends how often it's called. A lot of programs spend the majority
of their time running "small and fast" functions.
Post by David Brown
I would ask them to show the measurements, which would almost certainly
prove that your ugly code gave negligible benefits. But if it turned up
to be absolutely necessary, then I would write code that was correct -
using implementation-specific extensions as needed rather than making
void flipZ(uint16_t *p) {
typedef uint32_t __attribute__((__may_alias__)) uint32_a;
if ((uintptr_t) p & 2) {
uint32_a *q = __builtin_assume_aligned(p + 1, 4);
*q++ ^= 0xffffffff;
*q++ ^= 0xffffffff;
*q++ ^= 0xffffffff;
p[0] ^= 0xFFFF;
p[7] ^= 0xFFFF;
} else {
uint32_a *q = __builtin_assume_aligned(p, 4);
*q++ ^= 0xffffffff;
*q++ ^= 0xffffffff;
*q++ ^= 0xffffffff;
*q++ ^= 0xffffffff;
}
}
As I asked above, what if the present compiler doesn't use such directives
but merely makes pesimistic aliasing presumptions (and still produces code
that's better than gcc *with aliasing turned on* on the targets I use)?
Ian Collins
2017-04-07 00:39:59 UTC
Permalink
Raw Message
Post by s***@casperkitty.com
Post by David Brown
void flipZ(uint16_t *p) {
typedef uint32_t __attribute__((__may_alias__)) uint32_a;
if ((uintptr_t) p & 2) {
uint32_a *q = __builtin_assume_aligned(p + 1, 4);
*q++ ^= 0xffffffff;
*q++ ^= 0xffffffff;
*q++ ^= 0xffffffff;
p[0] ^= 0xFFFF;
p[7] ^= 0xFFFF;
} else {
uint32_a *q = __builtin_assume_aligned(p, 4);
*q++ ^= 0xffffffff;
*q++ ^= 0xffffffff;
*q++ ^= 0xffffffff;
*q++ ^= 0xffffffff;
}
}
As I asked above, what if the present compiler doesn't use such directives
but merely makes pesimistic aliasing presumptions (and still produces code
that's better than gcc *with aliasing turned on* on the targets I use)?
Your original ugly code generates more verbose output than the simple
loop David posted with gcc on x86...

Simple loop:

flip1:
.LFB0:
pcmpeqd %xmm0, %xmm0
movdqu (%rdi), %xmm1
pxor %xmm1, %xmm0
movups %xmm0, (%rdi)
ret

Your version:


flip3:
.LFB2:
testb $2, %dil
jne .L7
pcmpeqd %xmm0, %xmm0
movdqu (%rdi), %xmm1
pxor %xmm1, %xmm0
movups %xmm0, (%rdi)
ret
.p2align 4,,10
.p2align 3
.L7:
notl 2(%rdi)
notl 6(%rdi)
notl 10(%rdi)
notw (%rdi)
notw 14(%rdi)
ret

Same core, but with an additional (probably unnecessary test and branch).

Compiled with gcc -O3 -S -m64 x.c -std=c11 -pedantic
--
Ian
s***@casperkitty.com
2017-04-07 16:35:34 UTC
Permalink
Raw Message
Post by Ian Collins
Your original ugly code generates more verbose output than the simple
loop David posted with gcc on x86...
Optimal machine code for some processors would be to blindly use a wider
access and if the address is unaligned let the hardware handle it. On
other processors, unaligned accesses will trap. If something like a
simple graphics routine for a CPU without a cache would represent 1% of
a program's code size but represent 90% of execution time during an
animation sequence, a version which is even 5 times as large but runs 30%
faster would represent a 25% performance boost in exchange for a 5%
increase in program size. As I've said many times that particular example
didn't do enough with each unaligned address to allow a real performance
win, but make it flip an 8x8 group of pixels rather than just a single
stripe of 8 pixels and things would shift.

I'm not sure about instruction timings on the Cortex-M4, but I would expect
that on that platform performance might be improved by changing the unaligned-
case code slightly. Adding in an 8-row loop:

uint32_t *p2 = (uint32_t*)(p-1);
for (int i=0; i<8; i++)
{
p2[0] ^= 0xFFFF0000 >> (16*IS_BIG_ENDIAN);
p2[1] ^= 0xFFFFFFFF;
p2[2] ^= 0xFFFFFFFF;
p2[3] ^= 0xFFFFFFFF;
p2[4] ^= 0x0000FFFF << (16*IS_BIG_ENDIAN);
}

With that adjustment, I would expect optimal code to look like:

lsrs r2,r0,#30
bcc aligned
sub r0,r0,#2
add r1,r0,#stride*7
mvn r2,#0
asl r2,r2,#16 ; Get 0xFFFF0000 into R2
movn r3,r2 ; Get 0x0000FFFF into R3
unaligned_lp:
ldrm r0,{r4-r8}
cmp r0,r1
xor r4,r4,r3
mvn r5,r5
mvn r6,r6
mvn r7,r7
xor r8,r8,r2
strm r0!,{r4-r8}
add r0,#stride-20
bne unaligned_lp
b done
aligned:
add r1,r0,#stride*7
aligned_lp:
ldrm r0,{r4-r7}
cmp r0,r2
mvn r4,r4
mvn r5,r5
mvn r6,r6
mvn r7,r7
strm r0!,{r4-r7}
add r0,#stride-16
bne aligned_lp
done:

Aligning the pointer and using 32 bit xors for everything would allow the
Cortex-M4 to use ldrm/strm for all memory accesses; even though the normal
ldr/str instructions can handle unaligned loads, ldrm/strm cannot. I don't
know how easy it would be to coax gcc into giving the optimal code, but for
the case of handling an 8x8 box I would expect a better than 2x speedup
versus code that uses 128 16-bit accesses, and a substantial speedup versus
code that uses 64 random-alignment 32-bit accesses in the unaligned case
(64 penalty cycles), and significant speedup versus code that uses 64 random-
alignment 32-bit accesses in the aligned case (since individual loads cost
more per load than ldrm/strm).
Ian Collins
2017-04-06 21:05:57 UTC
Permalink
Raw Message
Post by David Brown
void flip1(uint16_t *p) {
for (int i = 0; i < 8; i++) {
*p++ ^= 0xffff;
}
}
/That/ code does exactly what it says it does - it is simple, clear, and
easy to modify. It is easy for a compiler to optimise in various ways.
If you have particular reason to want to have somewhat faster code, you
void flip2(uint16_t *p) {
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
*p++ ^= 0xffff;
}
And note that for /many/, perhaps even most, targets, the sensible C
versions give /better/ code - smaller and faster - than your
monstrosity.
gcc on x64 is a good example, both version reduce to

flip1:
.LFB0:
pcmpeqd %xmm0, %xmm0
movdqu (%rdi), %xmm1
pxor %xmm1, %xmm0
movups %xmm0, (%rdi)
ret

Ian
a***@gmail.com
2017-04-10 06:56:50 UTC
Permalink
Raw Message
gcc on x64 is a good example, both version reduce to

flip1:
.LFB0:
pcmpeqd %xmm0, %xmm0
movdqu (%rdi), %xmm1
pxor %xmm1, %xmm0
movups %xmm0, (%rdi)
ret

So nobody don't know the last CPU can understand translation...
Rod Pemberton
2017-04-07 03:59:22 UTC
Permalink
Raw Message
On Thu, 06 Apr 2017 12:33:44 GMT
Post by Spiros Bousbouras
Example : the expression array[i] where i is beyond the end of the array.
FYI, C explicitly allows you to access one element past the end of the
array. This is to allow you to access the nul character when comparing
strings to check for string termination. If you go beyond that though,
it's undefined ... However, many C implementations use the ability to
access outside a C object behind the scenes without any issues, e.g.,
to access a memory header prior to malloc'd memory blocks.
Post by Spiros Bousbouras
Thinking in terms of assembly : "array[i] will be translated into an
assembly instruction which calculates an address of the form base +
i * constant .
This is a requirement of the subscript operator [] for K&R, ANSI, and
ISO C. It's in all the standards. They define "E1[E2]" to be
equivalent to "*((E1)+(E2))". The subscript operator takes a pointer
to a type and also an index. The index is scaled by the size of the
data type pointed to. The pointer which cannot be a pointer-to-void
'void *' since 'void' has no data type, i.e., no size, pointer only.
The pointer used in the subscript operator can be any pointer to any
valid type, not just pointer for array types. The arguments to the
subscript operator can be either order, e.g., "0123456789"[i] is
equivalent to i["0123456789"] . Technically, you should think of the
subscript operator as just the left-brace, then you'll understand it's
like any other operator in C, e.g., arithmetic +, -> component
selection, or whatever.

BTW, the other required equivalence is "E1->MOS" is equivalent to
"(*E1).MOS". This is the equivalence between the two component
selection operators . and ->
Post by Spiros Bousbouras
This will always point into some address in memory so
I should get some value or at least a segmentation fault ; so not a
totally unpredictable result."
C's objects don't have bounds checks built in. It's not an
object-oriented language. So, it's up to the programmer to make sure
that they're within range.
Post by Spiros Bousbouras
Note first that not thinking in terms of C is more complicated and
requires more knowledge than thinking in terms of C. Second , if one
wants to think in terms of assembly then , from a practical point of
view , it makes sense to program in assembly , not C.
Non sequitur.

IMO, the key reasons to use C instead of assembly are the ability to
use variables and the portability of the code. Even so, you need to
understand how your code is converted to assembly or at least check the
compiler output.
Post by Spiros Bousbouras
How are you going to implement for example
the Unix sort utility without using realloc() (depending on the
size of the input) ?
I don't follow. I've never had to use realloc() for anything, ever.
If you're manipulating large data sets in C, it's better to use files
instead of memory.
Post by Spiros Bousbouras
Your second reference has
char *index;
....
*index++ = toupper(*index);
.He doesn't explain why he decided to write code like this as opposed
to *index = toupper(*index);
index++ ;
I'm guessing that this is most likely just pre-ANSI C. Most C style
guides etc are older and are for K&R C. E.g., most will recommend
that you use #defines and mask with logical-and instead of using
bit-fields, since many earlier C compilers were broken. They also
recommend avoiding enums for similar reasons. They also recommend
placing numbers in comparisons as "if(0==value)" instead of
"if(value==0)" to detect incorrect assignments such as "if(value=0)".
etc. They'll also say not to use the address of a struct by itself,
i.e., as a pointer, since an early, widespread C compiler, PCC, didn't
keep track of the struct's name or somesuch issue ... ANSI C added
other problems, e.g., obsoleted implicit int's since they caused a
syntax conflict with typedef's, messed up pre- and post- increment and
decrement operators, etc. Of course, C has some required behavior,
which is not part of some of the standards, such as the C struct hack
and Duff's device, which must work correctly.


Rod Pemberton
--
All it takes for humanity to conquer the world's toughest problems
is to hope, to believe, to share, and to do, in cooperation.
Robert Wessel
2017-04-07 04:09:57 UTC
Permalink
Raw Message
On Thu, 6 Apr 2017 23:59:22 -0400, Rod Pemberton
Post by Rod Pemberton
On Thu, 06 Apr 2017 12:33:44 GMT
Post by Spiros Bousbouras
Example : the expression array[i] where i is beyond the end of the array.
FYI, C explicitly allows you to access one element past the end of the
array. This is to allow you to access the nul character when comparing
strings to check for string termination. If you go beyond that though,
it's undefined ... However, many C implementations use the ability to
access outside a C object behind the scenes without any issues, e.g.,
to access a memory header prior to malloc'd memory blocks.
No it does not. C allows you to take the address of an object one
past the end of an array, but you may *not* access it. IOW, you may
do:

char *p;
char a[5];
...
p = &a[5]; //points one past the end of the array

But you may not apply the dereference operator to p at that point.

You may also not do "p = &a[6];".

And the nul at the end of a string is very much part of the array
object, so there's no special dispensation needed to access that. So
if you had:

char a[5] = "abc";

The nul is simply what in a[3]. You can also access a[4], even though
that's past the end of the "string". OTOH, if you have:

char a[5] = "abcde";

Then there is no nul at all, since there's no room in the string, and
attempting to *access* a[5] is not allowed.
Keith Thompson
2017-04-07 16:28:12 UTC
Permalink
Raw Message
Post by Rod Pemberton
On Thu, 06 Apr 2017 12:33:44 GMT
Post by Spiros Bousbouras
Example : the expression array[i] where i is beyond the end of the array.
FYI, C explicitly allows you to access one element past the end of the
array.
No, it does not. It allows you to compute an address one past the end
of an array, but attempting to deference such an address has undefined
behavior.

int arr[10];
arr + 10; // valid
arr[10]; // undefined behavior
Post by Rod Pemberton
This is to allow you to access the nul character when comparing
strings to check for string termination.
No, it is not. If an array contains a string, then the nul character
must be an element of the array.
Post by Rod Pemberton
If you go beyond that though,
it's undefined ... However, many C implementations use the ability to
access outside a C object behind the scenes without any issues, e.g.,
to access a memory header prior to malloc'd memory blocks.
Implementations can internally do whatever they like, as long as the
behavior is correct. But if, for example, free() accesses memory before
the location pointed to by its argument, it's likely (if malloc and free
are implemented in C) that malloc actually allocated a larger object
behind the scenes. Examining memory before the allocated object is just
accessing that larger object.
Post by Rod Pemberton
Post by Spiros Bousbouras
Thinking in terms of assembly : "array[i] will be translated into an
assembly instruction which calculates an address of the form base +
i * constant .
This is a requirement of the subscript operator [] for K&R, ANSI, and
ISO C. It's in all the standards.
No C standard says anything about assembly instructions.
Post by Rod Pemberton
They define "E1[E2]" to be
equivalent to "*((E1)+(E2))". The subscript operator takes a pointer
to a type and also an index. The index is scaled by the size of the
data type pointed to. The pointer which cannot be a pointer-to-void
'void *' since 'void' has no data type, i.e., no size, pointer only.
The pointer used in the subscript operator can be any pointer to any
valid type, not just pointer for array types. The arguments to the
subscript operator can be either order, e.g., "0123456789"[i] is
equivalent to i["0123456789"] . Technically, you should think of the
subscript operator as just the left-brace, then you'll understand it's
like any other operator in C, e.g., arithmetic +, -> component
selection, or whatever.
Yes, indexing is defined in terms of pointer arithmetic -- and both
indexing and pointer arithmetic have undefined behavior if a pointer
value is constructed that's outside the bounds of the underlying array
object (with the same special case for a pointer just past the end of
the aray object). (A single object is treated as a single-element
array.)
Post by Rod Pemberton
BTW, the other required equivalence is "E1->MOS" is equivalent to
"(*E1).MOS". This is the equivalence between the two component
selection operators . and ->
Yes -- but what the standard actually says (in a footnote) is that
(&E)->MOS is equivalent to E.MOS. (A minor point.)
Post by Rod Pemberton
Post by Spiros Bousbouras
I should get some value or at least a segmentation fault ; so not a
totally unpredictable result."
C's objects don't have bounds checks built in. It's not an
object-oriented language. So, it's up to the programmer to make sure
that they're within range.
I'm not sure that there's a connection between built-in bounds checks
and object orientation. But it's true that C doesn't require bounds
checking. Bounds violations are not necessarily caught; they cause
undefined behavior. (A conforming implementation *can* provide bounds
checking.)

[...]
Post by Rod Pemberton
I'm guessing that this is most likely just pre-ANSI C. Most C style
guides etc are older and are for K&R C. E.g., most will recommend
that you use #defines and mask with logical-and instead of using
bit-fields, since many earlier C compilers were broken. They also
recommend avoiding enums for similar reasons. They also recommend
placing numbers in comparisons as "if(0==value)" instead of
"if(value==0)" to detect incorrect assignments such as "if(value=0)".
etc.
What style guides are you reading?

(We just had a lengthy discussion about (0==value) vs. (value==0), and
I'd prefer not to revisit that particular issue just now.)
Post by Rod Pemberton
They'll also say not to use the address of a struct by itself,
i.e., as a pointer, since an early, widespread C compiler, PCC, didn't
keep track of the struct's name or somesuch issue ...
I've never seen such a recommendation.
Post by Rod Pemberton
ANSI C added
other problems, e.g., obsoleted implicit int's since they caused a
syntax conflict with typedef's, messed up pre- and post- increment and
decrement operators, etc.
Implicit int was removed by C99.

The term "ANSI C" is ambiguous. It almost always refers to the language
described by the 1989 ANSI C standard and by the equivalent 1990 ISO C
standard, but strictly speaking ANSI only recognizes the current 2011
ISO C standard. I strongly suggest referring to C89 or C90, C99, and
C11 rather than "ANSI C"..
Post by Rod Pemberton
Of course, C has some required behavior,
which is not part of some of the standards, such as the C struct hack
and Duff's device, which must work correctly.
It's not clear that the struct hack is legal, which is why C99
introduced flexible array members to replace it. See question 2.6
of the comp.lang.c FAQ, <http://www.c-faq.com/>. Duff's Device is,
I believe, valid and portable.
--
Keith Thompson (The_Other_Keith) kst-***@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
s***@casperkitty.com
2017-04-07 16:42:47 UTC
Permalink
Raw Message
Post by Keith Thompson
Yes -- but what the standard actually says (in a footnote) is that
(&E)->MOS is equivalent to E.MOS. (A minor point.)
What about

typeOfE *p = &E;
...
p->MOS?

Does the Standard make clear in what cases p->MOS would or would not be
required to behave equivalently to E.MOS? For example, is:

void ActOnPMOS(typeOfE *p) { p->MOS += 23; }
...
ActOnPMOS(&E);

equivalent to:

(&E)->MOS += 23;

and in turn to

E.MOS += 23;

I don't see anything in the Standard that would suggest that they should not
be equivalent, but gcc disagrees.
Spiros Bousbouras
2017-04-10 00:38:53 UTC
Permalink
Raw Message
[ Crossposted to comp.lang.c since we need some C lawyers ;-) ]

On Sun, 09 Apr 2017 14:53:01 GMT
On Wed, 05 Apr 2017 15:01:45 GMT
Translating Forth into traditional C (that compiles as intended by PCC
and early versions of GCC) is indeed not difficult; after all,
traditional C and Forth are semantically close to each other. Take a
look at Gforth or forth2c to get an idea on how to do it.
The resulting code is, of course, full of undefined behaviour, like
most of the other C code around; and in this case I think it's
impossible to get rid of the undefined behaviour without losing
functionality and performance, and even if you are willing to pay that
price, it requires significantly more effort, and in the end, you will
probably still have some undefined behaviour somewhere.
Here we need to make a distinction : I will term "weak undefined behaviour"
(WUB) C code which is not defined according to the C standard and "strong
undefined behaviour" (SUB) C code which is not defined according to
*anything* where "anything" could mean some standard which enhances the C
standard (like POSIX) and also relevant documentation of the compiler or
libraries one uses or whatever. Our main conern here is predictable behaviour
of the code so I believe the deciding concept is SUB rather than WUB.
I don't think that this distinction helps in this discussion, because
1) Different people consider different things to be "not defined
according to *anything*". E.g., assembly language and most Forth
programmers consider signed integer overflow to have a pretty
natural definition (wraparound aka modulo arithmetics), some more
mathematically-inclined people think they deserve nasal demons.
I guess I wasn't clear enough. By "anything" I mean anything related to the context.
So if I'm translating some C code (note that "translating" doesn't necessarily mean
compiling ; C is usually compiled but there are C interpreters) relevant stuff are

- C standard.
- Translator documentation. For example by using the appropriate options one may get
additional guarantees beyond the C standard like -fwrapv gcc supports and
which makes signed integer oveflow defined.
- If one uses headers or functions or macros , etc. which do not appear in the C
standard then these must be defined somewhere else like in some other standard
or they must be part of some library and be defined in the documentation of the
library.
- There may more which is relevant. I don't see how Forth or assembly are relevant
unless one calls an external process which is written into one of these languages.
Mathematics is relevant only to the extent that the related documentation or
standard(s) says that it's relevant. For example it follows from the C standard
that for integers x and y of the same type T , if the mathematically correct result
of x+y fits into T then that's what x+y evaluates to. But if it didn't specify
this then one should not have assumed that the mathematical definition of + is
relevant even if they felt that it would be the natural way to define C's +
operator.

The underlying principle is that one must have some rational reason to expect
code to behave in a certain way. For example if I have C code which includes
frooboz(123) ;

I must have have some rational reason to expect this to translate without errors and
do something specific I have in mind. Since the C standard does not define frooboz()
, this reason must come from somewhere else. So we definitely need some stronger
concept than WUB. What counts as "pretty natural definition" for signed integer
oveflow (or anything else) seems subjective to me and I don't even see how it is
relevant whether someone is a Forth programmer or assembly programmer (of which
architecture ?) or mathematically inclined. Even in mathematics , what x+y means
depends on the context. The definition is different if x and y are elements of ℤ
(integers) as opposed to say ℤ/5ℤ .So what is defined in some context depends
on the context rather than the overall knowledge and experiences one has. To get to
one of your favorite examples , if your C code has signed integer oveflow what's
your rational reason to expect that it will do something specific (like wraparound)?
The reason cannot be the C standard because the C standard says it's undefined
behaviour. If your compiler documentation does not say that it will produce assembly
which does wraparound then you have no reason to expect the code to behave in a
predictable manner so that code needs to be changed. You say above "to get rid of
the undefined behaviour without losing functionality and performance". Without
rational reason to believe that your code will do what you want then you don't
actually have functionality ; if it seems to behave the way you want then it is just
a coincidence and the behaviour may go away without warning like if you use a
different version of compiler. So there's nothing to lose (apart perhaps from
occasions where you urgently want to achieve something and you're willing to kludge
it) , either you fix the code or you have nothing you can depend on.
2) Nasal demon fans want license to produce nasal demons for every
undefined behaviour. E.g., in the general case both gcc and
Clang/LLVM generate code for 2s-complement arithmetic with
wraparound for signed integer addition, subtraction, and
multiplication, so you might classify it as WUB. Nasal demon fans
still want to keep signed integer overflow undefined even in those
compilers so that they make use of it for "optimizations".
As I said in a previous post , I don't believe there are "nasal demon fans". As for
wanting license I find this puzzling because I don't believe they need a license.
For example , one might decide to write a C compiler which supports all other parts
of the C standard but not the for statement. One does not need any kind of license
to do this , one simply needs (from a moral point of view) to be honest about what
they offer. And then other people can decide if they want to use the compiler or
not. But if they decide to use it then their expectations must be based on what the
compiler does claim to offer , not on what they would wish it to offer. They might
wish that it also supported the for statement but since it doesn't claim to do so
then they shouldn't use for in their code (if they want to translate their code
with this compiler).

By the way , regarding gcc and signed integer operations , what do you mean "general
case" ? My impression is that without using -ftrapv or -fwrapv then oveflow for
signed integer operations is undefined.
3) People aware of the evils of nasal demons don't want them anywhere,
no matter what undefined behaviour there is. E.g., for an
out-of-bounds array read, I expect either an unspecified, but
stable value, or a segmentation violation/general protection fault;
that's not fully defined, but it excludes nasal demons. E.g., I
don't expect it to turn a counted loop into an endless loop (and
this is actually a nasal demon "optimization" that occurs in
practice).
In any
case , if code has SUB then I cannot imagine on what basis one may think it
works and it will reliably continue to do so in the future : even if it
passes testing , then that's by coincidence. So if a code exhibits SUB then
it's buggy and it needs to be fixed.
Of course that depends on what you consider SUB. Anyway, here's an
example.
int d[16];
int SATD (void)
{
int satd = 0, dd, k;
for (dd=d[k=0]; k<16; dd=d[++k]) {
satd += (dd < 0 ? -dd : dd);
}
return satd;
}
If there is some other global data defined earlier and later in the
program, then d[16] is a mapped (i.e., readable) memory location. And
if it is an unmapped memory location, you get a SIGSEGV, not some
nasal demon.
Are you translating this with a C translator which defines in its documentation
things like "mapped memory location" ? If not then these terms are meaningless (in
the specific context of using this specific code with this specific translator) , it
is SUB and you can't have any expectations whatsoever regarding how this code will
behave. I certainly wouldn't. I take it you would ? If yes on what basis ?
3) Complete correctness proofs of programs are outside the competence
of nearly all programmers (and pretty pointless, because they would
prove only the correctness wrt the specification, which itself may be
buggy).
Now, to write a C program that is guaranteed to never exercise
undefined behaviour requires competences on the level 3.
int is_leap_year(unsigned int year) {
return 1 ;
}
It's trivial to see that this is defined but , judging from the function's
name , it's almost certainly buggy. A complete correctness proof is many
orders of magnitude harder than simply ensuring defined behaviour.
I don't think it is trivial to see that this is defined.
Well , I'm not going to locate and quote all the parts of the C standard which
support my view but I have no doubts that it is defined.
I once
|It seems to me that
|
| if ((size_t)(dest-src) < len)
|
|is a more efficient way to test whether dest is inside [src, src+len).
|
|Of course, since this code contains more than three tokens, it is
|likely to contain undefined behaviour
That was mostly in jest, but a post or two later, someone explained
that yes, this code is undefined. So while I do not see anything
undefined about the code you posted, just let the language lawyers at
it, and they will point it out.
Ok , we'll see what the language lawyers say.
In any case, in your example it is trivial to see that it does not
satisfy the requirement that is implied by the name, so even if it
really is trivial to see that it is defined, it only shows that both
correctness and definedness are trivial to determine for trivial
programs.
My point was simply to provide a counterexample to your statement
Now, to write a C program that is guaranteed to never exercise
undefined behaviour requires competences on the level 3.
So you will find undefined behaviours in the cases exercized by normal
users, but you will tend not to find undefined behaviour in those
cases exercized by an attacker; it's as difficult as finding the all
the vulnerabilities in the first place. E.g., in the OpenSSH case, if
you thought about the vulnerability, you would fix it right away,
rather than writing a test case that exploited it, and waiting for the
sanitizer to report that freed memory was accessed (if the sanitizer
actually does check that), and then fixing it.
The best write-up I could find on the vulnerability is at
https://www.qualys.com/2016/01/14/cve-2016-0777-cve-2016-0778/openssh-cve-2016-0777-cve-2016-0778.txt .
The crucial bug occurred much earlier than compiler optimisations eliminating
code which wrote 0's on some memory. From the previous link
- "out_start == out_last" (lines 205-206): no data was ever written to
out_buf (and both out_start and out_last are still equal to 0) because
no data was ever sent to the server after roaming_reply() was called,
but the client sends (leaks) the entire uninitialized out_buf to the
server (line 214), as if out_buf_size bytes of data were available.
This would still be a bug (albeit a less serious one) even if the memory had
been zeroed because it sends bogus information to the server. An analogous
bug would be sending over the network unencrypted data which should have been
encrypted. In both cases , to see the bugs one needs to think on the level of
what the application wants and does not want to do , not on the low level
details of the programming language. Some low level detail may accidentally
eliminate some bug caused by a high level thinking oversight but this is
entirely random so it could just as well make it worse.
Another bug was with not handling correctly C arithmetic (see the page).
Finally the not zeroing of memory happened because (from the link)
Unfortunately, an optimizing compiler may remove this
memset() or bzero() call, because the buffer is written to, but never
again read from (an optimization known as Dead Store Elimination).
so it's not due to undefined behaviour.
But actually the bug resulted in reading from the buffer. So the
assumption by the compiler that the buffer is never again read from is
wrong, and applying dead-store elimination to this memset() is wrong;
in other words, the compiler is buggy.
But of course the GCC maintainers don't accept this as a bug: They say
that reading from the buffer after the free() is undefined behaviour,
therefore does not happen, and therefore they are in their rights to
conjure up nasal demons and "optimize" the memset()/bzero() away. And
this is not an issue that could turn out one way or the other, this
memset()/bzero() is there exactly for mitigating information leaks,
such as the one that is the main reason of the CVE.
I provided a reference above and it does not support your claims that this
is what caused the bug. Do you have a reference which says otherwise ?
It is telling that you do not understand how the nasal-demon
interpretation of C allowed the compiler to "optimize" the
memset()/bzero() away. So much for Andrew Haley's claim that it is
"not difficult" to write "well-defined" C, and your defense of that
stance.
I understand what you're claiming but you haven't provided any reference that this
is what caused the bug in the case of OpenSSH.
I couldn't tell you off the top of my head but I'm sure you could get the
answer on comp.lang.c very quickly. Looking at the big picture , it could
be that C is not a suitable back-end for Forth after all even if you got
lucky for a number of years and your code (which presumably was exhibiting
SUB) worked as intended.
I think that nasal-demon C is not a suitable language for anything.
I use it all the time with perfectly predictable results. I have my complaints
but they are not caused by undefined behaviour.
Note that this may turn out to be an opportunity for
Forth : it may be that Forth itself is a more suitable back-end than C for other
programming languages
Yes.
A
*constructive* thing to do (as opposed to complaining) would be to describe
what guarantees with regard to C pointer comparisons on top of the C standard
would be needed to do what you want to do. Then you could ask on comp.lang.c
or some compiler mailing list whether these additional guarantees can be
offered by using some appropriate flag and if not how easy would be to implement
them as an extension.
If you think that's a promising approach, go ahead. My experience is
that they always tell you to go to the C standards committee.
Before you go to the C standards committee you need to have a specific proposal. I
take it that for C to be a suitable back end for implementing Forth , you want
comparison of C pointers pointing to different objects to be defined. Could you
provide a set of axioms that such comparisons ought to satisfy to be adequate for
your purposes ? If not then even if C compiler writers or the C standards committee
wanted to accommodate you , they have no way to do so. You can't expect them to read
your mind , can you ? "Do the natural thing" is not a meaningful definition. I don't
think that the following from your paper is a meaningful definition either :

C* A language (or family of languages) where language constructs correspond
directly to things that the hardware does. E.g., * corresponds to what a
hardware multiply instruction does. In terms of the C standard, conforming
programs are written in C .

I couldn't provide such a set of axioms because I've never implemented a Forth so I
don't know in what ways standard C as a back end is lacking.
1. The desire to allow implementors optimisations.
2. Inability to reconcile previous (i.e. before the standard) existing practices
therefore any attempt to define the behaviour more narrowly would unfairly
privilege some implementations (or architectures) over others.
3. The desire to not make the standard too large.
That was all fine and dandy in the good old days of PCC and early GCC,
but in recent years the compiler maintainers went insane and use them
as justification for miscompiling (they call it "optimizing") as many
programs as possible, with some exceptions, in particular, standard
benchmarks. Paying customers are probably also not treated as badly
as the rest of us; hmm, so nasal demons may be part of the business
models.
If it's SUB then it's impossible to miscompile it because correct compilation is not
defined.
But noone *likes* undefined behaviour , it is just a trade off.
The GCC and Clang compiler maintainers love it. For them reason 2 and
3 don't apply, yet with every release they miscompile more and more
cases that are undefined in the standard for reasons 2 and 3; e.g.,
signed integer overflow.
gcc provides 2 different options for getting predictable behaviour for signed
integer overflow. So this contradicts your statement.
If
programmers get convinced to avoid undefined behaviour then this ultimately
will make code more predictable not less.
So you love nasal demons, too.
I avoid SUB so they are indifferent to me. For the point I was making , you snipped
relevant context. I said

The talk of nasal demons is to drive home the point (especially to new
programmers) that when dealing with undefined behaviour (that is SUB) there's no
"reasonable" behaviour one can expect therefore *** Just say no *** .If
programmers get convinced to avoid undefined behaviour then this ultimately will
make code more predictable not less.

To add to the above , as a metaphor which is used as an educational tool , I
consider it unfortunate because it ignores the fact that a computer exists and
operates in the real world so it can only cause things which are physically
possible. There are other ways to drive home the point that SUB is unpredictable
without mentioning physical impossibilities.
Many programmers are convinced. But because there is actually no good
way to write code without undefined behaviour, we will get programs by
convinced programmers that have fewer undefined behaviours, but it
will still be there; and the next release of the compiler breaks some
code, the convinced programmers get busy fixing it, and the next
release breaks some more code. Bottom line: slightly faster
benchmarks, more programs that break, and more work for the C
programmers without any benefit.
There is a way to write code without SUB and that is to be familiar with the
standards and documentation which is relevant to the code one writes. Bugs may exist
of course but this would be the case just as much even with no undefined behaviour.

[...]
But then there are always some
stubborn ones like yourself or
http://www.open-chess.org/viewtopic.php?f=5&t=2519 where Robert Hyatt
insists that SUB should still do what he wants just because it did in the past.
So you think that using strcpy to do an overlapping backwards copy is
SUB, i.e., "not defined according to *anything*". Interesting.
I read the thread a while ago and I can't be bothered to reread it but I don't think
that Hyatt said that he was using a C library which , as an extension to the C
standard , defined strcpy() for overlapping objects. If that's the case then the
behaviour was not defined according to anything relevant to the code he was writing
so I would classify it as SUB and I don't believe that he had any rational reason to
expect a specific behaviour (but he thought otherwise). What one might consider
"natural" or useful behaviour for strcpy() does not affect the definition of SUB. I
will present a more general principle :

For the tools you use , you cannot expect what has not been promised to you.

A C compiler is a tool and so is a C library (whether it's the standard C library or
some other) ; they were written by humans. If those humans do not tell you for
example "we made sure that strcpy() copies correctly even overlapping objects" then
you have no rational reason to expect it.
--
This, of course, is a straw man. As the Stanford Encyclopedia of Philosophy observes
Moral relativism has the unusual distinction - both within philosophy and outside
it - of being attributed to others, almost always as a criticism, far more often than
it is explicitly professed by anyone.
http://www.mathpages.com/home/kmath557/kmath557.htm
James Kuyper
2017-04-10 02:46:04 UTC
Permalink
Raw Message
Post by Spiros Bousbouras
[ Crossposted to comp.lang.c since we need some C lawyers ;-) ]
On Sun, 09 Apr 2017 14:53:01 GMT
...
Post by Spiros Bousbouras
I once
|It seems to me that
|
| if ((size_t)(dest-src) < len)
|
|is a more efficient way to test whether dest is inside [src, src+len).
|
|Of course, since this code contains more than three tokens, it is
|likely to contain undefined behaviour
That was mostly in jest, but a post or two later, someone explained
that yes, this code is undefined. So while I do not see anything
undefined about the code you posted, just let the language lawyers at
it, and they will point it out.
Ok , we'll see what the language lawyers say.
It would help to know what the relevant data types are, but the comment
about "dest is inside [src, src+len] suggests that dest and src are both
pointers. The C standard defines subtraction of pointers in terms of
positions in the SAME array. If src and dest don't both point into or
one past the end of the same array, subtraction of those pointers has
undefined behavior (6.5.6p9).
Consider an implementation of C targeting a machine with a
segment-offset architecture, where every pointer into a given object
(including pointers to sub-objects) has the same segment value (which
implies that no object can be bigger than a single segment), while
pointers into completely unrelated objects will, in general, have
different segments. On such a machine, the fact that this behavior is
undefined means that a fully conforming implementation of C is allowed
to implement pointer subtraction by simply subtracting the offsets and
IGNORING the segment numbers. Feel free to figure out the implications
that would have for your dest-size test.

...
Post by Spiros Bousbouras
A
*constructive* thing to do (as opposed to complaining) would be to describe
what guarantees with regard to C pointer comparisons on top of the C standard
would be needed to do what you want to do. Then you could ask on comp.lang.c
or some compiler mailing list whether these additional guarantees can be
offered by using some appropriate flag and if not how easy would be to implement
them as an extension.
If you think that's a promising approach, go ahead. My experience is
that they always tell you to go to the C standards committee.
Redirecting you to the standards committee would be an inappropriate
response to a inquiry about the existence or feasibility of an extension
to C. Are you sure you didn't word your inquiry in such a way that it
sounded like you were asking for a change to the standard itself?
Post by Spiros Bousbouras
Before you go to the C standards committee you need to have a specific proposal. I
take it that for C to be a suitable back end for implementing Forth , you want
comparison of C pointers pointing to different objects to be defined. Could you
Note: C++ requires that std::less<T*> must provide a total ordering of
all pointers of type T*, even if the "<" operator does not provide such
an ordering (20.9.5p14). C++ has a sufficiently strong shared heritage
with C that the fact that the C++ standard can successfully impose this
requirement strongly suggests that the C standard could impose a similar
requirement.
However, it's telling that C++ did NOT impose this requirement on the
"<" operator itself, but only on std::less<T*>. That's because there are
some systems where meeting this requirement makes std::less<T*>(p,q)
substantially slower than p<q.
Robert Wessel
2017-04-10 03:27:59 UTC
Permalink
Raw Message
On Sun, 9 Apr 2017 22:46:04 -0400, James Kuyper
Post by James Kuyper
Post by Spiros Bousbouras
[ Crossposted to comp.lang.c since we need some C lawyers ;-) ]
On Sun, 09 Apr 2017 14:53:01 GMT
...
Post by Spiros Bousbouras
I once
|It seems to me that
|
| if ((size_t)(dest-src) < len)
|
|is a more efficient way to test whether dest is inside [src, src+len).
|
|Of course, since this code contains more than three tokens, it is
|likely to contain undefined behaviour
That was mostly in jest, but a post or two later, someone explained
that yes, this code is undefined. So while I do not see anything
undefined about the code you posted, just let the language lawyers at
it, and they will point it out.
Ok , we'll see what the language lawyers say.
It would help to know what the relevant data types are, but the comment
about "dest is inside [src, src+len] suggests that dest and src are both
pointers. The C standard defines subtraction of pointers in terms of
positions in the SAME array. If src and dest don't both point into or
one past the end of the same array, subtraction of those pointers has
undefined behavior (6.5.6p9).
Consider an implementation of C targeting a machine with a
segment-offset architecture, where every pointer into a given object
(including pointers to sub-objects) has the same segment value (which
implies that no object can be bigger than a single segment), while
pointers into completely unrelated objects will, in general, have
different segments. On such a machine, the fact that this behavior is
undefined means that a fully conforming implementation of C is allowed
to implement pointer subtraction by simply subtracting the offsets and
IGNORING the segment numbers. Feel free to figure out the implications
that would have for your dest-size test.
...
Post by Spiros Bousbouras
A
*constructive* thing to do (as opposed to complaining) would be to describe
what guarantees with regard to C pointer comparisons on top of the C standard
would be needed to do what you want to do. Then you could ask on comp.lang.c
or some compiler mailing list whether these additional guarantees can be
offered by using some appropriate flag and if not how easy would be to implement
them as an extension.
If you think that's a promising approach, go ahead. My experience is
that they always tell you to go to the C standards committee.
Redirecting you to the standards committee would be an inappropriate
response to a inquiry about the existence or feasibility of an extension
to C. Are you sure you didn't word your inquiry in such a way that it
sounded like you were asking for a change to the standard itself?
Post by Spiros Bousbouras
Before you go to the C standards committee you need to have a specific proposal. I
take it that for C to be a suitable back end for implementing Forth , you want
comparison of C pointers pointing to different objects to be defined. Could you
Note: C++ requires that std::less<T*> must provide a total ordering of
all pointers of type T*, even if the "<" operator does not provide such
an ordering (20.9.5p14). C++ has a sufficiently strong shared heritage
with C that the fact that the C++ standard can successfully impose this
requirement strongly suggests that the C standard could impose a similar
requirement.
However, it's telling that C++ did NOT impose this requirement on the
"<" operator itself, but only on std::less<T*>. That's because there are
some systems where meeting this requirement makes std::less<T*>(p,q)
substantially slower than p<q.
It also does not (necessarily) produce the same ordering as does the
built-in operator <. It does if the built-in < comparison is between
objects that can be compared without undefined behavior, but if not,
the two can produce different results.

It also does not require any particular order for those objects, just
a consistent one - that might become visible if the objects overlap in
memory. IOW, if you have two overlapping objects of 10 bytes each, A
and B, with an overlap of five bytes, such that A[5] is the same
position in storage as B[0], there's no requirement that std::less
actually order A before B. Of course, that kind of overlap is
undefined behavior itself.
Spiros Bousbouras
2017-04-10 12:36:00 UTC
Permalink
Raw Message
On Sun, 9 Apr 2017 22:46:04 -0400
Post by James Kuyper
Post by Spiros Bousbouras
[ Crossposted to comp.lang.c since we need some C lawyers ;-) ]
On Sun, 09 Apr 2017 14:53:01 GMT
...
Post by Spiros Bousbouras
I once
|It seems to me that
|
| if ((size_t)(dest-src) < len)
|
|is a more efficient way to test whether dest is inside [src, src+len).
|
|Of course, since this code contains more than three tokens, it is
|likely to contain undefined behaviour
That was mostly in jest, but a post or two later, someone explained
that yes, this code is undefined. So while I do not see anything
undefined about the code you posted, just let the language lawyers at
it, and they will point it out.
Ok , we'll see what the language lawyers say.
It would help to know what the relevant data types are, but the comment
about "dest is inside [src, src+len] suggests that dest and src are both
pointers.
Actually the code in question was

int is_leap_year(unsigned int year) {
return 1 ;
}

.It was for this code that Anton posted

I don't think it is trivial to see that this is defined.
and
So while I do not see anything undefined about the code you posted,
just let the language lawyers at it, and they will point it out.
Ben Bacarisse
2017-04-10 13:09:17 UTC
Permalink
Raw Message
Spiros Bousbouras <***@gmail.com> writes:
<snip>
Post by Spiros Bousbouras
Actually the code in question was
int is_leap_year(unsigned int year) {
return 1 ;
}
.It was for this code that Anton posted
I don't think it is trivial to see that this is defined.
and
So while I do not see anything undefined about the code you posted,
just let the language lawyers at it, and they will point it out.
It's fine. The only possible question mark is the name since some
identifiers with external linkage that start with "is" are reserved, but
the reserved names are those where the following letter be lower-case.
is_<anything> is fine.
--
Ben.
Spiros Bousbouras
2017-04-10 13:07:51 UTC
Permalink
Raw Message
On Sun, 9 Apr 2017 22:46:04 -0400
Post by James Kuyper
Post by Spiros Bousbouras
[ Crossposted to comp.lang.c since we need some C lawyers ;-) ]
On Sun, 09 Apr 2017 14:53:01 GMT
[...]
Post by James Kuyper
Post by Spiros Bousbouras
Before you go to the C standards committee you need to have a specific proposal. I
take it that for C to be a suitable back end for implementing Forth , you want
comparison of C pointers pointing to different objects to be defined. Could you
Note: C++ requires that std::less<T*> must provide a total ordering of
all pointers of type T*, even if the "<" operator does not provide such
an ordering (20.9.5p14). C++ has a sufficiently strong shared heritage
with C that the fact that the C++ standard can successfully impose this
requirement strongly suggests that the C standard could impose a similar
requirement.
However, it's telling that C++ did NOT impose this requirement on the
"<" operator itself, but only on std::less<T*>. That's because there are
some systems where meeting this requirement makes std::less<T*>(p,q)
substantially slower than p<q.
I wonder if for Anton's purposes and for a big endian machine the following
would do

int compare_pointers_to_void(void **p1 , void **p2) {
return memcmp(*p1 , *p2 , sizeof(void *)) ;
}

For a little endian machine you would have to write it by hand :

int compare_pointers_to_void(void **p1 , void **p2) {
size_t i = sizeof(void *) - 1 ;
const unsigned char * const pc1 = *p1 , * const pc2 = *p2 ;
while (1) {
if (pc1[i] > pc2[i]) return 1 ;
if (pc1[i] < pc2[i]) return -1 ;
if (i == 0) return 0 ;
i-- ;
}
}
Ben Bacarisse
2017-04-10 15:18:18 UTC
Permalink
Raw Message
Post by Robert Wessel
On Sun, 9 Apr 2017 22:46:04 -0400
Post by James Kuyper
Post by Spiros Bousbouras
[ Crossposted to comp.lang.c since we need some C lawyers ;-) ]
On Sun, 09 Apr 2017 14:53:01 GMT
[...]
Post by James Kuyper
Post by Spiros Bousbouras
Before you go to the C standards committee you need to have a specific proposal. I
take it that for C to be a suitable back end for implementing Forth , you want
comparison of C pointers pointing to different objects to be defined. Could you
Note: C++ requires that std::less<T*> must provide a total ordering of
all pointers of type T*, even if the "<" operator does not provide such
an ordering (20.9.5p14). C++ has a sufficiently strong shared heritage
with C that the fact that the C++ standard can successfully impose this
requirement strongly suggests that the C standard could impose a similar
requirement.
However, it's telling that C++ did NOT impose this requirement on the
"<" operator itself, but only on std::less<T*>. That's because there are
some systems where meeting this requirement makes std::less<T*>(p,q)
substantially slower than p<q.
I wonder if for Anton's purposes and for a big endian machine the following
would do
int compare_pointers_to_void(void **p1 , void **p2) {
return memcmp(*p1 , *p2 , sizeof(void *)) ;
}
That seems unlikely. You might have meant

int compare_pointers_to_void(void **p1, void **p2) {
return memcmp(p1, p2, sizeof (void *));
}

or, possibly,

int compare_pointers_to_void(void *p1, void *p2) {
return memcmp(&p1, &p2, sizeof (void *));
}

Note that using the idiom:

memcmp(<exp1>, <exp2>, sizeof *<exp1>);

can highlight indirection issues like this. Your example written in
this style is

int compare_pointers_to_void(void **p1, void **p2) {
return memcmp(*p1, *p2, sizeof **p1);
}

and sizeof **p1 is sizeof a void object -- undefined.

But even fixed, I doubt that comparing the representation of the
pointers is a good idea. If the system uses some peculiar segmented
architecture, that's going to go wrong. Even though it's not portable,
the best solution is usually to convert to an integer type. Where that
is meaningful, C99 implementations provide the type uintptr_t in
stdint.h.

<snip>
--
Ben.
James R. Kuyper
2017-04-10 15:51:37 UTC
Permalink
Raw Message
Post by Robert Wessel
On Sun, 9 Apr 2017 22:46:04 -0400
...
Post by Robert Wessel
Post by James Kuyper
Note: C++ requires that std::less<T*> must provide a total ordering of
all pointers of type T*, even if the "<" operator does not provide such
an ordering (20.9.5p14). C++ has a sufficiently strong shared heritage
with C that the fact that the C++ standard can successfully impose this
requirement strongly suggests that the C standard could impose a similar
requirement.
However, it's telling that C++ did NOT impose this requirement on the
"<" operator itself, but only on std::less<T*>. That's because there are
some systems where meeting this requirement makes std::less<T*>(p,q)
substantially slower than p<q.
I wonder if for Anton's purposes and for a big endian machine the following
would do
int compare_pointers_to_void(void **p1 , void **p2) {
return memcmp(*p1 , *p2 , sizeof(void *)) ;
}
<sidetrack>
That seems unnecessarily complicated. To use it, if the pointers you
want to compare are the values of expressions, you first have to store
those values in void*. The same is true if you wan to compare pointers
that are not of type "void*". Then you would have to pass pointers to
those objects to compare_pointers_to_void(). Wouldn't it be simpler to
define:

int compare_pointers_to_void(void *p1, void *p2)
{
return memcmp(&p1, &p2, sizeof (void *));
}

This works on pointers of arbitrary types, which would all be implicitly
converted to void* before the function call, and requires no extra work
to apply it to a pointer-valued expression, rather than a pointer stored
in an object.
</sidetrack>

It's unlikely that such a work-around would actually work as desired on
any system where "<" would not - if the above code would in fact work,
that's precisely how "<" should be implemented by the compiler.
Implementations where "<" doesn't provide a total ordering generally use
a pointer representation for which your function would sometimes give
either the wrong order, or an order that is inconsistent. On some
systems, pointers to the same location in memory can have different
representations, so your function could even return a non-zero value for
equivalent pointers.
Spiros Bousbouras
2017-04-12 21:32:26 UTC
Permalink
Raw Message
On Mon, 10 Apr 2017 11:51:37 -0400
Post by James R. Kuyper
Post by Robert Wessel
On Sun, 9 Apr 2017 22:46:04 -0400
...
Post by Robert Wessel
Post by James Kuyper
Note: C++ requires that std::less<T*> must provide a total ordering of
all pointers of type T*, even if the "<" operator does not provide such
an ordering (20.9.5p14). C++ has a sufficiently strong shared heritage
with C that the fact that the C++ standard can successfully impose this
requirement strongly suggests that the C standard could impose a similar
requirement.
However, it's telling that C++ did NOT impose this requirement on the
"<" operator itself, but only on std::less<T*>. That's because there are
some systems where meeting this requirement makes std::less<T*>(p,q)
substantially slower than p<q.
I wonder if for Anton's purposes and for a big endian machine the following
would do
int compare_pointers_to_void(void **p1 , void **p2) {
return memcmp(*p1 , *p2 , sizeof(void *)) ;
}
<sidetrack>
That seems unnecessarily complicated. To use it, if the pointers you
want to compare are the values of expressions, you first have to store
those values in void*. The same is true if you wan to compare pointers
that are not of type "void*". Then you would have to pass pointers to
those objects to compare_pointers_to_void(). Wouldn't it be simpler to
int compare_pointers_to_void(void *p1, void *p2)
{
return memcmp(&p1, &p2, sizeof (void *));
}
This works on pointers of arbitrary types, which would all be implicitly
converted to void* before the function call, and requires no extra work
to apply it to a pointer-valued expression, rather than a pointer stored
in an object.
</sidetrack>
Yes , your version is better.
Post by James R. Kuyper
It's unlikely that such a work-around would actually work as desired on
any system where "<" would not - if the above code would in fact work,
that's precisely how "<" should be implemented by the compiler.
Isn't it more likely that the compiler would compare pointers by a single
assembly instruction as opposed a call to memcmp() ? In any case , the
whole point of my code is that it's defined even if the pointers point to
different objects. Whereas using < , if the pointers point to different
objects then the behaviour is undefined. So if the compiler can prove they
point to different objects then instead of emitting the "obvious" assembly
instruction which presumably would do what Anton wants , it may decide
instead to be a smart ass and do something totally unpredictable.
[ If I ever write a C compiler I will consider adding a -smartass option.
Precise functionality to be decided. ]
Post by James R. Kuyper
Implementations where "<" doesn't provide a total ordering generally use
a pointer representation for which your function would sometimes give
either the wrong order, or an order that is inconsistent.
When you say "wrong order" do you mean that for pointers pointing to the
same object , compare_pointers_to_void() might not give the same result
as < ? And what about inconsistent ? Do you mean not satisfy transitivity ?
Post by James R. Kuyper
On some
systems, pointers to the same location in memory can have different
representations, so your function could even return a non-zero value for
equivalent pointers.
Fascinating. Can you give an example ?
--
Apple Computer reported today that it has developed computer chips that
can store and play music inside women's breasts. This is considered to
be a major breakthrough because women are always complaining about men
staring at their breasts and not listening to them.
<***@netfunny.com>
James R. Kuyper
2017-04-12 22:29:26 UTC
Permalink
Raw Message
Post by Spiros Bousbouras
On Mon, 10 Apr 2017 11:51:37 -0400
Post by James R. Kuyper
Post by Robert Wessel
On Sun, 9 Apr 2017 22:46:04 -0400
...
Post by Robert Wessel
Post by James Kuyper
Note: C++ requires that std::less<T*> must provide a total ordering of
all pointers of type T*, even if the "<" operator does not provide such
an ordering (20.9.5p14). C++ has a sufficiently strong shared heritage
with C that the fact that the C++ standard can successfully impose this
requirement strongly suggests that the C standard could impose a similar
requirement.
However, it's telling that C++ did NOT impose this requirement on the
"<" operator itself, but only on std::less<T*>. That's because there are
some systems where meeting this requirement makes std::less<T*>(p,q)
substantially slower than p<q.
I wonder if for Anton's purposes and for a big endian machine the following
would do
...
Post by Spiros Bousbouras
Post by James R. Kuyper
<sidetrack>
...
Post by Spiros Bousbouras
Post by James R. Kuyper
int compare_pointers_to_void(void *p1, void *p2)
{
return memcmp(&p1, &p2, sizeof (void *));
}
...
Post by Spiros Bousbouras
Post by James R. Kuyper
</sidetrack>
Yes , your version is better.
Post by James R. Kuyper
It's unlikely that such a work-around would actually work as desired on
any system where "<" would not - if the above code would in fact work,
that's precisely how "<" should be implemented by the compiler.
Isn't it more likely that the compiler would compare pointers by a single
assembly instruction as opposed a call to memcmp() ?
I should have said "that's precisely equivalent to how "<" should be
implemented by the compiler. If the above code would work as desired,
then the assembly language instruction implementing p<q must perform
essentially the same job as memcmp(&p,&q,sizeof p)<0. To put it another
way, on such a platform, a sufficiently aggressively optimizing compiler
might optimize "memcmp(&p, &q, sizeof p) < 0" into that assembly
language instruction.
Post by Spiros Bousbouras
In any case , the
whole point of my code is that it's defined even if the pointers point to
different objects. Whereas using < , if the pointers point to different
objects then the behaviour is undefined. So if the compiler can prove they
point to different objects then instead of emitting the "obvious" assembly
instruction which presumably would do what Anton wants , it may decide
instead to be a smart ass and do something totally unpredictable.
That's a legitimate point; if I needed the ability to compare pointers
to unrelated objects, I'd probably use C++ and std::less<T*> rather than
writing such a function. I'd expect the resulting program to execute
exactly as fast as direct use of "<", on any platform where
compare_pointers_to_void() would work as desired.

However, if you can't use C++, and you know that, for all of the
implementations where you need your code to work,
compare_pointers_to_void() does produce the results that you need, it's
a valid way of preventing dangerous optimizations that use of "<" would
allow.

Our client imposes stricter portability requirements than that on my
code. It must work correctly even on implementations where
compare_pointers_to_void() would not work as desired, so that wouldn't
be an option for me. This is not stated explicitly, but is implied by a
more general prohibition on writing code that works as intended only
with specific compilers.
Post by Spiros Bousbouras
Post by James R. Kuyper
Implementations where "<" doesn't provide a total ordering generally use
a pointer representation for which your function would sometimes give
either the wrong order, or an order that is inconsistent.
When you say "wrong order" do you mean that for pointers pointing to the
same object , compare_pointers_to_void() might not give the same result
as < ? ...
Correct.
Post by Spiros Bousbouras
... And what about inconsistent ? Do you mean not satisfy transitivity ?
That's part of it. The following paragraph is about one of the other
Post by Spiros Bousbouras
Post by James R. Kuyper
On some
systems, pointers to the same location in memory can have different
representations, so your function could even return a non-zero value for
equivalent pointers.
Fascinating. Can you give an example ?
The case I'm most familiar with is fairly out-of-date: a segment:offset
architecture where the segment size was only 64 bytes, so that a pointer
represented as 0x1234:0x7890 would point at the exact same location as
one represented by 0x1235:0x7850. Whether any more modern architecture
has multiple valid representations for the same memory location, I have
no idea. But since the C standard allows it, and has always allowed it,
I wouldn't recommend assuming that no modern implementation takes
advantage of that fact. As I mentioned above, I'm actually prohibited
from making such assumptions.
s***@casperkitty.com
2017-04-12 22:34:53 UTC
Permalink
Raw Message
Post by Spiros Bousbouras
Isn't it more likely that the compiler would compare pointers by a single
assembly instruction as opposed a call to memcmp() ? In any case , the
whole point of my code is that it's defined even if the pointers point to
different objects. Whereas using < , if the pointers point to different
objects then the behaviour is undefined. So if the compiler can prove they
point to different objects then instead of emitting the "obvious" assembly
instruction which presumably would do what Anton wants , it may decide
instead to be a smart ass and do something totally unpredictable.
[ If I ever write a C compiler I will consider adding a -smartass option.
Precise functionality to be decided. ]
Unless a system uses big-endian byte ordering, using memcmp on pointers
within an object would often yield a different ordering from using "<" on
those pointers.
Post by Spiros Bousbouras
Post by David Brown
On some
systems, pointers to the same location in memory can have different
representations, so your function could even return a non-zero value for
equivalent pointers.
Fascinating. Can you give an example ?
The most likely scenario where that would arise would be on systems where
pointer types may include padding bits. If an 80386 compiler stored each
pointer as a 32-bit offset, 16-bit segment, and 16 padding bits, for
example, a pointer assignment might use a 32-bit store to write the offset,
a 16-bit store to write the segment, and not do anything to write the
padding bits. Depending upon what the padding bits were used for, a memcmp
upon two equal pointers might yield arbitrary results.
Spiros Bousbouras
2017-04-12 23:41:41 UTC
Permalink
Raw Message
On Wed, 12 Apr 2017 15:34:53 -0700 (PDT)
Post by s***@casperkitty.com
Post by Spiros Bousbouras
Isn't it more likely that the compiler would compare pointers by a single
assembly instruction as opposed a call to memcmp() ? In any case , the
whole point of my code is that it's defined even if the pointers point to
different objects. Whereas using < , if the pointers point to different
objects then the behaviour is undefined. So if the compiler can prove they
point to different objects then instead of emitting the "obvious" assembly
instruction which presumably would do what Anton wants , it may decide
instead to be a smart ass and do something totally unpredictable.
[ If I ever write a C compiler I will consider adding a -smartass option.
Precise functionality to be decided. ]
Unless a system uses big-endian byte ordering, using memcmp on pointers
within an object would often yield a different ordering from using "<" on
those pointers.
Yes. As I said in <HsLGA.931050$***@fx46.am4> {

I wonder if for Anton's purposes and for a big endian machine the following
would do

int compare_pointers_to_void(void **p1 , void **p2) {
return memcmp(*p1 , *p2 , sizeof(void *)) ;
}

For a little endian machine you would have to write it by hand :

int compare_pointers_to_void(void **p1 , void **p2) {
size_t i = sizeof(void *) - 1 ;
const unsigned char * const pc1 = *p1 , * const pc2 = *p2 ;
while (1) {
if (pc1[i] > pc2[i]) return 1 ;
if (pc1[i] < pc2[i]) return -1 ;
if (i == 0) return 0 ;
i-- ;
}
}


}
Albert van der Horst
2017-04-13 00:55:42 UTC
Permalink
Raw Message
Post by Spiros Bousbouras
On Mon, 10 Apr 2017 11:51:37 -0400
<SNIP>
Post by Spiros Bousbouras
Post by David Brown
On some
systems, pointers to the same location in memory can have different
representations, so your function could even return a non-zero value for
equivalent pointers.
Fascinating. Can you give an example ?
Come one, can't you come up with an example yourself?

An example is with the segmented addressing on the 8088.
0 1A0A FAR@
and
1 19FA FAR@
access the same location.
So on a Forth with a particular memory model e.g. a 32 bit Forth
with 1 Mbyte memory you'd want a special address comparision where
01A0A 119FA ADDRESS=
gives TRUE.
The memory compare solution doesn't work and neither does the
32 bit = that works for numbers.

Groetjes Albert
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
***@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
s***@casperkitty.com
2017-04-10 22:12:56 UTC
Permalink
Raw Message
Post by Spiros Bousbouras
The underlying principle is that one must have some rational reason to expect
code to behave in a certain way. For example if I have C code which includes
frooboz(123) ;
I must have have some rational reason to expect this to translate without errors and
do something specific I have in mind. Since the C standard does not define frooboz()
, this reason must come from somewhere else. So we definitely need some stronger
concept than WUB.
On an implementation which allows C programs to invoke, or be invoked by,
code written in other languages the compiler knows nothing about, the
code:

void write_ushort(void*p, unsigned value)
{
*(unsigned short*)p = value;
}

would be required to export the address of a function which performed some
rather specific actions at the hardware level. On a compiler targeting any
32-bit ARM platform, for example, the function would be required to store
the value held in the lower 16 bits of R1 into two bytes starting at the
the address held in R0 and then jump to the address in R14; when the
function returns, R0-R3 may hold any arbitrary bit patterns, but all other
registers must have the same values as when the function was called.

If some other program were to include the code:

extern void write_ushort(void*p, unsigned value);
extern unsigned x;
unsigned test(void *p)
{
x = 0x12345678;
write_ushort(p, 0x4321);
return test;
}

and the compiler didn't know anything about write_ushort() other than its
prototype, the function "test" would be required to store 0x12345678 into
four bytes at external symbol x, load R1 with the value 0x4321 and R0 with
the passed-in pointer, and then generate a call to write_ushort, placing the
return address in R14 (the return address is normally placed in R14 by
the same BL instruction that does the call). After doing that, the code
must read four bytes from external symbol x into R0 and return to the
calling code.

If either function were replaced by a function, not in C, whose machine-
level behavior precisely matched what would be necessary when the functions
interact with outside code, behavior of the pair of functions would be
defined. The $50,000 question is whether a definition of how a C function
must behave when calling a function the compiler knows nothing about should
be regarded as defining the behavior in cases where both functions are
written in C.
Post by Spiros Bousbouras
To get to
one of your favorite examples , if your C code has signed integer oveflow what's
your rational reason to expect that it will do something specific (like wraparound)?'
The authors of the C89 Standard noted in their rationale that the majority
of then-existing implementations used silent wraparound overflow semantics,
and that on such implementations, promotion of short unsigned types to signed
int would yield the same behavior as promotion to unsigned, *even in cases
involving overflow*, unless the results were used in certain ways. What
reason could the authors of the Standard have for mentioning that, if they
didn't think it reasonable to expect that commonplace implementations would
behave in such fashion in the relevant cases?
Post by Spiros Bousbouras
But of course the GCC maintainers don't accept this as a bug: They say
that reading from the buffer after the free() is undefined behaviour,
therefore does not happen, and therefore they are in their rights to
conjure up nasal demons and "optimize" the memset()/bzero() away. And
this is not an issue that could turn out one way or the other, this
memset()/bzero() is there exactly for mitigating information leaks,
such as the one that is the main reason of the CVE.
I provided a reference above and it does not support your claims that this
is what caused the bug. Do you have a reference which says otherwise ?
Systems programming code often needs to guard against a variety of security
risks which are not defined by the C Standard. For a compiler to undermine
that would be like a lock manufacturer saying that the purpose of a lock is
simply to prevent the cylinder from rotating when an incorrect key is
inserted, and there's no need to spend money on pick-resistant pins since
it's not possible to pick a lock while an incorrect key is inserted, and
since the lock's behavior isn't specified if someone inserts something that
isn't a key.
Post by Spiros Bousbouras
I think that nasal-demon C is not a suitable language for anything.
I use it all the time with perfectly predictable results. I have my complaints
but they are not caused by undefined behaviour.
It's fine for some kinds of programs whose input only comes from trusted
sources. It can make certain kinds of programming needlessly difficult,
however, since it tends to undermine code which is designed to include
redundant layers of protection against certain threats. Code might be
engineered so that a certain bad event could only be triggered by an
impossible set of coincidences, but having a compiler decide that certain
things "can't" happen, and thus code which would trap them can be eliminated,
may result in all the safety-checking code getting "optimized out".
Post by Spiros Bousbouras
"Do the natural thing" is not a meaningful definition. I don't
C* A language (or family of languages) where language constructs correspond
directly to things that the hardware does. E.g., * corresponds to what a
hardware multiply instruction does. In terms of the C standard, conforming
programs are written in C .
I couldn't provide such a set of axioms because I've never implemented a Forth so I
don't know in what ways standard C as a back end is lacking.
Most C implementations target platforms which have hardware concepts
corresponding to concepts in C; the vast majority of microcomputer
platforms would, from a C standpoint, differ only in the sizes and
endianness of their various data types. Because some platforms have
various kinds of non-linear addressing schemes, the Standard refrains
from defining a particular mapping between numbers and pointers. On
many platforms, however, there will only be logical thing that something
like:

*(unsigned int volatile*)0x12345678 = 0x1234;

could possibly mean, and there should be a way of saying that "This code
is only suitable for compilers that target a platform where there is a
clear and obvious mapping between numbers and pointers, and which perform
conversions involving those types using that mapping." While there might
in theory be some squabbling about what is "clear and obvious", the vast
majority of platforms would have one overwhelmingly clear and obvious
mapping; only in rare cases where code targets a specific platform for
which multiple plausible mappings exist should code need to be more specific
than "use the obvious one".
Spiros Bousbouras
2017-04-12 23:34:31 UTC
Permalink
Raw Message
On Mon, 10 Apr 2017 15:12:56 -0700 (PDT)
Post by s***@casperkitty.com
Post by Spiros Bousbouras
The underlying principle is that one must have some rational reason to expect
code to behave in a certain way. For example if I have C code which includes
frooboz(123) ;
I must have have some rational reason to expect this to translate without errors and
do something specific I have in mind. Since the C standard does not define frooboz()
, this reason must come from somewhere else. So we definitely need some stronger
concept than WUB.
On an implementation which allows C programs to invoke, or be invoked by,
code written in other languages the compiler knows nothing about, the
void write_ushort(void*p, unsigned value)
{
*(unsigned short*)p = value;
}
would be required to export the address of a function which performed some
rather specific actions at the hardware level. On a compiler targeting any
32-bit ARM platform, for example, the function would be required to store
the value held in the lower 16 bits of R1 into two bytes starting at the
the address held in R0 and then jump to the address in R14; when the
function returns, R0-R3 may hold any arbitrary bit patterns, but all other
registers must have the same values as when the function was called.
extern void write_ushort(void*p, unsigned value);
extern unsigned x;
unsigned test(void *p)
{
x = 0x12345678;
write_ushort(p, 0x4321);
return test;
}
You probably mean
return x;
Post by s***@casperkitty.com
and the compiler didn't know anything about write_ushort() other than its
prototype, the function "test" would be required to store 0x12345678 into
four bytes at external symbol x, load R1 with the value 0x4321 and R0 with
the passed-in pointer, and then generate a call to write_ushort, placing the
return address in R14 (the return address is normally placed in R14 by
the same BL instruction that does the call). After doing that, the code
must read four bytes from external symbol x into R0 and return to the
calling code.
If either function were replaced by a function, not in C, whose machine-
level behavior precisely matched what would be necessary when the functions
interact with outside code, behavior of the pair of functions would be
defined. The $50,000 question is whether a definition of how a C function
must behave when calling a function the compiler knows nothing about should
be regarded as defining the behavior in cases where both functions are
written in C.
I don't know why you had to include all those details about ARM to ask the
question but in any case the answer seems to me almost tautologically to be
yes. To be sure one would need to see context but in general , definition is
different than implementation. From the point of view of C , one only needs
to know the signature of the function and its behaviour ; whether it is
implemented in C or something else is irrelevant. Even the standard C
functions could be implemented in any language.
Post by s***@casperkitty.com
Post by Spiros Bousbouras
To get to
one of your favorite examples , if your C code has signed integer oveflow what's
your rational reason to expect that it will do something specific (like wraparound)?'
The authors of the C89 Standard noted in their rationale that the majority
of then-existing implementations used silent wraparound overflow semantics,
and that on such implementations, promotion of short unsigned types to signed
int would yield the same behavior as promotion to unsigned, *even in cases
involving overflow*, unless the results were used in certain ways. What
reason could the authors of the Standard have for mentioning that, if they
didn't think it reasonable to expect that commonplace implementations would
behave in such fashion in the relevant cases?
Well , an implementation could be "commonplace" but not be among the majority
of implementations which "used silent wraparound overflow semantics" (I presume
for signed integers as well as unsigned). Then you also add the caveat
"unless the results were used in certain ways". Does the C89 rationale say
in which ways ? If not then the rationale doesn't give any practically useful
information. Finally , does it give a way to know if the implementation one
uses is among the "majority" which behave in that way ? If not , then that's
another reason the information is not useful.
Post by s***@casperkitty.com
Post by Spiros Bousbouras
But of course the GCC maintainers don't accept this as a bug: They say
that reading from the buffer after the free() is undefined behaviour,
therefore does not happen, and therefore they are in their rights to
conjure up nasal demons and "optimize" the memset()/bzero() away. And
this is not an issue that could turn out one way or the other, this
memset()/bzero() is there exactly for mitigating information leaks,
such as the one that is the main reason of the CVE.
I provided a reference above and it does not support your claims that this
is what caused the bug. Do you have a reference which says otherwise ?
Systems programming code often needs to guard against a variety of security
risks which are not defined by the C Standard. For a compiler to undermine
that would be like a lock manufacturer saying that the purpose of a lock is
simply to prevent the cylinder from rotating when an incorrect key is
inserted, and there's no need to spend money on pick-resistant pins since
it's not possible to pick a lock while an incorrect key is inserted, and
since the lock's behavior isn't specified if someone inserts something that
isn't a key.
This is hopelessly vague. The reference I provided (and my own remarks) describe
much more specific reasons why the code was buggy and these reasons have nothing
to do with compiler optimisations.
Post by s***@casperkitty.com
Post by Spiros Bousbouras
I think that nasal-demon C is not a suitable language for anything.
I use it all the time with perfectly predictable results. I have my complaints
but they are not caused by undefined behaviour.
It's fine for some kinds of programs whose input only comes from trusted
sources. It can make certain kinds of programming needlessly difficult,
however, since it tends to undermine code which is designed to include
redundant layers of protection against certain threats. Code might be
engineered so that a certain bad event could only be triggered by an
impossible set of coincidences, but having a compiler decide that certain
things "can't" happen, and thus code which would trap them can be eliminated,
may result in all the safety-checking code getting "optimized out".
I'd love to see an example.
Post by s***@casperkitty.com
Post by Spiros Bousbouras
"Do the natural thing" is not a meaningful definition. I don't
C* A language (or family of languages) where language constructs correspond
directly to things that the hardware does. E.g., * corresponds to what a
hardware multiply instruction does. In terms of the C standard, conforming
programs are written in C .
I couldn't provide such a set of axioms because I've never implemented a Forth so I
don't know in what ways standard C as a back end is lacking.
Most C implementations target platforms which have hardware concepts
corresponding to concepts in C; the vast majority of microcomputer
platforms would, from a C standpoint, differ only in the sizes and
endianness of their various data types.
Perhaps they have hardware concepts corresponding to some concepts in C but
not all. Obviously they won't have hardware concepts corresponding to C
declarations. Also , likely several assembly instructions will be needed to
implement a for .But let's take those C constructs for which there is an
"obvious" assembly instruction. Even for those , a straightforward translation
from C to the assembly instructions would be unnecessarily slow ; for example
a straightforward translation for something like (i + j + k) * (i + j + k)
would emit twice the assembly instructions for i + j + k instead of reusing
the previous value. I don't think most programmers would want this. Some
optimisations *are* desirable. But because optimisations use heuristics there
is always the possibility that applying the heuristic on some occasion would
make the code slower or , in the case of undefined behaviour , make it do
something unexpected.
Post by s***@casperkitty.com
Because some platforms have
various kinds of non-linear addressing schemes, the Standard refrains
from defining a particular mapping between numbers and pointers. On
many platforms, however, there will only be logical thing that something
*(unsigned int volatile*)0x12345678 = 0x1234;
could possibly mean, and there should be a way of saying that "This code
is only suitable for compilers that target a platform where there is a
clear and obvious mapping between numbers and pointers, and which perform
conversions involving those types using that mapping." While there might
in theory be some squabbling about what is "clear and obvious", the vast
majority of platforms would have one overwhelmingly clear and obvious
mapping; only in rare cases where code targets a specific platform for
which multiple plausible mappings exist should code need to be more specific
than "use the obvious one".
There will most definitely be squabbling. Unless you have specific properties
such a mapping would satisfy , what you're proposing has no practical value
whatsoever. You seem familiar with the ARM architecture. Is there such an
obvious mapping for ARM ? If yes , is there a way to express this mapping in
terms of only C , meaning it would be understandable even by someone who
doesn't know ARM assembly ? It's not enough to present some simple cases and
say "the generated assembly instructions should do the obvious thing for this
case". An actual *definition* which extends standard C must specify what
should be done for arbitrarily complicated C constructs otherwise a compiler
writer cannot implement the extension.
--
The paper decided early on that 2016 was to be a coronation, and that all attempts
to derail Hillary's ascent to the presidency (or even to point out that it wasn't
going according to plan) would be mocked, ignored, or treated as failures to
acknowledge Empirical Reality.
https://www.currentaffairs.org/2017/01/how-the-times-failed-you
s***@casperkitty.com
2017-04-13 00:25:42 UTC
Permalink
Raw Message
Post by Spiros Bousbouras
I don't know why you had to include all those details about ARM to ask the
question but in any case the answer seems to me almost tautologically to be
yes. To be sure one would need to see context but in general , definition is
different than implementation. From the point of view of C , one only needs
to know the signature of the function and its behaviour ; whether it is
implemented in C or something else is irrelevant. Even the standard C
functions could be implemented in any language.
On many compilers, the machine-level behavior for C functions would be
defined if called from code written in other languages, and their behavior
would be defined if they called functions in other languages whose machine-
level behavior was defined as above, *but* some implementations may behave
differently when a function written in C is called from another function
also written in C.

I gave the ARM example to make clear how rigidly the hardware-level behaviors
will often be defined by platform specifications.
Post by Spiros Bousbouras
Post by s***@casperkitty.com
The authors of the C89 Standard noted in their rationale that the majority
of then-existing implementations used silent wraparound overflow semantics,
and that on such implementations, promotion of short unsigned types to signed
int would yield the same behavior as promotion to unsigned, *even in cases
involving overflow*, unless the results were used in certain ways. What
reason could the authors of the Standard have for mentioning that, if they
didn't think it reasonable to expect that commonplace implementations would
behave in such fashion in the relevant cases?
Well , an implementation could be "commonplace" but not be among the majority
of implementations which "used silent wraparound overflow semantics" (I presume
for signed integers as well as unsigned). Then you also add the caveat
"unless the results were used in certain ways". Does the C89 rationale say
in which ways ? If not then the rationale doesn't give any practically useful
information. Finally , does it give a way to know if the implementation one
uses is among the "majority" which behave in that way ? If not , then that's
another reason the information is not useful.
The rationale does specify the ways. In the document at

http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf

Look on page 44, especially the paragraph starting at line 20.

Note that the authors of the Standard did not mention that even on an
implementation targeting a two's-complement silent-wraparound platform
the behavior of signed and unsigned types might diverge in cases where
the result of an expression isn't used in any of the ways listed in
point 2, but a compiler decides to "optimize" based upon an assumption
that multiplying two values of type "unsigned short" will not yield a
value larger than "INT_MAX". I doubt they could have imagined that
anyone trying to write a quality compiler for such a system would have
it make such assumptions when the result of a multiply is converted to
an unsigned type of the same size.
Post by Spiros Bousbouras
Post by s***@casperkitty.com
Post by Spiros Bousbouras
But of course the GCC maintainers don't accept this as a bug: They say
that reading from the buffer after the free() is undefined behaviour,
therefore does not happen, and therefore they are in their rights to
conjure up nasal demons and "optimize" the memset()/bzero() away. And
this is not an issue that could turn out one way or the other, this
memset()/bzero() is there exactly for mitigating information leaks,
such as the one that is the main reason of the CVE.
I provided a reference above and it does not support your claims that this
is what caused the bug. Do you have a reference which says otherwise ?
Systems programming code often needs to guard against a variety of security
risks which are not defined by the C Standard. For a compiler to undermine
that would be like a lock manufacturer saying that the purpose of a lock is
simply to prevent the cylinder from rotating when an incorrect key is
inserted, and there's no need to spend money on pick-resistant pins since
it's not possible to pick a lock while an incorrect key is inserted, and
since the lock's behavior isn't specified if someone inserts something that
isn't a key.
This is hopelessly vague. The reference I provided (and my own remarks) describe
much more specific reasons why the code was buggy and these reasons have nothing
to do with compiler optimisations.
The optimizations were applied to one part of the program and bypassed code
whose purpose was, in significant measure, to *guard against possible
erroneous exposure of data by other parts of the program*. The fact that
those other parts of the program shouldn't have exposed the contents of
storage they received doesn't change the fact that they were not supposed
to have received storage with any confidential information in the first
place.

Good security design requires having multiple largely-redundant layers,
to minimize the number of single points of failure that could create a
vulnerability. Having a compiler bypass such redundancy in the name of
"efficiency" will increase the number of dangerous single points of
failure.
Post by Spiros Bousbouras
Post by s***@casperkitty.com
It's fine for some kinds of programs whose input only comes from trusted
sources. It can make certain kinds of programming needlessly difficult,
however, since it tends to undermine code which is designed to include
redundant layers of protection against certain threats. Code might be
engineered so that a certain bad event could only be triggered by an
impossible set of coincidences, but having a compiler decide that certain
things "can't" happen, and thus code which would trap them can be eliminated,
may result in all the safety-checking code getting "optimized out".
I'd love to see an example.
For an example, of gcc's creativity, consider the following, tested using
the version of gcc7 hosted by godbolt.org:

unsigned mul(unsigned short x, unsigned short y)
{
return x*y;
}
unsigned mulAddTest(unsigned short x)
{
unsigned sum = 0;
x|=32768;
for (int i=32768; i<=x; i++)
{
sum += mul(i,65535);
}
return sum;
}

GCC's generated code for mulAddTest will unconditionally return 2147450880
since the loop will always get to run for one pass, and the second pass of
the loop, if there were one, would attempt to multiply 32769*65535, yielding
a value 32767 larger than INT_MAX.
Post by Spiros Bousbouras
Post by s***@casperkitty.com
Most C implementations target platforms which have hardware concepts
corresponding to concepts in C; the vast majority of microcomputer
platforms would, from a C standpoint, differ only in the sizes and
endianness of their various data types.
Perhaps they have hardware concepts corresponding to some concepts in C but
not all. Obviously they won't have hardware concepts corresponding to C
declarations. Also , likely several assembly instructions will be needed to
implement a for .But let's take those C constructs for which there is an
"obvious" assembly instruction. Even for those , a straightforward translation
from C to the assembly instructions would be unnecessarily slow ; for example
a straightforward translation for something like (i + j + k) * (i + j + k)
would emit twice the assembly instructions for i + j + k instead of reusing
the previous value. I don't think most programmers would want this. Some
optimisations *are* desirable. But because optimisations use heuristics there
is always the possibility that applying the heuristic on some occasion would
make the code slower or , in the case of undefined behaviour , make it do
something unexpected.
On a 16-bit system, given something like:

long mul(int x, int y) { return x*y; }

it may be faster on some platforms for a compiler to return the numerically-
correct 32-bit result than for it to sign-extend a 16-bit result, and I would
suggest that programmers should not view numerically-correct results as
"astonishing". If what I wanted was a 16-bit result that was truncated and
sign-extended, I would have made that intention clear via:

long mul(int x, int y) { return (int)(x*y); }

The notion that a compiler might use the multiplication to make assumptions
about x and y would never have occurred to me.
Post by Spiros Bousbouras
Post by s***@casperkitty.com
While there might
in theory be some squabbling about what is "clear and obvious", the vast
majority of platforms would have one overwhelmingly clear and obvious
mapping; only in rare cases where code targets a specific platform for
which multiple plausible mappings exist should code need to be more specific
than "use the obvious one".
There will most definitely be squabbling. Unless you have specific properties
such a mapping would satisfy , what you're proposing has no practical value
whatsoever. You seem familiar with the ARM architecture. Is there such an
obvious mapping for ARM ? If yes , is there a way to express this mapping in
terms of only C , meaning it would be understandable even by someone who
doesn't know ARM assembly ? It's not enough to present some simple cases and
say "the generated assembly instructions should do the obvious thing for this
case". An actual *definition* which extends standard C must specify what
should be done for arbitrarily complicated C constructs otherwise a compiler
writer cannot implement the extension.
On the vast majority of processors, there will be some integer type such that
given a pointer p and an unsigned integer x with the same bit pattern, along
with any integer n, the pointer n+(char*)p and unsigned integer n+x will both
have the same bit pattern. I would suggest that having each integer be mapped
to the pointer with the same bit pattern is a simple and obvious choice, and
no other mapping could be reasonably called simple and obvious.
Keith Thompson
2017-04-13 03:04:45 UTC
Permalink
Raw Message
***@casperkitty.com writes:
[...]
Post by s***@casperkitty.com
long mul(int x, int y) { return x*y; }
it may be faster on some platforms for a compiler to return the
numerically- correct 32-bit result than for it to sign-extend a 16-bit
result, and I would suggest that programmers should not view
numerically-correct results as "astonishing". If what I wanted was a
16-bit result that was truncated and sign-extended, I would have made
long mul(int x, int y) { return (int)(x*y); }
The notion that a compiler might use the multiplication to make
assumptions about x and y would never have occurred to me.
The notion that casting an expression that's already of type int to type
int would have some non-trivial effect would never have occurred to me.

An expression of type int yielding a "numerically-correct" result that's
outside the range INT_MIN..INT_MAX is certainly something I would find
astonishing -- until I realized that the evaluation has undefined
behavior.

And even if x*y somehow yielded a result outside the range
of int, casting that result to int would at best yield an
implementation-defined result (even ignoring the fact that the cast
logically does nothing).

If I wanted a 16-by-16-to-32 bit multiplication function, given 16-bit
int and 32-bit long, I wouldn't write either version of your mul
function. I'd probably write:

long mul(int x, int y) { return (long)x * y; }

(Or I might cast both operands just for symmetry, but the other
one will be promoted anyway.)
--
Keith Thompson (The_Other_Keith) kst-***@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
Loading...