Post by Stefan Ram
A double has at least 14 significant digits (about 14-16) and
is the largest double value that can be added to 1.0
giving the "next" double value
Do you really mean "next double value" or "next integer"? The next
integer is 1 higher than Omega. The "next double value" will, in
general, be Omega+FLT_RADIX. I'm going to explain below why "next double
value" is problematic. Then I'll go ahead on the assumption that you
meant "next integer".
Post by Stefan Ram
(when one adds 1.0 again, the value will not change
I don't know whether there is a common name for such a
value. I will call it (just for this post) "omega".
(In the above example, the omega is 9007199254740991.)
Now, the standard does not require 14 significant
digits, but 10 (DBL_DIG).
I wonder whether one can somehow find the smallest
omega for the type double that is required by the
Clearly, omega >= 0 and omega <= DBL_MAX.
You've defined Omega in terms of what happens when you add 1.0 to the
number. However, that will depend upon the rounding mode.
Consider the smallest integer for which the next integer has no
representation as a double, lets call it Omega1, to keep it distinct
from your Omega.
Omega1 and Omega1+FLT_RADIX will both be representable. If FLT_RADIX==2,
those values will both be equally far from the mathematical value of
Omega1 + 1. The C standard imposes NO requirements of its own on the
accuracy of such floating point expressions (22.214.171.124.2p6). None -
whatsoever. In particular, this means that a fully conforming
implementation of C is allowed to implement floating point math so
inaccurately that DBL_MAX-DBL_MIN < DBL_MIN - DBL_MAX.
However, if __STDC_IEC_559__ is pre#defined by the implementation, then
the requirements of annex F apply (126.96.36.199p1), which are essentially
equivalent to IEC 60559:1989, which is equivalent ANSI/IEEE 754-1985
(F1p1). In that case, if FLT_RADIX == 2, and you add 1.0 to Omega1, the
result will be rounded to a value of either Omega1 or Omega1+2,
depending upon the current rounding mode. If the rounding mode is
FE_TONEAREST, FE_DOWNWARD, or FE_TOWARDZERO, then it will round to
Omega1, and that will be true of every value up to 2*Omega1, so 2*Omega1
will be the quantity you define as Omega. On the other hand, if the
rounding mode is FE_UPWARD, then Omega is the same as Omega1.
If you meant "next integer", that ambiguity disappears. Regardless of
rounding mode, Omega1 is the the smallest integer to which you can add
1.0 without getting the value of the next integer - because, by
definition, that next integer cannot be represented.
Section 188.8.131.52.2 specifies a model for how floating point types are
represented. That model need not actually be followed, but the required
behavior of floating point operations, and the ranges of representable
values are specified in terms of that model; a floating point
representation for which that model is sufficiently bad might not have
any way of correctly implementing the standard's requirements. So I'll
restrict my discussion to implementations for which that model is
The standard uses subscripts, superscripts, and greek letters in its
description of the model, which don't necessarily display meaningfully
in a usenet message. I'll indicate subscripts and superscripts by
preceeding them with _ and ^, respectively, and I'll replace the
summation with Sum(variable, lower limit, upper limit, expression).
Post by Stefan Ram
The following parameters are used to
s sign (+/-1)
b base or radix of exponent representation (an integer > 1)
e exponent (an integer between a minimum e_min
and a maximum e_max )
p precision (the number of base-b digits in the significand)
f_k nonnegative integers less than b (the significand digits)
x = s b^e Sum(k, 1, p, f_k b^-k) e_min <= e <= e_max
The term in that sum with the smallest value is the one for k==p. f_p
gets multiplied by both b^e and b^-p, and therefore by b^(e-p). The
lowest integer that cannot be represented exactly is one which would
require a precision p+1 to represented, because b^(e-p) is greater than
1. Therefore, the Omega1 must have e == p+1, f_1 == 1, and f_k == 0 for
all other values of k. Therefore,
Omega1 = (+1) * b^(p+1) * 1*b^(-1) == b^p
== b/b^(1-p) == FLT_RADIX/DBL_EPSILON
Note: even on implementations that pre#define __STDC_IEC_559__, a
certain amount of inaccuracy is still permitted in both the
interpretation of floating point constants and in floating point
expressions. Jut barely enough, in fact, to allow FLT_RADIX/DBL_EPSILON
to come out as Omega1-1. However, that shouldn't happen if those macros
are defined as hexadecimal floating point constants, which must be
converted exactly if they can be represented exactly.
The smallest permitted value for FLT_RADIX is 2, and the maximum
permitted value for DBL_EPSILON is 1e-9, so omega1-1 cannot be less than
2/1e-9. A value for DBL_EPSILON of exactly 1e-9 is not possible if
FLT_RADIX is 2; the closest you can get is with p=29, so DBL_EPSILON =
2^(-30) == 1.0/(1,073,741,824), so the smallest permitted value for
omega1-1 is 2/(2^-30) == 2,147,483,648.
I believe this analysis is not correct. The Standard gives a
minimum value for DBL_DIG of 10. Any implementation that uses a
represent them exactly). Unless you are using some idea of omega