*Post by Stefan Ram*Newsgroups: comp.lang.c,comp.lang.c++

#include <float.h>

A double has at least 14 significant digits (about 14-16) and

9007199254740991.

is the largest double value that can be added to 1.0

giving the "next" double value

Do you really mean "next double value" or "next integer"? The next

integer is 1 higher than Omega. The "next double value" will, in

general, be Omega+FLT_RADIX. I'm going to explain below why "next double

value" is problematic. Then I'll go ahead on the assumption that you

meant "next integer".

*Post by Stefan Ram*9007199254740992.

(when one adds 1.0 again, the value will not change

anymore).

I don't know whether there is a common name for such a

value. I will call it (just for this post) "omega".

(In the above example, the omega is 9007199254740991.)

Now, the standard does not require 14 significant

digits, but 10 (DBL_DIG).

I wonder whether one can somehow find the smallest

omega for the type double that is required by the

standard.

Clearly, omega >= 0 and omega <= DBL_MAX.

You've defined Omega in terms of what happens when you add 1.0 to the

number. However, that will depend upon the rounding mode.

Consider the smallest integer for which the next integer has no

representation as a double, lets call it Omega1, to keep it distinct

from your Omega.

Omega1 and Omega1+FLT_RADIX will both be representable. If FLT_RADIX==2,

those values will both be equally far from the mathematical value of

Omega1 + 1. The C standard imposes NO requirements of its own on the

accuracy of such floating point expressions (5.2.4.2.2p6). None -

whatsoever. In particular, this means that a fully conforming

implementation of C is allowed to implement floating point math so

inaccurately that DBL_MAX-DBL_MIN < DBL_MIN - DBL_MAX.

However, if __STDC_IEC_559__ is pre#defined by the implementation, then

the requirements of annex F apply (6.10.8.3p1), which are essentially

equivalent to IEC 60559:1989, which is equivalent ANSI/IEEE 754-1985

(F1p1). In that case, if FLT_RADIX == 2, and you add 1.0 to Omega1, the

result will be rounded to a value of either Omega1 or Omega1+2,

depending upon the current rounding mode. If the rounding mode is

FE_TONEAREST, FE_DOWNWARD, or FE_TOWARDZERO, then it will round to

Omega1, and that will be true of every value up to 2*Omega1, so 2*Omega1

will be the quantity you define as Omega. On the other hand, if the

rounding mode is FE_UPWARD, then Omega is the same as Omega1.

If you meant "next integer", that ambiguity disappears. Regardless of

rounding mode, Omega1 is the the smallest integer to which you can add

1.0 without getting the value of the next integer - because, by

definition, that next integer cannot be represented.

Section 5.2.4.2.2 specifies a model for how floating point types are

represented. That model need not actually be followed, but the required

behavior of floating point operations, and the ranges of representable

values are specified in terms of that model; a floating point

representation for which that model is sufficiently bad might not have

any way of correctly implementing the standard's requirements. So I'll

restrict my discussion to implementations for which that model is

perfectly accurate.

The standard uses subscripts, superscripts, and greek letters in its

description of the model, which don't necessarily display meaningfully

in a usenet message. I'll indicate subscripts and superscripts by

preceeding them with _ and ^, respectively, and I'll replace the

summation with Sum(variable, lower limit, upper limit, expression).

*Post by Stefan Ram*The following parameters are used to

s sign (+/-1)

b base or radix of exponent representation (an integer > 1)

e exponent (an integer between a minimum e_min

and a maximum e_max )

p precision (the number of base-b digits in the significand)

f_k nonnegative integers less than b (the significand digits)

x = s b^e Sum(k, 1, p, f_k b^-k) e_min <= e <= e_max

The term in that sum with the smallest value is the one for k==p. f_p

gets multiplied by both b^e and b^-p, and therefore by b^(e-p). The

lowest integer that cannot be represented exactly is one which would

require a precision p+1 to represented, because b^(e-p) is greater than

1. Therefore, the Omega1 must have e == p+1, f_1 == 1, and f_k == 0 for

all other values of k. Therefore,

Omega1 = (+1) * b^(p+1) * 1*b^(-1) == b^p

== b/b^(1-p) == FLT_RADIX/DBL_EPSILON

Note: even on implementations that pre#define __STDC_IEC_559__, a

certain amount of inaccuracy is still permitted in both the

interpretation of floating point constants and in floating point

expressions. Jut barely enough, in fact, to allow FLT_RADIX/DBL_EPSILON

to come out as Omega1-1. However, that shouldn't happen if those macros

are defined as hexadecimal floating point constants, which must be

converted exactly if they can be represented exactly.

The smallest permitted value for FLT_RADIX is 2, and the maximum

permitted value for DBL_EPSILON is 1e-9, so omega1-1 cannot be less than

2/1e-9. A value for DBL_EPSILON of exactly 1e-9 is not possible if

FLT_RADIX is 2; the closest you can get is with p=29, so DBL_EPSILON =

2^(-30) == 1.0/(1,073,741,824), so the smallest permitted value for

omega1-1 is 2/(2^-30) == 2,147,483,648.

I believe this analysis is not correct. The Standard gives a

minimum value for DBL_DIG of 10. Any implementation that uses a

represent them exactly). Unless you are using some idea of omega