Re: Bug report: double floating point arithmetic errors under Linux

From: Warwick Harvey <wh_at_icparc.ic.ac.uk> Date: Tue 22 Oct 2002 01:02:10 PM GMT Message-ID: <20021022140207.H9607@tempest.icparc.ic.ac.uk> · This archive was generated by hypermail 2.1.8 : Wed 16 Nov 2005 06:08:18 PM GMT GMT

Hi Bruno,

First, may I ask why you are surprised that different processors do floating
point numbers slightly differently, particularly when pushing the limits of
precision?

Solaris SPARC:
> [eclipse 1]: N is breal(0_1).
> 
> N = -4.94065645841247e-324__4.94065645841247e-324
> Yes (0.00s cpu)

Linux P4:
> [eclipse 1]: N is breal(0_1).
> 
> N = -2.2250738585072014e-308__2.2250738585072014e-308
> Yes (0.00s cpu)

This difference is a result of Solaris defining MINDOUBLE to be the smallest
denormalised double, while Linux defines it to be the smallest normalised
double, and only crops up in the special case where a zero bound is being
"widened" for numerical safety.  We will look into whether this should be
made consistent.

Solaris SPARC:
> [eclipse 2]: N is breal(rational(16.0)),  %% Interval containing 16.0
> 	breal_max(N,SN),                  %% Successor float of 16.0
> 	Delta is (SN - 16.0)/2.0,         %% Righ error of 16.0
> 	Sixteen is 16.0 + Delta,          %% Are we still on 16.0 ?		
>          BrDelta is breal(rational(Delta)),%% Interval enclosing Delta
> 	breal_max(BrDelta,SDelta),        %% Successor of Delta
> 	NotSixteen is 16.0 + SDelta.
> 
> N = 15.999999999999998__16.000000000000004
> SN = 16.000000000000004
> Sixteen = 16.0
> Delta = 1.7763568394002505e-15
> BrDelta = 1.7763568394002503e-15__1.7763568394002509e-15
> NotSixteen = 16.000000000000004
> SDelta = 1.7763568394002509e-15
> Yes (0.00s cpu)

Linux P4:
> [eclipse 2]: N is breal(rational(16.0)),  %% Interval containing 16.0
> 	breal_max(N,SN),                  %% Successor float of 16.0
> 	Delta is (SN - 16.0)/2.0,         %% Righ error of 16.0
> 	Sixteen is 16.0 + Delta,          %% Are we still on 16.0 ?		
>          BrDelta is breal(rational(Delta)),%% Interval enclosing Delta
> 	breal_max(BrDelta,SDelta),        %% Successor of Delta
> 	NotSixteen is 16.0 + SDelta.						
> 
> N = 15.999999999999998__16.000000000000004
> SN = 16.000000000000004
> Sixteen = 16.0
> Delta = 1.7763568394002505e-15
> BrDelta = 1.7763568394002503e-15__1.7763568394002509e-15
> NotSixteen = 16.0
> SDelta = 1.7763568394002509e-15
> Yes (0.00s cpu)

This is purely down to how the processor implements floating point
arithmetic.  The difference is that the i386-based processors actually use
extra precision internally (floating-point registers are 80 bits, whereas
stored doubles and SPARC floating-point registers are 64 bits).  Consider
some of the numbers written in fixed-point hexadecimal:

	16.0			10.000000000000
SN	16.000000000000004	10.000000000001
Delta	1.7763568394002505e-15	 0.0000000000008
SDelta	1.7763568394002509e-15   0.00000000000080000000000008

16.0 + Delta is 10.0000000000008.  Internal to the i386 processor, this can
be represented exactly.  On the SPARC, it cannot and must be rounded: the
default when it's exactly half-way between two representable doubles is to
make the last binary digit 0, the result being 10.000000000000.  On the
i386, when it comes time to write the double out to memory, the same
rounding operation is required, with the same result, 10.000000000000.

Now, 16.0 + SDelta is 10.00000000000080000000000008.  On the SPARC, this is
rounded up, to 10.000000000001, since that is the closer representable
double.  On the i386, the result is first rounded to the internal register's
precision, resulting in 10.0000000000008000 (the closest internally
representable number).  When it comes time to write this out to memory, a
further rounding must occur, except that the rounded result is, like the
previous case, half-way between two representable doubles, so it rounds to
the "even" one again: 10.000000000000.  Hence the difference in the
results.

If you think this is wrong, complain to Intel.  :)  Perhaps they could have
recorded which way an internal double was rounded, and taken that into
account when deciding which way to round the internal double when converting
it to the 64-bit format.  (I haven't thought this through enough; there may
be some reasons why this isn't practical.)

One can put the i386 into a 64-bit internal double mode, which would
presumably give the same result as for the SPARC, but this is a trade-off:
the default 80-bit mode can be expected to have lower average error over a
sequence of operations, but the 64-bit mode has (apparently)
ever-so-slightly better worst-case error.

Is this level of accuracy important to you?  If so, why?  What is it you are
trying to do?

I'm also not sure why you're converting these things to rationals and back.
What is it you are trying to achieve by doing this?

Cheers,
Warwick

P.S.  You're lucky I've just been doing some work with floats and rounding
modes and so on, so have learnt about all this stuff --- a month ago I
probably wouldn't have been able to help you!