Hi Bruno, First, may I ask why you are surprised that different processors do floating point numbers slightly differently, particularly when pushing the limits of precision? Solaris SPARC: > [eclipse 1]: N is breal(0_1). > > N = -4.94065645841247e-324__4.94065645841247e-324 > Yes (0.00s cpu) Linux P4: > [eclipse 1]: N is breal(0_1). > > N = -2.2250738585072014e-308__2.2250738585072014e-308 > Yes (0.00s cpu) This difference is a result of Solaris defining MINDOUBLE to be the smallest denormalised double, while Linux defines it to be the smallest normalised double, and only crops up in the special case where a zero bound is being "widened" for numerical safety. We will look into whether this should be made consistent. Solaris SPARC: > [eclipse 2]: N is breal(rational(16.0)), %% Interval containing 16.0 > breal_max(N,SN), %% Successor float of 16.0 > Delta is (SN - 16.0)/2.0, %% Righ error of 16.0 > Sixteen is 16.0 + Delta, %% Are we still on 16.0 ? > BrDelta is breal(rational(Delta)),%% Interval enclosing Delta > breal_max(BrDelta,SDelta), %% Successor of Delta > NotSixteen is 16.0 + SDelta. > > N = 15.999999999999998__16.000000000000004 > SN = 16.000000000000004 > Sixteen = 16.0 > Delta = 1.7763568394002505e-15 > BrDelta = 1.7763568394002503e-15__1.7763568394002509e-15 > NotSixteen = 16.000000000000004 > SDelta = 1.7763568394002509e-15 > Yes (0.00s cpu) Linux P4: > [eclipse 2]: N is breal(rational(16.0)), %% Interval containing 16.0 > breal_max(N,SN), %% Successor float of 16.0 > Delta is (SN - 16.0)/2.0, %% Righ error of 16.0 > Sixteen is 16.0 + Delta, %% Are we still on 16.0 ? > BrDelta is breal(rational(Delta)),%% Interval enclosing Delta > breal_max(BrDelta,SDelta), %% Successor of Delta > NotSixteen is 16.0 + SDelta. > > N = 15.999999999999998__16.000000000000004 > SN = 16.000000000000004 > Sixteen = 16.0 > Delta = 1.7763568394002505e-15 > BrDelta = 1.7763568394002503e-15__1.7763568394002509e-15 > NotSixteen = 16.0 > SDelta = 1.7763568394002509e-15 > Yes (0.00s cpu) This is purely down to how the processor implements floating point arithmetic. The difference is that the i386-based processors actually use extra precision internally (floating-point registers are 80 bits, whereas stored doubles and SPARC floating-point registers are 64 bits). Consider some of the numbers written in fixed-point hexadecimal: 16.0 10.000000000000 SN 16.000000000000004 10.000000000001 Delta 1.7763568394002505e-15 0.0000000000008 SDelta 1.7763568394002509e-15 0.00000000000080000000000008 16.0 + Delta is 10.0000000000008. Internal to the i386 processor, this can be represented exactly. On the SPARC, it cannot and must be rounded: the default when it's exactly half-way between two representable doubles is to make the last binary digit 0, the result being 10.000000000000. On the i386, when it comes time to write the double out to memory, the same rounding operation is required, with the same result, 10.000000000000. Now, 16.0 + SDelta is 10.00000000000080000000000008. On the SPARC, this is rounded up, to 10.000000000001, since that is the closer representable double. On the i386, the result is first rounded to the internal register's precision, resulting in 10.0000000000008000 (the closest internally representable number). When it comes time to write this out to memory, a further rounding must occur, except that the rounded result is, like the previous case, half-way between two representable doubles, so it rounds to the "even" one again: 10.000000000000. Hence the difference in the results. If you think this is wrong, complain to Intel. :) Perhaps they could have recorded which way an internal double was rounded, and taken that into account when deciding which way to round the internal double when converting it to the 64-bit format. (I haven't thought this through enough; there may be some reasons why this isn't practical.) One can put the i386 into a 64-bit internal double mode, which would presumably give the same result as for the SPARC, but this is a trade-off: the default 80-bit mode can be expected to have lower average error over a sequence of operations, but the 64-bit mode has (apparently) ever-so-slightly better worst-case error. Is this level of accuracy important to you? If so, why? What is it you are trying to do? I'm also not sure why you're converting these things to rationals and back. What is it you are trying to achieve by doing this? Cheers, Warwick P.S. You're lucky I've just been doing some work with floats and rounding modes and so on, so have learnt about all this stuff --- a month ago I probably wouldn't have been able to help you!Received on Tue Oct 22 14:02:16 2002
This archive was generated by hypermail 2.1.8 : Wed 16 Nov 2005 06:08:18 PM GMT GMT