  Close Menu Navigation

# Floating point quirks

In this post, we’ll explore a scenario where the non-commutativityassociativity of floating point arithmetic can lead us into trouble.

Let and . Write a computer program to compute this sum.

## Table Of Content

1. A closed form equation for

2. Uh oh

3. An explanation

### 1. A closed form equation for function S = S(n)
S = 1 - 1/(1+n);
end


It’s a really simple problem, and if you do a little bit of algebra, you can make the following simplification From this, it follows immediately that , or . Great, now that we have a closed form expression for , what else do we need to do? Of course, just to make sure that my derivation is sound, I wrote out the full sum program anyways. (Note: this behavior will only happen when you use single precision floating point storage)

n = 1e6;
S = 0;
for k=1:n
S = single(double(S) + 1/(k*(k+1)));
end
(1 - 1/(n+1)) - S


After computing for a few small values and comparing them to the value of , I found that seems to compute the same value as my expression; but for large values of , rapidly gets farther and farther away from the value computed by . I just attributed this to roundoff error, afterall, this is roundoff. I concluded that the more expensive summation isn’t a good solution at all and thought that was the end of it.

### 2. Uh oh

Later that day, I showed my friend this question and he told me that while he agrees that the closed form formula is indeed better, he sees no significant loss of accuracy in his implementation. For equals to a million, his solution only has amount of error (on par with for single precision) using only single precision storage types.

“What? No, that’s not possible, I already tried it on mine and I got an error of around “Well, try it yourself”

So I ran his code in . At first I saw no difference in our two implementations, but then it hit me; “he’s computing the series backwards”

n = 1e6;
S = 0;
for k=n:-1:1
S = single(double(S) + 1/(k*(k+1)));
end
(1 - 1/(n+1)) - S


and sure enough, I get the same error he does. What the hell is going on?

### 3. An explanation

It turns out that the problem stems from how roundoff works under certain rounding mode for addition. First, the IEEE guarantees that addition for floating point numbers that are within of each other will be perfectly computed.

Now, a single precision floating point number can store about 8 digits accurately, therefore, adding 1 to will only return 1 since the number will take more than 8 digits to store and hence will truncate the lower few digits. In general, you can assume that if two numbers differ by 8 orders of magnitude, then adding them will have no effect.

Now, when we compute the series in the forward direction, what we’re really doing is computing an approximation such that. As we’ve already shown, , so at each iteration, we’re making the floating point computation . Relatively speaking, for large values of the expression tends towards just ; futhermore, will become more than 8 orders of magnitude away from 1 when is merely 10-thousand (however round off will be noticeable much earlier) therefore, for the last 999000 operations, our method is effectively doing nothing. You can actually verify this in by seeing that both and output the same long-formatted float when the true answer is about away.

Okay, so how does doing this summation in reverse help in any way or form? Well, suppose that each iteration of the reverse method produces a number at step , then it’s obvious that ; furthermore, for large enough value of , , so we can just have . The inductive definition also states that  . Now, there’s something intriguing about this expression: so that the relative difference between and is around which is and can never exceed for (so the roundoff is at the worst in the beginning, but that’s also where this difference formula make assumptions that no longer hold, namely that the linear factor dominates the quadratic factor in , so the relative magnitude between the two terms isn’t really that big of a deal. Anyways, the roundoff will progressively become better and better as grows). Therefore, rather than having a significant vast majority of the computations do nothing, every step now counts. This is largely the reason why summing the series backwards gives less roundoff overall than doing the “same” set of operations in the other direction. This is one case where the “non-associativity” of arithmetic in floating point will give you trouble. Get more education at masters scholarships.

neat   : )

##### Posted By Lee

Woosh

• James Jones

Beware of not absolutely convergent series; you can’t always shuffle the order of evaluation–in fact, there’s a theorem that shows you can make them add up to whatever the heck you want by reordering the terms.

• SirClueless

This series *is* absolutely convergent. Every term is positive, and as is shown the sum of this series is less than one, and therefore finite. (And anyways, the theorem only applies to infinite series. This is a finite sum.)

• Philip Smith

Floating point addition is commutative. The problem is that it is not *associative*
commutative: a + b = b + a
associative: (a + b) + c = a + (b + c)
This is an example of the latter.

• You’re right, thank you for pointing this out to me

• Prinz Rowan

This is the reason for having the function scalarproduct in some libs. You give it to vectors to multiply and sum up. If it is carefully programmed, the function first multiplies the summands and sorts them in a second step. Then build the sum from the smallest summands to the biggest.
This could be done as fast as the “plain” scalar product in hardware. You just need an adder register, which is 2048 bits long to represent the whole range of floating point values. This is not a big deal in contemporary billion transistor cpu’s.

But game performance is just more important than correct calculations in engineering. Loosing in quake hurts so much more than faults in bridges, cars or plains….

The problem is attacted by intervall arithmetic, which calculates the error intervall with each operation. Unfortunately intervall arithmetic has not been implemented in HW. It would show the bigger error intervall in the case of summing forward.

• Dan Sutton

These things become more readily apparent if you understand what’s going on in machine code — this way you can allow for it, as your solution ultimately does: never, ever trust a floating point calculation; always round to a known level of predictable precision… and always write your code so that it understands that this is the limit of accuracy you’re dealing with.

• I’ve heard that in the days of the old, it was easy to manually check for a rough estimate of how bad the roundoff for a particular set of calculations is by redoing the computation using all 4 available rounding modes and taking the maximum of the difference between the $4 choose 2$ different pairs. If the difference is significant, you have pretty solid evidence that your code is running into roundoff trouble. Unfortunately, most of the newer languages starting with Java no longer lets you set the rounding mode.

• Charlie Tanner

Great post, very informative and interesting.

Quick question, how did your friend get 10^-8 error? Wouldn’t the minimum observable error for single precision be 2^-24 = 10^-7.2ish?

• Good question; the closed form true/expected value was computed and stored in double precision, the truncated single precision experimental value is then promoted to double precision during the calculation of the error. If we are to calculate the error with a single precision expected value, i.e. (S-single(1 – 1/(n+1))), we would get a 0 for the difference.

I thought that it might be a little bit misleading to claim that the second method is no longer subjected to roundoff, so I instead opted to demonstrate that the error is no longer within representable range of single precision (absolute error for a term close to 1 is less than the machine epsilon). Thanks.

• Charlie Tanner

It all makes sense now. Thank you :)

• Gary Knott

Dear Sanity, For another floating point story, see http://www.civilized.com/files/intpow.c

• Gyro Gearloose

Be aware that single floating point number have a 23 bit mantissa, double floating point number have a 52 bit mantissa. See IEEE754

• Yep, which, along with the implicit first digit, is why we’re allowed to represent up to ~8 decimal digits (log_10(2^24) ~= 8) of accuracy in single precision and around 16 in double precision.