In this post, we’ll explore a scenario where the non-
commutativityassociativity of floating point arithmetic can lead us into trouble.
Let and . Write a computer program to compute this sum.
1. A closed form equation for
function S = S(n) S = 1 - 1/(1+n); end
It’s a really simple problem, and if you do a little bit of algebra, you can make the following simplification
From this, it follows immediately that , or . Great, now that we have a closed form expression for , what else do we need to do? Of course, just to make sure that my derivation is sound, I wrote out the full sum program anyways. (Note: this behavior will only happen when you use single precision floating point storage)
n = 1e6; S = 0; for k=1:n S = single(double(S) + 1/(k*(k+1))); end (1 - 1/(n+1)) - S
After computing for a few small values and comparing them to the value of , I found that seems to compute the same value as my expression; but for large values of , rapidly gets farther and farther away from the value computed by . I just attributed this to roundoff error, afterall, this is roundoff. I concluded that the more expensive summation isn’t a good solution at all and thought that was the end of it.
2. Uh oh
Later that day, I showed my friend this question and he told me that while he agrees that the closed form formula is indeed better, he sees no significant loss of accuracy in his implementation. For equals to a million, his solution only has amount of error (on par with for single precision) using only single precision storage types.
“What? No, that’s not possible, I already tried it on mine and I got an error of around ”
“Well, try it yourself”
So I ran his code in . At first I saw no difference in our two implementations, but then it hit me; “he’s computing the series backwards”
n = 1e6; S = 0; for k=n:-1:1 S = single(double(S) + 1/(k*(k+1))); end (1 - 1/(n+1)) - S
and sure enough, I get the same error he does. What the hell is going on?
3. An explanation
It turns out that the problem stems from how roundoff works under certain rounding mode for addition. First, the IEEE guarantees that addition for floating point numbers that are within of each other will be perfectly computed.
Now, a single precision floating point number can store about 8 digits accurately, therefore, adding 1 to will only return 1 since the number will take more than 8 digits to store and hence will truncate the lower few digits. In general, you can assume that if two numbers differ by 8 orders of magnitude, then adding them will have no effect.
Now, when we compute the series in the forward direction, what we’re really doing is computing an approximation such that.
As we’ve already shown, , so at each iteration, we’re making the floating point computation . Relatively speaking, for large values of the expression tends towards just ; futhermore, will become more than 8 orders of magnitude away from 1 when is merely 10-thousand (however round off will be noticeable much earlier) therefore, for the last 999000 operations, our method is effectively doing nothing. You can actually verify this in by seeing that both and output the same long-formatted float when the true answer is about away.
Okay, so how does doing this summation in reverse help in any way or form? Well, suppose that each iteration of the reverse method produces a number at step , then it’s obvious that ; furthermore, for large enough value of , , so we can just have . The inductive definition also states that . Now, there’s something intriguing about this expression: so that the relative difference between and is around which is and can never exceed for (so the roundoff is at the worst in the beginning, but that’s also where this difference formula make assumptions that no longer hold, namely that the linear factor dominates the quadratic factor in , so the relative magnitude between the two terms isn’t really that big of a deal. Anyways, the roundoff will progressively become better and better as grows). Therefore, rather than having a significant vast majority of the computations do nothing, every step now counts. This is largely the reason why summing the series backwards gives less roundoff overall than doing the “same” set of operations in the other direction. This is one case where the “non-associativity” of arithmetic in floating point will give you trouble.
neat : )