Add Me!Close Menu Navigation

Floating point quirks

In this post, we’ll explore a scenario where the non-commutativityassociativity of floating point arithmetic can lead us into trouble.

Let a_k = \frac{1}{k(k+1)} and S_n = \sum_{k=1}^n a_k. Write a computer program to compute this sum.

1. A closed form equation for S_n

function S = S(n)
    S = 1 - 1/(1+n);

It’s a really simple problem, and if you do a little bit of algebra, you can make the following simplification

    \[S_n = \sum_k^n \frac{1}{k(k+1)} = \sum_k^n \frac{k+1 - k}{k(k+1)} = \sum_k^n \frac{1}{k} - \frac{1}{k+1}\]

From this, it follows immediately that S_n = 1 - \frac{1}{2} + \frac{1}{2} - \frac{1}{3} + \frac{1}{3} - \dots - \frac{1}{n} + \frac{1}{n} - \frac{1}{n+1}, or S_n = 1 - \frac{1}{n+1}. Great, now that we have a closed form expression for S_n, what else do we need to do? Of course, just to make sure that my derivation is sound, I wrote out the full sum program anyways. (Note: this behavior will only happen when you use single precision floating point storage)

n = 1e6;
S = 0;
for k=1:n
    S = single(double(S) + 1/(k*(k+1)));
(1 - 1/(n+1)) - S

After computing S_n for a few small values and comparing them to the value of 1 - \frac{1}{n+1}, I found that S' seems to compute the same value as my expression; but for large values of n, S'_n rapidly gets farther and farther away from the value computed by S_n. I just attributed this to roundoff error, afterall, this is roundoff. I concluded that the more expensive summation isn’t a good solution at all and thought that was the end of it.

2. Uh oh

Later that day, I showed my friend this question and he told me that while he agrees that the closed form formula is indeed better, he sees no significant loss of accuracy in his implementation. For n equals to a million, his solution only has 10^{-8} amount of error (on par with \epsilon_{mach} for single precision) using only single precision storage types.

“What? No, that’s not possible, I already tried it on mine and I got an error of around 10^{-4}
“Well, try it yourself”

So I ran his code in \textsc{Matlab}. At first I saw no difference in our two implementations, but then it hit me; “he’s computing the series backwards”

n = 1e6;
S = 0;
for k=n:-1:1
    S = single(double(S) + 1/(k*(k+1)));
(1 - 1/(n+1)) - S

and sure enough, I get the same error he does. What the hell is going on?

3. An explanation

It turns out that the problem stems from how roundoff works under certain rounding mode for addition. First, the IEEE guarantees that addition for floating point numbers that are within \frac{1}{2} of each other will be perfectly computed.

Now, a single precision floating point number can store about 8 digits accurately, therefore, adding 1 to 10^{-10} will only return 1 since the number 1+10^{-10} will take more than 8 digits to store and hence will truncate the lower few digits. In general, you can assume that if two numbers differ by 8 orders of magnitude, then adding them will have no effect.

Now, when we compute the series in the forward direction, what we’re really doing is computing an approximation \hat S_n \approx S_n such that.

    \begin{align*} \hat S_0 &= 0 \\ \hat S_1 &\approx \hat S_0 + \frac{1}{2} \\ \hat S_2 &\approx \hat S_1 + \frac{1}{6} \\ \hat S_3 &\approx \hat S_2 + \frac{1}{12} \\ \hat S_4 &\approx \hat S_3 + \frac{1}{20} \\ &\vdots \\ \hat S_n &\approx \hat S_{n-1} + \frac{1}{n(n+1)} \end{align*}

As we’ve already shown, \hat S_k \approx S_k = 1 - \frac{1}{k+1} = \frac{(k-1)(k+1)}{k(k+1)}, so at each iteration, we’re making the floating point computation \textsc{Flt}\left(\frac{k^2 - 1}{k(k+1)} + \frac{1}{k(k+1)}\right). Relatively speaking, for large values of k the expression k^2 - 1 tends towards just k^2; futhermore, k^2 will become more than 8 orders of magnitude away from 1 when k is merely 10-thousand (however round off will be noticeable much earlier) therefore, for the last 999000 operations, our method is effectively doing nothing. You can actually verify this in \textsc{Matlab} by seeing that both S(1e4) and S(1e6) output the same long-formatted float when the true answer is about 10^{-4} away.

Okay, so how does doing this summation in reverse help in any way or form? Well, suppose that each iteration of the reverse method produces a number R_k at step k, then it’s obvious that R_k = S_n - S_{n-k-1}; furthermore, for large enough value of n, S_n \approx 1, so we can just have R_k = 1 - S_{n-k-1} \approx 1 - \left(1 - \frac{1}{n-k-1+1}\right) = \frac{1}{n-k}. The inductive definition also states that R_k = R_{k-1} + a_{n-k+1} = R_{k-1} + \frac{1}{(n-k+1)^2 + (n-k+1)}. Now, there’s something intriguing about this expression: (n-k+1)^2 + (n-k+1) \approx (n-k+1)^2 so that the relative difference between R_{k-1} and a_{n-k+1} is around (n-k+1) which is O(n-k) and can never exceed 10^{8} for k < n < 10^{6} (so the roundoff is at the worst in the beginning, but that’s also where this difference formula make assumptions that no longer hold, namely that the linear factor dominates the quadratic factor in (n-k+1)^2 + (n-k+1), so the relative magnitude between the two terms isn’t really that big of a deal. Anyways, the roundoff will progressively become better and better as k grows). Therefore, rather than having a significant vast majority of the computations do nothing, every step now counts. This is largely the reason why summing the series backwards gives less roundoff overall than doing the “same” set of operations in the other direction. This is one case where the “non-associativity” of arithmetic in floating point will give you trouble.

neat   : )

Next: How to fix floating point error

Posted By Lee


  • James Jones

    Beware of not absolutely convergent series; you can’t always shuffle the order of evaluation–in fact, there’s a theorem that shows you can make them add up to whatever the heck you want by reordering the terms.

    • SirClueless

      This series *is* absolutely convergent. Every term is positive, and as is shown the sum of this series is less than one, and therefore finite. (And anyways, the theorem only applies to infinite series. This is a finite sum.)

  • Philip Smith

    Floating point addition is commutative. The problem is that it is not *associative*
    commutative: a + b = b + a
    associative: (a + b) + c = a + (b + c)
    This is an example of the latter.

    • Phailure

      You’re right, thank you for pointing this out to me

  • Pingback: Curiosidades do ponto flutuante | Blog de Programação Literati

  • Prinz Rowan

    This is the reason for having the function scalarproduct in some libs. You give it to vectors to multiply and sum up. If it is carefully programmed, the function first multiplies the summands and sorts them in a second step. Then build the sum from the smallest summands to the biggest.
    This could be done as fast as the “plain” scalar product in hardware. You just need an adder register, which is 2048 bits long to represent the whole range of floating point values. This is not a big deal in contemporary billion transistor cpu’s.

    But game performance is just more important than correct calculations in engineering. Loosing in quake hurts so much more than faults in bridges, cars or plains….

    The problem is attacted by intervall arithmetic, which calculates the error intervall with each operation. Unfortunately intervall arithmetic has not been implemented in HW. It would show the bigger error intervall in the case of summing forward.

  • Dan Sutton

    These things become more readily apparent if you understand what’s going on in machine code — this way you can allow for it, as your solution ultimately does: never, ever trust a floating point calculation; always round to a known level of predictable precision… and always write your code so that it understands that this is the limit of accuracy you’re dealing with.

    • Phailure

      I’ve heard that in the days of the old, it was easy to manually check for a rough estimate of how bad the roundoff for a particular set of calculations is by redoing the computation using all 4 available rounding modes and taking the maximum of the difference between the $4 choose 2$ different pairs. If the difference is significant, you have pretty solid evidence that your code is running into roundoff trouble. Unfortunately, most of the newer languages starting with Java no longer lets you set the rounding mode.

  • Charlie Tanner

    Great post, very informative and interesting.

    Quick question, how did your friend get 10^-8 error? Wouldn’t the minimum observable error for single precision be 2^-24 = 10^-7.2ish?

    • Phailure

      Good question; the closed form true/expected value was computed and stored in double precision, the truncated single precision experimental value is then promoted to double precision during the calculation of the error. If we are to calculate the error with a single precision expected value, i.e. (S-single(1 – 1/(n+1))), we would get a 0 for the difference.

      I thought that it might be a little bit misleading to claim that the second method is no longer subjected to roundoff, so I instead opted to demonstrate that the error is no longer within representable range of single precision (absolute error for a term close to 1 is less than the machine epsilon). Thanks.

      • Charlie Tanner

        It all makes sense now. Thank you :)

  • Gary Knott

    Dear Sanity, For another floating point story, see

  • Gyro Gearloose

    Be aware that single floating point number have a 23 bit mantissa, double floating point number have a 52 bit mantissa. See IEEE754

    • Phailure

      Yep, which, along with the implicit first digit, is why we’re allowed to represent up to ~8 decimal digits (log_10(2^24) ~= 8) of accuracy in single precision and around 16 in double precision.