Add Me!Close Menu Navigation

Fixing floating point cancellation I

How can you compute \sqrt{x^2+1}-x to 14 digits of accuracy within the domain |x| < 10^8?

The graph of f(x) is not smooth around 1e7
This sounds like an easy problem, but it turns out that the solution isn’t so simple. If we plot f(x) = \sqrt{x^2+1}-x, you’ll see that its graph is fairly smooth early on, but once x approaches 10^{7}, it is no longer smooth.

1. Uh Oh

This is puzzling, since it’s fairly obvious that f(x) is smooth everywhere; we can only conclude that floating point roundoff is to blame, and seeing the error manifest already on just an order of 10^{-8} means that this equation itself isn’t enough to solve our problem. In fact, using arbitrary precision floating point arithmetic, we see that f(x) is indeed smooth within this region, so we know who the culprit is.


We’re plagued by roundoff because \sqrt{x^2+1} and x are extremely close together, so the IEEE standard guarantees that their subtraction will be flawless. Well, if this subtraction doesn’t incur any error whatsoever, why is there round-off? While the subtraction itself is error-free, the same cannot be said of the \sqrt{x^2+1}. A crude bound analysis on the relative error of the form

    \[\textsc{Flt}(\textsf{expr}_1 \circ \textsf{expr}_2) = \left(\textsc{Flt}(\textsf{expr}_1) \circ \textsc{Flt}(\textsf{expr}_2)\right)(1 + \delta_{\circ})\]


    \[\textsc{Flt}(\textsf{num}) = \textsf{num}(1 + \delta_{num})\]

where the \textsc{Flt} function says whatever is in its argument will be evaluated under floating point arithmetic, \circ represents one of the arithmetic operations or standard transcendental functions, and \delta_{\cdot} represents the relative error associated with an operation or with the representation of a number. Let’s look at the example in the problem. We’re going to assume that the subtraction gives no error, or that \delta_{minus} = 0. Furthermore, let’s just for the sake of simplicity assume that x is perfectly represented (so that \delta_x = 0 as well, and since integers will never overflow, we have \textsc{Flt}(x^2) is perfectly computed as well)

    \begin{align*} \textsc{Flt}\left(\sqrt{x^2+1}-x\right) &= \textsc{Flt}\left(\sqrt{x^2+1}\right) - \textsc{Flt}\left(x\right) \\ &= \sqrt{\textsc{Flt}\left(x^2+1\right)}(1+\delta_{\sqrt{}}) - x \\ &= \sqrt{\left(\textsc{Flt}(x^2)+\textsc{Flt}(1)\right)(1+\delta_+)}(1+\delta_{\sqrt{}}) - x \\ &\approx \sqrt{\left(\textsc{Flt}(x^2)+\textsc{Flt}(1)\right)}\left(1+\frac{\delta_+}{2}\right)(1+\delta_{\sqrt{}}) - x \\ &\approx \sqrt{x^2 + 1}\left(1+\frac{\delta_+}{2} + \delta_{\sqrt{}}\right) - x \\ &= \sqrt{x^2 + 1} - x + \sqrt{x^2 + 1}\left(\frac{\delta_+}{2} + \delta_{\sqrt{}}\right) \\ &= \left(\sqrt{x^2 + 1} - x\right)\left(1 + \frac{\sqrt{x^2 + 1}\left(\frac{\delta_+}{2} + \delta_{\sqrt{}}\right)}{\sqrt{x^2 + 1} - x}\right) \\ &\implies \textsf{relative error is bounded by }\frac{\sqrt{x^2 + 1}\left(\frac{\delta_+}{2} + \delta_{\sqrt{}}\right)}{\sqrt{x^2 + 1} - x} \end{align*}

This analysis is intriguing: it tells us that at the very worst, we can expect our relative error of the entire computation to be bounded by \frac{\sqrt{x^2 + 1}\left(\frac{\delta_+}{2} + \delta_{\sqrt{}}\right)}{\sqrt{x^2 + 1} - x} where each of the \delta_{\cdot} term is bounded above by \epsilon_{mach}, so the entire \frac{\delta_+}{2} + \delta_{\sqrt{}} is bounded to an order of magnitude of 10^{-16}. This spells out why the error gets blown up at around x=10^7. Our relative error is basically \epsilon \frac{\sqrt{x^2 + 1}}{\sqrt{x^2 + 1} - x}, and the absolute error is then just \epsilon \frac{\sqrt{x^2 + 1}}{\sqrt{x^2 + 1} - x} \left(\sqrt{x^2 + 1} - x\right) = \epsilon\sqrt{x^2 + 1} \approx \epsilon \times x, where when x \to 10^7, the absolute error is around 10^{-16}\times 10^7 \approx 10^{-9}, which is definitely not good enough.

Hold on, the original relative error of the computation \sqrt{x^2 + 1} only incurred a relative error of just 10^{-16}, why did the final error get blown up by that many orders of magnitude? (if you do the calculation, \sqrt{x^2 + 1} - x \approx \frac{1}{2x})? Here’s a more intuitive way of seeing this: when we computed \sqrt{x^2 + 1}, we got a small \epsilon amount of relative error on a large number. However, this amounts to \epsilon x amount of absolute error. When we subtracted off the rest of the x, we’re still left with an absolute error of \epsilon x, but this is the error on a much smaller number (\frac{1}{x}), so the original absolute error now becomes a rather large amount of relative error.

2. The Fix?

Intuition plays a large role here since finding workarounds to floating point issues is pretty much a dark art. Here, the subtraction is actively amplifying our error, so what if we find some equivalent form of this expression that doesn’t subtract two large but close numbers? Through simple algebra, we see that by multiplying the conjugate we get

    \begin{align*} \sqrt{x^2+1}-x &= \left(\sqrt{x^2+1}-x\right)\frac{\sqrt{x^2+1}+x}{\sqrt{x^2+1}+x} \\ &= \frac{x^2 + 1 - x^2}{\sqrt{x^2+1}+x} \\ &= \frac{1}{\sqrt{x^2+1}+x} \end{align*}

Is this form better? Let’s try it out.



In this case, getting rid of the subtraction got our method’s error down to just 10^{-20}, which is much better than what we’re looking for.

Posted By Lee


  • Mike Riley

    A long time ago (8086/80286 era) the company I worked for made our own system and they were too cheap to pay for a floating point coprocessor, which was expensive at that time. We did financial loans calculations for car dealerships and had to be accurate. We used a trick for how FP math was done on the X86 platform. Internally it represents things as 10-byte “temp” floating point values in an 8-number “stack” of registers. We stored our data in that 10-byte format throughout the calculation and only converted it back to 8-byte (we never used the 4-byte format) at the end when the result was stored.

    The advantage of doing that is that most errors in a calculation ended up being in the part of the Mantissa that we threw away when converting from the 10-byte to 8-byte format. Usually you would have only 1 or 2 bits at the end that might be off in the 8-byte format when you did the conversion. That way when you converted to ASCII we got very accurate values.

    We even had to convince some banks that were were right by doing the calculations on calculators, which do math using BCD, because their IBM mainframes were coming up with different answers. In all cases I know of we always matched the calculator results using this method.


    • That’s actually quite ingenious, thank you for sharing!

  • flood

    sqrt(x^2 + 1) – x = x*((1+1/x^2)^0.5-1) = x*(0.5/x^2 + O(x^-4)) = 0.5/x + O(x^-3)

    so the 1st order approximation gives you something on the order of log_10(x^2) digits of accuracy.

    • My god I thought you’d died.

      Btw, I’m finally going to be working this summer

      • flood

        where u working?
        i’m going to grad school in fall

        btw here’s a math problem

        show that the root(s) of the quadratic equation x^2 + a*x + b = 0 both have magnitude greater than 1 if and only if all of the following 3 conditions hold: a+b<1, b-a<1, |b|<1

        • flood

          oops wrong equation
          should be
          1 – a*x – b*x^2 = 0