  Close Menu Navigation

# Fixing floating point cancellation I

How can you compute to 14 digits of accuracy within the domain ? This sounds like an easy problem, but it turns out that the solution isn’t so simple. If we plot , you’ll see that its graph is fairly smooth early on, but once approaches , it is no longer smooth.

### 1. Uh Oh

This is puzzling, since it’s fairly obvious that is smooth everywhere; we can only conclude that floating point roundoff is to blame, and seeing the error manifest already on just an order of means that this equation itself isn’t enough to solve our problem. In fact, using arbitrary precision floating point arithmetic, we see that is indeed smooth within this region, so we know who the culprit is. We’re plagued by roundoff because and are extremely close together, so the IEEE standard guarantees that their subtraction will be flawless. Well, if this subtraction doesn’t incur any error whatsoever, why is there round-off? While the subtraction itself is error-free, the same cannot be said of the . A crude bound analysis on the relative error of the form and where the function says whatever is in its argument will be evaluated under floating point arithmetic, represents one of the arithmetic operations or standard transcendental functions, and represents the relative error associated with an operation or with the representation of a number. Let’s look at the example in the problem. We’re going to assume that the subtraction gives no error, or that . Furthermore, let’s just for the sake of simplicity assume that is perfectly represented (so that as well, and since integers will never overflow, we have is perfectly computed as well) This analysis is intriguing: it tells us that at the very worst, we can expect our relative error of the entire computation to be bounded by where each of the term is bounded above by , so the entire is bounded to an order of magnitude of . This spells out why the error gets blown up at around . Our relative error is basically , and the absolute error is then just , where when , the absolute error is around , which is definitely not good enough.

Hold on, the original relative error of the computation only incurred a relative error of just , why did the final error get blown up by that many orders of magnitude? (if you do the calculation, )? Here’s a more intuitive way of seeing this: when we computed , we got a small amount of relative error on a large number. However, this amounts to amount of absolute error. When we subtracted off the rest of the , we’re still left with an absolute error of , but this is the error on a much smaller number ( ), so the original absolute error now becomes a rather large amount of relative error.

### 2. The Fix?

Intuition plays a large role here since finding workarounds to floating point issues is pretty much a dark art. Here, the subtraction is actively amplifying our error, so what if we find some equivalent form of this expression that doesn’t subtract two large but close numbers? Through simple algebra, we see that by multiplying the conjugate we get Is this form better? Let’s try it out.  In this case, getting rid of the subtraction got our method’s error down to just , which is much better than what we’re looking for.

##### Posted By Lee

Woosh

• Mike Riley

A long time ago (8086/80286 era) the company I worked for made our own system and they were too cheap to pay for a floating point coprocessor, which was expensive at that time. We did financial loans calculations for car dealerships and had to be accurate. We used a trick for how FP math was done on the X86 platform. Internally it represents things as 10-byte “temp” floating point values in an 8-number “stack” of registers. We stored our data in that 10-byte format throughout the calculation and only converted it back to 8-byte (we never used the 4-byte format) at the end when the result was stored.

The advantage of doing that is that most errors in a calculation ended up being in the part of the Mantissa that we threw away when converting from the 10-byte to 8-byte format. Usually you would have only 1 or 2 bits at the end that might be off in the 8-byte format when you did the conversion. That way when you converted to ASCII we got very accurate values.

We even had to convince some banks that were were right by doing the calculations on calculators, which do math using BCD, because their IBM mainframes were coming up with different answers. In all cases I know of we always matched the calculator results using this method.

Mike

• That’s actually quite ingenious, thank you for sharing!

• flood

sqrt(x^2 + 1) – x = x*((1+1/x^2)^0.5-1) = x*(0.5/x^2 + O(x^-4)) = 0.5/x + O(x^-3)

so the 1st order approximation gives you something on the order of log_10(x^2) digits of accuracy.

• My god I thought you’d died.

Btw, I’m finally going to be working this summer

• flood

where u working?
i’m going to grad school in fall

btw here’s a math problem

show that the root(s) of the quadratic equation x^2 + a*x + b = 0 both have magnitude greater than 1 if and only if all of the following 3 conditions hold: a+b<1, b-a<1, |b|<1

• flood

oops wrong equation
should be
1 – a*x – b*x^2 = 0