How can you compute to 14 digits of accuracy within the domain ?
1. Uh Oh
This is puzzling, since it’s fairly obvious that is smooth everywhere; we can only conclude that floating point roundoff is to blame, and seeing the error manifest already on just an order of means that this equation itself isn’t enough to solve our problem. In fact, using arbitrary precision floating point arithmetic, we see that is indeed smooth within this region, so we know who the culprit is.
We’re plagued by roundoff because and are extremely close together, so the IEEE standard guarantees that their subtraction will be flawless. Well, if this subtraction doesn’t incur any error whatsoever, why is there round-off? While the subtraction itself is error-free, the same cannot be said of the . A crude bound analysis on the relative error of the form
where the function says whatever is in its argument will be evaluated under floating point arithmetic, represents one of the arithmetic operations or standard transcendental functions, and represents the relative error associated with an operation or with the representation of a number. Let’s look at the example in the problem. We’re going to assume that the subtraction gives no error, or that . Furthermore, let’s just for the sake of simplicity assume that is perfectly represented (so that as well, and since integers will never overflow, we have is perfectly computed as well)
This analysis is intriguing: it tells us that at the very worst, we can expect our relative error of the entire computation to be bounded by where each of the term is bounded above by , so the entire is bounded to an order of magnitude of . This spells out why the error gets blown up at around . Our relative error is basically , and the absolute error is then just , where when , the absolute error is around , which is definitely not good enough.
Hold on, the original relative error of the computation only incurred a relative error of just , why did the final error get blown up by that many orders of magnitude? (if you do the calculation, )? Here’s a more intuitive way of seeing this: when we computed , we got a small amount of relative error on a large number. However, this amounts to amount of absolute error. When we subtracted off the rest of the , we’re still left with an absolute error of , but this is the error on a much smaller number (), so the original absolute error now becomes a rather large amount of relative error.
2. The Fix?
Intuition plays a large role here since finding workarounds to floating point issues is pretty much a dark art. Here, the subtraction is actively amplifying our error, so what if we find some equivalent form of this expression that doesn’t subtract two large but close numbers? Through simple algebra, we see that by multiplying the conjugate we get
Is this form better? Let’s try it out.
In this case, getting rid of the subtraction got our method’s error down to just , which is much better than what we’re looking for.