# Somewhat Fast Square Root

Written for a challenge from /r/dailyprogrammer, I present to you an algorithm to approximate square roots that is still unfortunately slower than the _mm_sqrt_ps primitive offered by Intel. The following article is largely inspired by the brilliantly written Fast InvSqrt algorithm from id software.

Disclaimer: I’m no expert, in fact I’m a lowly undergrad who is very prone to making mistakes, so take what I have written with a healthy dose of skepticism.

I’m aware of this wikipedia algorithm. It performs as accurately as the non-compensated version of qsqrt(x) exactly half of the time, but have the advantage of not having to use up an extra cycle on the FPU for the multiplication by . Can you figure out the cases where the wikipedia method doesn’t perform as well?

### 1. To whet your appetite

I will explain why the scary looking code listing presented below approximates to 7 digits of precision. To see it in action, see http://codepad.org/s28WBi5y. (The reason that the relative errors are all 0 is because each of the relative error associated with are subnormal, or underflows values representable by single precision floating point numbers.)

#define SQRT2 1.4142135623730951

float qsqrt(float x){
// x = [0 eeeeeeeeeee mmmmmmmmmmmmmmmmmmmmmmm]
int i = *(int*)&x;
int j = (((i/2-0x1fc00000)+0x3f800000)&0x3ff800000)+((i&0x7fffff)/5*2);

float y = (*(float*)&j) * ((i-0x3f800000)&(0x800000) ? SQRT2 : 1); // 2-3 sig figs of significance

// two iterations of newton: y = y - (y^2 - x)/2y yields 2^-18 points of precision
y = (y + x/y)/2; // this yields 2^-8 points of precision
return (y + x/y)/2;
}


### 2. A brief introduction to the IEEE standard

Say we stored a number as a single precision floating bit number; as a bit vector consisting of 32 bits (4 bytes), it will look something like

The first 23 bits, from bit 0 to bit 22, are reserved for the mantissa, which represents the binary rational number such that

where each of can either be the binary bit 0 or 1. You can easily show that , which means that must be strictly less than 1.

Furthermore, the next 8 bits, bits 23 to 30, are reserved for the biased exponent, which represents the binary integer such that

where each of is also either 0 or 1. Now, the decimal x is represented as

Here, is notation for the biased exponent of the floating point number , not the bit within the exponent’s bit vector; similarly, is the notation for the mantissa of the floating point number .

Let’s look at an example before moving on: Say we are given the following floating point number, and we are asked to find its decimal point equivalent.

We start by finding out what is. From the above formula, we have

Note that in the above diagram is all the way at the right while is at the leftmost.

Similarly, we need to figure out what is. We start off with the formula given above

So then this means that

### 3. How to square root

Taking the above definition of a floating point number, let’s see how the square root affects its components

now, itself can be stored as a floating point number

If you’ve taken a physics course on E&M of particles and waves, you’ll probably remember having to use the following approximation

This is taken from the first order taylor expansion of the square root function, let

So we can simplify y further

By matching up the ‘s and ‘s, the above suggests that for

and

That is, the mantissa bits (bit 0 to 22) of y should be approximately half that of the mantissa for x, and the “biased exponent” of y () should be about half of (the biased exponent of x – 127), and all of that added together with 127, so this is what we will attempt to write.

### 4. Onto the code

In this section, we’ll try to fit the above logic into C from scratch, then figure out how to optimize it

We can’t access the bits of a floating point number conveniently in C, so we need to find a way to convert it into an underlying type where it is easy to work with the bits. You can’t just cast it to an int as the compiler will actually just truncate the number to an integer and discard all of the underlying information about how the number was originally represented.

Instead, we can put it in memory, and then dereference that location as an integer pointer to get the actual bits representing the floating point number.

    float x;
int i = *(int*)&x; // this gives you the bit array representing x


Alternatively, you can also use the following

    float x;
union{
float x;
int i;
} u;
u.x = x;
int i = u.i; // this gives you the bit array representing x


On gcc4.7 default optimization level, both compiles to a single SSE mov instruction.

Once we figure out how to get the bit array of a floating point number into a variable i, we need to be able to isolate out the “biased exponent” bits (23 to 30) and the exponent bits. We can do this by creating a bit mask and bitwise anding the bit array with the mask to extract out the required parts

// exp is between 23 and 30
//      bits  10987654321098765432109876543210
#define EXP 0b01111111100000000000000000000000
#define MAN 0b00000000011111111111111111111111
#define BIAS (1<<23)

int m_x = i&MAN;
int e_x = (i&EXP) - BIAS; // this is the unbiased exponent


Next, we find an approximation for y as from above

    // m_y = m_x/2
int m_y = m_x/2;
// e_y = e_x/2, where e_y and e_x are the unbiased exponents
// be careful, if e_x is odd, e_x/2 will get truncated
int e_y = e_x/2;
// however, we need to turn the unbiased exponent into the biased one, we do this by adding the bias
e_y += BIAS


Next, we need to concatenate the exponent and the mantissa together, we can do this by oring the two together, which is just the addition operator

    int j = (e_y&EXP)+(m_y&MAN);
float y = *(float*)&j;


Finally, if the unbiased exponent for x is odd, then the division of the exponent in the previous piece of code will truncate the exponent by an error of , which is just . Hence, we need to test to see if is odd, and if it is, we need to multiply y by

#define SQRT2 1.4142135623730951
if (e_x & (1<<23)){
y *= SQRT2;
}

return y;


However, we can rewrite this more succintly. Convince yourself that the above code can be more succintly written as

    int j = (((i/2-BIAS/2)+BIAS)&EXP)+((i&MAN)/2);


### 5. Crude Error compensation

As we’ve seen previously, the first order taylor approximation of has an error that can be approximated by . Now there’s no easy way of calculating in integer arithmetic, so what we do instead is bound to something reasonable. Since around any reasonable neighborhoods, , we compromise and decide experimentally that , which means that . This is how we got the expression

    int j = (((i/2-BIAS/2)+BIAS)&EXP)+((i&MAN)*2/5);

##### Posted By Lee

Woosh

• maister

Please don’t cast float* to int* like that. It is illegal and is *very* likely to break on high levels of optimization. While not guaranteed by the standard, going through a union instead will work on pretty much any compiler.

• Pingback: すばらしいビット | プログラミング | POSTD()