## 0x5f400000: Understanding Fast Inverse Sqrt the Easy(ish) Way!

#### TL;DR – Start Here!

1. Let be the float-to-bit function (that takes in a floating point number and outputs a 32-bit long representing its IEEE 754 encoding used on your everyday computer) and be the bit-to-float function, then the fast inverse square root method can be rewritten as

2. It turns out that for some non-linear function that doesn’t vary a lot with , so it becomes an optimization game to find such that .
3. We know that when no matter what is, so if we want to find such that , all we need to do is set because then

so , which let’s us make arbitrary fast power methods on demand.
4. ???
5. Profit!

### 1. Introduction

The first time I saw the magical fast inverse square root method, I was a freshman in college and the code pretty much killed my brain. I’d just learned about algorithms and data structures and I thought that binary search trees were like the coolest things ever. Turns out I didn’t know them well enough to tackle this problem back then.

#### WTF?

If you’re not familiar with the fast inverse square root method, here’s what it looks like in C (source: wikipedia)

float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;

x2 = number * 0.5F;
y  = number;
i  = * ( long * ) &y;                        // evil floating point bit level hacking
i  = 0x5f3759df - ( i >> 1 );               // what the fuck?
y  = * ( float * ) &i;
y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//      y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

return y;
}


[Note: The use of pointer aliasing here is illegal, as Maristic pointed out in the followup discussion on reddit. Use unions instead.]
Okay, so first you declare i, x2, y, and a constant threehalfs. Next, you set x2 to be and set . So far so good!

Okay, next, i gets the dereferenced value of the float* pointer to y casted into a long* pointer instead?? Okay…

Next, i = 0x5f3759df – (i/2)??? Huh?

y = *(float*)&i????

y = y * ( threehalfs – ( x2 * y * y ) )??????

What. The. Fucking. Fuck.

#### Obsession

Before we go on, I should probably warn you that this code was a constant obsession of mine for the better part of my adult life. Because of this, a lot of what I’m about to say may sound preachy, cheesy, or even outright annoying. I’ll try to tone it down, but truth be told, I am proud of this work; I’ve come to see the fast inverse square root method as my baby and because of that, I’ve become quite possessive of it. As much as this is a post on the possible derivation of 0x5f3759df, this is also a memoir of my college life (minus the fun parts ;).

From the very beginning, I tried to wrap my head around this strange game of algebraic reasoning, and for the next 4 years, my academic life largely revolved around this mystery: I spent my winter break freshman year learning C and figured out that the “bit level” hack gave you the array of bits that your computer encoded floating point numbers as; sophomore year, I took a first course on numerical methods and understood how iterating the sequence

ad infinitum will eventually converge to ; then, during the summer before my junior year, I worked through a lot of tedious algebra dealing with the quirks of IEEE754 and found another bit-level hack to quickly compute the square root a la the spirit of the fast inverse square root method, but this time exploiting the distribution of floats and a convenient analogy to second order taylor expansions.

Indeed, my obsession with this method has given me an appetite for numerical analysis, programming languages, software verification, compilers, and even combinatorics and dynamical systems. Without it, I would have never explored most of the areas of computer science that I currently consider my home.

But through it all, there was always a constant question at the back of my mind that I’ve failed to answer time and time again: where did the 0x5f3759df come from?

I never got to a point where I can claim that I have a satisfactory answer to that question, but I think I did come to discover something along the road: a different magical constant 0x5f400000 that I believe lay at the very beginning of the history of this method, and it is the joy of discovering this constant that I would like to gift to my readers.

### 2. Switching Strategy

Towards the end of my college years, I realized that there was a big problem in the way I approached the problem of looking at 0x5f3759df. Most of the traditional research into the subject has been through the perspective that whoever wrote the algorithm went after the most optimal method possible. In the process of hunting for the jackpot, a lot of great intuition was left behind because those ideas weren’t efficient or effective enough. This is what led me down the path of looking for inefficient but elegant methods that could inspire the fast inverse square root method.

But first, let’s ramp up on some preliminaries. A lot of work and paper on this matter jump immediately to the representation of floating point numbers; I will completely forgo that, instead relying on mathematical intuition to do the job. (This is in part why I didn’t become a mathematician :P)

#### Converting between a Float and its Bits in memory

Let’s first consider two functions: , which stands for float-to-bits will return the IEEE 754 bit array representation of the floating number , effectively performing *(long*)&y; and , which stands for bits-to-float and takes a bit array and turns it back into a float. Note that and vice versa so that .

It turns out that the magical fast inverse square root method is equivalent to

#### Looking at b2f(f2b(x)*b)

With the conversion functions out of the way, I next turned my attention to everything but the magic constant. Specifically, what are the effects of multiplying inside the . I fired up my favorite plotting library and looked at

After a little bit of wrestling around with the plot, I realized that when plotting in log-log style, it has the same shape as . This means that

for some constant .

I fitted a “regression” line (this isn’t a real regression however, don’t be mislead) through the data onto and found that the relative error is quite small as can be seen in the figure below.

In fact, if you tick through the slider at the bottom of the figure, you’ll come to find that for any arbitrary , it seems to be the case that

for some constant .

### 3. Homing In on 0x5F400000

Let’s now consider the following function:

Notice here that we now have added an additive term to the equation. Again, we can rewrite the fast inverse square root method as

A natural question one might consider is the effect of varying on the behavior of . It turns out that on the log-log graph of , varying does not change the slope of the line, but only its y-intercept. In other words, varying will only affect magnitude of the constant , as we can see below.

This then suggests that determines the degree of while determines its magnitude. Symbolically:

where is some constant.

#### 0x3F800000 – Generalized fast power method

We are now left at an interesting point in the puzzle. We have on one hand the knowledge that

for some function and we want to determine such that

Canceling out the on either side, we are left with the functional equation

So right now we need to find such that . But for , for any , so the naive thing to do here is to reduce our approximation to

Let’s take a moment and appreciate this statement:

This says that for any exponent (positive or negative, integer or fractional), we can approximate with

That’s mind blowing because we can immediately generate “fast power” methods using nothing but !

float fast_pow(float x, float n) {
long bit = (*((long*)&x))*n + 0x3f800000*(1-n);
return *((float*)&bit);
}


Going back to inverse square roots, we have , which is the namesake of this article.

### 4. Epilogue – The Hard Parts

#### Closing In On 0x5f3759df

After discovering the existence of 0x5f400000, things became much clearer. However, it was still quite off from SGI’s 0x5f3759df or Lomont’s 0x5f375a86, and so the question of how this constant came to be still nagged at the back of my head. Nevertheless, I am certain that 0x5f400000 appeared at least in some shape or form in the earlier versions of this numerical recipe.

Many others have previously discussed the possible origins of 0x5f3759df, and in the rest of this post, I will offer my own views on the matter.

For the most part, I believe that the discovery of 0x5f3759df is an iterative process: it starts with 0x5f40, then it gets refined to 0x5f375X, and finally someone must have decided on 0x5f3759df. As engineers are practitioners by nature, I don’t believe that anyone that worked on 0x5f3759df cared for its optimality; therefore, I believe that intuition and visualization play the most important part in its discovery.

In this section, I will first outline one path to refine 0x5f40 to 0x5f375X through intuition about the maximum error bounds of the recipe, followed by a discussion on the final discovery as a product of analyzing one step of newton’s iteration.

#### Refinement – 0x5f376000

If we look at the graph of the relative error of on a logarithmic x-axis below, you will notice a couple of peculiar things.

First and foremost, changing the value of the magic constant by small amounts seems to lift the relative error up. Next, the error is periodic with respect to with a period of (so that ). Finally, the position of the maximum and the minimum of the relative error is largely unchanged when we vary the magic constant.

We can use these observations to our advantage. Notice that the error for is completely negative. However, if we were to use a smaller magic constant, we can shift part of the error up into the positive zone, which reduces the absolute error as can be seen below.

We are trying to reduce the error bounds as much as possible (under the infinity norm optimization as we would say). In order to do so, we need to ensure that the maximum point plus the minimum point in the non-absolute relative error plot is as close to 0 as possible.

Here’s what we’re going to do. First, we’re going to zoom in onto the section of the graph within the interval because it is periodic. Next, we will look at the maximum point and minimum point and look at their sum.

You will see that the optimal constant here lies somewhere close to 0x5f376000, which is about away from the SGI constant.

#### Newton Iteration – 0x5f375a1a

In our final trick tonight, we will refine this constant down further to within 60 away from SGI’s 0x5f3759df.

One of the important things to notice from the quake code is that the second newton iteration was commented away. This suggests that whoever worked on the latest version of the fast inverse square root method must have optimized the constant with the effects of newton iteration applied. Let’s look at how well the constants do with the help of an iteration.

Wow, using 0x5f3759df as the magic constant improved the accuracy nearly tenfolds! This likely suggests that whoever worked on quake’s version of fast inverse square root method sought after 4 digits of accuracy. In order to achieve this with the vanilla constant 0x5f400000, they would need to run two newton iterations.

Let’s dig in a little bit and try to find the magic constant that minimizes the maximum error after one newton iteration.

Incidentally, this occurs when the right most two peeks in the above graph collide. We find that the magic constant 0x5f375a1a works best here, which is only 59 away from 0x5f3759df.

We can look at the mean error as well (the one-norm):

but as the bulk of the mass of the error is tucked under the two humps, the 1-norm/mean-metric (and in fact, the first few reasonable -norms) will tend to disregard the significance of the outer edges and their maximum-bounds. Therefore, these will attempt to decrease the constant by as much as possible.

Finally, let’s look at the maximum relative error after two iterations:

This ended up giving us the same magic constant of 0x5f375a1a, which I guess makes sense.

#### Closing Thoughts

In this article, we’ve come across a couple of different constants: 0x5f400000, 0x54376000, 0x54375a1a. None of these are exactly the 0x543759df that we’ve been searching for, but the last one is really close, which leads me to believe that the discovery of 0x543759df went through a similar process: 1. intuition, and 2. refinement.

The dark art of fast reciprocal square roots have been largely done away with nowadays with faster intrinsics. If anything, this article contributes little technical value besides proposing a half-assed solution to a great puzzle. Nevertheless, I sincerely hope that, having made it this far, you’ve also found some value in this research. Good Luck!

#### Requested by averageisnothalfway: Differences between a1a and df


function makeData(f) {
var data = [];
for (var i = -5; i <= 5; i+= 0.1) {
var x = Math.pow(2,i);
// console.log([x,f(x)]);
data.push([x,f(x)]);
}
return data;
}
function makeData2(f) {
var data = [];
for (var i = -5; i <= 5; i+= 0.01) {
var x = Math.pow(2,i);
// console.log([x,f(x)]);
data.push([x,f(x)]);
}
return data;
}
function doPlot(z) {
var hz = h(z);
var zz = Math.floor(Math.log10(b2f(magic(z))));
var h0; var h1; var h2;
jQuery(function () {
h0 = jQuery('#container0').highcharts({

title: {
text: 'Log-Log plot of C*x<sup>'+z+'</sup> versus b2f(f2b(x)*' + z + '));', useHTML: true
},
plotOptions: {
line: {
animation: false
}
},
xAxis: {
type: 'logarithmic',
tickInterval: 1
},

yAxis: {
type: 'logarithmic',
minorTickInterval: 1,labels: {
formatter: function(value) {return ''+this.value;}
},
},

tooltip: {
formatter: function() {var x = this.x; return "relative error: "+ Math.abs(1-b2f(f2b(x)*z)/b2f(f2b(1)*z)/Math.pow(x,z))}
},

series: [{data: makeData(function(x) {return Math.abs(Math.pow(x,z) * b2f(f2b(1)*z));}), name: "Constant("+ b2f(f2b(1)*z) +") * x^"+z}, {
data: makeData(function(x) {return Math.abs(b2f(f2b(x)*z));}),
name: "b2f(f2b(x)*" + z + ")", color:'#BF0B23'
}]
});
h1 = jQuery('#container1').highcharts({

title: {
text: 'Log-Log plot of x<sup>'+z+'</sup>', useHTML: true
},
plotOptions: {
line: {
animation: false
}
},
xAxis: {
type: 'logarithmic',
tickInterval: 1
},

yAxis: {
type: 'logarithmic',
minorTickInterval: 10,labels: {
formatter: function(value) {return ''+this.value;}
},
min: 1e-38,
max: 1e38
},

tooltip: {
enabled:false
},

series: [{data: makeData(function(x) {return Math.abs(Math.pow(x,z) * b2f(f2b(1)*z));}), name: "Constant("+ b2f(f2b(1)*z) +") * x^"+z}, {
data: makeData(function(x) {return Math.abs(b2f(f2b(x)*z));}),
name: "b2f(f2b(x)*" + z + ")", color:'#BF0B23'
},{data: makeData(function(x) {return Math.pow(x,z);}), name: "x^"+z},]
});
h2 = jQuery('#container2').highcharts({

title: {
text: 'Plot of the relative error'
},

xAxis: {
type: 'logarithmic',
tickInterval: 1
},

yAxis: {
minorTickInterval: 1,
labels: {
formatter: function(value) {return ''+this.value;}
},
min: 0,
max: 0.25
},
plotOptions: {
line: {
animation: false
}
},
tooltip: {
pointFormat: 'x = {point.x}, y = {point.y}'
},

series: [{data: makeData(function(x) {return Math.abs(1-b2f(f2b(x)*z)/b2f(f2b(1)*z)/Math.pow(x,z));}), name: "relative error"}]
});

});
}

doPlot(50/100);
jQuery(function() {
jQuery('#slider1').slider({range: "min",
min: 1,
max: 500,
value: 50,
slide: function( event, ui ) {
doPlot(ui.value/100);
jQuery('#slider1 .ui-slider-handle').text(ui.value/100);
}});

jQuery('#slider1 .ui-slider-handle').text(50/100).css("text-align","center").width(45);
});

function doPlot2(z,a) {
var hz = h(z);
var zz = Math.floor(Math.log10(b2f(magic(z))));
var h0; var h1; var h2;
jQuery(function () {

h1 = jQuery('#containera1').highcharts({

title: {
text: 'Log-Log plot of b2f('+a+'-f2b(x)/2)', useHTML: true
},
plotOptions: {
line: {
animation: false
}
},
xAxis: {
type: 'logarithmic',
tickInterval: 1
},

yAxis: {
type: 'logarithmic',
minorTickInterval: 10,labels: {
formatter: function(value) {return ''+this.value;}
},
min: 1e-38,
max: 1e38
},

tooltip: {
enabled:false
},

series: [{
data: makeData(function(x) {return Math.abs(b2f(f2b(x)*z + a));}),
name: "b2f(" + a + " - f2b(x)/2)", color:'#BF0B23'
},{data: makeData(function(x) {return Math.pow(x,z);}), name: "1/sqrt(x)"},]
});
});
}
doPlot2(-0.5, 0);
jQuery(function() {
jQuery('#slider2').slider({range: "min",
min: 0,
max: 2130806432,
value: 0,
step: 100000,
slide: function( event, ui ) {
doPlot2(-0.5, ui.value);
jQuery('#slider2 .ui-slider-handle').text(ui.value);
}});

jQuery('#slider2 .ui-slider-handle').text(0).css("text-align","center").width(115);
});

function doPlot3(z) {
var hz = h(z);
jQuery(function () {

jQuery('#containerb1').highcharts({

title: {
text: 'Plot of the relative error of 0x5f400000 and friends'
},

xAxis: {
type: 'logarithmic',
tickInterval: 1
},

yAxis: {
minorTickInterval: 1,
labels: {
formatter: function(value) {return ''+this.value;}
},
min: -0.1,
max: 0.1
},
plotOptions: {
line: {
animation: false
}
},
tooltip: {
pointFormat: 'x = {point.x}, y = {point.y}'
},

series: [{data: makeData2(function(x) {return (1 - hz(x)/Math.pow(x,z));}), name: "relative error of 0x5f400000"},{data: makeData2(function(x) {return (1 - (b2f(1597571072 - f2b(x)/2))/Math.pow(x,z));}), name: "relative error of 0x5f390000"},{data: makeData2(function(x) {return (1 - (b2f(1597463007 - f2b(x)/2))/Math.pow(x,z));}), name: "relative error of 0x5f3759DF"}]
});
jQuery('#containerc1').highcharts({

title: {
text: 'Plot of the absolute relative error of 0x5f400000 and friends'
},

xAxis: {
type: 'logarithmic',
tickInterval: 1
},

yAxis: {
minorTickInterval: 1,
labels: {
formatter: function(value) {return ''+this.value;}
},
min: 0,
max: 0.1
},
plotOptions: {
line: {
animation: false
}
},
tooltip: {
pointFormat: 'x = {point.x}, y = {point.y}'
},

series: [{data: makeData2(function(x) {return Math.abs(1 - hz(x)/Math.pow(x,z));}), name: "abs relative error of 0x5f400000"},{data: makeData2(function(x) {return Math.abs(1 - (b2f(1597571072 - f2b(x)/2))/Math.pow(x,z));}), name: "abs relative error of 0x5f390000"},{data: makeData2(function(x) {return Math.abs(1 - (b2f(1597463007 - f2b(x)/2))/Math.pow(x,z));}), name: "abs relative error of 0x5f3759DF"}]
});
});
}

doPlot3(-0.5);

function makeData3(f) {
var data = [];
var mx = -Infinity; var mn = Infinity;
var mxx = 0; var mnx = 0;
for (var i = 0; i <= 2; i+= 0.01) {
var x = Math.pow(2,i);
var fx = f(x);
if (fx >= mx) {
mx = fx; mxx = x;
}
if (fx <= mn) {
mn = fx; mnx = x;
}
// console.log([x,f(x)]);
//data.push([x,f(x)]);
}
var j = 0;
for (var i = 0; i <= 2; i+= 0.01) {
var x = Math.pow(2,i);
var fx = f(x);

// console.log([x,f(x)]);
if (!(x == mxx || x == mnx)) {
if (j%4 != 0) data.push([x,fx]); } else {
data.push({x:x,y:fx,color:"#BF0B23",marker: { enabled:true,
}, dataLabels: {
enabled: true,
align: 'center',
style: {
fontWeight: 'bold'
},
y: 15*Math.pow(-1,x == mxx),
verticalAlign: 'middle',
overflow: true,
crop: false, formatter : function() {return ((this.x == mxx) ? 'max ' : 'min ') + this.x.toFixed(5)}
}})}
j = j + 1;
}
return [data, mx, mn, mxx, mnx];
}

function doPlot4(z,m) {
var hz = h(z);
jQuery(function () {
var dat = makeData3(function(x) {return (1 - b2f(m - f2b(x)/2)/Math.pow(x,z));});
jQuery('#containerd1').highcharts({

title: {
text: 'Plot of the relative error of 0x' + m.toString(16)
},

xAxis: {
type: 'logarithmic',
tickInterval: 1
},

yAxis: {
minorTickInterval: 1,
labels: {
formatter: function(value) {return ''+this.value;}
},
min: -0.15,
max: 0.15
},
plotOptions: {
line: {
animation: false
}
},
tooltip: {
pointFormat: 'x = {point.x}, y = {point.y}'
},

series: [{data: dat[0], name: "relative error of 0x" +  m.toString(16)},
{data:[[dat[4],dat[2]], {x:dat[4], y:dat[1], dataLabels: {
enabled: true,
align: 'center',
style: {
fontWeight: 'bold'
},
y: -15,
verticalAlign: 'middle',
overflow: true,
crop: false, formatter : function() {return 'sum = ' + (dat[1]+dat[2])}
}}]}]
});
});
}

doPlot4(-0.5, 0x5f400000);
jQuery(function() {
slider3 = jQuery('#slider3').slider({range: "min",
min: 0x5f400000 - 0x100000,
max: 0x5f400000 + 0x100000,
value: 0x5f400000,
step: 1000,
slide: function( event, ui ) {
doPlot4(-0.5, ui.value);
jQuery('#slider3 .ui-slider-handle').text(ui.value.toString(16));
}});

jQuery('#slider3 .ui-slider-handle').text('5f400000').css("text-align","center").width(115);
});

function makeData4(f) {
var data = [];
var mx = -Infinity; var mn = Infinity;
var mxx = 0; var mnx = 0;
for (var i = 0; i <= 2; i+= 0.01) {
var x = Math.pow(2,i);
var fx = f(x);
if (fx >= mx) {
mx = fx; mxx = x;
}
if (fx <= mn) {
mn = fx; mnx = x;
}
// console.log([x,f(x)]);
//data.push([x,f(x)]);
}
var j = 0;
for (var i = 0; i <= 2; i+= 0.01) {
var x = Math.pow(2,i);
var fx = f(x);

// console.log([x,f(x)]);
if (!(x == mxx || x == mnx)) {
if (j%4 != 0) data.push([x,fx]); } else {
data.push({x:x,y:fx,color:"#BF0B23",marker: { enabled:true,
}, dataLabels: {
enabled: true,
align: 'center',
style: {
fontWeight: 'bold'
},
y: 15*Math.pow(-1,x == mxx),
verticalAlign: 'middle',
overflow: true,
crop: false, formatter : function() {return ((this.x == mxx) ? 'max y = ' : 'min y = ') + this.y.toString()}
}})}
j = j + 1;
}
return [data, mx, mn, mxx, mnx];
}

function doPlot5(z,m) {
var hz = h(z);
jQuery(function () {
var dat = makeData4(function(x) {
var a = b2f(m - f2b(x)/2);
var b = a * (1.5 - x/2*a*a);
return (1 - b/Math.pow(x,z));});
jQuery('#containere1').highcharts({

title: {
text: 'Plot of the relative error of 0x' + m.toString(16) + ' after 1 newton step.'
},

xAxis: {
type: 'logarithmic',
tickInterval: 1
},

yAxis: {//type: 'logarithmic',
minorTickInterval: 1,
labels: {
formatter: function(value) {return ''+this.value;}
},
min: 0.000,
max: 0.002
},
plotOptions: {
line: {
animation: false
}
},
tooltip: {
pointFormat: 'x = {point.x}, y = {point.y}'
},

series: [{data: dat[0], name: "relative error of 0x" +  m.toString(16)}]
});
});
}

doPlot5(-0.5, 0x5f376000);
jQuery(function() {
slider4 = jQuery('#slider4').slider({range: "min",
min: 0x5f376000 - 0x1000,
max: 0x5f376000 + 0x1000,
value: 0x5f376000,
step: 5,
slide: function( event, ui ) {
doPlot5(-0.5, ui.value);
jQuery('#slider4 .ui-slider-handle').text(ui.value.toString(16));
}});

jQuery('#slider4 .ui-slider-handle').text('5f376000').css("text-align","center").width(115);
});

function iterate(x,a) {
return a * (1.5 - x/2*a*a);
}

function doPlot6(z) {
var hz = h(z);
jQuery(function () {

jQuery('#containerf1').highcharts({

title: {
text: 'Plot of the relative error of 0x5f400000 and friends after 1 newton iteration'
},

xAxis: {
type: 'logarithmic',
tickInterval: 1
},

yAxis: {
minorTickInterval: 1,
labels: {
formatter: function(value) {return ''+this.value;}
},
min: 0,
max: 0.015
},
plotOptions: {
line: {
animation: false
}
},
tooltip: {
pointFormat: 'x = {point.x}, y = {point.y}'
},

series: [{data: makeData2(function(x) {return (1 - iterate(x,hz(x))/Math.pow(x,z));}), name: "relative error of 0x5f400000"},{data: makeData2(function(x) {return (1 - iterate(x, b2f(1597463007 - f2b(x)/2))/Math.pow(x,z));}), name: "relative error of 0x5f3759DF"}]
});
});
}

doPlot6(-0.5);

function makeData5(f) {
var data = [];
var mm = 0; var n = 0;
for (var i = 0; i <= 2; i+= 0.01) {
var x = Math.pow(2,i);
var fx = f(x);
mm += fx; n += 1;
// console.log([x,f(x)]);
data.push([x,f(x)]);
}
return [data, mm/n];
}

function doPlot7(z,m) {
var hz = h(z);
jQuery(function () {
var dat = makeData5(function(x) {
var a = b2f(m - f2b(x)/2);
var b = a * (1.5 - x/2*a*a);
return (1 - b/Math.pow(x,z));});
jQuery('#containerg1').highcharts({

title: {
text: 'Mean error of 0x' + m.toString(16) + ' after 1 newton step: ' + dat[1]
},

xAxis: {
type: 'logarithmic',
tickInterval: 1
},

yAxis: {//type: 'logarithmic',
minorTickInterval: 1,
labels: {
formatter: function(value) {return ''+this.value;}
},
min: 0.000,
max: 0.002
},
plotOptions: {
line: {
animation: false
}
},
tooltip: {
pointFormat: 'x = {point.x}, y = {point.y}'
},

series: [{data: dat[0], name: "relative error of 0x" +  m.toString(16)}]
});
});
}

doPlot7(-0.5, 0x5f376000);
jQuery(function() {
slider5 = jQuery('#slider5').slider({range: "min",
min: 0x5f376000 - 0x3000,
max: 0x5f376000,
value: 0x5f376000,
step: 5,
slide: function( event, ui ) {
doPlot7(-0.5, ui.value);
jQuery('#slider5 .ui-slider-handle').text(ui.value.toString(16));
}});

jQuery('#slider5 .ui-slider-handle').text('5f376000').css("text-align","center").width(115);
});

function doPlot8(z,m) {
var hz = h(z);
jQuery(function () {
var dat = makeData4(function(x) {
var a = b2f(m - f2b(x)/2);
a = a * (1.5 - x/2*a*a);
a = a * (1.5 - x/2*a*a);
return (1 - a/Math.pow(x,z));});
jQuery('#containerh1').highcharts({

title: {
text: 'Relative error of 0x' + m.toString(16) + ' after 2 newton steps'
},

xAxis: {
type: 'logarithmic',
tickInterval: 1
},

yAxis: {//type: 'logarithmic',
minorTickInterval: 1,
labels: {
formatter: function(value) {return ''+this.value;}
},
min: 0.000,
max: 0.000005
},
plotOptions: {
line: {
animation: false
}
},
tooltip: {
pointFormat: 'x = {point.x}, y = {point.y}'
},

series: [{data: dat[0], name: "relative error of 0x" +  m.toString(16)}]
});
});
}

doPlot8(-0.5, 0x5f376000);
jQuery(function() {
slider6 = jQuery('#slider6').slider({range: "min",
min: 0x5f376000 - 0x1000,
max: 0x5f376000 + 0x1000,
value: 0x5f376000,
step: 5,
slide: function( event, ui ) {
doPlot8(-0.5, ui.value);
jQuery('#slider6 .ui-slider-handle').text(ui.value.toString(16));
}});

jQuery('#slider6 .ui-slider-handle').text('5f376000').css("text-align","center").width(115);
});

function doPlot9(z) {
var hz = h(z);
jQuery(function () {

jQuery('#containeri1').highcharts({

title: {
text: 'Plot of the relative error of a1a and df 1 newton iteration'
},

xAxis: {
type: 'logarithmic',
tickInterval: 1, min: 1, max: 4
},

yAxis: {
minorTickInterval: 1,
labels: {
formatter: function(value) {return ''+this.value;}
},

},
plotOptions: {
line: {
animation: false
}
},
tooltip: {
pointFormat: 'x = {point.x}, y = {point.y}'
},

series: [{data: makeData2(function(x) {return (1 - iterate(x,b2f(0x5f375a1a - f2b(x)/2))/Math.pow(x,z));}), name: "relative error of 0x5f375a1a"},{data: makeData2(function(x) {return (1 - iterate(x, b2f(1597463007 - f2b(x)/2))/Math.pow(x,z));}), name: "relative error of 0x5f3759DF"},
{data: makeData2(function(x) {return (1 - iterate(x,b2f(0x5f375a1a - f2b(x)/2))/Math.pow(x,z)) - (1 - iterate(x, b2f(1597463007 - f2b(x)/2))/Math.pow(x,z));}), name: "Difference in Relative Error (positive segments is where df wins)"}]
});
});
}

doPlot9(-0.5);



## Counting Binary Trees (The Hard Way)

Suppose I have a binary tree of internal nodes, we can count all possible such configurations by looking at the tree like

where each internal subtree with nodes is a vertex plus two trees whose node-sum is , or

This is pretty fucking hard to solve. Here’s a way where we do not need to appeal to generating functions.

Let’s begin with a reasonable hypothesis: is a first order recurrence. After groping around for a while, we find that this is actually a nonlinear homogeneous recurrence:

Here’s the first couple of terms:

No obvious pattern yet, but if we look at the sequence , we see that it looks like

Notice how the primes in the denominator are exactly the right amount of spacing apart ( is 2 away from , 4 away from , etc). Furthermore, for the nonprime denominators, they are always a divisor of what the expected value there is supposed to be (for example, the in the number right before the ), this seems to suggest that there’s always an inverse affine factor between two elements of the sequence. Since the denominator of is , let’s conjecture that

In fact, the sequence looks really nice!

why, this is just the sequence , so we can pretty accurately guess that

Warning: This is a really tedious proof; the point is to construct a series that cancels out the on the denominator of at each step of the induction.

Proof: By strong induction, for some , suppose that we have already shown that .
Consider the series

In the case, this looks like

notice that we can flip the order of this series and add with itself to get

By appealing to the induction hypothesis on , we also have

So, above, we’ve derived the following:

Substituting the left into the right, we get

which concludes the proof.

This recurrence allows us to write

Notice that this is the same as

Phew.

## Calculating Modified Fibonacci Numbers, Quickly and Exactly

### 1. Properties of

The sequence is essentially the same as the Fibonacci sequence, with the exception that the initial conditions are altered slightly (so that , where is the fibonacci sequence). This similarity lets us use many of the same algebraic machinery that can be used to tackle the fibonacci numbers.

#### Closed Form

I will defer the derivation until the appendix, but it can be quickly shown using a little bit of linear algebra that letting be the golden ratio (and it’s “conjugate” wrt to the root), then

which looks to have the same form as the fibonacci numbers.

#### Analytic Property

The numerical value of and , therefore as grows large, the contribution of becomes insignificant, serving only to ensure that is a whole number, therefore , in fact, from simple observation, it seems that for , (or rounded to the nearest integer).

#### Computation by Reduction to Fibonacci Number

Unfortunately, we don’t have quite a good way to compute with irrational objects: if we use floating point arithmetic, we quickly lose information about the trailing terms, if we use some algebraic package, then simplifying expressions with irrational numbers may become computationally infeasible.

However, we can exploit the connection between and to help us out. Suppose we have some software that can efficiently compute for any arbitrary , can we possibly use that to our advantage in computing ?

It can be shown (similar to the steps used in the appendix) that . Notice that

therefore

#### Direct Computation

Alternatively, we can also compute this sequence directly in time. Recall that , then

so . Unfortunately, there’s no elegant solution for all odd terms, but recall that within the consecutive triple , at least one of these must be a multiple of three. Now, consider

Therefore, we can compute using

Memoizing gives a solution

C = {0:2,1:1}
d = lambda k: -1 if k % 2 else 1
def c(k):
if k in C:
return C[k]
if k % 2 is 0:
r = c(k/2)
C[k] = r*r - 2*d(k/2)
elif k % 3 is 0:
r = c(k/3)
C[k] = r*(r*r - 3*d(k/3))
else:
C[k] = c(k-1) + c(k-2)
return C[k]


### 2. Example Problem: Fibonacci Graph

Suppose we have the Fibonacci sequence

defined by

Let’s construct a sequence of points such that

so that it looks like the sequence

Let’s connect all of the points in together (so that it forms a jagged line) and look at the area under the curve; we will call this quantity .

Give an algorithm that computes in time and compute the area for the case where .

#### Solution Sketch

If you look at the figure above, you’ll notice that the total region is made up of these trapezoids. Let’s call the area of the trapezoid defined by and , then as illustrated below,

we know that

which is just the area of the rectangle on the bottom plus the triangle on the top. From the relation , we know that , and similarly for , so that we have

Let’s make a connection to the exponential form of the Fibonacci sequence:

where are the golden ratio and its conjugate respectively. If we actually substitute this equation into the definition of and expand everything out, we will find that

Since is ‘s conjugate, we have , so each of the term can be rewritten as . Grouping related terms together, we get

simplifying, we find that

This is awesome! But we need to find the sum of , so let’s do just that

Sweet!

Since I was a little vague on specifying the range of trapezoids to consider for , I will accept both (the literalist interpretation) and (the more casual interpretation).

### 3. Appendix

#### Deriving Closed Form of

Consider the sequence , we can represent this sequence as an infinite dimensional vector called (so that indexing is the same as finding the element of the sequence). Now, consider the vector-space in which inhabits, we can define a transformation (that I will call the right-shift operator) that basically takes in and outputs shifted one index over, so that . e.g.

It’s clear that this is a linear transform, because we can inductively define as the limit of the matrix-valued sequence of matrices whose first upper diagonal is constant 1:

Consider the subspace spanned by all of the shifted sequences:

the second equality comes from the fact that because . Now, because we don’t actually know what is (only that it exists), we can alternatively look at it as the kernel of the system

(because we’ve already established above that is a linear operator). Furthermore, and all scalar matrices commute. Appealing to a little bit of algebra, we know that, letting be the characteristic and be scalar roots of the quadratic such that , then

Therefore, let and , then the solution must be some linear combination of the solutions and :

for some scalar , because then

as expected.

Now, we know that , but this is the same as saying that letting , then , so . Using a similar argument, we can also argue that , so

We can then use our initial values that to solve for

so

## Counting Words

### 1. The Combinatorial (Easy) Approach

Consider the placement of the zeros in an -word: this is equivalent to the problem where we choose an arbitrary word prefixed with zeros, like

and permuting it in every possible way, excluding overlaps. Now, there are permutations of , but of those are zero-overlapped and of those are overlapped, so there are

unique permutations of a single word . How many possible s can we possibly make? Well, there are ‘s that we can fill out and each of these can take any value besides , so there are such words, therefore there are

words of length with exactly zeros.

### 2. The Computer Scientist’s (Convoluted) Approach

We can boil this down into a decision problem. Suppose we use the decision variable to denote the total number of words of length with exactly zeros, then we can consider a component-by-component decision problem. At each cell of the word, we can either fill in a , or a . Suppose we begin at the left end of the word, if we fill in a at the current slot, then there are words available to us (because we used up one slot, so decrease by 1, and we used up one zero, so decrease by 1), and if we fill in either a or a , then we still have zeros to choose from, so there are such words available to us. Combining the two together (in the spirit of dynamic programming), we get the recurrence

If we make a table of this

Hmm, there doesn’t seem to be an obvious pattern here. Since we know that , let’s attempt to do this inductively on :

We know that

Great, let’s go on to the case

For the case, we would get

and inductively, because

we have that

which we derived earlier.

## Annoying Mclaurin Series

We can employ Taylor’s theorem, but to my knowledge, there’s no easy way to construct the derivatives of [that’s left to the more ambitious reader ;)] Instead, we focus on classical algebraic techniques to carry us through the day.

Notice that

recall that all terms of the form , so this becomes

Okay, let’s suppose that , then the above tells us that

If we compute the first few out, we will see that it looks like

Why, this is the Fibonacci sequence! It seems that for the first few terms we looked at, obeys the fibonacci recurrence

#### Lemma (Fibonacci Sequence)

For ,

Proof: Substituting the definition of in, we have

Finally, recall that , so

so we would expect

If we want a closed form, we can find that, letting

then

Finally, we can compute

## Infinity Norm of the Inverse of Lower Bidiagonal Matrices

### 1. Introduction

The infinity norm of a matrix is just the absolute maximum row sum of that matrix, or alternatively

so computing the norm shouldn’t be all that difficult. For a full-matrix , it should take approximately time to compute the sum of a single row, and such computations to get the maximum row sum for a total of time.

If is also lower bidiagonal, then it would have the shape

that is, the diagonal and the first lower-subdiagonal are filled, but everywhere else are zero. Being bidiagonal grants the matrix a lot of nice properties; for one thing, it only takes operations to get the row sum now, so it only takes time to find the infinity norm of a bidiagonal matrix. Similarly, as we shall see later, it also only takes time to solve a system of linear equation of the form

if is bidiagonal.

### 2. The Problem

Suppose you’re given a lower-bidiagonal matrix (defined above). How do you find the infinity norm of its inverse

One naive solution is to first compute the inverse explicitly, and then take the infinity norm of it. As we’ve hinted above, it takes time to solve a single system, then the system

can be solved as

where is the solution to the system

or, letting denote the solution of the system

and then we would look at the absolute row sum. The bottleneck here is the solves w.r.t the columns of the identity at cost each, so the entire process is expected to take time. This is actually pretty good, because the inverse takes to print out, so finding its maximum row-sum in this amount of time is pretty reasonable. But the world is a cruel place, and somewhere out there, and billion variable dynamic program awaits us that needs to shave off every possible second possible.

Can we do better than time and space?

### 3. Solution Attempt

#### Exploring the Inverse

Let’s begin with a reasonable direction: let’s figure out what the inverse of a bidiagonal matrix looks like. We’ll mainly be looking at the case but generalizing for any . Suppose that looks like

Now, if we take the inverse of , we’re not guaranteed that its bidiagonal structure will be preserved, but we know that it will at least be lower triangular. Let’s, for the sake of being punny, denote , then we know that

let’s start writing out the individual equations generated by this system:

Inductively, we can show that

furthermore, the zero equations can be inductively generalized as

Suppose that , then we know that . We also have an equation:

so that

and so on, we can compute every gamma with the recurrence

Great, this gives us a nice algorithm to fill out the entries of the inverse in time, but that doesn’t really help us.

Now,

so it doesn’t matter what the sign of is. Now, the s are entirely multiplicative, and hence their magnitudes do not get changed if any of or are changed. As such, we get the following lemma:

##### (3.1.1): Invariance

Suppose that such that , then

this basically says that we are allowed to arbitrarily flip the signs of the entries of without changing the value of . This follows from the fact that the entries of the inverse, , depend only multiplicatively on the entries of so flipping the signs of the elements of will only flip the signs of , but the magnitudes of each element of will remain the same.

In other words, is invariant under transformations during which the signs of are flipped. This is going to come in very handy, and the above recurrence for is constructive in that it gives an algorithm for which elements of to flip to reduce the problem into an easier space.

#### Row Summation as a Linear Operation

Now, consider the operation

it seems that multiplying any matrix by the column of ones ( from here on) will return a column of row sums. Great! Shouldn’t it then be the case that we just need to find the maximum column of ? Unfortunately, what we really want is the sum of the absolute values of each row, so if ever contains a negative number, this will break :(

Some of you may already see how to resolve this now; but for the moment, let’s just check whether it’s actually possible to at least do this row-sum operation in linear time! The problem is solving , or equivalently, finding an such that

here

so it takes time to solve this system. If only we can just do this…

#### A Final Observation

By now, you might be a little perplexed by lemma , which we took a lot of pain and agony to develop, but never used. Now, the problem of solving this problem by maximizing the column of is that the elements of aren’t guaranteed positive. However, and its construction seems to say that if we use the correct signs in the elements of , then we might be able to manipulate the signs of the elements of without disturbing its final infinity norm. At this point, I had an idea:
can I somehow force all of the entries of the inverse to be positive by flipping the signs of the right elements? If we can, then from we can generate a whose inverse only contains positive numbers, so

which can be done in time! This is quite exciting, but how do we do it?

Recall in the proof of , we used the recurrence for the elements of the inverse as

Now, suppose that for some , we know immediately that , so the diagonals of must be positive. By the strong induction hypothesis, let’s assume that for any , , then for to also be positive, it better be the case that

so

so must be negative in , which gives a construction of

### 4. Solution!

Taking the above into consideration, the following algorithm can be used to compute the infinity norm of the inverse.

diagonal of L = abs(diag(L));
subdiagonal of L = -abs(diag(L, -1));
x(1) = 1/L(1,1);
for k = 2 to n
x(k) = (1 - L(k,k-1)*x(k-1))/L(k,k);
end
return the maximum of x


## Limit Preserving Functions in CPOs

A complete partial order (hereon cpo) is a partial order such that all monotone chains

has a least upper bound

A continuous function between two cpos and is one that preserves the limit/least upper bound on all monotone chains:

#### Theorem (Monotonicity)

All continuous functions are monotone in the sense that .

Proof: For the sake of deriving a contradiction, suppose that there exists some non-monotone continuous function for some pairs of cpos and . Then it must be the case that for some pair of elements , but . Now, consider an -chain

that is -terminal and monotone. Obviously, its least upper bound , but

where

because otherwise . Hence, we can show that

and hence cannot be continuous; a contradiction! Therefore, all continuous functions are monotone.

## Introduction to Scientific Computing: Error Propagation

There was recently a good article on scientific computing, defined loosely as the dark art, as it may have seemed to the uninitiated, of deriving solutions to equations, dynamical systems, or what-not that would have made your Mechanics professor scream in horror at the thought that they will need to somehow “solve” these systems. Of course, in the rich computer world today, almost any problem imaginable (exaggeration of course!) can already be solved by some existing tool. Hence more and more, the focus gets shifted from “how do I solve this differential equation” to “what do I ask google?”

My dad once told me of a glorious time “back in [his] days” when every respectable engineering institution would have a crash course on this dark art on top of Fortran. Of course, my dad is only 43, and that was only 19 years ago.

Even now, when computer science departments everywhere no longer believes in the necessity in forcing all of their graduates to have a basic grasp on numerical analysis, there is still some draw in the subject that either makes people deathly afraid of it or embrace it as their life ambition. I am personally deathly afraid of it. Even then, there are quite many cute gems in the field, and as such, I am still very much so attracted to the field.

Scientific computing is the all encompassing field involving the design and analysis of numerical methods. I intend to start a survey of some of the basic (but also most useful) tools such as methods that: solve linear and nonlinear systems of equations, interpolate data, compute integrals, and solve differential equations. We will often do this on problems for which there exists no “analytical” solution (in terms of the common transcendental functions that we’re all used to).

### 1. Safety First

In an ideal world, there would be a direct correspondence between numerical algorithms their implementation. Everything would work out of the box and there would be no need to worry that, even if you’ve implemented the on-paper algorithm correctly, it would somehow behave “differently”.

Of course, this isn’t the case. We’ve all heard of the age old saying that computers are finitary, and therefore it cannot represent all real numbers, specifically, there’s no way to represent irrationals, and in most of the cases, we do not fully represent all rationals either. Instead, we round numbers to a certain digit.

On the surface, this doesn’t seem too unfortunate. You probably did the same in your Chemistry lab report, to even more horrendous precision than what your computer will likely do for you. If you aced your Chemistry lab, then this will likely seem like a perfectly good scheme. But if you, like me, were troubled by the “sig fig” rules, then you are probably a little wary.

In fact, more often than not, you will not be bothered by the lack of a full spectrum of real numbers to choose from. However, when these little nasty “roundoff” errors are the culprit, they are often resolved through hours upon hours of debugging and general sense of hopelessness. To inherit a roundoff bug from someone else is like contracting the spanish flu: either you know what you’re doing and your system (with a little bit of agonizing) successfully resolve it, or you just won’t have any idea what you’re going to do (and die in the spanish flu metaphor). Think of this article as a vaccination against the roundoff bugs :)

### 2. A Modern Day Little Gauss Story

Suppose little Gauss lived in the modern age. little Gauss’ teacher wanted to surf the internet, so he assigned all of his students the following integral to evaluate:

Being the clever alter-ego of the boy who immediately saw , little Gauss constructed a sequence

and with a little bit of clever manipulation (integration by parts), he found that, using

and

Why, this is a simple recursive function! Little Gauss was absolutely thrilled, he has at his disposal a programmable calculator capable of python (because he’s Gauss, he can have whatever the fuck he wants), and he quickly coded up the recurrence:

Python Code

import math
def s(k):
if k == 0:
return 1 - math.exp(-1)
else:
return 1 - k*s(k-1)


He verified the first few cases by hand, and upon finding his code satisfactory, he computed the 100th element of the sequence as -1.1599285429663592e+141 and turned in the first few digits.

His teacher glanced at his solution, and knowing that there’s no way little Gauss could have done his school work with such proficiency, immediately declared it wrong. Upon repeated appeal, the teach finally relented and looked up the solution in his solution manual and, bewildered… again told little Gauss that he was WAAAAY off. Unfortunately for our tragic hero, this would not be a modern day retelling of a clever little boy who outsmarted his teacher. No, this is a tragic story of a clever little boy who succumbed to a fatal case of the roundoff bugs.

#### At a Brief Glance

In fact, had his teacher been a little bit more clever, he would have been able to immediately see why Gauss’ solution wouldn’t have worked. It’s immediately obvious that between and , is always positive.

Furthermore, if a function is positive inside an interval, and suppose is also a positive in side the same interval but is everywhere smaller than , then obviously the area under must be bigger than that of . Similarly, if is everywhere larger than , then the area under must also be larger than that of . Now suppose , then because inside , then inside , and

or

and in general

of course can’t be on the order of !

So what went wrong? Well, whenever you’re in trouble, just make a plot!

Hmm. Everything seems to be going fine until around .

It turns out that there was nothing wrong with little Gauss’ method and the integral is perfectly well-behaved. The culprit lies in the fact that can only be represented approximately on his calculator. As we know already, is irrational, and cannot be represented in finite amount of memory. is only represented approximately, slightly perturbed so that to the computer, we’re actually giving them a initial for that small perturbation (think of it as a really really really tiny number). Now, let’s see what Gauss’ calculator is computing once we unravel the recursion (we’ll use the notation to mean the calculated value of on the calculator):

Oh god! No wonder the method produced the wrong answer, the slight perturbation in the computed value of “propagates” throughout the computation and at the step, manifests itself as -factorial times that original perturbation . Even at , we will see around fudged into the calculation. For , the answer will have an additional factor of ! Little Gauss definitely should have learned about round-off errors.

### 3. Some Basics – Errors

Before we dig into the floating point encoding underlying most modern computing platforms, let’s talk about errors. Suppose that we’re computing the value of something and the true value of that computation is “stored” in a variable , but our computation is inexact, and in the end, we shall put that inexact result in a “hatted” version of the known result, that is .

Since there’s this notion of inexactness, we can talk about the “error” of our computation. There are two kinds of errors:

Absolute Error
This is just the difference between the true value of the computation and the inexact value . We shall define this as

note that ( subscript ) can be taken to mean the error of the computation of .
Relative Error
This the the ratio of the absolute error to the true value of the computation, or in other words

we read to mean the relative error in the computation of .

First, we can’t compute the absolute or relative errors, because if we can, then we would have know the true value of the computation already! Errors are purely analytic objects that can help us determine how well-behaving our computations are. Second, in the analysis of floating point roundoff, we will typically exclusively use relative error. This may seem counter-intuitive, but it has a few nice properties that simplifies error analysis, as we will see.

One more thing to add, if we allow to have any sign, then through some simple algebra, we will find that

#### Error Propagation

Suppose that, through some series of computations, that we have the computed values of and . Now, in the next instruction, we wish to compute the value of . Because, we have already the (potentially inexact) computed values and , then it seems natural that to compute . we would just add :

Now, suppose that and similarly for , then it seems that

now, if we were doing error analysis, then we would want the relative error of , which from our earlier notation we already called . To find this, we merely need to solve the above equation for :

distributing both sides, we get

canceling the from both sides, we end up with

which gives .

Wow, what a mouthful. This defies intuition, as you would expect error to accumulate additively. Of course, this is true of the absolute errors:

but this no longer holds when you consider the relative error.

So why use relative error at all for analysis? It turns out that while convenient here, it becomes less tractable when reasoning about roundoff. Whenever you do an addition operation in floating point, you accumulate a small bit of absolute error from that operation itself! This absolute error unfortunately depends on the value of the computations itself much like relative error does, so sooner or later, you’re going to need to start reasoning about relative error. Worry not, we will develop a systematic formula for reasoning about the propagation of relative error that will boil down to high school level algebra (and some calculus).

Now, we’ve done addition. What about subtraction? You should easily verify for yourself that

where the relative error is defined as

Let’s now derive the propagated relative error of multiplication:

again, solving for boils down to solving the equation above

assuming that neither nor are 0, we can divide out the to get

which tells us that

It is traditional to assume that the relative errors of your computations are small (), for otherwise, there’s no point in computing a value if you can’t even get one digit correct! Therefore, if is tiny, and is tiny, then is insignificant in comparison. Therefore, we typically discard “higher order” terms. In effect, we just say that

I’ll leave it to your to verify that division propagates as

#### Arbitrary Differentiable Function

The propagation schemes we’ve talked about so far are all fine and dandy, but if we already have and we want to compute for some arbitrary but [twice] differentiable function ?

Well, let’s just call with as the argument then!

now, we can do some algebra and get

but we can no longer use our typical algebraic tools to solve the above equation for , since could be anything! Whatever will we do?

If you’ve had a little bit of calculus, then the section heading should probably give the answer away. We’ve already reasoned earlier that we don’t care about second order error terms because they’re so insignificant. But then, we can express as a polynomial in the error term through taylor series centered around !

if is twice differentiable, then is bounded to some constant, then that second order term tends to , which we can disregard (with some skepticism of how nonlinear/curvy is around the neighborhood of ). Using the first order taylor approximation as the right hand side, we can rewrite the above equation as

which, as long as , gives

Now, the restriction might seem a bit unfair, but recall that if , then we won’t have any digits of accuracy, so in effect, the relative error is in fact .

#### Table of Error Propagation

Summarized, suppose that we want to run the computation , but we have inexact computed values and each with relative error and respectively, then the computation will also be inexact, and its relative error will be

### 4. Example

Suppose that we’ve computed with relative error and with no error. Furthermore, suppose that the true value of and . What will be the computed error of ?

We’re looking to compute

Now, we need to figure out a few things:

#### 1.

We can simplify this to , but even then, we’re still going to take a first order taylor expansion to get

Since we’re looking for the relative error, we need to factor out a out to get

Now, we’re just left with

#### 2.

From our table of error propagation above, we see that

so we’re just left with

#### 3.

From our table of error propagation, we see that

which finally simplifies our computed value as

so that the final relative error is . Note that the final relative error isn’t just , because we need to also take into account the error of computing . Also note that we are not working with above. To see this more concretely, we are essentially looking for in the following system

which gives the same solution .

This concludes our brief overview of error propagation. We’ll start on IEEE floating point encoding of rational numbers and how to avoid errors in computer arithmetic next time.

## Subtype Ambiguity in Java

true ? 1 : "Hello"


Take a look at the Java expression on the left hand side. Off of the top of your head, can you tell me if it will type check? If so what will be its type?

If you had asked me this question out of the blue, I would have probably said no; I would reason it as follows: this is a Java expression, so it must be assigned a type, meaning that we’re giving a system of equations saying that the type of 1 and the type of "Hello" must be equal; this is clearly not the case as one is an Integer and the other a String, so this can’t possibly type check!

Of course, I’m actually just suffering from a common case of overlooking the obvious. It turns out that I’m forgetting about the whole subtyping thing built into Java and that I can actually assign Object o = true ? 1 : "Hello";, because the type Object sits at the of the type ordering.

### 1. Subtyping as Inequalities

Since we’re speaking of subtyping as a type relation anyways, let’s give it a symbolic name to make it easier for us: we’ll let and range over types in Java, and we’ll say

to mean that is a subtype of . Under one interpretation, we can think of this as saying that if we’re expecting a somewhere, we should theoretically be allowed to plug in a in its place and not expect weird or undefined behaviors as a result. Another perhaps more practical view is to claim that there exists some function called the coercion function (which we will denote, for the sake of historical consistency, ) that, given some subtype relation , will be able to translate an object of type into an object of type that behaves “similarly”.

Anyways, It’s obvious that both and (since everything’s a subtype of Object), so we can draw a subtyping hierarchy as a tree:

and we can type check true?1:"Hello" : Object. This seems to suggest that we can typecheck all instances of to some type if we can typecheck to , to , and . In otherwords, is an upper bound of the types of and .

Of course, upon closer inspection, it turns out that both Integer and String share another common supertype: Serializable.

So just as well, we could claim that the ternary expression above extends Serializable. But if we want to “assign” a type to an expression, we don’t just want its supertypes.

### 2. Most Precise Supertype

Looking at the type hierarchy above, it’s clear that the lower a type is on that hierarchy, the more “precise” it is. When we say that an expression has a type, we don’t mean that it is bounded above by some set of super types; we actually mean that its most “precise” type is so and so. For example, we could just as well draw the type hierarchy for Integers as

and claim that a is a Serializable; but when I typically talk about the type of the expression , I’m usually talking about Integer, not Serializable.

From this, it’s clear that we also want a way to find the most precise (smallest) type of true ? 1 : "Hello" such that it is also an upper bound of both and “Hello”; this is called the least upper bound, and we will for historical reasons call this the “join” of two types.

Symbolically, if we want to find the most precise upperbound (join) type of two types and , we will denote it as such that

the reason that the least upper bound is usually denoted as a square union symbol is actually somewhat natural: the union of two sets is the least upper bound of those two sets in the partial order (set inclusion), convince yourself that this is actually the case.

Note that whenever we join two types, the resulting type is “less precise”. We are after all finding an upper bound!

Going back to the problem at hand, the entire point of this exercise is then very simple: we want to find the least upper bound of Integer and String:

We’ve already seen earlier that both Object and Serializable are upper bounds of both Integer and String, but we also know that , so Serializable is a more precise (less) upper bound than Object is, but is Serializable the most precise?

In order to answer this question and find the least upper bound, we’re going to need the complete subtyping hierarchy. Consulting the Java documentation, we find that Integer is a subtype of Numbers, which are Serializable, and are themselves Comparable with other Integers. Furthermore, Strings are also Serializable and Comparable with other Strings! This gives the slightly more convoluted typing hierarchy:

### 3. Problems

At first first glance, it doesn’t seem possible that there’s a least upper bound: it seems like both Comparable and Serializable are the most precise upper bounds. Furthermore, this isn’t even necessarily true as Comparable itself might be too imprecise.

#### Wildcards and Bounded Quantifiers

Consider the constraint

now, we know that

so we can rewrite this as

by plugging in into the in . (Wildcards, amiright?) Similarly, we can say the same about integers, so we get

but we , so a more precise upper bound of is . But now, we can construct a sequence

each of which a more precise approximation than the previous, which undoubtedly converges to the most precise type: and here’s the kicker, that type cannot be expressed finitely!

So how does Java handle these types? Let’s try it out. I wanted Java to output the internal type representation, so I need to get a type-mismatch error during compile time.

Comparable Type

It seems as if Java is reporting the type of the two as an unbounded quantification: just Comparable<?>. This is intentional, Java treats the ? extends ... as a constraint (or a system of inequalities) and discards them during type checking and coercion in order to use them later solely for typechecking, so we seemed to have for now solved the problem (So even though , we’re going to treat them as if they weren’t).

#### No least upper bound?

But we still have another problem, there’s is no least upper bound in the above hierarchy! Again, let’s see what Java does:

Intersection Type

It turns out that the least upper bound is just the “intersection” over the three types, which should then be the LEAST upper-bound of all three.

## Neat way of counting OCaml objects satisfying certain types

Full Disclosure: this isn’t mine. I got the idea from http://www.youtube.com/watch?v=YScIPA8RbVE and the book Analytic Combinatorics from Robert Sedgewick and Philippe Flajolet. I just thought the entire concept is really really neat.

If you’re like me, you spend your flight pondering the deepest and darkest corners of human knowledge, like

1. was the cashier flirting with me? (she wasn’t)
2. which one is better, the integral or the derivative? (definitely the derivative)
3. are we there yet? (no.)
4. are airlines the mortal enemies of tall people? (they are)

Of course, after a day of mentally exhausting travel, I’d like to sit back, relax, and ask myself how, given a specific type declaration in , I can figure out the number of all possible instances of that type.

For example, there is just one instance of the type unit (kind of funny because its name is unit) corresponding to the object in . There are two distinct instances of the type bool corresponding to the set , and about distinct instances of the type int.

### 1. Types in

Now, before we go on any further, let’s first take a look at the syntax of types.

 type suit = Diamond | Club | Heart | Spade 

this type declaration says that we want to create a new type whose instances are either labeled Diamond, Club, Heart, or Spade. The labels are capitalized to mean that they can serve as constructors of the type. So if you type in Diamond into the OCaml, it’ll construct an object of type suit corresponding to the Diamond label. Note that Diamond isn’t itself a type here.

It’s quite obvious that there are 4 distinct instances of the type suit that we can construct, namely .

We can also give each type constructor some extra stuff. For example, if we want types corresponding to the individual cards rather than suits, we can write something like

 type card = Diamond of int | Club of int | Heart of int | Spade of int 

so that we can construct an instance of the card Jack of Clubs as Club 11. Here, there are different instances of the type (we’re only familiar with 52 of these). Alternatively, we can also write each card as a pair of a suit and an integer.

 type card = Card of (suit * int) 

here, the notation suit * int stands for types whose objects are pairs, the first element of which is an instance of a suit and the second an integer; for example, Card (Club, 11) would be the jack of clubs.

One really interesting thing to observe here is that the two different ways of writing the card type above should be equivalent. In each, we are specifying a suit and also a value. Therefore, if there are distinct instances of the first type, there should also be instances of the second type, and it turns out, there are.

Furthermore, if we want to “parameterize” the types to be more generalized, we can do so by introducing type variables (variables that corresponds to types) as 'x that starts with a single apostrophe. So for example, if we want to build a linked list of whatever type we want, we could do something like

 type 'x list = Null | Cons ('x * 'x list) 

where the list of type int list (so we make the substitution ) is constructed from Cons(1, Cons(2, Cons(3, Null))).

A binary tree can be build up like

 type 'x tree = Leaf | Node of ('x tree * 'x * 'x tree) 

where the tree

has type int tree (so we make the substitution ) is constructed from

Node(
Node(Leaf, 2, Leaf),
1,
Node(
Node(Leaf, 4, Leaf),
3,
Node(Leaf, 5, Leaf)
)


### 2. Counting instances

#### One or unit

Now, we get to the interesting part. As discussed previously, we have already established that there’s only one object of the unit type (just ) and two (True and False) of the boolean type. Of course, there’s also only one object of the type

 type one = One 

and in fact, we can rename the constructor to anything (assuming that we don’t care about it being valid ocaml code) including the unit object , so it turns out that is equivalent to all types with a single constructor and equivalence isn’t effected under renaming of the constructor (well, there are certain rules to this renaming/substitution, and we’ll call this property equivalence). But is this the only property that preserves the count of the number of instances? Well, no, as it turns out, the following all only have one element.

 type one' = One' of unit type one'' = One'' of one' ... 

this is intuitive, if we have only one possible object for the argument of the constructor, and we only have that one constructor, then we can only construct one object for that type. In fact, from now on, we will say that an argument-less type construct C is just a syntax sugar for C of unit because doing so will not change the number of instances for that type.

#### Two or bool

We’ve also established that there are two instances of booleans, and . In type parlance, this would be

 type bool = True | False 

but remember earlier that we said if we give a type constructor not expecting any argument the unit argument, the type’s size is preserved. Therefore, we can rewrite the above as

 type bool = True of unit | False of unit 

and since renaming things doesn’t affect the size of the set of objects satisfying this type, this is also equivalent to

 type two = One of unit | Two of unit 

Neat. Let’s count up one more.

#### Three

There’s no native type in \f{OCaml} that only contains three instances, but it’s not too hard to guess how we’re going to construct this.

 type three = One of unit | Two of unit | Three of unit 

Here, we can either have or as instances of the type three. This gives a natural meaning to the notion of “or”. If I have types and , then the type

can either be one the or one of the , but since the two cases are necessarily distinct (because the label ), then if there are objects of type and objects of type , there must be objects of type .

It’s no coincidence either that the type is called a sum type between and (albeit for different reasons). Now that we’re warming to the notion that we can use these types almost as if we did in highschool algebra, let’s make some simplifications:

Since we do not ever care about what the labels themselves are, only that they are distinct, we can entirely leave out the labels for the type constructors. Therefore, we can write suit (type suit = Diamond of unit | Club of unit | ...) as

and the card type as either

or as

where the means the objects are of a pair type (with both a first and a second element). Note that we can also substitute suit directly in this type as in the second declaration of :

it’s no coincidence that there are also objects of type , but it’s also obvious that the two types are not equal syntactically. That is, .

Of course, it’s useful to construct another equivalence relation whereby type is translationally equivalent to type if both have the same number of type objects. In that case, we’ll say . As you probably have already guessed, we’re going to define some kind of a translation between these types into numbers denoting the number of objects of that type. We’ll go into detail more, but needless to say, we would translate as .

#### Four

Okay, at this point, we kind of already know what we are after. But first, I want to establish some basic laws seen in school algebra.

Recall that

and that

Now, obviously,

but let’s look at something else. Suppose we have a pair type such that

now, there are again four instances of , they are

so it seems as if as well. In fact it’s not hard to show that product types correspond to multiplication in the same that cartesian products corresponds to multiplication of the cardinality. (In fact, this question reduces to the cardinality of sets case).

#### Eight

Just to get to the point, how many functions are there for which the domain is of size and the codomain is of size ? For example, suppose that I have the type

where is taken to mean a function that maps a type with three elements to a type with two. Well, in general, suppose I have a function , then it turns out that there are such functions: reduce this problem to counting in base .

Suppose I have elements in that need to be mapped into . Suppose both and are finite and are order into the sequences and respectively, then I can encode one such function

and a second function

but since we don’t really want to clutter up the notation with all of the braces, the subscripts, and everything, we could just encode each function as a stream of digits where the digit = corresponds to which the element maps to. Then, we can just write

but this is the same as counting up to in base .

Anyways, from this argument, we would expect to have distinct objects, and hence .

### 3. The Translation

Let’s now define a translation between a simplified type declaration and a number denoting the total number of objects with that type declaration. We’ll write this relation . For now, pretend that is the naturals , but this will cause some subtle issues later on because we might not be able to translate certain types (like the list) directly into . Furthermore, is almost a function, but as discussed a few seconds ago, if we let be the naturals, it will not be able to translate certain types, so at best, it is a partial function. In that case, we will use the syntax to mean that .

To define this translation, let’s do it “recursively”. The validity of such a technique is beyond my scope, but convince yourself that because the right hand side is always translating smaller and smaller pieces, then eventually, the translation of any arbitrary type will terminate. Here, I will use to mean some type expression; that is, we will use it as a metavariable over the domain of types. Assume the same precedence level of as and power in school algebra.

Next, we define a notion of translational equivalence.

#### Translational Equivalence

note that is a weaker notion of syntactic equality of , therefore, if , then so too must owing to the fact that is a partial function. Furthermore, because of the construction of the sequence

then is obviously onto . However, because we need to have this notion of , it’s also very obvious that two different type expressions could have the same number of instances and hence is not one-one.

### 4. Examples

We’re going to take some extraordinary risks here (because we’re not mathematicians) and assume that this translation is correct (there are in fact many problems). Let’s do something with this new translation of ours

#### Distributivity

in fact, under translational equality, we can manipulate types algebraically as if in school algebra. More amazingly, that means that we can encode everything the same exact way. Therefore, if we wanted to pass around an object of type , then we can instead pass around an object of type instead!

#### Parameterized Types (Advanced, skip if you want)

Suppose we wanted to translate the list type above:

unfortunately, we don’t have a definition of . Rather, we merely have an equation that is expected to be satisfied in order for an object to be considered an instance of the type. This equation is

it turns out that is the least fixed point of the function . This is going a bit beyond the scope of this post, but that least fixed point can be computed on the ordering of immediate sub-type expression which turns the sequence into a complete partial order (complete because all subsets contain a least upper bound). Therefore, by the fixed point theorem, that least upper bound is the mysterious expression (where means the least upper bound)

now, the translation function is “continuous” in the sense that it preserves least upper bounds on the CPO . (Of course, this is why we can’t have = ), but loosely speaking, the operations can be substituted as

therefore,

unfortunately, if , the above sum diverges to the top element in , .

However, an interesting question arises that this type of analysis can readily answer: how many instances of the list type are translationally equivalent to the -tuple. In this case, it turns out that for whatever base-type , there are exactly lists.

Of course, there’s a slightly easier hack which comes from the above derivation directly (namely because the translation preserves monotonicity, verify this!).

#### Lists – the easier way

Why don’t we just translate the equation directly?

We’ll get

In fact, we could even get this by solving the equation and finding its generating function.

Neat. However, this brings up a good question. Can all such equations be solved? Unfortunately, no.

#### Unheaded List – unsolvable

Let’s define a type

if we translate it, we get

now, if our codomain of the translation is just , then we would be forced to abandon hope. However, what we really did was we packed that codomain to include that you can think of as . What this can be interpreted as is that no matter what type variable becomes, type is going to diverge.

However, using the above trick, we still get the expansion

which says that all instances of are translationally equivalent to .

#### Binary Trees

Let

which under translation gives

where corresponds to the catalan numbers (i.e., we’re generating the sequence that count the number of binary trees translationally equivalent to the -tuple, which is the number of binary trees with nodes).