0x5f400000: Understanding Fast Inverse Sqrt the Easy(ish) Way!

TL;DR – Start Here!

1. Let \g{f2b} be the float-to-bit function (that takes in a floating point number and outputs a 32-bit long representing its IEEE 754 encoding used on your everyday computer) and \g{b2f} be the bit-to-float function, then the fast inverse square root method can be rewritten as

    \[\sqrt{\tilde x^{-1}} \approx \g{b2f}(\g{0x5f3759df} - \g{f2b}(x)/2)\]

2. It turns out that \g{b2f}(\alpha + \g{f2b}(x)\times \beta) = C(x,\alpha, \beta) \times x^{\beta} for some non-linear function C(x,\alpha,\beta) \approx C(\alpha,\beta) that doesn’t vary a lot with x, so it becomes an optimization game to find \alpha such that C(\alpha, \beta) = 1.
3. We know that x^{\beta} = 1 when x = 1 no matter what \beta is, so if we want to find \alpha such that \g{b2f}(\alpha + \beta\g{f2b}(x)) \approx x^{\beta}, all we need to do is set x = 1 because then

    \[\g{b2f}(\alpha + \beta\g{f2b}(1)) = 1 = \g{b2f}(\g{f2b}(1))\]

so \alpha = \underbrace{\g{f2b}(1)}_{\g{0x3f800000}}(1-\beta), which let’s us make arbitrary fast \beta^{th} power methods on demand.
4. ???
5. Profit!

1. Introduction

The first time I saw the magical fast inverse square root method, I was a freshman in college and the code pretty much killed my brain. I’d just learned about algorithms and data structures and I thought that binary search trees were like the coolest things ever. Turns out I didn’t know them well enough to tackle this problem back then.


If you’re not familiar with the fast inverse square root method, here’s what it looks like in C (source: wikipedia)

float Q_rsqrt( float number )
	long i;
	float x2, y;
	const float threehalfs = 1.5F;
	x2 = number * 0.5F;
	y  = number;
	i  = * ( long * ) &y;                        // evil floating point bit level hacking
	i  = 0x5f3759df - ( i >> 1 );               // what the fuck?
	y  = * ( float * ) &i;
	y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//      y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed
	return y;

[Note: The use of pointer aliasing here is illegal, as Maristic pointed out in the followup discussion on reddit. Use unions instead.]
Okay, so first you declare i, x2, y, and a constant threehalfs. Next, you set x2 to be \frac{x}{2} and set y = x. So far so good!

Okay, next, i gets the dereferenced value of the float* pointer to y casted into a long* pointer instead?? Okay…

Next, i = 0x5f3759df – (i/2)??? Huh?

y = *(float*)&i????

y = y * ( threehalfs – ( x2 * y * y ) )??????

What. The. Fucking. Fuck.


Before we go on, I should probably warn you that this code was a constant obsession of mine for the better part of my adult life. Because of this, a lot of what I’m about to say may sound preachy, cheesy, or even outright annoying. I’ll try to tone it down, but truth be told, I am proud of this work; I’ve come to see the fast inverse square root method as my baby and because of that, I’ve become quite possessive of it. As much as this is a post on the possible derivation of 0x5f3759df, this is also a memoir of my college life (minus the fun parts ;).

From the very beginning, I tried to wrap my head around this strange game of algebraic reasoning, and for the next 4 years, my academic life largely revolved around this mystery: I spent my winter break freshman year learning C and figured out that the “bit level” hack gave you the array of bits that your computer encoded floating point numbers as; sophomore year, I took a first course on numerical methods and understood how iterating the sequence

    \[y_{k+1} = y_k \pa{\frac{3}{2} - \frac{x}{2}y_k^2}\]

ad infinitum will eventually converge to y_{\infinity} = \frac{1}{\sqrt{x}}; then, during the summer before my junior year, I worked through a lot of tedious algebra dealing with the quirks of IEEE754 and found another bit-level hack to quickly compute the square root a la the spirit of the fast inverse square root method, but this time exploiting the distribution of floats and a convenient analogy to second order taylor expansions.

Indeed, my obsession with this method has given me an appetite for numerical analysis, programming languages, software verification, compilers, and even combinatorics and dynamical systems. Without it, I would have never explored most of the areas of computer science that I currently consider my home.

But through it all, there was always a constant question at the back of my mind that I’ve failed to answer time and time again: where did the 0x5f3759df come from?

I never got to a point where I can claim that I have a satisfactory answer to that question, but I think I did come to discover something along the road: a different magical constant 0x5f400000 that I believe lay at the very beginning of the history of this method, and it is the joy of discovering this constant that I would like to gift to my readers.

2. Switching Strategy

Towards the end of my college years, I realized that there was a big problem in the way I approached the problem of looking at 0x5f3759df. Most of the traditional research into the subject has been through the perspective that whoever wrote the algorithm went after the most optimal method possible. In the process of hunting for the jackpot, a lot of great intuition was left behind because those ideas weren’t efficient or effective enough. This is what led me down the path of looking for inefficient but elegant methods that could inspire the fast inverse square root method.

But first, let’s ramp up on some preliminaries. A lot of work and paper on this matter jump immediately to the representation of floating point numbers; I will completely forgo that, instead relying on mathematical intuition to do the job. (This is in part why I didn’t become a mathematician :P)

Converting between a Float and its Bits in memory

Let’s first consider two functions: \g{f2b}(n), which stands for float-to-bits will return the IEEE 754 bit array representation of the floating number n, effectively performing *(long*)&y; and \g{b2f}(b), which stands for bits-to-float and takes a bit array and turns it back into a float. Note that \g{f2b} = \g{b2f}^{-1} and vice versa so that \g{b2f}(\g{f2b}(n)) = n.

It turns out that the magical fast inverse square root method is equivalent to

    \[\g{qsqrt}(n) = \g{b2f}\pa{\g{0x5f3759df} - \frac{\g{f2b}(n)}{2}}\]

Looking at b2f(f2b(x)*b)

With the conversion functions out of the way, I next turned my attention to everything but the magic constant. Specifically, what are the effects of multiplying inside the \g{b2f}. I fired up my favorite plotting library and looked at

    \[\g{b2f}(\g{f2b}(x) \times 0.5).\]

After a little bit of wrestling around with the plot, I realized that when plotting \g{b2f}(\g{f2b}(x) \times 0.5) in log-log style, it has the same shape as x^{0.5} = \sqrt{x}. This means that

    \[\g{b2f}(\g{f2b}(x) \times 0.5) \approx C \times \sqrt{x}\]

for some constant C.

I fitted a “regression” line (this isn’t a real regression however, don’t be mislead) through the data onto \sqrt{x} and found that the relative error is quite small as can be seen in the figure below.

    \[\g{change }\beta\]

In fact, if you tick through the slider at the bottom of the figure, you’ll come to find that for any arbitrary \beta, it seems to be the case that

    \[\g{b2f}(\g{f2b}(x) \times \beta) \approx C_{\beta} \times x^{\beta}\]

for some constant C_{\beta}.

3. Homing In on 0x5F400000

Adding Alpha

Let’s now consider the following function:

    \[\gamma(x, \alpha,\beta) = \g{b2f}\pa{\g{f2b}(x)\times \beta + \alpha}\]

Notice here that we now have added an additive term \alpha to the equation. Again, we can rewrite the fast inverse square root method as

    \[\sqrt{\tilde x^{-1}}= \gamma\pa{x, \g{0x5f3759df}, -\frac{1}{2}}\]

A natural question one might consider is the effect of varying \alpha on the behavior of \gamma. It turns out that on the log-log graph of \gamma(x, \alpha, \beta), varying \alpha does not change the slope of the line, but only its y-intercept. In other words, varying \alpha will only affect magnitude of the constant C_{\beta}, as we can see below.

    \[\g{change }\alpha\]

This then suggests that \beta determines the degree of \gamma while \alpha determines its magnitude. Symbolically:

    \[\gamma(x, \alpha, \beta) \approx C_\beta(\alpha) x^{\beta}\]

where C_{\beta}(\alpha) is some constant.

0x3F800000 – Generalized fast power method

We are now left at an interesting point in the puzzle. We have on one hand the knowledge that

    \[\gamma(x, \alpha, \beta) \approx C_\beta(\alpha) x^{\beta}\]

for some function C_\beta(\alpha) and we want to determine \alpha such that

    \[\gamma(x,\alpha,\beta) \approx C_\beta(\alpha) x^{\beta} = x^{\beta}\]

Canceling out the x^{\beta} on either side, we are left with the functional equation

    \[C_\beta(\alpha) = 1\]

So right now we need to find \alpha such that C_\beta(\alpha) = 1. But for x = 1, x^{\beta} = 1 for any \beta, so the naive thing to do here is to reduce our approximation to

    \begin{align*} \gamma(1, \alpha, \beta) &\approx C_\beta(\alpha) = 1 \\ \g{b2f}(\g{f2b}(1)\times \beta + \alpha) &\approx 1 \\ & = \g{b2f}(\g{f2b}(1)) \\ \g{f2b}(1)\times \beta + \alpha &\approx \g{f2b}(1) \\ \alpha &\approx \g{f2b}(1) \times (1-\beta) \\ &= \g{0x3f800000} \times (1-\beta) \end{align*}

Let’s take a moment and appreciate this statement:

    \[\alpha \approx \g{0x3f800000} \times (1-\beta)\]

This says that for any exponent \beta (positive or negative, integer or fractional), we can approximate x^{\beta} with

    \[\g{b2f}\pa{\g{f2b}(x)\times \beta + \g{0x3f800000} \times (1-\beta)}\]

That’s mind blowing because we can immediately generate “fast \beta^{th} power” methods using nothing but \beta!

float fast_pow(float x, float n) {
  long bit = (*((long*)&x))*n + 0x3f800000*(1-n);
  return *((float*)&bit);

Going back to inverse square roots, we have \alpha = \g{0x3f800000} \times (1- (-0.5)) = \g{0x5f400000}, which is the namesake of this article.

4. Epilogue – The Hard Parts

Closing In On 0x5f3759df

After discovering the existence of 0x5f400000, things became much clearer. However, it was still quite off from SGI’s 0x5f3759df or Lomont’s 0x5f375a86, and so the question of how this constant came to be still nagged at the back of my head. Nevertheless, I am certain that 0x5f400000 appeared at least in some shape or form in the earlier versions of this numerical recipe.

Many others have previously discussed the possible origins of 0x5f3759df, and in the rest of this post, I will offer my own views on the matter.

For the most part, I believe that the discovery of 0x5f3759df is an iterative process: it starts with 0x5f40, then it gets refined to 0x5f375X, and finally someone must have decided on 0x5f3759df. As engineers are practitioners by nature, I don’t believe that anyone that worked on 0x5f3759df cared for its optimality; therefore, I believe that intuition and visualization play the most important part in its discovery.

In this section, I will first outline one path to refine 0x5f40 to 0x5f375X through intuition about the maximum error bounds of the recipe, followed by a discussion on the final discovery as a product of analyzing one step of newton’s iteration.

Refinement – 0x5f376000

If we look at the graph of the relative error of \gamma\pa{x, \g{0x5f400000}, -\frac{1}{2}} on a logarithmic x-axis below, you will notice a couple of peculiar things.

First and foremost, changing the value of the magic constant by small amounts seems to lift the relative error up. Next, the error is periodic with respect to \log_2(x) with a period of 2 (so that \g{re}(x) = \g{re}(2^{\log_2(x)+2}) = \g{re}(4x)). Finally, the position of the maximum and the minimum of the relative error is largely unchanged when we vary the magic constant.

We can use these observations to our advantage. Notice that the error for \g{magic} = \g{0x5f400000} is completely negative. However, if we were to use a smaller magic constant, we can shift part of the error up into the positive zone, which reduces the absolute error as can be seen below.

We are trying to reduce the error bounds as much as possible (under the infinity norm optimization as we would say). In order to do so, we need to ensure that the maximum point plus the minimum point in the non-absolute relative error plot is as close to 0 as possible.

Here’s what we’re going to do. First, we’re going to zoom in onto the section of the graph within the interval x \in [1,4] because it is periodic. Next, we will look at the maximum point and minimum point and look at their sum.

    \[\g{change magic constant }m\]

You will see that the optimal constant here lies somewhere close to 0x5f376000, which is about 10^3 away from the SGI constant.

Newton Iteration – 0x5f375a1a

In our final trick tonight, we will refine this constant down further to within 60 away from SGI’s 0x5f3759df.

One of the important things to notice from the quake code is that the second newton iteration was commented away. This suggests that whoever worked on the latest version of the fast inverse square root method must have optimized the constant with the effects of newton iteration applied. Let’s look at how well the constants do with the help of an iteration.

Wow, using 0x5f3759df as the magic constant improved the accuracy nearly tenfolds! This likely suggests that whoever worked on quake’s version of fast inverse square root method sought after 4 digits of accuracy. In order to achieve this with the vanilla constant 0x5f400000, they would need to run two newton iterations.

Let’s dig in a little bit and try to find the magic constant that minimizes the maximum error after one newton iteration.

    \[\g{change magic constant }m\]

Incidentally, this occurs when the right most two peeks in the above graph collide. We find that the magic constant 0x5f375a1a works best here, which is only 59 away from 0x5f3759df.

We can look at the mean error as well (the one-norm):

    \[\g{change magic constant }m\]

but as the bulk of the mass of the error is tucked under the two humps, the 1-norm/mean-metric (and in fact, the first few reasonable L-norms) will tend to disregard the significance of the outer edges and their maximum-bounds. Therefore, these will attempt to decrease the constant by as much as possible.

Finally, let’s look at the maximum relative error after two iterations:

    \[\g{change magic constant }m\]

This ended up giving us the same magic constant of 0x5f375a1a, which I guess makes sense.

Closing Thoughts

In this article, we’ve come across a couple of different constants: 0x5f400000, 0x54376000, 0x54375a1a. None of these are exactly the 0x543759df that we’ve been searching for, but the last one is really close, which leads me to believe that the discovery of 0x543759df went through a similar process: 1. intuition, and 2. refinement.

The dark art of fast reciprocal square roots have been largely done away with nowadays with faster intrinsics. If anything, this article contributes little technical value besides proposing a half-assed solution to a great puzzle. Nevertheless, I sincerely hope that, having made it this far, you’ve also found some value in this research. Good Luck!

Requested by averageisnothalfway: Differences between a1a and df

Counting Binary Trees (The Hard Way)

Suppose I have a binary tree of n internal nodes, we can count all possible such configurations by looking at the tree like

Rendered by QuickLaTeX.com

where each internal subtree with n nodes is a vertex plus two trees whose node-sum is n-1, or

    \[T_n = \sum_{k=0}^{n-1} T_k T_{n-k-1}\]

This is pretty fucking hard to solve. Here’s a way where we do not need to appeal to generating functions.

Let’s begin with a reasonable hypothesis: T_n is a first order recurrence. After groping around for a while, we find that this is actually a nonlinear homogeneous recurrence:

    \[T_n = \lambda_{n-1} T_{n-1}\]

Here’s the first couple of terms:

    \[1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440\]

No obvious pattern yet, but if we look at the sequence T_{n+1}/T_n, we see that it looks like


Notice how the primes in the denominator are exactly the right amount of spacing apart (7 is 2 away from 5, 11 4 away from 7, etc). Furthermore, for the nonprime denominators, they are always a divisor of what the expected value there is supposed to be (for example, the 5 in the number right before the 11), this seems to suggest that there’s always an inverse affine factor between two elements of the sequence. Since the denominator of \lambda_3 is 5, let’s conjecture that

    \[\lambda_n \sim \frac{1}{n+2}\]

In fact, the sequence \frac{T_{n+1}(n+2)}{T_n} looks really nice!


why, this is just the sequence (2 + 4n), so we can pretty accurately guess that

    \[T_0 = 1, ~~~~ T_{n+1} = \underbrace{\frac{2 + 4n}{n + 2}}_{\lambda_n} T_n\]

Warning: This is a really tedious proof; the point is to construct a series that cancels out the (n+2) on the denominator of \lambda_n at each step of the induction.

Proof: By strong induction, for some n+1, suppose that we have already shown that T_n = \lambda_{n-1} T_{n-1}.
Consider the series

    \[S_{n+1} = \sum_{k=0}^n k T_k T_{n-k}\]

In the n=4 case, this looks like

    \[S_5 = 0 T_0 T_4 + 1 T_1 T_3 + 2 T_2 T_2 + 3 T_3 T_1 + 4 T_4 T_0\]

notice that we can flip the order of this series and add with itself to get

    \[2S_{n+1} =  \sum_{k=0}^n (k + n - k) T_k T_{n-k} = n \sum_{k=0}^n T_k T_{n-k} = n T_{n+1}\]

By appealing to the induction hypothesis on T_k, we also have

    \begin{align*} S_{n+1} &=  \sum_{k=0}^n kT_k T_{n-k} \\ &= \sum_{k=1}^n k \lambda_{k-1}T_{k-1} T_{n-k} \\ &= \sum_{k=0}^{n-1} (k+1) \lambda_{k}T_{k} T_{n-1-k} \\ S_{n+1} +T_{n+1} &= T_0T_n + \sum_{k=0}^{n-1} \lambda_k T_k T_{n-1-k} + \sum_{k=0}^{n-1} \pa{k+1}\lambda_{k}T_{k} T_{n-1-k} \\ &= T_n + \sum_{k=0}^{n-1} \underbrace{\pa{k+2}\lambda_{k}}_{\g{cancel the denominator}}T_{k} T_{n-1-k} \\ &= T_n + \sum_{k=0}^{n-1} \pa{2 + 4k}T_{k} T_{n-1-k} \\ \intertext{by a simple substitution of $k = n-1-k$, we get} 2S_{n+1} + 2T_{n+1}  &= 2T_n + \sum_{k=0}^{n-1} \pa{2 + 4k + 2 + 4(n-1-k)}T_{k} T_{n-1-k} \\ &= 2T_n + 4n \sum_{k=0}^{n-1}T_{k} T_{n-1-k} \\ &= (2+4n)T_n \end{align*}

So, above, we’ve derived the following:

    \[2S_{n+1} = nT_{n+1}, ~~~~ 2S_{n+1} + 2T_{n+1} = (2+4n)T_n\]

Substituting the left into the right, we get

    \[(n+2)T_{n+1} - (2+4n)T_n = 0 => T_{n+1} = \frac{2+4n}{n+2} T_n\]

which concludes the proof. ~~~~\Box

This recurrence allows us to write

    \[T_{n+1} = \prod_k^n \lambda_k\]

Notice that this is the same as

    \begin{align*} T_{n+1} &= \frac{\prod_k 2 + 4k}{\prod_k 2+ k} \\ &= \frac{1}{(n+2)!} \prod_k^n 2(1 + 2k) \\ &= \frac{2^{n+1} \cdot 1 \cdot 3 \cdot 5 \cdots (2n+1)}{(n+2)!} \frac{n!}{n!} \\ &= 2\frac{1 \cdot 3 \cdot 5 \cdots (2n+1) \cdot 2 \cdot 4 \cdots 2n}{(n+2)!n!} \\ &= \frac{2}{n+2} \frac{(2n+1)!}{(n+1)!n!} \\ &= \frac{2}{n+2} {2n+1 \choose n+1} \sim \frac{4^{n+1}}{\sqrt{\pi (n+1)^3}} \end{align*}


Calculating Modified Fibonacci Numbers, Quickly and Exactly

1. Properties of c_k

The sequence c_k is essentially the same as the Fibonacci sequence, with the exception that the initial conditions are altered slightly (so that c_0 \ne f_0, where f_k is the fibonacci sequence). This similarity lets us use many of the same algebraic machinery that can be used to tackle the fibonacci numbers.

Closed Form

I will defer the derivation until the appendix, but it can be quickly shown using a little bit of linear algebra that letting \phi,\psi = \frac{1\pm \sqrt{5}}{2} be the golden ratio (and it’s “conjugate” wrt to the root), then

    \[c_k = \phi^k + \psi^k\]

which looks to have the same form as the fibonacci numbers.

Analytic Property

The numerical value of \phi \approx 1.618\dots and \psi = -\phi^{-1}, therefore as k grows large, the contribution of \psi^k becomes insignificant, serving only to ensure that c_k is a whole number, therefore c_k = O(\phi^k), in fact, from simple observation, it seems that for k > 1, c_k = \left\lfloor\phi^k + \frac{1}{2}\right\rfloor (or \phi^k rounded to the nearest integer).

Computation by Reduction to Fibonacci Number

Unfortunately, we don’t have quite a good way to compute with irrational objects: if we use floating point arithmetic, we quickly lose information about the trailing terms, if we use some algebraic package, then simplifying expressions with irrational numbers may become computationally infeasible.

However, we can exploit the connection between c_k and f_k to help us out. Suppose we have some software that can efficiently compute f_k for any arbitrary k, can we possibly use that to our advantage in computing c_k?

It can be shown (similar to the steps used in the appendix) that \sqrt{5}f_k = \phi^k - \psi^k. Notice that

    \[\underbrace{\pa{\phi^k - \psi^k}}_{\sqrt{5}f_k}\underbrace{\pa{\phi^k + \psi^k}}_{c_k} = \phi^{2k} - \psi^{2k} = \sqrt{5}f_{2k}\]


    \[c_k = \frac{f_{2k}}{f_k} = 2f_{k+1} - f_k\]

Direct Computation

Alternatively, we can also compute this sequence directly in O(\log(k)) time. Recall that \phi\psi = -1, then

    \[c_k^2 = \phi^{2k} + 2\phi^k\psi^k + \psi^{2k} = c_{2k} + 2(-1)^k\]

so c_{2k} = c_k^2 - 2(-1)^k. Unfortunately, there’s no elegant solution for all odd terms, but recall that within the consecutive triple (k+1,k,k-1), at least one of these must be a multiple of three. Now, consider

    \begin{align*} c_k^3 &= c_k^2(\phi^k + \psi^k) \\  &= \pa{\phi^{2k} + \psi^{2k}}\pa{\phi^k + \psi^k} + 2(-1)^k c_k \\ &= \phi^{3k} + \psi^{3k} + \phi^{2k}\psi^k + \psi^{2k}\phi^k + 2(-1)^k c_k \\ &= c_{3k} + (-1)^k c_k + 2(-1)^k c_k \\ &= c_{3k} + 3(-1)^k c_k \\ c_{3k} &= c_k(c_k^2 - 3(-1)^k) \end{align*}

Therefore, we can compute c_k using

    \[c_0 = 2, c_1 = 1, c_k = \begin{cases} c_{k/2}^2 - 2(-1)^{k/2} & \mbox{if $k$ is even} \\ c_{k/3}(c_{k/3}^2 - 3(-1)^{k/3}) & \mbox{if $k$ is a multiple of three} \\ \underbrace{c_{k-1}}_{\g{even}} + \underbrace{c_{k-2}}_{\g{multiple of 3}} & \mbox{otherwise} \end{cases}\]

Memoizing gives a O(\log(k)) solution

C = {0:2,1:1}
d = lambda k: -1 if k % 2 else 1
def c(k):
    if k in C:
        return C[k]
    if k % 2 is 0:
        r = c(k/2)
        C[k] = r*r - 2*d(k/2)
    elif k % 3 is 0:
        r = c(k/3)
        C[k] = r*(r*r - 3*d(k/3))
        C[k] = c(k-1) + c(k-2)
    return C[k]

2. Example Problem: Fibonacci Graph

Suppose we have the Fibonacci sequence

    \[0, 1, 1, 2, 3, 5, 8, 13, 21, \dots\]

defined by

    \[f_0 = 0 , f_1 = 1 , f_{k+1} = f_k + f_{k-1}.\]

Let’s construct a sequence of points P_k = \pa{x_k,y_k} such that

    \[P_0 = (0,0), P_k = (f_{k-1},f_k)\]

so that it looks like the sequence

    \[(0,0), (0,1), (1,1), (1,2), (2,3), (3,5), (5,8), \dots\]

Let’s connect all of the points in (P_k)_k together (so that it forms a jagged line) and look at the area under the curve; we will call this quantity B_n.

Give an algorithm that computes B_n \mod 10^{10} in O(\log(n)) time and compute the area for the case where n = 1234567891011121314.

Solution Sketch

If you look at the figure above, you’ll notice that the total region is made up of these trapezoids. Let’s call A_k the area of the trapezoid defined by P_k and P_{k+1}, then as illustrated below,

Rendered by QuickLaTeX.com

we know that

    \[A_k = (x_{k+1} - x_k) y_k + \frac{(y_{k+1} - y_k)(x_{k+1} - x_k)}{2},\]

which is just the area of the rectangle on the bottom plus the triangle on the top. From the relation x_{k+1} = x_k + x_{k-1}, we know that x_{k+1} - x_k = x_{k-1}, and similarly for y, so that we have

    \[A_k = x_{k-1} \pa{y_k + \frac{y_{k-1}}{2}} = f_{k-2}\pa{f_k + \frac{f_{k-1}}{2}}\]

Let’s make a connection to the exponential form of the Fibonacci sequence:

    \[f_k = \frac{\phi^k - \psi^k}{\sqrt{5}}\]

where \phi,\psi = \frac{1 \pm \sqrt{5}}{2} are the golden ratio and its conjugate respectively. If we actually substitute this equation into the definition of A_k and expand everything out, we will find that

    \[A_k = \frac{1}{10} \phi^{2k-3} + \frac{1}{5}\phi^{2k-2} - \frac{1}{10} \phi^{k-1}\psi^{k-2} - \frac{1}{5} \phi^k\psi^{k-2} - \frac{1}{10} \phi^{k-2}\psi^{k-1} - \frac{1}{5}\phi^{k-2}\psi^k + \frac{1}{10} \psi^{2k-3} + \frac{1}{5}\psi^{2k-2}\]

Since \psi is \phi‘s conjugate, we have \phi\psi = -1, so each of the \phi^{k-2+a}\psi^{k-2+b} term can be rewritten as (-1)^{k-2} \phi^a\psi^b = (-1)^k \phi^a\psi^b. Grouping related terms together, we get

    \[A_k = \phi^{2k} \underbrace{\pa{\frac{1}{10} \phi^{-3} + \frac{1}{5} \phi^{-2}}}_{10^{-1}} + \psi^{2k} \underbrace{\pa{\frac{1}{10} \psi^{-3} + \frac{1}{5} \psi^{-2}}}_{10^{-1}} - (-1)^k \pa{\frac{\overbrace{\phi + \psi}^{1}}{10} + \frac{\overbrace{\phi^2 + \psi^2}^{3}}{5}},\]

simplifying, we find that

    \[A_k = \frac{\phi^{2k} + \psi^{2k} - 7 (-1)^k }{10}, A_0 = 0\]

This is awesome! But we need to find the sum of B_k = A_0 + A_1 + \cdots + A_{k-1}, so let’s do just that

    \begin{align*} B_{k+1} &= \sum_{j=0}^{k} A_j \\ &= \sum_{j=1}^{k-1} \frac{\phi^{2j} + \psi^{2j} - 7 (-1)^j }{10} \\ \intertext{using the geometric series} &= \frac{1}{10} \pa{ \frac{\phi^{2k+2} - \phi^2}{\phi^2 - 1} + \frac{\psi^2 - \psi^{2k+2}}{1 - \psi^2} + 7 \frac{1 - (-1)^k}{2} } \\ \intertext{let's define $\gamma_k = \begin{cases}0 & k \g{ even} \\ 1 & \g{otherwise}\end{cases}$}  &= \frac{1}{10} \pa{ \frac{\phi^{2k+2} - \phi^2}{\phi^2 - 1} + \frac{\psi^2 - \psi^{2k+2}}{1 - \psi^2} + 7\gamma_k } \\ \intertext{furthermore, notice that $\phi = \phi^2 - 1$ and likewise with $\psi$, so} &= \frac{1}{10} \pa{ \frac{\phi^{2k+2} - \phi^2}{\phi} + \frac{\psi^2 - \psi^{2k+2}}{- \psi} + 7 \gamma_k} \\ &= \frac{ \phi^{2k+1} + \psi^{2k+1} - \overbrace{\pa{\phi + \psi}}^{-1} +  7 \gamma_k}{10} \\ &= \frac{\phi^{2k+1} + \psi^{2k+1} - 1 + 7 \gamma_k}{10} \\ &= \frac{c_{2k+1} - 1 + 7\gamma_k}{10} \end{align*}


Since I was a little vague on specifying the range of trapezoids to consider for B_n, I will accept both 7997480508.5 (the literalist interpretation) and 7203060060 (the more casual interpretation).

3. Appendix

Deriving Closed Form of c_k

Consider the sequence (c_k), we can represent this sequence as an infinite dimensional vector called c (so that indexing c is the same as finding the k^{th} element of the sequence). Now, consider the vector-space in which c inhabits, we can define a transformation R (that I will call the right-shift operator) that basically takes in x and outputs x shifted one index over, so that (Rc)_k = c_{k+1}. e.g.

    \[R\mat{x_0\\x_1\\x_2\\\vdots} = \mat{x_1\\x_2\\x_3\\\vdots}\]

It’s clear that this is a linear transform, because we can inductively define R as the limit of the matrix-valued sequence of matrices whose first upper diagonal is constant 1:

    \[R_k = \mat{0 & I_{k-1} \\ 0 & 0} = \mat{ 0 & 1 & 0 & 0 & 0 \\0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & \ddots & 0 \\ 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0}\]

Consider the subspace spanned by all of the shifted c sequences:

    \[\mathcal{C} = \g{span}\pa{c, Rc, R^2c, \cdots} = \g{span}(c,Rc)\]

the second equality comes from the fact that because c_{k+2} = c_{k+1} + c => R^2c = Rc + c. Now, because we don’t actually know what c is (only that it exists), we can alternatively look at it as the kernel of the system

    \[(R - \phi)(R-\psi)c = 0\]

(because we’ve already established above that R is a linear operator). Furthermore, R and all scalar matrices commute. Appealing to a little bit of algebra, we know that, letting \rho(x) = x^2 - x - 1 be the characteristic and \phi,\psi be scalar roots of the quadratic x^2-x-1 such that \rho(\phi) = \rho(\psi) = 0, then

    \[\rho(R) = (R - \phi)(R-\psi) = (R-\psi)(R-\phi)\]

Therefore, let (R-\phi)a = 0 and (R-\psi)b = 0, then the solution c must be some linear combination of the solutions a and b:

    \[c = \alpha a + \beta b\]

for some scalar \alpha, \beta, because then

    \begin{align*} \rho(R)c &= (R^2 - R - 1) c \\ &= \alpha(R - \psi)\underbrace{(R-\phi)a}_0 + \beta (R-\phi)\underbrace{(R-\psi)b}_0 \\ &= 0 \end{align*}

as expected.

Now, we know that Ra = \phi a, but this is the same as saying that letting a_0 = 1, then a_{k+1} = \phi a_k, so a_k = \phi^k. Using a similar argument, we can also argue that b_k = \psi^k, so

    \[c_k = \alpha \phi^k + \beta \psi^k\]

We can then use our initial values that c_0 = 2, c_1 = 1 to solve for \alpha, \beta

    \[\mat{1 & 1\\ \phi & \psi} \mat{\alpha\\\beta} = \mat{2\\1} => \mat{\alpha\\\beta} = \frac{1}{\psi-\phi}\mat{\psi & -1 \\ -\phi & 1}\mat{2\\1} = \mat{1\\1}\]


    \[c_k = \phi^k + \psi^k\]

Counting Words

1. The Combinatorial (Easy) Approach

Consider the placement of the zeros in an n-word: this is equivalent to the problem where we choose an arbitrary word prefixed with k zeros, like

    \[w_n = 0\stackrel{k}{\cdots}0\sigma_{k+1}\sigma_{k+2}\cdots\sigma_n\]

and permuting it in every possible way, excluding overlaps. Now, there are n! permutations of w_n, but k! of those are zero-overlapped and n-k of those are \sigma overlapped, so there are

    \[\frac{n!}{k!(n-k)!} = {n \choose k}\]

unique permutations of a single word w_n. How many possible w_ns can we possibly make? Well, there are n-k \sigma_i‘s that we can fill out and each of these can take any value besides 0, so there are 2^{n-k} such words, therefore there are

    \[{n \choose k}2^{n-k}\]

words of length n with exactly k zeros.

2. The Computer Scientist’s (Convoluted) Approach

We can boil this down into a decision problem. Suppose we use the decision variable a_n^k to denote the total number of words of length n with exactly k zeros, then we can consider a component-by-component decision problem. At each cell of the word, we can either fill in a 0, or a 1/2. Suppose we begin at the left end of the word, if we fill in a 0 at the current slot, then there are a_{n-1}^{k-1} words available to us (because we used up one slot, so decrease n by 1, and we used up one zero, so decrease k by 1), and if we fill in either a 1 or a 2, then we still have k zeros to choose from, so there are a_{n-1}^k such words available to us. Combining the two together (in the spirit of dynamic programming), we get the recurrence

    \begin{align*} a_n^k &= 2a_{n-1}^k + a_{n-1}^{k-1} \\ a_n^0 &= 2a_{n-1}^0 = 2^n \\ a_n^n &= 1 \end{align*}

If we make a table of this

    \[\begin{array}{|c|cccccc|} \hline n\k & 0 & 1 & 2 & 3 & 4 & 5\\ \hline 0 & 1 & & & & & \\ 1 & 2 & 1 &&&& \\ 2 & 4 & 4 & 1 &&& \\ 3 & 8 & 12 & 6 & 1 && \\ 4 & 16 & 32 & 24 & 8 & 1 & \\ 5 & 32 & 80 & 80 & 40 & 10 & 1 \\ \hline \end{array}\]

Hmm, there doesn’t seem to be an obvious pattern here. Since we know that a_n^0 = 2^n, let’s attempt to do this inductively on k:

We know that

    \begin{align*} a_n^1 &= 2^{n-1} + 2a_{n-1}^1 \\ \intertext{dividing both sides by $2^n$, we get} \\ \frac{a_n^1}{2^n} &= 2^{-1} + \frac{a_n^1}{2^{n-1}} \\ \intertext{let $b_n^1 = a_n^12^{-n}$, then} b_n^1 &= 2^{-1} + b_{n-1}^1 \\ b_n^1 &= n2^{-1} \\ a_n^1 &= n2^{n-1} \end{align*}

Great, let’s go on to the k = 2 case

    \begin{align*} a_n^2 &= a_{n-1}^1 + 2a_{n-1}^2 \\ &= (n-1)2^{n-2} + 2a_{n-1}^2 \\ b_n^2 &= a_n^2 2^{-n} \\ b_n^2 &= (n-1)2^{-2} + b_{n-1}^2 \\ b_n^2 &= \sum_{k=1}^n (k-1)2^{-2} \\ &= {n \choose 2} 2^{-2} \\ a_n^2 &= {n \choose 2} 2^{n-2} \end{align*}

For the k=3 case, we would get

    \[b_n^3 = 2^{-3} \sum_{k=2}^{n} {k \choose 2} = 2^{-3} {n \choose 3}\]

and inductively, because

    \[b_n^{k+1} = 2^{-k} \sum_{i=k}^{n} {i \choose k} = 2^{-k} {n \choose k} ~~~~ \blacksquare\]

we have that

    \[a_n^k = {n\choose k}2^{n-k}\]

which we derived earlier.

Annoying Mclaurin Series

We can employ Taylor’s theorem, but to my knowledge, there’s no easy way to construct the derivatives of B(z) [that’s left to the more ambitious reader ;)] Instead, we focus on classical algebraic techniques to carry us through the day.

Notice that

    \[B(z) = \pa{1+z}\frac{1}{1 - \pa{z\pa{1+z}}},\]

recall that all terms of the form \frac{1}{1-x} = \sum_k x^k, so this becomes

    \begin{align*} B(z) &= \pa{1+z} \underbrace{\sum_{k=0}^\infty z^k (1 + z)^k}_{S(z)} \intertext{let's focus on $S(z)$. Recall that} (1 + z)^k &= \sum_{j = 0}^\infty {k \choose j} z^j \\ S(z) &= \sum_{k=0}^\infty z^k \sum_{j=0}^\infty {k \choose j} z^j \\ \intertext{Let $S_n$ be defined as} S_n &= \sum_{k=0}^n z^k \sum_{j=0}^\infty {k \choose j} z^j \\ S_0 &= 1 \\ S_1 &= S_0 + \pa{{1 \choose 0} z^1 + {1 \choose 1}z^2} \\ S_2 &= S_1 + \pa{{2\choose 0} z^2 + {2\choose 1}z^3 + {2\choose2}z^4} \\ S_3 &= S_2 + \pa{{3\choose0} z^3 + {3\choose1}z^4 + {3\choose2}z^5 + {3\choose3}z^6} \\ \vdots \end{align*}

Okay, let’s suppose that S(z) = \sum_k s_k z^k, then the above tells us that

    \begin{align*} s_0 &= 1 \\ s_1 &= {1\choose0}\\ s_2 &= {2\choose0} + {1\choose1}\\ s_3 &= {3\choose0} + {2\choose1} \\ s_4 &= {4\choose0} + {3\choose1} + {2\choose2} \\ s_k &= \sum_{j=0}^\infty {{k-j}\choose j} \end{align*}

If we compute the first few out, we will see that it looks like

    \[1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, \dots\]

Why, this is the Fibonacci sequence! It seems that for the first few terms we looked at, s_k obeys the fibonacci recurrence

    \[s_{k+1} = s_k + s_{k-1} ~~~~ s_{0,1} = 1\]

Lemma (Fibonacci Sequence)

For k \ge 1,

    \[s_{k+1} = s_k + s_{k-1}\]

Proof: Substituting the definition of s_k in, we have

    \begin{align*} s_k + s_{k-1} &= \sum_{j=0}^\infty {k-j \choose j} + {k-j-1\choose j} \\ &= {k\choose 0} + \sum_{j=1}^\infty {k-j \choose j} + {k-j \choose j-1} \\ \intertext{we know that ${a \choose b} + {a \choose b+1} = {a+1\choose b+1}$} &= 1 + \sum_{j=1}^\infty {k-j+1 \choose j} \\ &= \sum_{j=0}^\infty {k-j+1 \choose j} \\ &= s_{k+1} ~~~~ \blacksquare \end{align*}

Finally, recall that B(z) = (1 + z)S(z) = S(z) +zS(z), so

    \[b_k = s_k + s_{k-1} = s_{k+1}\]

so we would expect

    \begin{align*} b_0 &= 1 \\ b_1 &= 2 \\ b_{k+1} &= b_k + b_{k-1} \end{align*}

If we want a closed form, we can find that, letting

    \begin{align*} \psi &= \frac{1 \pm \sqrt{5}}{2} \\ \mu &= \frac{5 \pm 2\sqrt{5}}{5} = \frac{3 + 4\psi}{5} \end{align*}


    \[b_k = \frac{\pa{3 + 4\psi_{+}}\psi^{k}_{+} + \pa{3 + 4\psi_{-}}\psi^{k}_{-}}{5}\]

Finally, we can compute \sigma_n = \sum_k b_k

    \[\sigma_n = \pa{3 + 4\psi_{+}}\frac{\psi_{+}^{n+1} - 1}{-\psi_{-}} + \pa{3 + 4\psi_{-}} \frac{1 - \psi_{-}^{n+1}}{\psi_{+}}\]

Infinity Norm of the Inverse of Lower Bidiagonal Matrices

1. Introduction

The infinity norm \|\cdot \|_\infty of a matrix is just the absolute maximum row sum of that matrix, or alternatively

    \[\|A\|_\infty = \max_i \sum_j |a_{ij}|\]

so computing the norm shouldn’t be all that difficult. For a full-matrix A\in \mathbb{R}^{n\times n}, it should take approximately O(n) time to compute the sum of a single row, and n such computations to get the maximum row sum for a total of O(n^2) time.

If A is also lower bidiagonal, then it would have the shape

    \[\mat{x \\ x & x \\ & x & x \\ && x & x}\]

that is, the diagonal and the first lower-subdiagonal are filled, but everywhere else are zero. Being bidiagonal grants the matrix a lot of nice properties; for one thing, it only takes 2 operations to get the row sum now, so it only takes O(n) time to find the infinity norm of a bidiagonal matrix. Similarly, as we shall see later, it also only takes O(n) time to solve a system of linear equation of the form

    \[Ax = b\]

if A is bidiagonal.

2. The Problem

Suppose you’re given a lower-bidiagonal matrix L (defined above). How do you find the infinity norm of its inverse


One naive solution is to first compute the inverse explicitly, and then take the infinity norm of it. As we’ve hinted above, it takes O(n) time to solve a single system, then the system

    \[LL^{-1} = \mat{1 \\ & 1 \\ && 1 \\ &&& 1}\]

can be solved as

    \[L^{-1} = \mat{x_1 & x_2 & x_3 & x_4}\]

where x_i is the solution to the system

    \[Lx_i = e_i ~~~~ e_i \mbox{ is the ith column of the identity matrix}\]

or, letting L\backslash b denote the solution of the system Lx = b

    \[L^{-1} = \mat{L\backslash \mat{1\\0\\0\\0} & L \backslash \mat{0\\1\\0\\0} & L \backslash \mat{0\\0\\1\\0} & L \backslash \mat{0\\0\\0\\1}}\]

and then we would look at the absolute row sum. The bottleneck here is the n solves w.r.t the columns of the identity at O(n) cost each, so the entire process is expected to take O(n^2) time. This is actually pretty good, because the inverse takes O(n^2) to print out, so finding its maximum row-sum in this amount of time is pretty reasonable. But the world is a cruel place, and somewhere out there, and billion variable dynamic program awaits us that needs to shave off every possible second possible.

Can we do better than O(n^2) time and space?

3. Solution Attempt

Exploring the Inverse

Let’s begin with a reasonable direction: let’s figure out what the inverse of a bidiagonal matrix looks like. We’ll mainly be looking at the n = 4 case but generalizing for any n. Suppose that L looks like

    \[\mat{l_1 \\ r_2 & l_2 \\ & r_3 & l_3 \\ && r_4 & l_4}\]

Now, if we take the inverse of L, we’re not guaranteed that its bidiagonal structure will be preserved, but we know that it will at least be lower triangular. Let’s, for the sake of being punny, denote \Gamma = L^{-1}, then we know that

    \[\mat{l_1 \\ r_2 & l_2 \\ & r_3 & l_3 \\ && r_4 & l_4} \mat{\gamma_{11} \\\gamma_{21} & \gamma_{22} \\ \gamma_{31} &\gamma_{32} &\gamma_{33} \\ \gamma_{41}&\gamma_{42}&\gamma_{43}&\gamma_{44}} = \mat{1 \\ &1 \\ &&1 \\ &&& 1}\]

let’s start writing out the individual equations generated by this system:

    \begin{align*} l_1\gamma_{11} &= 1 \\ r_2\gamma_{11} + l_2\gamma{21} &= 0 \\ l_2\gamma_{22} &= 0 \\ r_3\gamma_{21} + l_3\gamma_{31} &= 0 \\ r_3\gamma_{22} + l_3\gamma_{32} &= 0 \\ l_3\gamma_{33} &= 1 \\ r_4\gamma_{31} + l_4\gamma_{41} &= 0 \\ r_4\gamma_{32} + l_4\gamma_{42} &= 0 \\ r_4\gamma_{33} + l_4\gamma_{43} &= 0 \\ l_4\gamma_{44} &= 1 \end{align*}

Inductively, we can show that

    \[\gamma_{kk} &= \frac{1}{l_k}\]

furthermore, the zero equations can be inductively generalized as

    \[r_k\gamma_{k-1,j} + l_k\gamma_{k,j} = 0 ~~~~ \forall j < k.\]

Suppose that k = 4, then we know that \gamma_{44} = \frac{1}{l_{4}}. We also have an equation:

    \[r_4\gamma_{34} + l_4 \gamma_{44} = 0\]

so that

    \[\gamma_{34} = -\frac{l_4}{r_4} \gamma_{44}\]

and so on, we can compute every gamma with the recurrence

    \[\gamma_{kj} = \begin{cases} 0 & j > k \\ l_k^{-1} & j = k \\ -\frac{l_{k+1}}{r_{k+1}} \gamma_{k+1,j} & j < k \end{cases}\]

Great, this gives us a nice algorithm to fill out the entries of the inverse in O(n^2) time, but that doesn’t really help us.


    \[\|L^{-1}\|_\infty = \|\Gamma\|_\infty = \max_k \sum_{j=1}^k |\gamma_{kj}|\]

so it doesn’t matter what the sign of \gamma_{kj} is. Now, the \gammas are entirely multiplicative, and hence their magnitudes do not get changed if any of l_k or r_k are changed. As such, we get the following lemma:

(3.1.1): Invariance

Suppose that \hat L \in \mathbb{R}^{n\times n} such that |L| = |\hat{L}|, then

    \[\|\hat L^{-1}\|_\infty = \|L^{-1}\|_\infty\]

this basically says that we are allowed to arbitrarily flip the signs of the entries of L without changing the value of \|L^{-1}\|_\infty. This follows from the fact that the entries of the inverse, \Gamma, depend only multiplicatively on the entries of L so flipping the signs of the elements of L will only flip the signs of \Gamma, but the magnitudes of each element of \Gamma will remain the same.

In other words, \|L^{-1}\|_\infty is invariant under transformations during which the signs of L are flipped. This is going to come in very handy, and the above recurrence for \gamma_{kj} is constructive in that it gives an algorithm for which elements of L to flip to reduce the problem into an easier space.

Row Summation as a Linear Operation

Now, consider the operation

    \[A\mat{1\\1\\1\\1} = \mat{\sum_j a_{1j} \\ \sum_j a_{2j} \\ \sum_j a_{3j} \\ \sum_j a_{4j}}\]

it seems that multiplying any matrix by the column of ones (e from here on) will return a column of row sums. Great! Shouldn’t it then be the case that we just need to find the maximum column of L^{-1} e? Unfortunately, what we really want is the sum of the absolute values of each row, so if L^{-1} ever contains a negative number, this will break :(

Some of you may already see how to resolve this now; but for the moment, let’s just check whether it’s actually possible to at least do this row-sum operation in linear time! The problem is solving L^{-1}e, or equivalently, finding an x such that

    \[\mat{l_1 \\ r_2 & l_2 \\ & r_3 & l_3 \\ && r_4 & l_4} \mat{x_1\\x_2\\x_3\\x_4} = \mat{1\\1\\1\\1}\]


    \begin{align*} x_1 &= l_1^{-1} \\ x_2 &= \frac{1 - r_2x_1}{l_2} \\ x_3 &= \frac{1 - r_3x_2}{l_3} \\ \vdots \\ x_k &= \frac{1 - r_kx_{k-1}}{l_k} \end{align*}

so it takes O(n) time to solve this system. If only we can just do this…

A Final Observation

By now, you might be a little perplexed by lemma (3.1.1), which we took a lot of pain and agony to develop, but never used. Now, the problem of solving this problem by maximizing the column of L^{-1}e is that the elements of L^{-1} aren’t guaranteed positive. However, (3.1.1) and its construction seems to say that if we use the correct signs in the elements of L, then we might be able to manipulate the signs of the elements of L^{-1} without disturbing its final infinity norm. At this point, I had an idea:
can I somehow force all of the entries of the inverse to be positive by flipping the signs of the right elements? If we can, then from L we can generate a \hat L whose inverse \hat L^{-1} only contains positive numbers, so

    \[\|L^{-1}\|_\infty = \| \hat{L}^{-1}\|_\infty = \hat{L}^{-1} e\]

which can be done in O(n) time! This is quite exciting, but how do we do it?

Recall in the proof of (3.1.1), we used the recurrence for the elements of the inverse \gamma_{kj} as

    \[\gamma_{kj} = \begin{cases} 0 & j > k \\ l_k^{-1} & j = k \\ -\frac{l_{k+1}}{r_{k+1}} \gamma_{k+1,j} & j < k \end{cases}\]

Now, suppose that for some k, we know immediately that \gamma_{kk} \ge 0 =>l_k^{-1} \ge 0, so the diagonals of \hat L must be positive. By the strong induction hypothesis, let’s assume that for any j < k, \gamma_{k+1,j} \ge 0, then for \gamma_{kj} to also be positive, it better be the case that

    \[\gamma_{kj} = -\frac{\overbrace{l_{k+1}}^{\ge 0}}{r_{k+1}} \overbrace{\gamma_{k+1,j}}^{\ge 0} \ge 0\]


    \[-\frac{+}{?} + = + => ? \mbox{ must be } -\]

so r_k must be negative in \hat L, which gives a construction of \hat L

4. Solution!

Taking the above into consideration, the following algorithm can be used to compute the infinity norm of the inverse.

diagonal of L = abs(diag(L));
subdiagonal of L = -abs(diag(L, -1));
x(1) = 1/L(1,1);
for k = 2 to n
    x(k) = (1 - L(k,k-1)*x(k-1))/L(k,k);
return the maximum of x

Limit Preserving Functions in CPOs

A complete partial order (hereon cpo) is a partial order (D, \sqsubseteq) such that all monotone chains

    \[x_0 \sqsubseteq x_1 \sqsubseteq \cdots\]

has a least upper bound

    \[\bigsqcup_{k \in \omega} x_k \in D\]

A continuous function f: D \to E between two cpos D and E is one that preserves the limit/least upper bound on all monotone chains:

    \[f\left(\bigsqcup_{k\in \omega} x_k\right) = \bigsqcup_{k \in \omega} f(x_k)\]

Theorem (Monotonicity)

All continuous functions are monotone in the sense that x \sqsubseteq y => f(x) \sqsubseteq f(y).

Proof: For the sake of deriving a contradiction, suppose that there exists some non-monotone continuous function f: D \to E for some pairs of cpos D and E. Then it must be the case that for some pair of elements x, y \in D, x \sqsubseteq y but f(x) \not\sqsubseteq f(y). Now, consider an \omega-chain

    \[(s_k)_{k \in \omega} = x \sqsubseteq y \sqsubseteq y \sqsubseteq y \sqsubseteq \cdots\]

that is y-terminal and monotone. Obviously, its least upper bound \bigsqcup s_k = y, but

    \[\bigsqcup f(s_k) = f(x) \sqcup f(y) \sqcup f(y) \sqcup \cdots = f(x) \sqcup f(y)\]


    \[f(x) \sqcup f(y) \ne f(y)\]

because otherwise f(x) \subseteq f(y). Hence, we can show that

    \begin{align*} f\left(\bigsqcup s_k\right) &= f(y) \\ &\ne \bigsqcup f(s_k) \\ &= f(x) \sqcup f(y) \end{align*}

and hence f cannot be continuous; a contradiction! Therefore, all continuous functions are monotone. \square

Introduction to Scientific Computing: Error Propagation

There was recently a good article on scientific computing, defined loosely as the dark art, as it may have seemed to the uninitiated, of deriving solutions to equations, dynamical systems, or what-not that would have made your Mechanics professor scream in horror at the thought that they will need to somehow “solve” these systems. Of course, in the rich computer world today, almost any problem imaginable (exaggeration of course!) can already be solved by some existing tool. Hence more and more, the focus gets shifted from “how do I solve this differential equation” to “what do I ask google?”

My dad once told me of a glorious time “back in [his] days” when every respectable engineering institution would have a crash course on this dark art on top of Fortran. Of course, my dad is only 43, and that was only 19 years ago.

Even now, when computer science departments everywhere no longer believes in the necessity in forcing all of their graduates to have a basic grasp on numerical analysis, there is still some draw in the subject that either makes people deathly afraid of it or embrace it as their life ambition. I am personally deathly afraid of it. Even then, there are quite many cute gems in the field, and as such, I am still very much so attracted to the field.

Scientific computing is the all encompassing field involving the design and analysis of numerical methods. I intend to start a survey of some of the basic (but also most useful) tools such as methods that: solve linear and nonlinear systems of equations, interpolate data, compute integrals, and solve differential equations. We will often do this on problems for which there exists no “analytical” solution (in terms of the common transcendental functions that we’re all used to).

1. Safety First

In an ideal world, there would be a direct correspondence between numerical algorithms their implementation. Everything would work out of the box and there would be no need to worry that, even if you’ve implemented the on-paper algorithm correctly, it would somehow behave “differently”.

Of course, this isn’t the case. We’ve all heard of the age old saying that computers are finitary, and therefore it cannot represent all real numbers, specifically, there’s no way to represent irrationals, and in most of the cases, we do not fully represent all rationals either. Instead, we round numbers to a certain digit.

On the surface, this doesn’t seem too unfortunate. You probably did the same in your Chemistry lab report, to even more horrendous precision than what your computer will likely do for you. If you aced your Chemistry lab, then this will likely seem like a perfectly good scheme. But if you, like me, were troubled by the “sig fig” rules, then you are probably a little wary.

In fact, more often than not, you will not be bothered by the lack of a full spectrum of real numbers to choose from. However, when these little nasty “roundoff” errors are the culprit, they are often resolved through hours upon hours of debugging and general sense of hopelessness. To inherit a roundoff bug from someone else is like contracting the spanish flu: either you know what you’re doing and your system (with a little bit of agonizing) successfully resolve it, or you just won’t have any idea what you’re going to do (and die in the spanish flu metaphor). Think of this article as a vaccination against the roundoff bugs :)

2. A Modern Day Little Gauss Story

Suppose little Gauss lived in the modern age. little Gauss’ teacher wanted to surf the internet, so he assigned all of his students the following integral to evaluate:

    \[\int_0^1 x^{100} e^{x-1} dx\]

Being the clever alter-ego of the boy who immediately saw \sum^n_k k = {n+1 \choose 2}, little Gauss constructed a sequence

    \[s_k = \int_0^1 x^k e^{x-1} dx\]

and with a little bit of clever manipulation (integration by parts), he found that, using u_k = x^k, dv = e^{x-1}dx, v = e^{x-1}

    \[s_k = \int_0^1 u_kdv = \left.u_kv\right|_0^1 - \int_0^1 vdu_k\]

    \[= \left. x^ke^{x-1}\right|_0^1 - k\int_0^1x^{k-1}e^{x-1}dx\]

    \[= \left. x^ke^{x-1}\right|_0^1 - ks_{k-1}\]

    \[= 1 - ks_{k-1}\]


    \[s_0 = \int_0^1 e^{x-1}dx = 1 - 1/e\]

Why, this is a simple recursive function! Little Gauss was absolutely thrilled, he has at his disposal a programmable calculator capable of python (because he’s Gauss, he can have whatever the fuck he wants), and he quickly coded up the recurrence:

Python Code

import math
def s(k):
    if k == 0:
        return 1 - math.exp(-1)
        return 1 - k*s(k-1)

He verified the first few cases by hand, and upon finding his code satisfactory, he computed the 100th element of the sequence as `-1.1599285429663592e+141` and turned in the first few digits.

His teacher glanced at his solution, and knowing that there’s no way little Gauss could have done his school work with such proficiency, immediately declared it wrong. Upon repeated appeal, the teach finally relented and looked up the solution in his solution manual and, bewildered… again told little Gauss that he was WAAAAY off. Unfortunately for our tragic hero, this would not be a modern day retelling of a clever little boy who outsmarted his teacher. No, this is a tragic story of a clever little boy who succumbed to a fatal case of the roundoff bugs.

At a Brief Glance

In fact, had his teacher been a little bit more clever, he would have been able to immediately see why Gauss’ solution wouldn’t have worked. It’s immediately obvious that between x = 0 and x = 1, x^{100}e^{x-1} is always positive.

Furthermore, if a function f(x) is positive inside an interval, and suppose g(x) is also a positive in side the same interval but is everywhere smaller than f(x), then obviously the area under f(x) must be bigger than that of g(x). Similarly, if h(x) is everywhere larger than f(x), then the area under h(x) must also be larger than that of f(x). Now suppose f(x) = x^{100}e^{x-1}, then because 0 \le e^{x-1} \le 1 inside 0 \le x \le 1, then \frac{x^{100}}{e} \le x^{100}e^{x-1} \le x^{100} inside 0 \le x \le 1, and

    \[\int_0^1 \frac{x^{100}}{e} dx \le \int_0^1 x^{100}e^{x-1} dx \le \int_0^1 x^{100} dx\]


    \[\frac{1}{e(100 + 1)} \le s_{100} \le \frac{1}{100 + 1}\]

and in general

    \[\frac{1}{e(k+1)} \le s_k \le \frac{1}{k+1}\]

of course s_{100} can’t be on the order of -10^{141}!

So what went wrong? Well, whenever you’re in trouble, just make a plot!
Little Gauss' Sequence

Hmm. Everything seems to be going fine until around k=17.

It turns out that there was nothing wrong with little Gauss’ method and the integral is perfectly well-behaved. The culprit lies in the fact that s_0 can only be represented approximately on his calculator. As we know already, s_0 = 1 - e^{-1} is irrational, and cannot be represented in finite amount of memory. s_0 is only represented approximately, slightly perturbed so that to the computer, we’re actually giving them a initial \hat s_0 = s_0 + \epsilon for \epsilon that small perturbation (think of it as a really really really tiny number). Now, let’s see what Gauss’ calculator is computing once we unravel the recursion (we’ll use the notation \hat s_k to mean the calculated value of s_k on the calculator):

    \begin{align*} \hat s_0 &= s_0 + \epsilon \\ \hat s_1 &= 1 - \hat s_0 = s_1 - \epsilon \\ \hat s_2 &= 1 - 2\hat s_1 = s_2 + 2\epsilon \\ \hat s_3 &= 1 - 3\hat s_2 = s_3 - 6\epsilon \\ &\vdots\\ \hat s_{19} &= 1 - 19\hat s_{18} = s_{19} - 19!\epsilon \\ \hat s_k &= s_{k} \pm k!\epsilon \end{align*}

Oh god! No wonder the method produced the wrong answer, the slight perturbation in the computed value of \hat s_0 “propagates” throughout the computation and at the k^{th} step, manifests itself as k-factorial times that original perturbation \epsilon. Even at k=19, we will see around 10^{17}\epsilon fudged into the calculation. For k=100, the answer will have an additional factor of 10^{158}\epsilon! Little Gauss definitely should have learned about round-off errors.

3. Some Basics – Errors

Before we dig into the floating point encoding underlying most modern computing platforms, let’s talk about errors. Suppose that we’re computing the value of something and the true value of that computation is “stored” in a variable x, but our computation is inexact, and in the end, we shall put that inexact result in a “hatted” version of the known result, that is \hat{x}.

Since there’s this notion of inexactness, we can talk about the “error” of our computation. There are two kinds of errors:

Absolute Error
This is just the difference between the true value of the computation x and the inexact value \hat{x}. We shall define this as

    \[e_x = | x - \hat{x}|\]

note that e_x (e subscript x) can be taken to mean the error of the computation of x.
Relative Error
This the the ratio of the absolute error to the true value of the computation, or in other words

    \[\delta_x = \left| \frac{x - \hat{x}}{x} \right|\]

we read \delta_x to mean the relative error in the computation of x.

First, we can’t compute the absolute or relative errors, because if we can, then we would have know the true value of the computation already! Errors are purely analytic objects that can help us determine how well-behaving our computations are. Second, in the analysis of floating point roundoff, we will typically exclusively use relative error. This may seem counter-intuitive, but it has a few nice properties that simplifies error analysis, as we will see.

One more thing to add, if we allow \delta_x to have any sign, then through some simple algebra, we will find that

    \begin{align*} \delta_x &= \frac{\hat x - x}{x} \\ x\delta_x &= \hat x - x \\ \hat x &= x + x\delta_x  \\ &= x(1 + \delta_x) \end{align*}

Error Propagation

Suppose that, through some series of computations, that we have the computed values of \hat x and \hat y. Now, in the next instruction, we wish to compute the value of x + y. Because, we have already the (potentially inexact) computed values \hat x and \hat y, then it seems natural that to compute \widehat{x + y}. we would just add \hat x + \hat{y}:

    \[\widehat{x + y} = \hat{x} + \hat{y}\]

Now, suppose that \hat{x} = x(1 + \delta_x) and similarly for \hat{y}, then it seems that

    \[\widehat{x+y} = (x + y)(1 + \delta_{x+y})  = x(1 + \delta_x) + y(1 + \delta_y)\]

now, if we were doing error analysis, then we would want the relative error of \widehat{x + y}, which from our earlier notation we already called \delta_{x+y}. To find this, we merely need to solve the above equation for \delta_{x+y}:

    \[(x + y)(1 + \delta_{x+y})  = x(1 + \delta_x) + y(1 + \delta_y)\]

distributing both sides, we get

    \[x + y + x\delta_{x+y} + y\delta_{x+y} = x + x\delta_x + y + y\delta_y\]

canceling the x+y from both sides, we end up with

    \[\delta_{x+y}(x+y) = x\delta_x + y\delta_y\]

which gives \delta_{x+y} = \frac{x\delta_x + y\delta_y}{x+y}.

Wow, what a mouthful. This defies intuition, as you would expect error to accumulate additively. Of course, this is true of the absolute errors:

    \[\widehat{x+y} = \hat{x} + \hat{y} = x + y + e_x + e_y = (x+y) + e_{x+y}\]

but this no longer holds when you consider the relative error.

So why use relative error at all for analysis? It turns out that while convenient here, it becomes less tractable when reasoning about roundoff. Whenever you do an addition operation in floating point, you accumulate a small bit of absolute error from that operation itself! This absolute error unfortunately depends on the value of the computations itself much like relative error does, so sooner or later, you’re going to need to start reasoning about relative error. Worry not, we will develop a systematic formula for reasoning about the propagation of relative error that will boil down to high school level algebra (and some calculus).

Now, we’ve done addition. What about subtraction? You should easily verify for yourself that

    \[\widehat{x-y} = (x-y)(1 + \delta_{x-y})\]

where the relative error \delta_{x-y} is defined as

    \[\delta_{x-y} = \frac{x\delta_x - y\delta_y}{x-y}\]

Let’s now derive the propagated relative error of multiplication:

    \[\widehat{x \times y} = (x\times y)(1+\delta_{x\times y}) = \hat{x} \times \hat{y} = x(1+\delta_x)\times y(1+\delta_y)\]

again, solving for \delta_{x\times y} boils down to solving the equation above

    \[(x\times y)(1+\delta_{x\times y}) = \hat{x} \times \hat{y} = x(1+\delta_x)\times y(1+\delta_y)\]

assuming that neither x nor y are 0, we can divide out the x\times y to get

    \[1+\delta_{x\times y} = (1+\delta_x)(1+\delta_y) = 1 + \delta_x + \delta_y + \delta_x\delta_y\]

which tells us that

    \[\delta_{x\times y} = \delta_x + \delta_y + \delta_x\delta_y\]

It is traditional to assume that the relative errors of your computations are small (\delta {\small<\!\!<} 1), for otherwise, there’s no point in computing a value if you can’t even get one digit correct! Therefore, if \delta_x is tiny, and \delta_y is tiny, then \delta_x\delta_y is insignificant in comparison. Therefore, we typically discard “higher order” \delta terms. In effect, we just say that

    \[\delta_{x\times y} \approx \delta_x + \delta_y\]

I’ll leave it to your to verify that division propagates as

    \[\delta_{x/y} \approx \delta_x - \delta_y\]

Arbitrary Differentiable Function

The propagation schemes we’ve talked about so far are all fine and dandy, but if we already have \hat x = x(1 + \delta_x) and we want to compute f(x) for some arbitrary but [twice] differentiable function f(\cdot)?

Well, let’s just call f with \hat x as the argument then!

    \[\widehat{f(x)} = f(\hat x)\]

now, we can do some algebra and get

    \[f(x)(1 + \delta_{f(x)}) = f(x + x\delta_x) = f(x + e_x)\]

but we can no longer use our typical algebraic tools to solve the above equation for \delta_{f(x)}, since f(\cdot) could be anything! Whatever will we do?

If you’ve had a little bit of calculus, then the section heading should probably give the answer away. We’ve already reasoned earlier that we don’t care about second order error terms because they’re so insignificant. But then, we can express f(x + e_x) as a polynomial in the error term through taylor series centered around x!

    \[f(x + e_x) = f(x) + f'(x)e_x + O(e_x^2f''(\xi))\]

if f(\cdot) is twice differentiable, then f''(\xi) is bounded to some constant, then that second order term tends to O(e_x^2), which we can disregard (with some skepticism of how nonlinear/curvy f is around the neighborhood of x). Using the first order taylor approximation as the right hand side, we can rewrite the above equation as

    \[f(x) + f(x)\delta_{f(x)} = f(x) + f'(x)(x\delta_x)\]

which, as long as f(x) \ne 0, gives

    \[\delta_{f(x)} =\frac{f'(x)}{f(x)}x\delta_x\]

Now, the f(x)\ne 0 restriction might seem a bit unfair, but recall that if f(x) = 0, then we won’t have any digits of accuracy, so in effect, the relative error is in fact \infty.

Table of Error Propagation

Summarized, suppose that we want to run the computation x \circ y, but we have inexact computed values \hat x and \hat y each with relative error \delta_x and \delta_y respectively, then the computation \widehat{x \circ y} will also be inexact, and its relative error will be

    \[\delta_{x + y} = \frac{x\delta_x + y\delta_y}{x + y}\]

    \[\delta_{x - y} = \frac{x\delta_x - y\delta_y}{x - y}\]

    \[\delta_{x \times y} = \delta_x + \delta_y\]

    \[\delta_{x /y} = \delta_x - \delta_y\]

    \[\delta_{f(x)} = \frac{f'(x)}{f(x)}x\delta_x\]

4. Example

Suppose that we’ve computed \hat x with relative error \delta_x and \hat y with no error. Furthermore, suppose that the true value of x = 1 and y = \frac{\pi}{2}. What will be the computed error of \cos(x  +y)?

We’re looking to compute

    \begin{align*} \widehat{\cos(x+y)} &\approx \cos(\widehat{x+y})(1 + \delta_{\cos(x+y)})\\ &= \cos((x+y)(1 + \delta_{x+y}))(1 + \delta_{\cos(x+y)}) \end{align*}

Now, we need to figure out a few things:

1. \cos((x+y)(1 + \delta_{x+y}))

We can simplify this to \cos(x+y + e_{x+y}), but even then, we’re still going to take a first order taylor expansion to get

    \[\cos(x+y + e_{x+y}) = \cos(x+y) - \sin(x+y)(x+y)\delta_{x+y}\]

Since we’re looking for the relative error, we need to factor out a \cos(x+y) out to get

    \[\cos(x+y)(1 - \tan(x+y)(x+y)\delta_{x+y})\]

Now, we’re just left with

    \[\cos(x+y)\left((1 - \tan(x+y)(x+y)\delta_{x+y})(1 + \delta_{\cos(x+y)})\right)\]

    \[\approx \cos(x+y)\left(1 - \tan(x+y)(x+y)\delta_{x+y} + \delta_{\cos(x+y)}\right)\]

2. \delta_{\cos(x+y)}

From our table of error propagation above, we see that

    \[\delta_{\cos(x+y)} \approx -\tan(x+y)(x+y)\delta_{x+y}\]

so we’re just left with

    \[\widehat{\cos(x+y)} \approx \cos(x+y)\left(1 - 2\tan(x+y)(x+y)\delta_{x+y} \right)\]

3. \delta_{x+y}

From our table of error propagation, we see that

    \[\delta_{x+y} = \frac{x\delta_x + y\times 0}{x + y} = \frac{x}{x+y} \delta_x\]

which finally simplifies our computed value as

    \begin{align*} \widehat{\cos(x+y)} &\approx \cos(x+y)\left(1 - 2\tan(x+y)(x+y) \frac{x}{x+y} \delta_x \right)\\ &= \cos(x+y)\left(1 - 2\tan(x+y)x \delta_x \right)\\ &= \cos(x+y)(1 + 2\cot(1)\delta_x) \end{align*}

so that the final relative error is \delta = 2\cot(1)\delta_x. Note that the final relative error isn’t just \delta_{\cos(x+y)}, because we need to also take into account the error of computing \widehat{x+y}. Also note that we are not working with \delta_{\cos(\widehat{x+y})} above. To see this more concretely, we are essentially looking for \delta_{t_2} in the following system

    \begin{align*} \hat t_1 &= \widehat{x+y} \\ \hat t_2 &= \cos(\hat{t_1}) \end{align*}

which gives the same solution 2\cot(1)\delta_x.

This concludes our brief overview of error propagation. We’ll start on IEEE floating point encoding of rational numbers and how to avoid errors in computer arithmetic next time.

Subtype Ambiguity in Java

true ? 1 : "Hello"

Take a look at the Java expression on the left hand side. Off of the top of your head, can you tell me if it will type check? If so what will be its type?

If you had asked me this question out of the blue, I would have probably said no; I would reason it as follows: this is a Java expression, so it must be assigned a type, meaning that we’re giving a system of equations saying that the type of 1 and the type of "Hello" must be equal; this is clearly not the case as one is an Integer and the other a String, so this can’t possibly type check!

Of course, I’m actually just suffering from a common case of overlooking the obvious. It turns out that I’m forgetting about the whole subtyping thing built into Java and that I can actually assign Object o = true ? 1 : "Hello";, because the type Object sits at the \top of the type ordering.

1. Subtyping as Inequalities

Since we’re speaking of subtyping as a type relation anyways, let’s give it a symbolic name to make it easier for us: we’ll let T and S range over types in Java, and we’ll say

    \[S \le T\]

to mean that S is a subtype of T. Under one interpretation, we can think of this as saying that if we’re expecting a T somewhere, we should theoretically be allowed to plug in a S in its place and not expect weird or undefined behaviors as a result. Another perhaps more practical view is to claim that there exists some function called the coercion function (which we will denote, for the sake of historical consistency, \Theta) that, given some subtype relation S \le T, will be able to translate an object of type S into an object of type T that behaves “similarly”.

Anyways, It’s obvious that both \g{Integer} \le \g{Object} and \g{String} \le \g{Object} (since everything’s a subtype of Object), so we can draw a subtyping hierarchy as a tree:

Rendered by QuickLaTeX.com

and we can type check true?1:"Hello" : Object. This seems to suggest that we can typecheck all instances of (b?x:y) to some type R if we can typecheck x to S, y to T, and S,T \le R. In otherwords, R is an upper bound of the types of x and y.

Of course, upon closer inspection, it turns out that both Integer and String share another common supertype: Serializable.

Rendered by QuickLaTeX.com

So just as well, we could claim that the ternary expression above extends Serializable. But if we want to “assign” a type to an expression, we don’t just want its supertypes.

2. Most Precise Supertype

Looking at the type hierarchy above, it’s clear that the lower a type is on that hierarchy, the more “precise” it is. When we say that an expression has a type, we don’t mean that it is bounded above by some set of super types; we actually mean that its most “precise” type is so and so. For example, we could just as well draw the type hierarchy for Integers as

Rendered by QuickLaTeX.com

and claim that a 1 is a Serializable; but when I typically talk about the type of the expression 1, I’m usually talking about Integer, not Serializable.

From this, it’s clear that we also want a way to find the most precise (smallest) type of true ? 1 : "Hello" such that it is also an upper bound of both 1 and “Hello”; this is called the least upper bound, and we will for historical reasons call this the “join” of two types.

Symbolically, if we want to find the most precise upperbound (join) type of two types S and T, we will denote it as R such that

    \[R = S \sqcup T\]

the reason that the least upper bound is usually denoted as a square union symbol is actually somewhat natural: the union of two sets is the least upper bound of those two sets in the partial order \subseteq (set inclusion), convince yourself that this is actually the case.

Note that whenever we join two types, the resulting type is “less precise”. We are after all finding an upper bound!

Going back to the problem at hand, the entire point of this exercise is then very simple: we want to find the least upper bound of Integer and String:

    \[\g{Integer} \sqcup \g{String}\]

We’ve already seen earlier that both Object and Serializable are upper bounds of both Integer and String, but we also know that \g{Serializable} \le \g{Object}, so Serializable is a more precise (less) upper bound than Object is, but is Serializable the most precise?

In order to answer this question and find the least upper bound, we’re going to need the complete subtyping hierarchy. Consulting the Java documentation, we find that Integer is a subtype of Numbers, which are Serializable, and are themselves Comparable with other Integers. Furthermore, Strings are also Serializable and Comparable with other Strings! This gives the slightly more convoluted typing hierarchy:

Rendered by QuickLaTeX.com

3. Problems

At first first glance, it doesn’t seem possible that there’s a least upper bound: it seems like both Comparable and Serializable are the most precise upper bounds. Furthermore, this isn’t even necessarily true as Comparable itself might be too imprecise.

Wildcards and Bounded Quantifiers

Consider the constraint

    \[\g{String} \sqcup \g{Integer} \le \g{Comparable}\ba{?}\]

now, we know that

    \[\g{String} \le \g{Comparable}\ba{\g{String}}, \g{Integer} \le \g{Comparable}\ba{\g{Integer}}\]

so we can rewrite this as

    \[\g{String} \le \g{Comparable}\ba{?~\g{extends}~\g{Comparable}\ba{?}}\]

by plugging in \g{String} \le \g{Comparable}\ba{?} into the \g{String} in \g{Comparable}\ba{\g{String}}. (Wildcards, amiright?) Similarly, we can say the same about integers, so we get

    \[\g{String} \sqcup \g{Integer} \le \g{Comparable}\ba{?~\g{extends}~\g{Comparable}\ba{?}}\]

but we \g{Comparable}\ba{?~\g{extends}~\g{Comparable}\ba{?}} \le \g{Comparable}\ba{?}, so a more precise upper bound of \g{String}\sqcup \g{Integer} is \g{Comparable}\ba{?~\g{extends}~\g{Comparable}\ba{?}}. But now, we can construct a sequence





each of which a more precise approximation than the previous, which undoubtedly converges to the most precise type: and here’s the kicker, that type cannot be expressed finitely!

So how does Java handle these types? Let’s try it out. I wanted Java to output the internal type representation, so I need to get a type-mismatch error during compile time.

Comparable Type

It seems as if Java is reporting the type of the two as an unbounded quantification: just Comparable<?>. This is intentional, Java treats the ? extends ... as a constraint (or a system of inequalities) and discards them during type checking and coercion in order to use them later solely for typechecking, so we seemed to have for now solved the problem (So even though A\ba{?~\g{extends}~\g{A}\ba{?}} \le A\ba{?}, we’re going to treat them as if they weren’t).

No least upper bound?

But we still have another problem, there’s is no least upper bound in the above hierarchy! Again, let’s see what Java does:

Intersection Type

It turns out that the least upper bound is just the “intersection” over the three types, which should then be the LEAST upper-bound of all three.

Neat way of counting OCaml objects satisfying certain types

Full Disclosure: this isn’t mine. I got the idea from http://www.youtube.com/watch?v=YScIPA8RbVE and the book Analytic Combinatorics from Robert Sedgewick and Philippe Flajolet. I just thought the entire concept is really really neat.

If you’re like me, you spend your flight pondering the deepest and darkest corners of human knowledge, like

  1. was the cashier flirting with me? (she wasn’t)
  2. which one is better, the integral or the derivative? (definitely the derivative)
  3. are we there yet? (no.)
  4. are airlines the mortal enemies of tall people? (they are)

Of course, after a day of mentally exhausting travel, I’d like to sit back, relax, and ask myself how, given a specific type declaration in \f{OCaml}, I can figure out the number of all possible instances of that type.

For example, there is just one instance of the type unit (kind of funny because its name is unit) corresponding to the object () in \f{OCaml}. There are two distinct instances of the type bool corresponding to the set \Rec{\True,\False}, and about 2^{32} distinct instances of the type int.

1. Types in \f{OCaml}

Now, before we go on any further, let’s first take a look at the syntax of \f{OCaml} types.

type suit = Diamond | Club | Heart | Spade

this type declaration says that we want to create a new type whose instances are either labeled Diamond, Club, Heart, or Spade. The labels are capitalized to mean that they can serve as constructors of the type. So if you type in Diamond into the OCaml, it’ll construct an object of type suit corresponding to the Diamond label. Note that Diamond isn’t itself a type here.

It’s quite obvious that there are 4 distinct instances of the type suit that we can construct, namely \Rec{\g{Diamond},\g{Club},\g{Heart},\g{Spade}}.

We can also give each type constructor some extra stuff. For example, if we want types corresponding to the individual cards rather than suits, we can write something like

type card = Diamond of int | Club of int | Heart of int | Spade of int

so that we can construct an instance of the card Jack of Clubs as Club 11. Here, there are 4 \times 2^{32} different instances of the type \g{card} (we’re only familiar with 52 of these). Alternatively, we can also write each card as a pair of a suit and an integer.

type card = Card of (suit * int)

here, the notation suit * int stands for types whose objects are pairs, the first element of which is an instance of a suit and the second an integer; for example, Card (Club, 11) would be the jack of clubs.

One really interesting thing to observe here is that the two different ways of writing the card type above should be equivalent. In each, we are specifying a suit and also a value. Therefore, if there are 4 \times 2^{32} distinct instances of the first type, there should also be 4 \times 2^{32} instances of the second type, and it turns out, there are.

Furthermore, if we want to “parameterize” the types to be more generalized, we can do so by introducing type variables (variables that corresponds to types) as 'x that starts with a single apostrophe. So for example, if we want to build a linked list of whatever type we want, we could do something like

type 'x list = Null | Cons ('x * 'x list)

where the list [1,2,3] of type int list (so we make the substitution 'x \mapsto \g{int}) is constructed from Cons(1, Cons(2, Cons(3, Null))).

A binary tree can be build up like

type 'x tree = Leaf | Node of ('x tree * 'x * 'x tree)

where the tree

Rendered by QuickLaTeX.com

has type int tree (so we make the substitution 'x \mapsto \g{int}) is constructed from

    Node(Leaf, 2, Leaf),
        Node(Leaf, 4, Leaf),
        Node(Leaf, 5, Leaf)

2. Counting instances

One or unit

Now, we get to the interesting part. As discussed previously, we have already established that there’s only one object of the unit type (just ()) and two (True and False) of the boolean type. Of course, there’s also only one object of the type

type one = One

and in fact, we can rename the constructor to anything (assuming that we don’t care about it being valid ocaml code) including the unit object (), so it turns out that () is equivalent to all types with a single constructor and equivalence isn’t effected under renaming of the constructor (well, there are certain rules to this renaming/substitution, and we’ll call this property \alpha equivalence). But is this the only property that preserves the count of the number of instances? Well, no, as it turns out, the following all only have one element.

type one' = One' of unit
type one'' = One'' of one'

this is intuitive, if we have only one possible object for the argument of the constructor, and we only have that one constructor, then we can only construct one object for that type. In fact, from now on, we will say that an argument-less type construct C is just a syntax sugar for C of unit because doing so will not change the number of instances for that type.

Two or bool

We’ve also established that there are two instances of booleans, \g{true} and \g{false}. In type parlance, this would be

type bool = True | False

but remember earlier that we said if we give a type constructor not expecting any argument the unit argument, the type’s size is preserved. Therefore, we can rewrite the above as

type bool = True of unit | False of unit

and since renaming things doesn’t affect the size of the set of objects satisfying this type, this is also equivalent to

type two = One of unit | Two of unit

Neat. Let’s count up one more.


There’s no native type in \f{OCaml} that only contains three instances, but it’s not too hard to guess how we’re going to construct this.

type three = One of unit | Two of unit | Three of unit

Here, we can either have \f{One}(),\f{Two}(), or \f{Three}() as instances of the type three. This gives a natural meaning to the notion of “or”. If I have types a and b, then the type

    \[\g{type}~c = \g{A of }a \mid \g{B of } b\]

can either be one the A~a or one of the B~b, but since the two cases are necessarily distinct (because the label A \ne B), then if there are n_a objects of type a and n_b objects of type b, there must be n_a + n_b objects of type c = a \mid b.

It’s no coincidence either that the type c = a \mid b is called a sum type between a and b (albeit for different reasons). Now that we’re warming to the notion that we can use these types almost as if we did in highschool algebra, let’s make some simplifications:

Since we do not ever care about what the labels themselves are, only that they are distinct, we can entirely leave out the labels for the type constructors. Therefore, we can write suit (type suit = Diamond of unit | Club of unit | ...) as

    \[\g{suit} = \g{unit} \mid \g{unit} \mid \g{unit} \mid \g{unit}\]

and the card type as either

    \[\g{card} = \g{int} \mid \g{int} \mid \g{int} \mid \g{int}\]

or as

    \[\g{card'} = \g{suit} \times \g{int}\]

where the \times means the objects are of a pair type (with both a first and a second element). Note that we can also substitute suit directly in this type as in the second declaration of \g{card'}:

    \[\g{card'} = \pa{\g{unit} \mid \g{unit} \mid \g{unit} \mid \g{unit}} \times \g{int}\]

it’s no coincidence that there are also (1 + 1 + 1 + 1) \times 2^{32} objects of type \g{card'}, but it’s also obvious that the two types are not equal syntactically. That is, \g{card'} \ne \g{card}.

Of course, it’s useful to construct another equivalence relation whereby type a is translationally equivalent to type b if both have the same number of type objects. In that case, we’ll say a \circeq b. As you probably have already guessed, we’re going to define some kind of a translation between these \f{OCaml} types into numbers denoting the number of objects of that type. We’ll go into detail more, but needless to say, we would translate a \mid b as a + b.


Okay, at this point, we kind of already know what we are after. But first, I want to establish some basic laws seen in school algebra.

Recall that

    \[\g{four} = \g{suit} = \g{unit} \mid \g{unit} \mid \g{unit} \mid \g{unit}\]

and that

    \[\g{two} = \g{bool} = \g{unit} \mid \g{unit}\]

Now, obviously,

    \[\g{four} \circeq \g{two} \mid \g{two}\]

but let’s look at something else. Suppose we have a pair type c such that

    \[c = \g{two} \times \g{two}\]

now, there are again four instances of c, they are


so it seems as if \g{two}\times \g{two} \circeq \g{four} as well. In fact it’s not hard to show that product types correspond to multiplication in the same that cartesian products corresponds to multiplication of the cardinality. (In fact, this question reduces to the cardinality of sets case).


Just to get to the point, how many functions are there for which the domain is of size 3 and the codomain is of size 2? For example, suppose that I have the type

    \[c = \g{three} \to \g{two}\]

where \cdot \to \cdot is taken to mean a function that maps a type with three elements to a type with two. Well, in general, suppose I have a function A \to B, then it turns out that there are |B|^{|A|} such functions: reduce this problem to counting in base |B|.

Suppose I have |A| elements in A that need to be mapped into B. Suppose both A and B are finite and are order into the sequences a_1, a_2, \cdots, a_{|A|} and b_1, b_2, \cdots, b_{|B|} \in B respectively, then I can encode one such function

    \[f_0 : A \to B = \Rec{a_0 \mapsto b_0, a_1 \mapsto b_0, \cdots, a_{|A|} \mapsto b_0}\]

and a second function

    \[f_1 : A \to B = \Rec{a_0 \mapsto b_0, \cdots, a_{|A|-1} \mapsto b_0, a_{|A|} \mapsto b_1}\]

but since we don’t really want to clutter up the notation with all of the braces, the subscripts, and everything, we could just encode each function as a stream of digits where the d_th digit = k corresponds to which b_k the element a_d maps to. Then, we can just write

    \begin{align*} f_0 &= 00\cdots00 \\ f_1 &= 00\cdots01 \\ f_2 &= 00\cdots02 \\ f_{|B|-1|} &= 00\cdots0(|B|-1) \\ f_{|B|} &= 00\cdots10 \\ &\vdots \end{align*}

but this is the same as counting up to |B|^{|A|}-1 in base |B|.

Anyways, from this argument, we would expect c = \g{three} \to \g{two} to have 8 distinct objects, and hence c \circeq \g{eight}.

3. The Translation

Let’s now define a translation between a simplified type declaration and a number denoting the total number of objects with that type declaration. We’ll write this relation \trans{}{\cdot}: \g{type} \times E. For now, pretend that E is the naturals \mathbb{B}, but this will cause some subtle issues later on because we might not be able to translate certain types (like the list) directly into \mathbb{N}. Furthermore, \trans{}{\cdot} is almost a function, but as discussed a few seconds ago, if we let E be the naturals, it will not be able to translate certain types, so at best, it is a partial function. In that case, we will use the syntax \trans{}{c} = n to mean that (c,n) \in \trans{}{\cdot}.

To define this translation, let’s do it “recursively”. The validity of such a technique is beyond my scope, but convince yourself that because the right hand side is always translating smaller and smaller pieces, then eventually, the translation of any arbitrary type will terminate. Here, I will use \tau to mean some type expression; that is, we will use it as a metavariable over the domain of types. Assume the same precedence level of \mid, \times, \to as +,\times, and power in school algebra.

    \begin{align*} \trans{}{\g{unit}} &= 1 \\ \trans{}{\tau_1 \mid \tau_2} &= \trans{}{\tau_1} + \trans{}{\tau_2} \\ \trans{}{\tau_1 \times \tau_2} &= \trans{}{\tau_1} \times \trans{}{\tau_2} \\ \trans{}{\tau_1 \to \tau_2} &= \trans{}{\tau_2}^{\trans{}{\tau_1}} \\ \trans{}{\tau~\g{type}} &= \trans{}{\g{type}\Rec{'x \mapsto \tau}} \end{align*}

Next, we define a notion of translational equivalence.

Translational Equivalence

    \[\tau_1 \circeq \tau_2 \iff \trans{}{\tau_1} = \trans{}{\tau_2}\]

note that \circeq is a weaker notion of syntactic equality of =, therefore, if \tau_1 = \tau_2, then so too must \tau_1 \circeq \tau_2 owing to the fact that \trans{}{\cdot} is a partial function. Furthermore, because of the construction of the sequence

    \[\Rec{\g{unit}, \g{unit}\mid\g{unit}, \g{unit}\mid\g{unit}\mid\g{unit},\cdots}\]

then \trans{}{\cdot} is obviously onto E. However, because we need to have this notion of \circeq, it’s also very obvious that two different type expressions could have the same number of instances and hence \trans{}{\cdot} is not one-one.

4. Examples

We’re going to take some extraordinary risks here (because we’re not mathematicians) and assume that this translation is correct (there are in fact many problems). Let’s do something with this new translation of ours


    \[\trans{}{\tau_1 \mid \tau_2} = \trans{}{\tau_1} + \trans{}{\tau_2} = \trans{}{\tau_2} + \trans{}{\tau_1} = \trans{}{\tau_2 \mid \tau_1}\]


    \[\trans{}{(\tau_1 \mid \tau_2)\times\tau_3} = (\trans{}{\tau_1} + \trans{}{\tau_2})\times \trans{}{\tau_3} = \trans{}{\tau_1}\trans{}{\tau_3} + \trans{}{\tau_2}\trans{}{\tau_3} = \trans{}{\tau_1 \times \tau_3 \mid \tau_2 \times \tau_3}\]

in fact, under translational equality, we can manipulate types algebraically as if in school algebra. More amazingly, that means that we can encode everything the same exact way. Therefore, if we wanted to pass around an object of type \g{eight}, then we can instead pass around an object of type \g{three} \to \g{bool} instead!

Parameterized Types (Advanced, skip if you want)

Suppose we wanted to translate the list type above:


unfortunately, we don’t have a definition of \g{list}. Rather, we merely have an equation that is expected to be satisfied in order for an object to be considered an instance of the \g{list} type. This equation is

    \['x ~ \g{list} = \g{unit} \mid 'x \times 'x~\g{list}\]

it turns out that 'x~\g{list} is the least fixed point of the function R(\tau) = \g{unit}\mid 'x \times \tau. This is going a bit beyond the scope of this post, but that least fixed point can be computed on the ordering of immediate sub-type expression which turns the sequence R^n(\g{void}) into a complete partial order (complete because all subsets contain a least upper bound). Therefore, by the fixed point theorem, that least upper bound is the mysterious expression (where \bigsqcup means the least upper bound)

    \[R^*(\g{void}) = \bigsqcup_n R^n(\g{void})\]

now, the translation function is “continuous” in the sense that it preserves least upper bounds on the CPO E. (Of course, this is why we can’t have E = \mathbb{Z}), but loosely speaking, the operations can be substituted as

    \[\trans{}{\bigsqcup} \triangleq \max, \trans{}{\g{void}} = 0\]


    \begin{align*} \trans{}{'x~\g{list}} &= \trans{}{\bigsqcup_n R^n(\g{void})} \\ &= \max\Rec{\trans{}{void}, \trans{}{\g{unit}\mid 'x \times \g{void}}, \trans{}{\g{unit}\mid 'x \times (\g{unit}\mid 'x \times \g{void})}, \cdots} \\ &= \max\Rec{0, 1, 1+\trans{}{'x}, 1+\trans{}{'x}+\trans{}{'x}^2, \cdots} \\ &= \sum_k \trans{}{'x}^n \end{align*}

unfortunately, if 'x \not\circeq \g{void}, the above sum diverges to the top element in E, \top.

However, an interesting question arises that this type of analysis can readily answer: how many instances of the list type are translationally equivalent to the n-tuple. In this case, it turns out that for whatever base-type 'x, there are exactly \trans{}{'x}^n lists.

Of course, there’s a slightly easier hack which comes from the above derivation directly (namely because the translation preserves monotonicity, verify this!).

Lists – the easier way

Why don’t we just translate the equation directly?

We’ll get

    \begin{align*} \trans{}{l} &= 1 + \trans{}{'x} \times \trans{}{l} \\ \intertext{substituting in $\trans{}{l} = 1 + \trans{}{'x} \times \trans{}{l}$ into the right hand side} &= 1 + \trans{}{'x}(1 + \trans{}{'x} \times \trans{}{l}) \\ &\vdots \\ &= \sum_n \trans{}{'x}^n \end{align*}

In fact, we could even get this by solving the equation and finding its generating function.

    \begin{align*} \trans{}{l} &= \frac{1}{1 + \trans{}{'x}} \\ &= \sum_n \trans{}{'x}^n \end{align*}

Neat. However, this brings up a good question. Can all such equations be solved? Unfortunately, no.

Unheaded List – unsolvable

Let’s define a type

    \['x ~ \g{u} = unit \mid 'x ~ \g{u}\]

if we translate it, we get

    \[\trans{}{\g{u}} = 1 + \trans{}{\g{u}}\]

now, if our codomain of the translation is just \mathbb{N}, then we would be forced to abandon hope. However, what we really did was we packed that codomain to include \top that you can think of as \infty. What this can be interpreted as is that no matter what type variable 'x becomes, type \g{u} is going to diverge.

However, using the above trick, we still get the expansion

    \[\trans{}{\g{u}} = 1 + 1 + 1 + \cdots\]

which says that all instances of \g{u} are translationally equivalent to \g{unit}.

Binary Trees


    \['x ~ \g{t} = \g{unit} \mid ('x ~ \g{t} \times 'x \times 'x ~ \g{t})\]

which under translation gives

    \[\trans{}{\g{t}} = 1 + \trans{}{'x} \trans{}{\g{t}}^2 = \sum_n C_n \trans{}{'x}^n\]

where C_n corresponds to the catalan numbers (i.e., we’re generating the sequence that count the number of binary trees translationally equivalent to the n-tuple, which is the number of binary trees with n nodes).