Optimizing Performance of Orthogonal Polynomials

In a previous post I talked about a few methods to compute the Zernike polynomials, and alluded to a way to compute them significantly faster.

This post walks through doing that with the Jacobi polynomials, which underpin the Zernikes, and shows the optimization of that routine by multiple orders of magnitude.

To recap, the Jacobi polynomials are orthogonal polynomials on the domain $[-1,1]$ with respect to some weighting parameters $\alpha, \beta$. They can be computed with a recurrence relation:

$$ a \cdot P_n^{(\alpha,\beta)} = b \cdot x \cdot P_{n-1}^{(\alpha,\beta)} - c \cdot P_{n-2}^{(\alpha,\beta)} $$

This recurrence relation removes all powers from the calculation, which yields numerical stability; powers of large or small numbers creates catastrophic rounding error. A benefit of the recurrence is that if we already have $P_{n-1}$ and $P_{n-2}$, computing $P_{n}$ becomes; three multiplies, a subtract, and a divide.

The simplest possible algorithm computes $a, b, c$ in one step, initializes the recurrence, and then loops from $N=2$ to $N=n$ to compute $P_{n}$. That looks like this:

function abc(n, α, β, x)
	# the parens emphasize the terms, not PEMDAS
	a = (2n) * (n + α + β) * (2n + α + β - 2)
	b1 = (2n + α + β -1)
	b2 = (2n + α + β)
	b2 = b2 * (b2 - 2)
	b = b1 * (b2 * x + α^2 - β^2)
	c = 2 * (n + α - 1) * (n + β - 1) * (2n + α + β)
	return a, b, c
end

function jacobi(n, α, β, x)
	if n == 0
		return 1
	elseif n == 1
		return (α + 1) + (α + β + 2) * ((x-1)/2)
	end
	Pnm2 = jacobi(0, α, β, x)
	Pnm1 = jacobi(1, α, β, x)
	a, b, c = abc(2, α, β, x)
	Pn = ((b * Pnm1) - (c * Pnm2)) / a
	if n == 2
		return Pn
	end
	for i = 3:n
		Pnm2 = Pnm1
		Pnm1 = Pn
		a, b, c = abc(n, α, β, x)
		Pn = ((b * Pnm1) - (c * Pnm2)) / a
	end
	return Pn
end

In just 31 lines, we have a function which computes the Jacobi polynomials, which means we can compute Jacobi, Chevyshev, Legendre, and Zernike polynomials. This is powerful. Is it fast?

x=-1:0.00001:1;
n=10;
α=0.;
β=0;
@benchmark jacobi.($n, $α, $β, $x)
BenchmarkTools.Trial:
  memory estimate:  1.53 MiB
  allocs estimate:  4
  --------------
  minimum time:     15.138 ms (0.00% GC)
  median time:      15.490 ms (0.00% GC)
  mean time:        15.644 ms (0.60% GC)
  maximum time:     18.700 ms (16.95% GC)
  --------------
  samples:          320
  evals/sample:     1

x has 200001 elements, so this took 75.7 ns per element. My CPU runs at about 0.2 ns per clock, so this is about 380 cycles per element.

Ignoring the difference between the Zernike and Jacobi polynomials, and remembering that Zernike has extra steps and will be slower, this compares to 3.48 ms for 256x256, or 53.1 ns per element.

Didn’t I say this recurrence thing would be faster? If you look back at the abc function, it is actually a fairly large computation. We can see already that a and c do not depend on x, and we can refactor them out:

function calcac(n, α, β)
    a = (2n) * (n + α + β) * (2n + α + β - 2)
    c = 2 * (n + α - 1) * (n + β - 1) * (2n + α + β)
    return a, c
end


function calcb(n, α, β, x)
	# the parens emphasize the terms, not PEMDAS
	b1 = (2n + α + β -1)
	b2 = (2n + α + β)
	b2 = b2 * (b2 - 2)
	b = b1 * (b2 * x + α^2 - β^2)
	return b
end

function jacobi2(n, α, β, x)
    if n == 0
        return ones(size(x))
    elseif n == 1
        return (α + 1) .+ (α + β + 2) .* ((x.-1)./2)
    end
    Pnm1 = (α + 1) .+ (α + β + 2) .* ((x.-1)./2)
	  a, c = calcac(2, α, β)
	  inva = 1 / a
    Pn = ((calcb.(2, α, β, x) .* Pnm1) .- c) .* inva
    if n == 2
        return Pn
    end
    Pnm2 = similar(Pnm1)
    for i = 3:n
        Pnm2, Pnm1 = Pnm1, Pn
        a, c = calcac(n, α, β)
        inva = 1 / a
        Pn = ((calcb.(n, α, β, x) .* Pnm1) .- (c .* Pnm2)) ./ a
	end
    return Pn
end

By refactoring a and c out of the elementwise calculation we have greatly reduced the work. A trade is that the function now operates on arrays instead of scalar values. The implications of that will become clear in a moment. Did it get any faster?

# note lack of dot between jacobi2 and the arguments
@benchmark jacobi2($n, $α, $β, $x)
BenchmarkTools.Trial:
  memory estimate:  15.26 MiB
  allocs estimate:  36
  --------------
  minimum time:     7.516 ms (0.00% GC)
  median time:      8.002 ms (0.00% GC)
  mean time:        9.001 ms (11.88% GC)
  maximum time:     14.984 ms (0.00% GC)
  --------------
  samples:          556
  evals/sample:     1

Roughly twice as fast, thanks to the removal of significant waste work. There is a subtle performance improvement made by manually inlining Pnm1 and Pn when things are initialized. Unfortunately, the change from scalar to array calling has lead to the function now making more than one allocation per order $n$ due to repeated creation of new Pn matricies.

We can remove most of the allocations by reusing an array, and manually inline the b term of the recurrence will also boost performance. The Julia compiler seems to make sub-optimal choices either with the @inline macro, or without.

function jacobi3(n, α, β, x)
	if n == 0
        return ones(size(x))
    elseif n == 1
        return (α + 1) .+ (α + β + 2) .* ((x.-1)./2)
    end
	Pnm1 = (α + 1) .+ (α + β + 2) .* ((x.-1)./2)
	a, c, b1, b2, b3 = calcac_startb(2, α, β)
	inva = 1 / a
	# .- c .* Pnm2, but Pnm2 == 1, so drop it
	Pn = similar(Pnm1)
    Pn .= (b1 .* (b2 .* x .+ b3) .* Pnm1 .- c) .* inva
    if n == 2
        return Pn
    end
	Pnm2 = similar(Pn);
	for i = 3:n
		# assign Pnm2 to Pn to reuse a buffer
		Pnm2, Pnm1, Pn = Pnm1, Pn, Pnm2
		a, c, b1, b2, b3 = calcac_startb(i, α, β)
        inva = 1 / a
		Pn .= (b1 .* (b2 .* x .+ b3) .* Pnm1 .- c .* Pnm2) .* inva
	end
	return Pn
end

This function differs from jacobi2 in that part of the b term in the recurrence relation has been refactored to avoid repeat work, the b term’s application w.r.t x has been manually inlined, and Pn is assigned with .= for in-place assignment (“overwrite”) instead of making a new array each time.

Before showing the performance of this version, I’ll note that I added the @muladd macro in a fourth version, which was about 20% faster, and another version with explicit AVX vectorization:

function jacobi4(n, α, β, x)
	# ..
	for i = 3:n
		# ..
		@avx unroll=4 for i in eachindex(Pn)
			Pn[i] = (b1 * (b2 * x[i] + b3) * Pnm1[i] - c * Pnm2[i]) * inva
		end
	end
	return Pn
end

AVX is a special instruction set on your CPU that operates on multiple numbers at once. It is distinct to multi-threading.

The results:

@benchmark jacobi3($n, $α, $β, $x)
BenchmarkTools.Trial:
  memory estimate:  4.58 MiB
  allocs estimate:  6
  --------------
  minimum time:     2.328 ms (0.00% GC)
  median time:      2.604 ms (0.00% GC)
  mean time:        2.873 ms (7.55% GC)
  maximum time:     6.276 ms (33.33% GC)
  --------------
  samples:          1739
  evals/sample:     1

@benchmark jacobi4($n, $α, $β, $x)
BenchmarkTools.Trial:
  memory estimate:  4.58 MiB
  allocs estimate:  6
  --------------
  minimum time:     1.493 ms (0.00% GC)
  median time:      2.236 ms (0.00% GC)
  mean time:        2.636 ms (12.19% GC)
  maximum time:     9.074 ms (37.75% GC)
  --------------
  samples:          1895
  evals/sample:     1

The removal of the waste allocations made the code about 3x faster, and another 50% improvement by using vector instructions.

This works out to 7.5 ns per element, about 10x faster than we started.

This is not all there is, though. For comparison, we will write out the simplest possible jacobi sum function, using the slow waste work filled jacobi function we wrote first:

function jacobi_sum_simple(ns, weights, α, β, x)
	out = zero(x)
	for (n, w) in zip(ns, weights)
		out .+= (w .* jacobi.(n, α, β, x))
	end
	return out
end

@benchmark jacobi_sum_simple(1:100, 100 : -1 : 1, 0., 0., $x)
BenchmarkTools.Trial:
  memory estimate:  1.53 MiB
  allocs estimate:  2
  --------------
  minimum time:     5.364 s (0.00% GC)
  median time:      5.364 s (0.00% GC)
  mean time:        5.364 s (0.00% GC)
  maximum time:     5.364 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

Higher orders are inherently more time consuming, so the total time is weighted towards the largest n involved. You might think this is as fast as it gets with a little intuition. However, we are not exploiting the recurrence relation. This function is more complicated, but avoids any intermediate allocations and uses the recurrence to avoid repeat work:

function jacobi_sum_onepass(ns, weights, α, β, x)
	# start by allocating the output array, be a little clever
	# around the initialization value
	out = similar(x)
	if ns[1] == 0
		fill!(out, weights[1])
	else
		fill!(out, 0)
	end

	# Pnm1 = P_1, Pnm2= P_0, == 1
	# this block of setup is the same as before
	Pnm1 = (α + 1) .+ (α + β + 2) .* ((x.-1)./2)
	a, c, b1, b2, b3 = calcac_startb(2, α, β)
	inva = 1 / a
	# .- c .* Pnm2, but Pnm2 == 1, so drop it
	Pn = similar(Pnm1)
  Pn .= @muladd (b1 .* (b2 .* x .+ b3) .* Pnm1 .- c) .* inva

	# add P_2 if we can.  Two iffs is faster than
	# a searchsorted
	if ns[1] == 1
		out .+= weights[1] .* Pnm1
	elseif ns[2] == 1
		out .+= weights[2] .* Pnm1
	end
	if ns[1] == 2
		out .+= weights[1] .* Pn
	elseif ns[2] == 2
		out .+= weights[2] .* Pn
	end

	Pnm2 = similar(Pn);
	for i = 3:ns[end]
		# assign Pnm2 to Pn to reuse a buffer
		Pnm2, Pnm1, Pn = Pnm1, Pn, Pnm2
		a, c, b1, b2, b3 = calcac_startb(i, α, β)
        inva = 1 / a
		@avx unroll=4 for i in eachindex(Pn)
			Pn[i] = (b1 * (b2 * x[i] + b3) * Pnm1[i] - c * Pnm2[i]) * inva
		end
		# i is the order that was just computed
		for j in eachindex(ns)
			if ns[j] == i
				# this order is relevant to the sum
				out .+= weights[j] .* Pn
				break
			end
		end
	end
	return out
end

This is longer, but is mostly a copy paste – the big addition is the if..elseif block in the middle, and the inner for loop. The performance reflects the fact that we compute each term only once:

@benchmark jacobi_sum_onepass(1:100, 100: -1 : 1, 0., 0., $x)
BenchmarkTools.Trial:
  memory estimate:  6.10 MiB
  allocs estimate:  8
  --------------
  minimum time:     20.800 ms (0.00% GC)
  median time:      23.255 ms (0.00% GC)
  mean time:        24.710 ms (1.98% GC)
  maximum time:     39.047 ms (0.00% GC)
  --------------
  samples:          203
  evals/sample:     1

So for a weighted sum up to order 100, this is 270x faster.

Recap

In summary, we replaced a calculation that took on the order of 50 nanoseconds per element with one that takes on the order of 5 nanoseconds per element, by replacing a factorial expansion with a recurrence relation, and then optimizing the implementation of that recurrence relation.

We then used that recurrence relation to remove redundant work in computing sums of orthogonal polynomials, which yielded an additional 270x speedup.

The results combine to create an orthogonal polynomial sum calculator that is ~2,500x faster. The performance gap is larger for larger sums of terms and lesser for smaller sums of terms.

Future work will expand this to the Zernike polynomials, which require writing the same function body a third time to inline the recurrence relation. The dual expansion in n and m requires special care.