Season Finale of Fast Zernike Polynomials

In the last several posts, I have shown the link between the Zernike and Jacobi polynomials, as well as the recurrence relation for the latter which can be exploited to speed up calculations. In this post, I will bring things full circle with what is the fastest Zernike calculator of any open source software that I can find by a massive margin.

See the previous posts to review the slowness of the Rodrigues radial ponlynomial expansion and speed of the Jacobi polynomials.

The Zernike polynomials are a bit more complicated in that there is an expansion in $n, m, $and $n_j$ simultaneously. Fortunately, hidden in this complexity is shared work we can remove. Consider the following table:

Noll$n$$m$$n_j$
1000
2110
31-11
4201
52-22
6220
73-12
8311
93-33
10330

The $n_j$ column is filled with duplicates. If we wish to expand up to Noll Z36 then there < 36 unique combinations of $n_j$ and $|m|$. We can also see that $n_j$ is not monotonic. This shapes our algorithm, since we know that we cannot simply do a loop. If we are in control of allocations, this means that there must be an allocation of a series of jacobi polynomial evaluations for each combination of $n_j$ and $|m|$.

We can also see that there are duplicates within $m$ and especially so in $|m|$. These are similarly nonmonotonic, so there must be intermediate allocations for those too.

The final piece of the puzzle are the two separable computations for terms with $m \ne 0$:

$$ \rho^{|m|} \qquad \text{and} \qquad \sin{m\theta} $$

Because the inputs are different we must store these separately.

Therefore, we design our algorithm to work in several steps:

  1. Calculate a lookup table of Jacobi polynomials for the unique elements of $n_j \cap |m|$ using the efficient recurrence relation that we know.
  2. Calculate a lookup table of $\rho^{|m|}$ for each unique $|m|$
  3. Calculate a lookup table of $\sin{m\theta}$ and $\cos{m\theta}$ for each unique $m$.
  4. For each input $n, m$ look up the appropriate elements and compute the Zernike polynomial.

The function that implements this is quite large, I will spare you its body here. You can find it in the OpticsPolynomials source code. We begin by writing the equivalent python, using Poppy as a reference implementation with the aforementioned Rodrigues expansion.

The python function looks like:

import numpy as np

from poppy import zernike

def zseries(terms, samples):
    out = np.empty((len(terms),samples,samples), np.float64)
    for i, t in enumerate(terms):
        n, m = zernike.noll_indices(t)
        out[i,:,:] = zernike.zernike(n, m, samples)
    return out

%timeit zseries(range(1,38),512)
920 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit zseries(range(1,38),2048)
14.7 s ± 83.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The two benchmarks simply show that the timing is linear in the number of elements, as $512\cdot512 \eqiv 2048\cdot2048/16$.

This compares to the equivalent julia, with the smart work-reusing algorithm and underlying Jacobi basis:

using BenchmarkTools
using GridCreation
using OpticsPolynomials

x, y = mkCartVecs(1/256, 512);
r, t = cartVecsToPolarGrid(x,y);
nms = zernike_noll_to_nm.(1:37)

@benchmark zernike_series($nms, $r, $t)

# for 512x512
BenchmarkTools.Trial:
  memory estimate:  24.01 MiB
  allocs estimate:  99
  --------------
  minimum time:     4.773 ms (0.00% GC)
  median time:      5.147 ms (0.00% GC)
  mean time:        6.601 ms (14.90% GC)
  maximum time:     70.575 ms (88.02% GC)
  --------------
  samples:          762
  evals/sample:     1

# for 2048x2048
BenchmarkTools.Trial:
  memory estimate:  384.01 MiB
  allocs estimate:  99
  --------------
  minimum time:     84.019 ms (0.00% GC)
  median time:      86.484 ms (0.00% GC)
  mean time:        102.475 ms (14.03% GC)
  maximum time:     168.685 ms (42.45% GC)
  --------------
  samples:          49
  evals/sample:     1

This corresponds to a ~175x speedup. It may be possible to speed up this algorithm further, but more than two orders of magnitude is a tremendous improvement from the status quo.

The algorithm holds onto more memory for longer, but uses less total memory. Allowing portions of the memory to be released earlier may afford some speedup, but I doubt it.

On a GPU this should remain blazing fast since the number of operations per element is low.

For a 2048x2048 array, a matrix DFT of output size equal to input size takes 1.8 seconds. When a Zernike series evaluation, let alone the scaling and sum of the modes, took 15 seconds it dominated the computation time. This speedup reduces the computation time of Physical optics models to be essentially only the propagations, assuming the other parts of the model can be made as fast as the Zernike polynomial evaluation.

Brandon Dube
Brandon Dube
Optical Engineer