2025-10-16

Merkle Multi-Proofs

$\gdef\p#1{\left({#1}\right)} \gdef\setp#1{\left\{{#1}\right\}} \gdef\stirlingii#1#2{{{#1} \brace {#2}}} \gdef\setn#1{\mathcal{#1}} \gdef\range#1{\left[#1\right)} \gdef\floor#1{\left\lfloor#1\right\rfloor} \gdef\ceil#1{\left\lceil#1\right\rceil} \gdef\vec#1{\mathbf{#1}} \gdef\H{\mathsf{hash}} \gdef\bar#1{\left\lvert{#1}\right\rvert} \gdef\Exp#1{\mathbb{E}\!\left[{#1}\right]} \gdef\Var#1{\operatorname{Var}\!\p{#1}} \gdef\Cov#1{\operatorname{Cov}\!\p{#1}} \gdef\union{\cup} \gdef\intersection{\cap} \gdef\empty{\varnothing}$

Vector commitments, and specifically Merkle proofs, are the heart of every hash-based proof system. They are responsible for a large part of proof size and verifier work, yet we don't have sharp estimates of either. This is largely due to a complex dependency on the distribution of opened indices. In hash-based proof systems, this distribution is uniformly random with replacement, which lets us derive the concrete results that follow.

Key results are exact minimum, maximum, mean, variance and probability mass functions of proof size and verifier work, plus approximations and a way to implement Merkle caps purely verifier side.

Definition 1. A Vector Commitment Scheme for a set $\setn S$ has $\mathsf{commit}$ , $\mathsf{open}$ and $\mathsf{verify}$ algorithms.

The $\mathsf{commit}$ algorithm takes a vector $\vec v \in {\setn S}^n$ and outputs a commitment $C$ to $\vec v$ .
The $\mathsf{open}$ algorithm takes a commitment $C$ , a list of $m$ indices $\vec i \in [0,n)^m$ and values $\vec y \in {\setn S}^m$ and outputs a proof $\pi$ that $\vec v_{\vec{i}_j} = \vec y_j$ for all $j \in [0,m)$ .
The $\mathsf{verify}$ algorithm takes $C, \vec i, \vec y, \pi$ and outputs true if the proof is valid and false otherwise.

The canonical example of a vector commitment scheme is a Merkle tree. There are many expositions of this, so I'll just write a quick definition:

Definition 2. Given a two-to-one hash function $\H: \setn H \times \setn H \rightarrow \setn H$ , a Merkle tree of height $h$ is a vector commitment scheme for $n = 2^h$ values of $\setn H$ .

The $\mathsf{commit}$ algorithm takes a vector $\vec v \in {\setn H}^n$ and constructs a perfect binary tree with $\vec v_i$ as the $i$ -th leaf and each node value computed by taking the hash of the child values. The commitment $C \in \setn H$ is the value of the root node.
The $\mathsf{open}$ algorithm takes $\vec i$ and marks the corresponding leaves and their ancestors. The proof $\pi \in \setn H^\ast$ are the values of nodes and leaves that are not marked but have a marked parent.
The $\mathsf{verify}$ algorithm takes $C, \vec i, \vec y, \pi$ and applies $\H$ repeatedly to compute all node values it knows all child values of. This will eventually compute the root node, which is then verified to equal $C$ .

We called nodes in $\pi$ the revealed nodes and the ones computed in $\mathsf{verify}$ the computed nodes.

The variant presented here uses compressed multi-proofs. We also allow $\vec i$ to be in arbitrary order and contain repeated values. The revealed indices and values $\vec i, \vec y$ are not part of the proof $\pi$ as the assumption is prover and verifier already established these by other means.

This scheme naturally extends to imperfect trees and higher- or even mixed-arity trees. This allows vector lengths that are not a power of two and can reduce tree depth if this is beneficial. The definition above uses a single hash function for all nodes, but it works with arbitrary hash functions for each node as long as prover and verifier pick the same ones. In verifier circuits the hash function cost for prover and verifier can be wildly different and it would make sense to pick a prover-favorable hash function for the deeper nodes and a verifier-favorable hash function for the nodes closer to the top. Neither of these extensions are considered in this analysis.

In the aforementioned application in hash-based proof systems, the indices $\vec i$ are generated by $m$ uniform random draws from $[0,n)$ with repetition. Knowing the distribution of $\vec i$ we can compute the distribution of the proof size $\bar{\pi}$ , i.e. the number of revealed nodes, and the number of hashes the verifier has to compute, i.e. the number of computed nodes. Finally we will look at a verifier implementation with deterministic control flow.

Number of Computed Nodes

The number of hashes a verifier needs to perform is the number of computed nodes: non-leaf nodes that are marked. Given a Merkle proof of height $h$ with $m$ uniformly random openings with replacement, let $C_d$ be the number of computed nodes at level $d \in [0,h)$ of the tree. This creates a set of random variables $C_d$ in the choice of openings. Importantly these random variables are not independent as the computation of a node at level $d$ implies the computation of its parent at level $d+1$ .

Equivalently, consider a uniform random draw of $m$ infinite binary strings $\setn M \subset \setp{0,1}^\infty$ . Take $\setn M_d \subset \setp{0,1}^d$ to be the set of prefixes of length $d$ of elements in $\setn M$ . We have that $C_d = \bar{\setn M_d}$ , the number of distinct prefixes of length $d$ in $\setn M$ .

The distribution of $C_d$ is the occupancy distribution of throwing $m$ balls into $2^d$ bins uniformly at random with replacement. This is a well-known problem in probability theory and combinatorics. The probability mass function is given by

$\Pr[C_d = k] = \frac{k!}{2^{d \cdot m}} \binom{2^d}{k} \stirlingii{m}{k}$

where $\binom{n}{k}$ is the binomial coefficient and $\stirlingii{m}{k}$ is the Stirling number of the second kind. The support for $m > 0$ is $k \in [1, \min(2^d, m)]$ , but note that $C_d$ is also bounded by $C_{d+1} \in [C_d, 2 \cdot C_d]$ due to the tree structure.

Expected Value

Take $p_d = 1 - (1 - 2^{-d})^m$ to be the probability that a node at level $d$ is computed. The expected number of computed nodes at level $d$ is then $\mathbb{E}[C_d] = 2^d \cdot p_d$ . The total number of computed nodes $C$ is $C = \sum_{d \in [0,h)} C_d$ and therefore

$\boxed{ \mathbb{E}[C] = \sum_{d \in [0,h)} 2^d \cdot \p{1 - \p{1- 2^{-d}}^m} }$

This exact expression can be efficiently evaluated in $O(h)$ . For $m \ll 2^h$ we can also approximate it well with $m \cdot (h - \log_2 m + 0.11)$ .

Support

Let the $k \in [k_{\min}(h,m), k_{\max}(h,m)]$ be the support such that $\Pr[C = k] > 0$ . Then

$k_{\min}(h, m) = \begin{cases} 0 & \text{if } m = 0 \\ h & \text{otherwise} \\ \end{cases}$

follows from the minimal case where all $m$ openings fall on the same leaf. Similarly, the maximum follows from distributing the $m$ openings as spread out as possible on each level, so $k_{\max}(h, m) = \sum_{d \in [0,h)} \min(2^d, m)$ . This can be written in closed form as

$k_{\max}(h, m) = \begin{cases} 0 & \text{if } m = 0 \\ 2^{\floor{\log_2 m} + 1} + m \cdot (h - \floor{\log_2 m} - 1) - 1 & \text{if } 1 \leq m < 2^h \\ 2^h -1 & \text{if } m \geq 2^h \text{.} \end{cases}$

def support(h: int, m: int) -> (int, int):
    if m == 0 or h == 0:
        return (0, 0)
    if m >= (1 << h):
        return (h, (1 << h) - 1)
    l = min(h - 1, (int(m).bit_length() - 1))
    return (h, (1 << (l + 1)) + m * (h - l - 1) - 1)

Probability Mass Function

Let $p_{h,m}(k) = \Pr[C = k]$ be the probability mass function. The base cases $p_{h,0}(k)$ and $p_{0,m}(k)$ are $1$ when $k=0$ and $0$ otherwise. Then by induction a $h>0$ tree requires the root to be computed, plus computation required in the two subtrees. This results in the recurrence

$p_{h+1,m}(k) = \underbrace{2^{-m} \sum_{m' \in [0,m]} \binom{m}{m'} }_{\boxed{1}} \underbrace{\sum_{k' \in [0,k-1]}}_{\boxed{2}} p_{h,m'}(k') \cdot p_{h, m - m'}(k - 1 - k')$

where $\boxed{1}$ accounts for Binomial distribution of the $m$ openings into the subtrees as $(m', m-m')$ . The number of computed nodes has one subtract for the root node, and the remaining $k-1$ nodes are distributed by $\boxed{2}$ as $(k', k-1-k')$ over the subtrees.

The PMF can be calculated reasonably quickly on values of practical interest (e.g. $h=20, m=200$ ) using memoization and restricting the values to the support:

@cache
def pmf(h: int, m: int) -> np.ndarray:
    if h == 0 or m == 0:
        return np.array([1.0])
    kmin, kmax = support(h, m)
    out = np.zeros(kmax + 1, dtype=np.float64)
    for m1 in range(0, m + 1):
        pmf1 = pmf(h - 1, m1)
        pmf2 = pmf(h - 1, m - m1)
        conv = np.convolve(pmf1, pmf2)
        out[1 : 1 + len(conv)] += float(comb(m, m1)) * conv
    return (2.0 ** (-m)) * out

The thick black line is the exact PMF computed with the above recurrence formula. The dotted lines are in log scale, with black being the PMF. The red and blue lines are the normal and beta-binomial approximations discussed below.

The resulting PMF is concentrated around a narrow spike near the maximum support. The log scale reveals some skew asymmetry in the peak, but overall it looks like it can be well approximated by a normal distribution. We already have the expected value from above, so we now turn to the variance.

Variance

We will compute the variance by deriving a formula for $\Cov{C_d, C_e}$ and using the identity

$\Var{C} = \Var{\sum_{d \in [0,h)]} C_d} = \sum_{d, e \in [0,h) } \Cov{C_d, C_e} \text{.}$

Let $I_{d,i} \in \setp{0,1}$ be the indicator random variable for the marking of node $i$ at level $d$ . It is Bernoulli distributed with $p_d = 1 - (1 - 2^{-d})^m$ , such that $\Exp{I_{d,i}} = p_d$ and $\Var{I_{d,i}} = p_d \cdot (1 - p_d)$ . Observe that $C_d = \sum_{i \in [0, 2^d)} I_{d,i}$ and we quickly rederive $\Exp{C_d} = \sum_{i \in [0, 2^d)} \Exp{I_{d,i}} = 2^d \cdot p_d$ .

Let $L_{d,i} \subseteq [0,2^h)$ be the set of leaves descendant from node $i$ at level $d$ . Let's extend $p$ with the joint probability of two node markings conditioned on non-overlapping leaf sets as

$p_{de} = \Pr(I_{d,i} = 1, I_{e,j} = 1 \mid L_{d,i} \intersection L_{e,j} = \empty)\text{.}$

By inclusion-exclusion we have

$\begin{align*} p_{de} =\ & 1 - \Pr(I_{d,i} = 0) - \Pr(I_{e,j} = 0) \\ & + \Pr(I_{d,i} = 0, I_{e,j} = 0 \mid L_{d,i} \intersection L_{e,j} = \empty) \text{.} \end{align*}$

The first two probabilities are $(1 - 2^{-d})^m$ and $(1 - 2^{-e})^m$ as we saw before. The last is the probability that all $m$ draws avoid the leafs that are descendants of either $i$ or $j$ , which is $(1 - 2^{-d} - 2^{-e})^m$ . We thus get

$p_{de} = 1 - (1 - 2^{-d})^m - (1 - 2^{-e})^m + (1 - 2^{-d} - 2^{-e})^m$

with the special case $p_{dd} = 1 - 2 \cdot (1 - 2^{-d})^m + (1 - 2^{1-d})^m$ and symmetry $p_{de} = p_{ed}$ .

We will use the identity $\Cov{C_d, C_e} = \Exp{C_d \cdot C_e} - \Exp{C_d} \cdot \Exp{C_e}$ to compute the covariance. So next we compute $\Exp{C_d \cdot C_e}$ :

$\mathbb{E}[C_d \cdot C_e] = \sum_{i \in [0, 2^d)} \sum_{j \in [0, 2^e)} \mathbb{E}[I_{d,i} \cdot I_{e,j}] \text{.}$

Take $f = \max(d,e)$ the deepest level. We split the sum into two cases, either $L_{d,i} \intersection L_{e,j} = \empty$ or not. If it is not empty then one node is a descendant of or identical with the other and $I_{d,i} \cdot I_{e,j} = I_{f,j}$ . There are $2^f$ such cases. The remaining $2^{d+e} - 2^f$ cases have disjoint leaf sets and thus $\mathbb{E}[I_{d,i} \cdot I_{e,j}] = p_{de}$ . Summing the disjoint and non-disjoint contributions we get

$\begin{align*} \mathbb{E}[C_d \cdot C_e] &= 2^f \cdot p_f + \p{2^{d+e} - 2^f} \cdot p_{de} \text{.} \end{align*}$

From $\Cov{C_d, C_e} = \Exp{C_d \cdot C_e} - \Exp{C_d} \cdot \Exp{C_e}$ it then follows

$\begin{align*} \Cov{C_d, C_e} &= 2^f \cdot p_f + \p{2^{d+e} - 2^f} \cdot p_{de} - 2^{d+e} \cdot p_d \cdot p_e \\ \end{align*}$

and thus

$\boxed{ \Var{C} = \sum_{d, e \in [0,h)} 2^f \cdot p_f + \p{2^{d+e} - 2^f} \cdot p_{de} - 2^{d+e} \cdot p_d \cdot p_e }$

where $f = \max(d,e)$ . This double sum can be evaluated in $O(h^2)$ time.

def exact_moments(h, m):
    mean = 0.0
    var = 0.0
    for d in range(h):
        pd = 1 - (1 - 2**(-d))**m
        mean += 2**d * pd
        for e in range(h):
            pe = 1 - (1 - 2**(-e))**m
            qde = pd + pe - 1 + (1 - 2**(-d) - 2**(-e))**m
            f, pf = (e, pe) if e >= d else (d, pd)
            var += 2**f * pf + (2**(d + e) - 2**f) * qde - 2**(d + e) * pd * pe
    return mean, var

If we compare the mean and variance computed this way to empirical mean and variance computed from the above PMF recurrence we find they match perfectly to within numerical precision ( $~10^{-11}$ for the mean and $~10^{-5}$ for the variance).

Normal Approximation

Given the mean and variance above we can construct a normal approximation as $\Pr[C = k] \approx {\setn N}\!\p{\Exp{C}, \Var{C}}$ . We compute the approximate PMF $\hat{p}_{h,m}(k)$ by evaluating the normal distribution at each integer in the support and normalizing:

def pmf_approx_normal(h: int, m: int) -> np.ndarray:
    mean, var = exact_moments(h, m)
    kmin, kmax = support(h, m)
    if kmin == kmax or var <= 1e-12: # Degenerate point mass
        pmf = np.zeros(kmax + 1, dtype=np.float64)
        pmf[int(np.clip(round(mean), kmin, kmax))] = 1.0
        return pmf
    std = np.sqrt(var)
    x = np.arange(0, kmax + 1)
    pmf_approx = (1 / (std * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mean) / std) ** 2)
    pmf_approx[:kmin] = 0.0  # PMF is zero below minimum support
    pmf_approx /= np.sum(pmf_approx)  # Normalize
    return pmf_approx

Note: When computing confidence intervals from the normal distribution, don't forget to account for the tails cut off due to finite support.

We can then compute the KL-divergence between the exact PMF and the approximate PMF. For the previous graph of $h = 20$ and $m = 200$ , the $D_{KL} = 0.00362\ \mathrm{bits}$ which is very small. The linear graphs are visually indistinguishable. The log graphs show an underestimation of the left tail and overestimation of the right tail. This is desirable as it means the normal approximation is conservative in estimating the maximum.

Over the range $h \in [8, 21]$ and $m \in [0, 300]$ the KL-divergence between the exact PMF and the normal approximation is below $0.01\ \mathrm{bits}$ for $m > 75$ and below $0.1$ for $m > 5$ . For small $m$ the discrete nature of the PMF is more pronounced and the normal approximation is less accurate. But fortunately there the exact PMF can be computed quickly.

Beta-Binomial Approximation

We can do better than a normal approximation by using a discrete distribution that matches the support, mean, and variance exactly. The $\beta$ -binomial distribution is a good candidate as it is defined on $[0, n]$ for integer $n$ and has two shape parameters that can be tuned to match mean and variance.

def pmf_approx_betabin(h: int, m: int) -> np.ndarray:
    kmin, kmax = support(h, m)
    mean, var = exact_moments(h, m)
    if kmin == kmax or var <= 1e-12: # Degenerate point mass
        pmf = np.zeros(kmax + 1, dtype=np.float64)
        pmf[int(np.clip(round(mean), kmin, kmax))] = 1.0
        return pmf
    from scipy.stats import betabinom, binom
    n, p = kmax - kmin, (mean - kmin) / (kmax - kmin)
    denom = var - n * p * (1 - p)
    if denom < 1e-12: # Under-dispersed: Fallback to binomial
        return np.pad(binom.pmf(np.arange(n + 1), n, p), (kmin, 0))
    s = (n * n * p * (1 - p) - var) / denom
    pmf = betabinom.pmf(np.arange(n + 1), n, s * p, s * (1 - p))
    return np.pad(pmf, (kmin, 0))

The beta-binomial approximation is excellent! Around $m > 2^{h-2}$ (the lower quality bottom region) the variance becomes too low for a beta-binomial and we fall back to a binomial distribution. The binomial approximation gets better as $m \gg 2^h$ .

Number of Revealed Nodes

The proof consists of the hash values associated with the revealed nodes: nodes that have a marked parent but are not themselves marked. Let $S_{d,i} \in \set{0,1}$ be the indicator variable that a node in the $i$ -th sibling pair at level $d$ is revealed. Then $S$ is the XOR of the sibling $I$ indicators:

$\begin{align*} S_{d,i} &= I_{d,2i} \oplus I_{d,2i+1} \\ &= I_{d,2i} + I_{d,2i+1} - 2 \cdot I_{d,2i} \cdot I_{d,2i+1} \text{.} \end{align*}$

Note that there are $2^{d-1}$ sibling pairs at level $d > 0$ with no pairs at level $0$ (there is only one root node). Let $R_d$ be the number of revealed nodes at height $d \in [0,h]$ . We have $R_0 = 0$ and $R_d = \sum_{i \in [0, 2^{d-1})} S_{d,i}$ and thus $R = \sum_{d \in [1,h]} R_d$ the total number of revealed nodes.

Expected Value

Computing $\Exp{R}$ is now a matter of expanding definitions and using previous results:

$\begin{align*} \Exp{R} &= \sum_{d \in [1,h]} \sum_{i \in [0, 2^{d-1})} \Exp{I_{d,2i} + I_{d,2i+1} - 2 \cdot I_{d,2i} \cdot I_{d,2i+1}} \\ &= \sum_{d \in [1,h]} \sum_{i \in [0, 2^{d-1})} \p{p_d + p_d - 2 \cdot p_{dd}} \\ &= \sum_{d \in [1,h]} 2^d \p{p_d - p_{dd}} \\ \end{align*}$

where we use the fact that leaf sets of siblings must be disjoint. Expanding $p_d$ and $p_{dd}$ we get

$\boxed{ \Exp{R} = \sum_{d \in [1,h]} 2^d \cdot \p{(1 - 2^{-d})^m - (1 - 2^{1-d})^m} }$

This exact expression can be efficiently evaluated in $O(h)$ operations. For $m \ll 2^h$ we can approximate it well with $m \cdot (h - \log_2 m - 0.89)$ .

Support

Clearly for $m=0$ the support is $R \in \{0\}$ . For $m > 0$ the minimum number of revealed nodes is attained when a maximally sized subtree is fully filled, which gives

$R_{\min} = h - \min(h, \floor{\log_2 m}) \text{.}$

An easy upper bound for the number of revealed nodes is $\sum_{d \in [1, h]} \min(2^d, m)$ but this is not tight. The maximum is attained by allocating each opening a maximally sized subtree. A greedy procedure suffices where for each additional opening we pick largest subtree containing a single opening and split it into its two child subtrees each containing one opening. When $m \ge 2^{h-1}$ we reach the maximum of $2^{h-1}$ revealed nodes and further openings are made to overlap. For $0 < m < 2^{h-1}$ we can write the maximum as

$R_{\max} = 2^{\ceil{\log_2 m}} + m \cdot (h - \ceil{\log_2 m} - 1)\text{.}$

def support(h: int, m: int) -> (int, int):
    if h == 0 or m == 0:
        return (0, 0)
    k_min = h - min(h, int(np.floor(np.log2(m))))
    if m >= 2**(h - 1):
        return (k_min, 2**(h - 1))
    t = int(np.ceil(np.log2(m)))
    return (k_min, 2**t + m *(h - t - 1))

Probability Mass Function

Let $p_{h,m}(k) = \Pr[R = k]$ be the probability mass function. The base cases $p_{h,0}(k)$ and $p_{0,m}(k)$ are $1$ when $k=0$ and $0$ otherwise as before. For the inductive step, consider a tree of height $h > 0$ and $m > 0$ openings. The root has two children, each of which is the root of a subtree of height $h-1$ . Let $m_1$ and $m_2$ be the number of openings that fall in the left and right subtree respectively. Then there are two cases to consider: Either only one subtree has openings, and $R$ is the number of revealed nodes in that subtree plus one for the current node. Or both subtrees have openings and $R$ is the sum of the revealed nodes, but doesn't contribute itself. This gives the recurrence

$\begin{align*} p_{h+1,m}(k) &= 2^{1-m} \cdot p_{h,m}(k - 1) + \phantom{x} \\ &\phantom{=} 2^{-m} \cdot \sum_{m'=[1,m)} \binom{m}{m'} \sum_{k' \in [0,k]} p_{h,m'}(k') \cdot p_{h,m - m'}(k - k') \end{align*}$

This recurrence leads to a reasonably efficient implementation.

@cache
def pmf(h: int, m: int) -> np.ndarray:
    if h == 0 or m == 0:
        return np.array([1.0])
    k_min, k_max = support(h, m)
    out = np.zeros(k_max + 1, dtype=np.float64)
    f1 = pmf(h - 1, m)
    out[1:1 + len(f1)] = 2.0 * f1
    for m1 in range(1, m):
        f1 = pmf(h - 1, m - m1)
        f2 = pmf(h - 1, m1)
        conv = np.convolve(f1, f2)
        out[:len(conv)] += float(comb(m, m1)) * conv    
    return 2.0**(-m) * out

Variance

We are going to end up with cubic and quartic terms, so before we start let's extend our definition of $p_d, p_{de}$ to its final and most general form:

$\begin{align*} p_{d_0 d_2 \dots d_{n-1}} &= \Pr\p{\forall_{j \in [0,n)}\ I_{d_j,i_j} = 1 \mid L_{d_i,i_j} \text{ pairwise disjoint}} \\ &= \Exp{\prod_{j \in [0,n)} I_{d_j,i_j}} \\ &= \sum_{\setn J \subseteq [0,n)} (-1)^{\bar{\setn J}} \cdot \p{1 - \sum_{j \in \setn J} 2^{-d_j}}^m \text{.} \end{align*}$

This is obtained by inclusion-exclusion, extending the two-variable case above. It is again symmetric in its arguments.

As before, we start from $\Var{R} = \sum_{d, e \in [1,h] } \Cov{R_d, R_e}$ and use the identity $\Cov{R_d, R_e} = \Exp{R_d \cdot R_e} - \Exp{R_d} \cdot \Exp{R_e}$ , which means we need to compute

$\begin{align*} \mathbb{E}[R_d \cdot R_e] &= \sum_{\substack{ i \in [0, 2^{d-1}) \\ j \in [0, 2^{e-1}) }} \mathbb{E}[S_{d,i} \cdot S_{e,j}] \text{.} \end{align*}$

We consider three cases, depending on the relationship between the sibling pairs $(d,i)$ and $(e,j)$ : they can be identical, descendant/ancestor, or disjoint.

Identical Sibling Pairs. When $d = e$ and $i = j$ we have $S_{d,i} \cdot S_{e,j} = S_{d,i}^2 = S_{d,i}$ , the latter follows from indicators being $\{0, 1\}$ . We have $\Exp{S_{d,i}} = 2 \cdot (p_d - p_{dd})$ as before. There are $2^{d-1}$ such cases with total contribution to $\Exp{R_d, R_e}$ of

$2^d \cdot (p_d - p_{dd}) \text{.}$

Ancestor/Descendant Sibling Pairs. Without loss of generality assume $d < e$ and that the sibling pair $(e,j)$ is a descendant of left branch $(d, 2 \cdot i)$ . Then from $S_{e,j} = 1$ follows $I_{d,2i} = 1$ and thus $S_{d,i} = 1 - I_{d,2i+1}$ .

$\begin{align*} \Exp{S_{d,i} \cdot S_{e,j}} &= \Exp{\p{1 - I_{d,2i+1}} \cdot \p{I_{e,2j} + I_{e,2j+1} - 2 \cdot I_{e,2j} \cdot I_{e,2j+1}}} \\ &= \Exp{\begin{align*} &I_{e,2j} + I_{e,2j+1} - 2 \cdot I_{e,2j} \cdot I_{e,2j+1}\\ &-I_{d,2i+1} \cdot I_{e,2j} - I_{d,2i+1} \cdot I_{e,2j+1} \\ &+2 \cdot I_{d,2i+1} \cdot I_{e,2j} \cdot I_{e,2j+1} \end{align*}} \\ &= 2\cdot \p{ p_e - p_{ee} - p_{de} + p_{dee} } \end{align*}$

where the last line follows from all the $I$ s being pairwise disjoint. There are $2^{e - 1}$ sibling pairs at level $e$ that each have a unique ancestor at level $d$ (which is either left or right in a pair). So the total contribution to $\Exp{R_d, R_e}$ for from this case is

$2^e \cdot \p{ p_e - p_{ee} - p_{de} + p_{dee} } \text{.}$

Disjoint Pairs. In this case all $I$ s are disjoint and we just have a big tedious expansion to do

$\begin{align*} \Exp{S_{d,i} \cdot S_{e,j}} &= \p{I_{d,2i} + I_{d,2i+1} - 2 \cdot I_{d,2i} \cdot I_{d,2i+1}} \\ &\phantom{=} \cdot \p{I_{e,2j} + I_{e,2j+1} - 2 \cdot I_{e,2j} \cdot I_{e,2j+1}} \\ &= 4 \cdot \p{p_{de} - p_{dde} - p_{dee} + p_{ddee}} \text{.} \end{align*}$

For $d < e$ there are $2^{d-1} \cdot 2^{e-1} - 2^{e - 1}$ such disjoint pairs. Additionally there are $2^{d-1} \cdot (2^{d-1} - 1)$ disjoint pairs when $d = e$ .

Putting this all together we get

$\begin{align*} \mathbb{E}[R_d \cdot R_e] &= \begin{cases} \begin{align*} &2^d \cdot (p_d - p_{dd}) + \\ &(2^{2\cdot d} - 2^{d+1})\cdot \p{p_{dd} - 2 \cdot p_{ddd} + p_{dddd}} \end{align*} & \text{if } d = e \\[1.5em] \begin{align*} &2^e \cdot \p{ p_e - p_{ee} - p_{de} + p_{dee} } + \\ &\p{2^{d + e} - 2^{e + 1}} \cdot \p{p_{de} - p_{dde} - p_{dee} + p_{ddee}} \end{align*} & \text{if } d < e \\[1.5em] \text{(same as $d < e$ with $d, e$ swapped)} & \text{if } d > e \\ \end{cases} \\ \end{align*}$

Recall $\Var{R} = \sum_{d, e \in [1,h] } \Exp{R_d \cdot R_e} - \p{\Exp{R}}^2$ and we have $\Exp{R}$ from before. We could expand everything out into a huge expression for $\Var{R}$ , but it is not very insightful. Instead we'll give a way to compute $\Exp{R}$ and $\Var{R}$ in $O(h^2)$ time:

def exact_moments(h, m):
    def p(*ds):
        n = len(ds)
        total = 0.0
        for J in range(1 << n):
            sign = (-1)**J.bit_count()
            sum_inv = sum(2.0 ** (-ds[j]) for j in range(n) if (J & (1 << j)))
            total += sign * np.exp(m * np.log1p(-sum_inv)) if sum_inv < 1 else 0
        return total
    mean = 0.0
    var = 0.0
    for d in range(1, h + 1):
        pd, pdd, pddd, pdddd = p(d), p(d, d), p(d, d, d), p(d, d, d, d)
        mean += 2.0**d * (pd - pdd)
        var += 2.0**d * (pd - pdd) 
        var += (2.0**(2 * d) - 2.0**(d + 1)) * (pdd - 2.0 * pddd + pdddd)
        for e in range(1 + d, h + 1):
            pe, pde, pee, = p(e), p(d, e), p(e, e)
            pdee, pdde, pddee = p(d, e, e), p(d, d, e), p(d, d, e, e)
            var += 2.0**(1 + e) * (pe - pee - pde + pdee)
            var += (2.0**(1 + d + e) - 2.0**(2 + e)) * (pde - pdde - pdee + pddee)
    var -= mean**2
    return mean, var

Normal Approximation

With the exact moments we can again construct a normal approximation as $\Pr[R = k] \approx {\setn N}\!\p{\Exp{R}, \Var{R}}$ . We compute the approximate PMF $\hat{p}_{h,m}(k)$ by evaluating the normal distribution at each integer in the support and normalizing as before (identical code).

We can then compute the KL-divergence between the exact PMF and the approximate PMF. For the previous graph of $h = 20$ and $m = 200$ , the $D_{KL} = 0.0024\ \mathrm{bits}$ which is very small. The linear scale graphs are visually indistinguishable. The log graphs show an underestimation of the left tail and overestimation of the right tail. This is desirable as it means the normal approximation is conservative in estimating the maximum.

This time, KL-divergence between the exact PMF and the normal approximation is below $0.01\ \mathrm{bits}$ for essentially the entire range $m > 75$ and below $0.1$ for $m > 5$ . For small $m$ the discrete nature of the PMF is more pronounced and the normal approximation is less accurate. But fortunately there the exact PMF can be computed quickly.

Beta-Binomial Approximation

Circuit Verifier

For implementation in circuits we need a Merkle-proof verifier where the control flow does not depend on the proof. We are allowed to do a pre-processing step on the proof before we hand it to the circuit. The simple solution is to preprocess into $m$ single-proof verifications. This requires $m \cdot h$ hashes to be computed.

Merkle Caps

We can improve on this using a Merkle cap: First we compute all nodes up to level $l$ using only $2^l - 1$ hashes. Then each opening only requires $h - l$ hashes to be computed up to level $l$ . The total number of hashes is $2^l - 1 + m \cdot (h - l)$ which is minimized when

$l_{\mathrm{min}} = \min(h, \floor{ \log_2 m}) \text{.}$

One complication with this approach is that we may not know all the values at level $l$ . We could solve this by having the prover provide all nodes at level $l$ , at slightly increased proof size. Another solution is to explicitly support missing values in the circuit. We pursue the latter as it makes Merkle caps a pure verifier side optimization.

There are only two cases to consider: either both children are present, or both children are missing. The case where exactly one child is missing is impossible, as any colored node's children are all computed or revealed, and any uncolored node's children are all missing. When both children are present, we compute the hash as normal. When both children are missing, we substitute a hint value. The hint value can be either a hash, or None if the parent is also missing. The augmented hash function is as follows:

fn hash_with_missing(
    left: Option<Hash>,
    right: Option<Hash>,
    hint: Option<Hash>
) -> Option<Hash> {
    match (left, right) {
        (Some(l), Some(r)) => Some(hash(l, r)),
        (None, None) => hint,
        (Some(l), None) | (None, Some(r)) => unreachable!(),
    }
}

When this computes the correct root hash, all the values at level $l$ are either correctly computed or marked missing. For practical implementation we can use a flag value such as the all-zeros hash instead of an Option<Hash>. The overhead of the modified hash function should then be minimal.

Number of Computed Hashes

The number of computed hashes at each level $d < l$ is always $2^d$ as the Merkle cap is fully computed up to level $l$ . For levels $d \ge l$ there are $m$ individual paths of length $h - l$ . Thus the total number of computed nodes is

$C_d = \begin{cases} 2^d & \text{if } d < l_{\min} \\ m & \text{if } l_{\min} \le d < h \\ \end{cases}$

and the total number of computed hashes is as we saw before

$C = \sum_{d \in [0,h)} C_d = 2^{l_{\min}} - 1 + m \cdot (h - l_{\min})\text{.}$