Merkle Multi-Proofs
\gdef\p#1{\left({#1}\right)} \gdef\setp#1{\left\{{#1}\right\}} \gdef\stirlingii#1#2{{{#1} \brace {#2}}} \gdef\setn#1{\mathcal{#1}} \gdef\range#1{\left[#1\right)} \gdef\floor#1{\left\lfloor#1\right\rfloor} \gdef\ceil#1{\left\lceil#1\right\rceil} \gdef\vec#1{\mathbf{#1}} \gdef\H{\mathsf{hash}} \gdef\bar#1{\left\lvert{#1}\right\rvert} \gdef\Exp#1{\mathbb{E}\!\left[{#1}\right]} \gdef\Var#1{\operatorname{Var}\!\p{#1}} \gdef\Cov#1{\operatorname{Cov}\!\p{#1}} \gdef\union{\cup} \gdef\intersection{\cap} \gdef\empty{\varnothing}
Vector commitments, and specifically Merkle proofs, are the heart of every hash-based proof system. They are responsible for a large part of proof size and verifier work, yet we don't have sharp estimates of either. This is largely due to a complex dependency on the distribution of opened indices. In hash-based proof systems, this distribution is uniformly random with replacement, which lets us derive the concrete results that follow.
Key results are exact minimum, maximum, mean, variance and probability mass functions of proof size and verifier work, plus approximations and a way to implement Merkle caps purely verifier side.
Definition 1. A Vector Commitment Scheme for a set \setn S has \mathsf{commit}, \mathsf{open} and \mathsf{verify} algorithms.
- The \mathsf{commit} algorithm takes a vector \vec v \in {\setn S}^n and outputs a commitment C to \vec v.
- The \mathsf{open} algorithm takes a commitment C, a list of m indices \vec i \in [0,n)^m and values \vec y \in {\setn S}^m and outputs a proof \pi that \vec v_{\vec{i}_j} = \vec y_j for all j \in [0,m).
- The \mathsf{verify} algorithm takes C, \vec i, \vec y, \pi and outputs true if the proof is valid and false otherwise.
The canonical example of a vector commitment scheme is a Merkle tree. There are many expositions of this, so I'll just write a quick definition:
Definition 2. Given a two-to-one hash function \H: \setn H \times \setn H \rightarrow \setn H, a Merkle tree of height h is a vector commitment scheme for n = 2^h values of \setn H.
- The \mathsf{commit} algorithm takes a vector \vec v \in {\setn H}^n and constructs a perfect binary tree with \vec v_i as the i-th leaf and each node value computed by taking the hash of the child values. The commitment C \in \setn H is the value of the root node.
- The \mathsf{open} algorithm takes \vec i and marks the corresponding leaves and their ancestors. The proof \pi \in \setn H^\ast are the values of nodes and leaves that are not marked but have a marked parent.
- The \mathsf{verify} algorithm takes C, \vec i, \vec y, \pi and applies \H repeatedly to compute all node values it knows all child values of. This will eventually compute the root node, which is then verified to equal C.
We called nodes in \pi the revealed nodes and the ones computed in \mathsf{verify} the computed nodes.
The variant presented here uses compressed multi-proofs. We also allow \vec i to be in arbitrary order and contain repeated values. The revealed indices and values \vec i, \vec y are not part of the proof \pi as the assumption is prover and verifier already established these by other means.
This scheme naturally extends to imperfect trees and higher- or even mixed-arity trees. This allows vector lengths that are not a power of two and can reduce tree depth if this is beneficial. The definition above uses a single hash function for all nodes, but it works with arbitrary hash functions for each node as long as prover and verifier pick the same ones. In verifier circuits the hash function cost for prover and verifier can be wildly different and it would make sense to pick a prover-favorable hash function for the deeper nodes and a verifier-favorable hash function for the nodes closer to the top. Neither of these extensions are considered in this analysis.
In the aforementioned application in hash-based proof systems, the indices \vec i are generated by m uniform random draws from [0,n) with repetition. Knowing the distribution of \vec i we can compute the distribution of the proof size \bar{\pi}, i.e. the number of revealed nodes, and the number of hashes the verifier has to compute, i.e. the number of computed nodes. Finally we will look at a verifier implementation with deterministic control flow.
Number of Computed Nodes
The number of hashes a verifier needs to perform is the number of computed nodes: non-leaf nodes that are marked. Given a Merkle proof of height h with m uniformly random openings with replacement, let C_d be the number of computed nodes at level d \in [0,h) of the tree. This creates a set of random variables C_d in the choice of openings. Importantly these random variables are not independent as the computation of a node at level d implies the computation of its parent at level d+1.
Equivalently, consider a uniform random draw of m infinite binary strings \setn M \subset \setp{0,1}^\infty. Take \setn M_d \subset \setp{0,1}^d to be the set of prefixes of length d of elements in \setn M. We have that C_d = \bar{\setn M_d}, the number of distinct prefixes of length d in \setn M.
The distribution of C_d is the occupancy distribution of throwing m balls into 2^d bins uniformly at random with replacement. This is a well-known problem in probability theory and combinatorics. The probability mass function is given by
\Pr[C_d = k] = \frac{k!}{2^{d \cdot m}} \binom{2^d}{k} \stirlingii{m}{k}
where \binom{n}{k} is the binomial coefficient and \stirlingii{m}{k} is the Stirling number of the second kind. The support for m > 0 is k \in [1, \min(2^d, m)], but note that C_d is also bounded by C_{d+1} \in [C_d, 2 \cdot C_d] due to the tree structure.
Expected Value
Take p_d = 1 - (1 - 2^{-d})^m to be the probability that a node at level d is computed. The expected number of computed nodes at level d is then \mathbb{E}[C_d] = 2^d \cdot p_d. The total number of computed nodes C is C = \sum_{d \in [0,h)} C_d and therefore
\boxed{ \mathbb{E}[C] = \sum_{d \in [0,h)} 2^d \cdot \p{1 - \p{1- 2^{-d}}^m} }
This exact expression can be efficiently evaluated in O(h). For m \ll 2^h we can also approximate it well with m \cdot (h - \log_2 m + 0.11).
Support
Let the k \in [k_{\min}(h,m), k_{\max}(h,m)] be the support such that \Pr[C = k] > 0. Then
k_{\min}(h, m) = \begin{cases} 0 & \text{if } m = 0 \\ h & \text{otherwise} \\ \end{cases}
follows from the minimal case where all m openings fall on the same leaf. Similarly, the maximum follows from distributing the m openings as spread out as possible on each level, so k_{\max}(h, m) = \sum_{d \in [0,h)} \min(2^d, m). This can be written in closed form as
k_{\max}(h, m) = \begin{cases} 0 & \text{if } m = 0 \\ 2^{\floor{\log_2 m} + 1} + m \cdot (h - \floor{\log_2 m} - 1) - 1 & \text{if } 1 \leq m < 2^h \\ 2^h -1 & \text{if } m \geq 2^h \text{.} \end{cases}
return
return
=
return
Probability Mass Function
Let p_{h,m}(k) = \Pr[C = k] be the probability mass function. The base cases p_{h,0}(k) and p_{0,m}(k) are 1 when k=0 and 0 otherwise. Then by induction a h>0 tree requires the root to be computed, plus computation required in the two subtrees. This results in the recurrence
p_{h+1,m}(k) = \underbrace{2^{-m} \sum_{m' \in [0,m]} \binom{m}{m'} }_{\boxed{1}} \underbrace{\sum_{k' \in [0,k-1]}}_{\boxed{2}} p_{h,m'}(k') \cdot p_{h, m - m'}(k - 1 - k')
where \boxed{1} accounts for Binomial distribution of the m openings into the subtrees as (m', m-m'). The number of computed nodes has one subtract for the root node, and the remaining k-1 nodes are distributed by \boxed{2} as (k', k-1-k') over the subtrees.
The PMF can be calculated reasonably quickly on values of practical interest (e.g. h=20, m=200) using memoization and restricting the values to the support:
return
, =
=
=
=
=
+= *
return *

The thick black line is the exact PMF computed with the above recurrence formula. The dotted lines are in log scale, with black being the PMF. The red and blue lines are the normal and beta-binomial approximations discussed below.
The resulting PMF is concentrated around a narrow spike near the maximum support. The log scale reveals some skew asymmetry in the peak, but overall it looks like it can be well approximated by a normal distribution. We already have the expected value from above, so we now turn to the variance.
Variance
We will compute the variance by deriving a formula for \Cov{C_d, C_e} and using the identity
\Var{C} = \Var{\sum_{d \in [0,h)]} C_d} = \sum_{d, e \in [0,h) } \Cov{C_d, C_e} \text{.}
Let I_{d,i} \in \setp{0,1} be the indicator random variable for the marking of node i at level d. It is Bernoulli distributed with p_d = 1 - (1 - 2^{-d})^m, such that \Exp{I_{d,i}} = p_d and \Var{I_{d,i}} = p_d \cdot (1 - p_d). Observe that C_d = \sum_{i \in [0, 2^d)} I_{d,i} and we quickly rederive \Exp{C_d} = \sum_{i \in [0, 2^d)} \Exp{I_{d,i}} = 2^d \cdot p_d.
Let L_{d,i} \subseteq [0,2^h) be the set of leaves descendant from node i at level d. Let's extend p with the joint probability of two node markings conditioned on non-overlapping leaf sets as
p_{de} = \Pr(I_{d,i} = 1, I_{e,j} = 1 \mid L_{d,i} \intersection L_{e,j} = \empty)\text{.}
By inclusion-exclusion we have
\begin{align*} p_{de} =\ & 1 - \Pr(I_{d,i} = 0) - \Pr(I_{e,j} = 0) \\ & + \Pr(I_{d,i} = 0, I_{e,j} = 0 \mid L_{d,i} \intersection L_{e,j} = \empty) \text{.} \end{align*}
The first two probabilities are (1 - 2^{-d})^m and (1 - 2^{-e})^m as we saw before. The last is the probability that all m draws avoid the leafs that are descendants of either i or j, which is (1 - 2^{-d} - 2^{-e})^m. We thus get
p_{de} = 1 - (1 - 2^{-d})^m - (1 - 2^{-e})^m + (1 - 2^{-d} - 2^{-e})^m
with the special case p_{dd} = 1 - 2 \cdot (1 - 2^{-d})^m + (1 - 2^{1-d})^m and symmetry p_{de} = p_{ed}.
We will use the identity \Cov{C_d, C_e} = \Exp{C_d \cdot C_e} - \Exp{C_d} \cdot \Exp{C_e} to compute the covariance. So next we compute \Exp{C_d \cdot C_e}:
\mathbb{E}[C_d \cdot C_e] = \sum_{i \in [0, 2^d)} \sum_{j \in [0, 2^e)} \mathbb{E}[I_{d,i} \cdot I_{e,j}] \text{.}
Take f = \max(d,e) the deepest level. We split the sum into two cases, either L_{d,i} \intersection L_{e,j} = \empty or not. If it is not empty then one node is a descendant of or identical with the other and I_{d,i} \cdot I_{e,j} = I_{f,j}. There are 2^f such cases. The remaining 2^{d+e} - 2^f cases have disjoint leaf sets and thus \mathbb{E}[I_{d,i} \cdot I_{e,j}] = p_{de}. Summing the disjoint and non-disjoint contributions we get
\begin{align*} \mathbb{E}[C_d \cdot C_e] &= 2^f \cdot p_f + \p{2^{d+e} - 2^f} \cdot p_{de} \text{.} \end{align*}
From \Cov{C_d, C_e} = \Exp{C_d \cdot C_e} - \Exp{C_d} \cdot \Exp{C_e} it then follows
\begin{align*} \Cov{C_d, C_e} &= 2^f \cdot p_f + \p{2^{d+e} - 2^f} \cdot p_{de} - 2^{d+e} \cdot p_d \cdot p_e \\ \end{align*}
and thus
\boxed{ \Var{C} = \sum_{d, e \in [0,h)} 2^f \cdot p_f + \p{2^{d+e} - 2^f} \cdot p_{de} - 2^{d+e} \cdot p_d \cdot p_e }
where f = \max(d,e). This double sum can be evaluated in O(h^2) time.
= 0.0
= 0.0
= 1 - **
+= 2** *
= 1 - **
= + - 1 + **
, =
+= 2** * + * - 2** * *
return ,
If we compare the mean and variance computed this way to empirical mean and variance computed from the above PMF recurrence we find they match perfectly to within numerical precision (~10^{-11} for the mean and ~10^{-5} for the variance).
Normal Approximation
Given the mean and variance above we can construct a normal approximation as \Pr[C = k] \approx {\setn N}\!\p{\Exp{C}, \Var{C}}. We compute the approximate PMF \hat{p}_{h,m}(k) by evaluating the normal distribution at each integer in the support and normalizing:
, =
, =
# Degenerate point mass
=
= 1.0
return
=
=
= *
= 0.0 # PMF is zero below minimum support
/= # Normalize
return
Note: When computing confidence intervals from the normal distribution, don't forget to account for the tails cut off due to finite support.
We can then compute the KL-divergence between the exact PMF and the approximate PMF. For the previous graph of h = 20 and m = 200, the D_{KL} = 0.00362\ \mathrm{bits} which is very small. The linear graphs are visually indistinguishable. The log graphs show an underestimation of the left tail and overestimation of the right tail. This is desirable as it means the normal approximation is conservative in estimating the maximum.

Over the range h \in [8, 21] and m \in [0, 300] the KL-divergence between the exact PMF and the normal approximation is below 0.01\ \mathrm{bits} for m > 75 and below 0.1 for m > 5. For small m the discrete nature of the PMF is more pronounced and the normal approximation is less accurate. But fortunately there the exact PMF can be computed quickly.
Beta-Binomial Approximation
We can do better than a normal approximation by using a discrete distribution that matches the support, mean, and variance exactly. The \beta-binomial distribution is a good candidate as it is defined on [0, n] for integer n and has two shape parameters that can be tuned to match mean and variance.
, =
, =
# Degenerate point mass
=
= 1.0
return
, = - , /
= - * *
# Under-dispersed: Fallback to binomial
return
= /
=
return

The beta-binomial approximation is excellent! Around m > 2^{h-2} (the lower quality bottom region) the variance becomes too low for a beta-binomial and we fall back to a binomial distribution. The binomial approximation gets better as m \gg 2^h.
Number of Revealed Nodes
The proof consists of the hash values associated with the revealed nodes: nodes that have a marked parent but are not themselves marked. Let S_{d,i} \in \set{0,1} be the indicator variable that a node in the i-th sibling pair at level d is revealed. Then S is the XOR of the sibling I indicators:
\begin{align*} S_{d,i} &= I_{d,2i} \oplus I_{d,2i+1} \\ &= I_{d,2i} + I_{d,2i+1} - 2 \cdot I_{d,2i} \cdot I_{d,2i+1} \text{.} \end{align*}
Note that there are 2^{d-1} sibling pairs at level d > 0 with no pairs at level 0 (there is only one root node). Let R_d be the number of revealed nodes at height d \in [0,h]. We have R_0 = 0 and R_d = \sum_{i \in [0, 2^{d-1})} S_{d,i} and thus R = \sum_{d \in [1,h]} R_d the total number of revealed nodes.
Expected Value
Computing \Exp{R} is now a matter of expanding definitions and using previous results:
\begin{align*} \Exp{R} &= \sum_{d \in [1,h]} \sum_{i \in [0, 2^{d-1})} \Exp{I_{d,2i} + I_{d,2i+1} - 2 \cdot I_{d,2i} \cdot I_{d,2i+1}} \\ &= \sum_{d \in [1,h]} \sum_{i \in [0, 2^{d-1})} \p{p_d + p_d - 2 \cdot p_{dd}} \\ &= \sum_{d \in [1,h]} 2^d \p{p_d - p_{dd}} \\ \end{align*}
where we use the fact that leaf sets of siblings must be disjoint. Expanding p_d and p_{dd} we get
\boxed{ \Exp{R} = \sum_{d \in [1,h]} 2^d \cdot \p{(1 - 2^{-d})^m - (1 - 2^{1-d})^m} }
This exact expression can be efficiently evaluated in O(h) operations. For m \ll 2^h we can approximate it well with m \cdot (h - \log_2 m - 0.89).
Support
Clearly for m=0 the support is R \in \{0\}. For m > 0 the minimum number of revealed nodes is attained when a maximally sized subtree is fully filled, which gives
R_{\min} = h - \min(h, \floor{\log_2 m}) \text{.}
An easy upper bound for the number of revealed nodes is \sum_{d \in [1, h]} \min(2^d, m) but this is not tight. The maximum is attained by allocating each opening a maximally sized subtree. A greedy procedure suffices where for each additional opening we pick largest subtree containing a single opening and split it into its two child subtrees each containing one opening. When m \ge 2^{h-1} we reach the maximum of 2^{h-1} revealed nodes and further openings are made to overlap. For 0 < m < 2^{h-1} we can write the maximum as
R_{\max} = 2^{\ceil{\log_2 m}} + m \cdot (h - \ceil{\log_2 m} - 1)\text{.}
return
= -
return
=
return
Probability Mass Function
Let p_{h,m}(k) = \Pr[R = k] be the probability mass function. The base cases p_{h,0}(k) and p_{0,m}(k) are 1 when k=0 and 0 otherwise as before. For the inductive step, consider a tree of height h > 0 and m > 0 openings. The root has two children, each of which is the root of a subtree of height h-1. Let m_1 and m_2 be the number of openings that fall in the left and right subtree respectively. Then there are two cases to consider: Either only one subtree has openings, and R is the number of revealed nodes in that subtree plus one for the current node. Or both subtrees have openings and R is the sum of the revealed nodes, but doesn't contribute itself. This gives the recurrence
\begin{align*} p_{h+1,m}(k) &= 2^{1-m} \cdot p_{h,m}(k - 1) + \phantom{x} \\ &\phantom{=} 2^{-m} \cdot \sum_{m'=[1,m)} \binom{m}{m'} \sum_{k' \in [0,k]} p_{h,m'}(k') \cdot p_{h,m - m'}(k - k') \end{align*}
This recurrence leads to a reasonably efficient implementation.
return
, =
=
=
= 2.0 *
=
=
=
+= *
return 2.0** *

Variance
We are going to end up with cubic and quartic terms, so before we start let's extend our definition of p_d, p_{de} to its final and most general form:
\begin{align*} p_{d_0 d_2 \dots d_{n-1}} &= \Pr\p{\forall_{j \in [0,n)}\ I_{d_j,i_j} = 1 \mid L_{d_i,i_j} \text{ pairwise disjoint}} \\ &= \Exp{\prod_{j \in [0,n)} I_{d_j,i_j}} \\ &= \sum_{\setn J \subseteq [0,n)} (-1)^{\bar{\setn J}} \cdot \p{1 - \sum_{j \in \setn J} 2^{-d_j}}^m \text{.} \end{align*}
This is obtained by inclusion-exclusion, extending the two-variable case above. It is again symmetric in its arguments.
As before, we start from \Var{R} = \sum_{d, e \in [1,h] } \Cov{R_d, R_e} and use the identity \Cov{R_d, R_e} = \Exp{R_d \cdot R_e} - \Exp{R_d} \cdot \Exp{R_e}, which means we need to compute
\begin{align*} \mathbb{E}[R_d \cdot R_e] &= \sum_{\substack{ i \in [0, 2^{d-1}) \\ j \in [0, 2^{e-1}) }} \mathbb{E}[S_{d,i} \cdot S_{e,j}] \text{.} \end{align*}
We consider three cases, depending on the relationship between the sibling pairs (d,i) and (e,j): they can be identical, descendant/ancestor, or disjoint.
Identical Sibling Pairs. When d = e and i = j we have S_{d,i} \cdot S_{e,j} = S_{d,i}^2 = S_{d,i}, the latter follows from indicators being \{0, 1\}. We have \Exp{S_{d,i}} = 2 \cdot (p_d - p_{dd}) as before. There are 2^{d-1} such cases with total contribution to \Exp{R_d, R_e} of
2^d \cdot (p_d - p_{dd}) \text{.}
Ancestor/Descendant Sibling Pairs. Without loss of generality assume d < e and that the sibling pair (e,j) is a descendant of left branch (d, 2 \cdot i). Then from S_{e,j} = 1 follows I_{d,2i} = 1 and thus S_{d,i} = 1 - I_{d,2i+1}.
\begin{align*} \Exp{S_{d,i} \cdot S_{e,j}} &= \Exp{\p{1 - I_{d,2i+1}} \cdot \p{I_{e,2j} + I_{e,2j+1} - 2 \cdot I_{e,2j} \cdot I_{e,2j+1}}} \\ &= \Exp{\begin{align*} &I_{e,2j} + I_{e,2j+1} - 2 \cdot I_{e,2j} \cdot I_{e,2j+1}\\ &-I_{d,2i+1} \cdot I_{e,2j} - I_{d,2i+1} \cdot I_{e,2j+1} \\ &+2 \cdot I_{d,2i+1} \cdot I_{e,2j} \cdot I_{e,2j+1} \end{align*}} \\ &= 2\cdot \p{ p_e - p_{ee} - p_{de} + p_{dee} } \end{align*}
where the last line follows from all the Is being pairwise disjoint. There are 2^{e - 1} sibling pairs at level e that each have a unique ancestor at level d (which is either left or right in a pair). So the total contribution to \Exp{R_d, R_e} for from this case is
2^e \cdot \p{ p_e - p_{ee} - p_{de} + p_{dee} } \text{.}
Disjoint Pairs. In this case all Is are disjoint and we just have a big tedious expansion to do
\begin{align*} \Exp{S_{d,i} \cdot S_{e,j}} &= \p{I_{d,2i} + I_{d,2i+1} - 2 \cdot I_{d,2i} \cdot I_{d,2i+1}} \\ &\phantom{=} \cdot \p{I_{e,2j} + I_{e,2j+1} - 2 \cdot I_{e,2j} \cdot I_{e,2j+1}} \\ &= 4 \cdot \p{p_{de} - p_{dde} - p_{dee} + p_{ddee}} \text{.} \end{align*}
For d < e there are 2^{d-1} \cdot 2^{e-1} - 2^{e - 1} such disjoint pairs. Additionally there are 2^{d-1} \cdot (2^{d-1} - 1) disjoint pairs when d = e.
Putting this all together we get
\begin{align*} \mathbb{E}[R_d \cdot R_e] &= \begin{cases} \begin{align*} &2^d \cdot (p_d - p_{dd}) + \\ &(2^{2\cdot d} - 2^{d+1})\cdot \p{p_{dd} - 2 \cdot p_{ddd} + p_{dddd}} \end{align*} & \text{if } d = e \\[1.5em] \begin{align*} &2^e \cdot \p{ p_e - p_{ee} - p_{de} + p_{dee} } + \\ &\p{2^{d + e} - 2^{e + 1}} \cdot \p{p_{de} - p_{dde} - p_{dee} + p_{ddee}} \end{align*} & \text{if } d < e \\[1.5em] \text{(same as $d < e$ with $d, e$ swapped)} & \text{if } d > e \\ \end{cases} \\ \end{align*}
Recall \Var{R} = \sum_{d, e \in [1,h] } \Exp{R_d \cdot R_e} - \p{\Exp{R}}^2 and we have \Exp{R} from before. We could expand everything out into a huge expression for \Var{R}, but it is not very insightful. Instead we'll give a way to compute \Exp{R} and \Var{R} in O(h^2) time:
=
= 0.0
= **
=
+= *
return
= 0.0
= 0.0
, , , = , , ,
+= 2.0** *
+= 2.0** *
+= *
, , , = , ,
, , = , ,
+= 2.0** *
+= *
-= **2
return ,
Normal Approximation
With the exact moments we can again construct a normal approximation as \Pr[R = k] \approx {\setn N}\!\p{\Exp{R}, \Var{R}}. We compute the approximate PMF \hat{p}_{h,m}(k) by evaluating the normal distribution at each integer in the support and normalizing as before (identical code).
We can then compute the KL-divergence between the exact PMF and the approximate PMF. For the previous graph of h = 20 and m = 200, the D_{KL} = 0.0024\ \mathrm{bits} which is very small. The linear scale graphs are visually indistinguishable. The log graphs show an underestimation of the left tail and overestimation of the right tail. This is desirable as it means the normal approximation is conservative in estimating the maximum.

This time, KL-divergence between the exact PMF and the normal approximation is below 0.01\ \mathrm{bits} for essentially the entire range m > 75 and below 0.1 for m > 5. For small m the discrete nature of the PMF is more pronounced and the normal approximation is less accurate. But fortunately there the exact PMF can be computed quickly.
Beta-Binomial Approximation

Circuit Verifier
For implementation in circuits we need a Merkle-proof verifier where the control flow does not depend on the proof. We are allowed to do a pre-processing step on the proof before we hand it to the circuit. The simple solution is to preprocess into m single-proof verifications. This requires m \cdot h hashes to be computed.
Merkle Caps
We can improve on this using a Merkle cap: First we compute all nodes up to level l using only 2^l - 1 hashes. Then each opening only requires h - l hashes to be computed up to level l. The total number of hashes is 2^l - 1 + m \cdot (h - l) which is minimized when
l_{\mathrm{min}} = \min(h, \floor{ \log_2 m}) \text{.}
One complication with this approach is that we may not know all the values at level l. We could solve this by having the prover provide all nodes at level l, at slightly increased proof size. Another solution is to explicitly support missing values in the circuit. We pursue the latter as it makes Merkle caps a pure verifier side optimization.
There are only two cases to consider: either both children are present, or both children are missing. The case where exactly one child is missing is impossible, as any colored node's children are all computed or revealed, and any uncolored node's children are all missing. When both children are present, we compute the hash as normal. When both children are missing, we substitute a hint value. The hint value can be either a hash, or None if the parent is also missing. The augmented hash function is as follows:
When this computes the correct root hash, all the values at level l are either correctly computed or marked missing. For practical implementation we can use a flag value such as the all-zeros hash instead of an Option<Hash>. The overhead of the modified hash function should then be minimal.
Number of Computed Hashes
The number of computed hashes at each level d < l is always 2^d as the Merkle cap is fully computed up to level l. For levels d \ge l there are m individual paths of length h - l. Thus the total number of computed nodes is
C_d = \begin{cases} 2^d & \text{if } d < l_{\min} \\ m & \text{if } l_{\min} \le d < h \\ \end{cases}
and the total number of computed hashes is as we saw before
C = \sum_{d \in [0,h)} C_d = 2^{l_{\min}} - 1 + m \cdot (h - l_{\min})\text{.}