Verkle proof verification on the EVM

Building and gas-optimising a Verkle proof verifier on the EVM, and comparing it to Merkle proofs in practice

1/6/2026

It's been a while since I was first introduced to Verkle tries back in 2021. Intrigued by the data structure that felt like it worked through sheer cryptographic magic, a friend and I built a simple TypeScript prover and verifier and wrote a short paper on how they work as part of a university course. Years later, now as a smart contract engineer, I keep running into Merkle trees — airdrops, allowlists, governance, proofs of inclusion — yet I couldn't find a practical Solidity verifier for Verkle tries. Thus, I took it as a Christmas holiday challenge to dust off the TypeScript prover and see how efficient I could make a verifier in Solidity in comparison to Merkle trees. What started as a straightforward port quickly spiralled into a deep rabbit hole of gas optimisation, EVM quirks, and assembly-level hacks. This article documents that journey. I'll briefly recap how Verkle tries work and the key computations involved, then dive into the Solidity implementation and the gas optimisation tricks required to make it practically viable. Finally, I'll compare the result against OpenZeppelin's Merkle proof verifier to see if it makes sense to substitute Merkle trees for Verkle tries in any practical applications. For the interested, you can find the TypeScript prover/verifier here and the Solidity verifier here. Feel free to play around with them - the prover is designed with the Solidity verifier in mind and can build a Verkle trie from a JSON file with arbitrary Solidity-typed data.

WARNING: This article assumes familiarity with Verkle tries and the underlying cryptographic math. If you’re new to the topic, some sections may feel like jargon — in that case, I recommend starting with my earlier deep dive. This post is for readers who enjoy practical cryptography, low-level Solidity/Yul, and squeezing every bit of gas out of an implementation.

Brief Verkle trie recap

Verkle tries provide the same property as Merkle tries: the ability to efficiently verify that some piece of data is part of some larger data collection. While Merkle trees require sending all sibling nodes along the root-to-leaf path, Verkle tries use vector commitments (in this case, KZG polynomial commitments) to create constant-sized proofs linking parent nodes to their children. This dramatically reduces proof size at the cost of higher computational complexity from elliptic curve operations.

The key difference is that each internal node stores a polynomial commitment $C = [p(s)]_1$ covering all its children, rather than a hash. To verify that a child $y_i$ is the $i$ -th child of a parent with commitment $C$ , we use KZG commitments. The verifier must store the root commitment to be checked against.

Verkle trie example

In the above illustration, I've constructed a simple example of a Verkle trie with a branching factor of 4 (2 bits per segment) and a key length of 8 bits. The trie contains 6 values. Note that this is a compact trie: inner nodes are merged when there's no branching, so paths are only split where multiple children exist. The leaves marked with red have the values we want to prove (or some derivation of it), and the red inner nodes contain the commitments we have to send to the verifier. First, we'll remind ourselves how to prove a single value, then how to prove multiple values with a single polynomial, and finally how random evaluation allows proving multiple values across different polynomials with a single proof. I won't bother too much with derivations here, but focus on the equations needed for the verifier. If you're only here for the Solidity implementation details, feel free to skip the following section.

Single point evaluation

For a polynomial $p(X)$ with commitment $C = [p(s)]_1$ , proving that $p(z) = y$ relies on the polynomial remainder theorem. If $p(z) = y$ , then $z$ is a root of $p(X) - y$ , meaning:

p(X) - y = (X - z) \cdot q(X)

The prover computes a proof $\pi = [q(s)]_1$ and the verifier checks:

e(\pi, [s - z]_2) = e(C - [y]_1, H)

where $e$ is an elliptic curve pairing and $H$ is a generator point on curve $G_2$ .

Multi-point evaluation for a single polynomial

To prove multiple evaluations $(z_0, y_0), (z_1, y_1), \ldots, (z_{k-1}, y_{k-1})$ of a single polynomial $p(X)$ simultaneously, we use Lagrange interpolation to construct an interpolating polynomial $I(X)$ that passes through all points:

I(X) = \sum_{i=0}^{k-1} y_i \cdot L_i(X)

where $L_i(X)$ are the Lagrange basis polynomials:

L_i(X) = \prod_{\substack{j=0 \\ j \neq i}}^{k-1} \frac{X - z_j}{z_i - z_j}

The zero polynomial is defined as:

Z(X) = (X - z_0)(X - z_1)\cdots(X - z_{k-1})

Since both $p(X)$ and $I(X)$ pass through the same $k$ points, $p(X) - I(X)$ has roots at all $z_i$ , giving:

p(X) - I(X) = Z(X) \cdot q(X)

The proof is $\pi = [q(s)]_1$ and verification requires:

e(\pi, [Z(s)]_2) = e(C - [I(s)]_1, H)

where $I(s)$ can be computed using the Lagrange basis polynomials evaluated at $s$ :

I(s) = \sum_{i=0}^{k-1} y_i \cdot L_i(s) = \sum_{i=0}^{k-1} y_i \cdot \prod_{\substack{j=0 \\ j \neq i}}^{k-1} \frac{s - z_j}{z_i - z_j}

and $Z(s)$ is:

Z(s) = \prod_{i=0}^{k-1} (s - z_i)

Multi-proof with random evaluation

With Verkle tries, we often need to prove multiple parent-child relationships across different polynomials simultaneously—one for each commitment along the root-to-leaf paths. Instead of requiring one proof per relationship, we can combine them into a single constant-sized proof using random evaluation.

Suppose we have $m$ relationships to prove: $p_0(z_0) = y_0, p_1(z_1) = y_1, \ldots, p_{m-1}(z_{m-1}) = y_{m-1}$ , where each $p_i(X)$ has commitment $C_i = [p_i(s)]_1$ . The prover constructs a combined polynomial:

g(X) = \sum_{i=0}^{m-1} r^i \cdot \frac{p_i(X) - y_i}{X - z_i}

where $r$ is a Fiat-Shamir challenge computed by hashing all commitments, values, and evaluation points:

r = H(C_0, \ldots, C_{m-1}, y_0, \ldots, y_{m-1}, z_0, \ldots, z_{m-1})

The random challenge $r$ ensures that if any of the relationships are false, the linear combination cannot cancel out the remainders, making $g(X)$ a rational function rather than a polynomial. The prover commits to $g(X)$ as $D = [g(s)]_1$ .

To verify, the verifier needs to check that $D$ is indeed a commitment to a polynomial. This is done by evaluating $g(X)$ at a random point $t = H(r, D)$ and verifying the evaluation. The evaluation splits into two parts:

g(t) = \underbrace{\sum_{i=0}^{m-1} \frac{r^i}{t - z_i} \cdot p_i(t)}_{g_1(t)} - \underbrace{\sum_{i=0}^{m-1} \frac{r^i \cdot y_i}{t - z_i}}_{g_2(t)}

The verifier can compute $g_2(t)$ directly from the known values $y_i$ and evaluation points $z_i$ :

y = g_2(t) = \sum_{i=0}^{m-1} \frac{r^i \cdot y_i}{t - z_i}

For $g_1(t)$ , the verifier uses the commitments $C_i$ to compute:

E = [h(s)]_1 = \sum_{i=0}^{m-1} \frac{r^i}{t - z_i} \cdot C_i

where $h(X) = \sum_{i=0}^{m-1} \frac{r^i}{t - z_i} \cdot p_i(X)$ is a polynomial whose evaluation at $t$ equals $g_1(t)$ .

The verifier checks that the commitment $E - D$ opens to $y$ at point $t$ . This is verified using a KZG proof $\pi$ :

e(E - D - [y]_1, [1]_2) = e(\pi, [s - t]_2)

If this equation holds, the Schwartz-Zippel lemma guarantees that $D$ must be a commitment to $g(X)$ , meaning all $m$ relationships are valid.

In practice, child commitments in Verkle tries are hashed before being used as polynomial values. For a commitment $C$ , the hash is computed as $H(C)$ (typically using keccak256 in Ethereum), which is then reduced modulo the field size and used as the $y_i$ value in the polynomial commitment.

Solidity verifier implementation

The Solidity verifier implements the multi-proof verification algorithm using the BN128 elliptic curve (also known as alt_bn128), which is supported on Ethereum through precompiled contracts. Precompiles are special contracts at fixed addresses that provide optimised implementations of cryptographic operations, accessible via staticcall at significantly lower gas costs than equivalent Solidity code. For BN128, we use three precompiles: ECADD (0x06) for G1 point addition, ECMUL (0x07) for G1 scalar multiplication, and ECPAIRING (0x08) for the final pairing check.

Proof structure and input optimisation

The verifier takes two inputs: an array of bytes[] representing the leaf data to prove, and a VerkleProof struct containing the proof components:

struct VerkleProof {
    Pairing.G1Point D; // commitment to the polynomial g(s)
    Pairing.G1Point pi; // KZG proof
    Pairing.G1Point[] commitments; // unique commitments
    uint256[] yvals; // evaluation values of inner nodes
    uint256[] zvals; // evaluation points z0, z1, ..., zm-1
    uint256 rootCommitmentIndex; // root idx in commitments
    bytes commitmentIndices; // mapping zval/yval positions to commitments
}

function verifyVerkleMulti(bytes[] calldata data, VerkleProof calldata proof) public view returns (bool);

Note that for data, you can use abi.encode to encode any Solidity-compatible data or structs. Having it as bytes thus enables arbitrary data to be proven. Since we only need to read the inputs, all arguments are taken in via calldata instead of memory to minimise gas costs.

The proof structure includes several optimisations to reduce the input space:

Commitment deduplication: When multiple paths share common internal nodes, the same commitment appears multiple times in the proof. Instead of sending duplicate commitments, we send a deduplicated list in commitments and use commitmentIndices — a packed bytes array of uint16 values—to map each (zval, yval) pair to its corresponding commitment. The indices are packed as uint16 values (2 bytes each) rather than uint256[] (32 bytes each), saving significant calldata. We use assembly to extract these indices directly from calldata during verification. uint16 was chosen since the max value 65535 is sufficient for the number of internal commitments in any foreseeable application.

Leaf yvals computation: The proof only includes yvals for inner nodes, not for leaves. Leaf yvals are computed on-chain from the data array by hashing each leaf's data: y_i = keccak256(data[i]) % PRIME_Q. This serves a dual purpose: it reduces calldata costs (we don't need to send leaf yvals), and it acts as an implicit verification check—if the prover provides incorrect leaf data, the computed yvals won't match what's expected in the proof, causing verification to fail.

Verification algorithm

The verification follows a clear sequence. First, we compute leaf yvals from the input data. Then we compute the Fiat-Shamir challenge r by hashing all commitments, evaluation values, and points. Next comes the evaluation point t, which is simply a hash of r and the polynomial commitment D. The heavy lifting happens in computing E and yScalar, which requires looping through all evaluation points and performing elliptic curve operations—each iteration calls the G1 scalar multiplication and addition precompiles, which are among the most expensive operations in the verification. We also compute modular inverses in this loop, but these are relatively cheap compared to the elliptic curve precompiles. Finally, we perform the pairing check to verify the KZG proof, which is the single most expensive operation in the entire verification process.

function verifyVerkleMulti(
    bytes[] calldata data,
    VerkleProof calldata proof
) public view returns (bool) {
    uint256 m = proof.zvals.length;
    
    // Validation checks omitted for clarity...
    
    // Compute leaf yvals from data (reduces calldata costs)
    uint256[] memory leafYvals = new uint256[](data.length);
    for (uint256 i; i < data.length;) {
        leafYvals[i] = uint256(keccak256(data[i])) % PRIME_Q;
        unchecked { ++i; }
    }
    
    // Step 1: Compute random challenge r = H(C₀, ..., Cₘ₋₁, y₀, ..., yₘ₋₁, z₀, ..., zₘ₋₁)
    uint256 r = _computeR(proof, leafYvals, m);
    
    // Step 2: Compute random evaluation point t = H(r, D)
    uint256 t = _computeT(r, proof.D);
    
    // Step 3: Compute E = Σ (r^i / (t - z_i)) * C_i and y = Σ (r^i * y_i / (t - z_i))
    (Pairing.G1Point memory E, uint256 yScalar) = _computeEandY(
        r, t, proof, leafYvals, data.length, m
    );
    
    // Step 4: Final pairing check (equation 4.8 from paper)
    // e(E - D - [y]_1 + t*π, [1]_2) = e(π, [s-t]_2)
    return Pairing.pairing(
        Pairing.plus(
            Pairing.plus(
                Pairing.plus(E, Pairing.negate(proof.D)),
                Pairing.negate(Pairing.mulScalar(SRS_G1_0, yScalar))
            ),
            Pairing.mulScalar(proof.pi, t)
        ),
        g2Generator,
        Pairing.negate(proof.pi),
        SRS_G2_1
    );
}

The attentive reader might have noticed that the final pairing check $e$ differs from the equation in the "Multi-proof with random evaluation" section. The implementation does not make use of arithmetic operations on the G2 curve. This transformation is necessary because the EVM's BN128 pairing precompile only supports G1 point operations directly—we cannot compute $[s - t]_2$ on-chain since there's no G2 scalar multiplication precompile. Instead, we rearrange the original equation using the bilinearity property of pairings:

Using bilinearity, we can split the right-hand side:

e(\pi, [s - t]_2) = e(\pi, [s]_2) \cdot e(\pi, [-t]_2)

Since $e(\pi, [-t]_2) = e(-t \cdot \pi, [1]_2)$ by bilinearity:

e(E - D - [y]_1, [1]_2) = e(\pi, [s]_2) \cdot e(-t \cdot \pi, [1]_2)

Rearranging:

e(E - D - [y]_1, [1]_2) \cdot e(t \cdot \pi, [1]_2) = e(\pi, [s]_2)

Using bilinearity again to combine the left-hand side, we get the check used in the implementation:

e(E - D - [y]_1 + t \cdot \pi, [1]_2) = e(\pi, [s]_2)

This final form only requires G1 operations (which we can compute using the ECADD and ECMUL precompiles) and G2 points from the SRS (which are constants).

Computing the Fiat-Shamir challenge `r`

The first major optimisation opportunity appeared in computing the Fiat-Shamir challenge r. My initial naive implementation used abi.encodePacked to concatenate all the commitments, yvals, and zvals. This seemed reasonable at first, but I quickly discovered it had quadratic memory complexity—each call to abi.encodePacked creates a new array and copies all previous data. For 50 leaves, this meant thousands of unnecessary memory operations.

The solution was to drop into inline assembly and copy data directly into a pre-allocated buffer. Instead of letting Solidity manage memory through its abstractions, we grab the free memory pointer and manually lay out all the data in one contiguous block. This requires careful pointer arithmetic—we copy commitments first (using the commitmentIndices lookup to handle deduplication), then leaf yvals, then inner yvals, and finally zvals. All in a single pass, enabling one keccak256 call at the end. This optimisation provides substantial gas savings compared to the naive approach, primarily from avoiding repeated memory allocations that scale quadratically with the number of leaves.

function _computeR(
    VerkleProof calldata proof,
    uint256[] memory leafYvals,
    uint256 m
) internal pure returns (uint256 r) {
    Pairing.G1Point[] memory commitments = proof.commitments;
    uint256[] memory yvals = proof.yvals;
    uint256[] memory zvals = proof.zvals;
    bytes calldata commitmentIndices = proof.commitmentIndices;
    uint256 numLeaves = leafYvals.length;
    
    assembly {
        let freeMemPtr := mload(0x40)  // Get free memory pointer
        let ptr := freeMemPtr
        let commitmentsPtr := add(commitments, 0x20)  // Skip length word
        let commitmentsLen := mload(commitments)
        let indicesOffset := commitmentIndices.offset  // Hoist outside loop
        
        // Copy commitments using commitmentIndices lookup (m * 64 bytes)
        for { let i := 0 } lt(i, m) { i := add(i, 1) } {
            // Extract uint16 index from packed calldata
            let start := add(indicesOffset, mul(i, 2))
            let loaded := calldataload(start)
            let commitIdx := shr(240, loaded)  // Extract leftmost 16 bits
            
            // Bounds check
            if iszero(lt(commitIdx, commitmentsLen)) {
                revert(0, 0)
            }
            
            // Copy commitment (X and Y, 64 bytes total)
            let commitSlot := mload(add(commitmentsPtr, mul(commitIdx, 0x20)))
            mstore(ptr, mload(commitSlot))                    // X (32 bytes)
            mstore(add(ptr, 0x20), mload(add(commitSlot, 0x20)))  // Y (32 bytes)
            ptr := add(ptr, 0x40)
        }
        
        // Copy leaf yvals (numLeaves * 32 bytes)
        let leafYvalsPtr := add(leafYvals, 0x20)
        for { let i := 0 } lt(i, numLeaves) { i := add(i, 1) } {
            mstore(ptr, mload(add(leafYvalsPtr, mul(i, 0x20))))
            ptr := add(ptr, 0x20)
        }
        
        // Copy inner yvals (yvals.length * 32 bytes)
        let yvalsPtr := add(yvals, 0x20)
        let yvalsLen := mload(yvals)
        for { let i := 0 } lt(i, yvalsLen) { i := add(i, 1) } {
            mstore(ptr, mload(add(yvalsPtr, mul(i, 0x20))))
            ptr := add(ptr, 0x20)
        }
        
        // Copy zvals (m * 32 bytes)
        let zvalsPtr := add(zvals, 0x20)
        for { let i := 0 } lt(i, m) { i := add(i, 1) } {
            mstore(ptr, mload(add(zvalsPtr, mul(i, 0x20))))
            ptr := add(ptr, 0x20)
        }
        
        // Hash everything: commitments (m*64) + yvals (m*32) + zvals (m*32) = m*128 bytes
        let totalSize := mul(m, 128)
        r := keccak256(freeMemPtr, totalSize)
        
        // Update free memory pointer
        mstore(0x40, add(freeMemPtr, add(totalSize, 0x20)))
    }
}

Computing the evaluation point `t`

Computing the evaluation point t is much simpler—just a hash of r and the polynomial commitment D. Still, we use assembly here too, following the same pattern of direct memory manipulation to avoid Solidity's overhead. It's a small win, but these small wins add up.

function _computeT(uint256 r, Pairing.G1Point memory D) 
    internal pure returns (uint256 t) 
{
    assembly {
        let ptr := mload(0x40)  // Get free memory pointer
        mstore(ptr, r)                              // r (32 bytes)
        mstore(add(ptr, 0x20), mload(D))           // D.X (32 bytes)
        mstore(add(ptr, 0x40), mload(add(D, 0x20))) // D.Y (32 bytes)
        t := keccak256(ptr, 0x60)  // Hash 96 bytes
        // Safe: memory used immediately, no allocations after
    }
}

Computing `E` and `yScalar`

The most expensive part of verification is computing the sums $E$ and $y$ from the multi-proof. Each term requires computing a coefficient of the form $\text{coeff}_i = r^i / (t - z_i) \bmod q$ , where $r$ is the Fiat-Shamir challenge, $t$ is the evaluation point, $z_i$ are the leaf positions, and $q$ is the field modulus. The naive approach would compute $m$ modular inverses—one for each denominator $(t - z_i)$ —which is prohibitively expensive. Each modular inverse costs ~100,000 gas, so for 50 evaluation points, this alone would cost ~5M gas just for inversions. This quickly makes verification impractical for any non-trivial proof size.

The breakthrough comes from a trick in finite field arithmetic called batched inversion. Instead of computing each $1/(t - z_i)$ separately, we compute a single inverse for the product of all denominators, and then recover each individual inverse with cheap multiplications. This reduces hundreds of expensive inversions to just one. More concretely, the algorithm works as follows:

1. Compute denominators and prefix products: For each $i$ , compute $d_i = t - z_i$ and the cumulative product $\text{prefix}[i] = d_0 \cdot d_1 \cdots d_i$ .

2. Compute a single inverse: Compute $\text{inv} = 1 / \text{prefix}[m-1]$ , where $m$ is the total number of evaluation points.

3. Recover each inverse in a backward pass: For each $i$ (iterating backward from $m-1$ to $0$ ), compute $1/d_i = \text{inv} \cdot \text{prefix}[i-1]$ (treating $\text{prefix}[-1]$ as $1$ for the first iteration), then update $\text{inv} = \text{inv} \cdot d_i$ for the next iteration.

This works because each inverse can be expressed as a ratio of prefix products. Updating the accumulator as we iterate backward gives exactly the correct value for each term—the same values as computing each inverse individually, but with massive gas savings: instead of $m$ inversions at ~100k gas each, we compute just one inverse at ~100k gas total. The extra multiplications (mulmod) are negligible in comparison (~5–10 gas each). This optimisation is the difference between Verkle verification being impractically expensive and being competitive with Merkle proofs.

This optimisation provides substantial gas savings that scale with the number of evaluation points. The technique is mathematically sound and produces identical results to naive inversions, but it only works if all denominators are non-zero, which in our case is guaranteed if the evaluation point $t$ is not equal to any $z_i$ (we check this explicitly).

The implementation splits the computation into two functions: _precomputeBatchedInversion handles the precomputation, and _computeEandYLoop performs the main accumulation loop using backward iteration. We also precompute $r^i$ values in a forward pass to avoid repeated exponentiations, and we read commitments directly from calldata using assembly to avoid memory expansion costs.

function _computeEandY(
    uint256 r,
    uint256 t,
    VerkleProof calldata proof,
    uint256[] memory leafYvals,
    uint256 numLeaves,
    uint256 m
) internal view returns (Pairing.G1Point memory E, uint256 yScalar) {
    // Precompute batched inversion data and rPowers
    (uint256[] memory denom, uint256[] memory prefix, uint256 inv, uint256[] memory rPowers) = 
        _precomputeBatchedInversion(r, t, proof, m);
    
    // Main computation loop: accumulate E and yScalar
    (E, yScalar) = _computeEandYLoop(
        proof, leafYvals, numLeaves, m, denom, prefix, inv, rPowers
    );
}

function _precomputeBatchedInversion(
    uint256 r,
    uint256 t,
    VerkleProof calldata proof,
    uint256 m
) internal pure returns (
    uint256[] memory denom,
    uint256[] memory prefix,
    uint256 inv,
    uint256[] memory rPowers
) {
    denom = new uint256[](m);
    prefix = new uint256[](m);
    
    // Compute denom[i] = (t - z_i) and prefix[i] = product of all denom[0..i]
    for (uint256 i; i < m;) {
        denom[i] = addmod(t, PRIME_Q - proof.zvals[i], PRIME_Q);
        require(denom[i] != 0, "t equals z_i: invalid proof");
        
        prefix[i] = (i == 0)
            ? denom[i]
            : mulmod(prefix[i - 1], denom[i], PRIME_Q);
        unchecked { ++i; }
    }
    
    // ONE single inverse for all denominators (key optimisation!)
    inv = _modInverse(prefix[m - 1], PRIME_Q);
    
    // Precompute rPowers[i] = r^i
    rPowers = new uint256[](m);
    rPowers[0] = 1;
    for (uint256 i = 1; i < m;) {
        rPowers[i] = mulmod(rPowers[i - 1], r, PRIME_Q);
        unchecked { ++i; }
    }
}

function _computeEandYLoop(
    VerkleProof calldata proof,
    uint256[] memory leafYvals,
    uint256 numLeaves,
    uint256 m,
    uint256[] memory denom,
    uint256[] memory prefix,
    uint256 inv,
    uint256[] memory rPowers
) internal view returns (Pairing.G1Point memory E, uint256 yScalar) {
    E = Pairing.G1Point(0, 0);
    yScalar = 0;
    
    // Iterate backward for correct batched inversion
    for (uint256 i = m; i > 0;) {
        uint256 idx = i - 1;
        
        // Compute 1/(t - z_idx) using batched inversion
        uint256 prev = (idx == 0) ? 1 : prefix[idx - 1];
        uint256 invDenom = mulmod(inv, prev, PRIME_Q);
        
        // Compute coefficient: coeff = r^idx * invDenom
        uint256 coeff = mulmod(rPowers[idx], invDenom, PRIME_Q);
        
        // Update inverse accumulator for next iteration
        inv = mulmod(inv, denom[idx], PRIME_Q);
        
        // Accumulate E and yScalar using assembly (reads commitments from calldata)
        // ... (assembly code for ECADD/ECMUL precompiles) ...
        
        unchecked { --i; }
    }
}

Modular inverse optimisation

The modular inverse function itself benefits from a few optimisations, though with batched inversion we only call it once per proof instead of $m$ times. We use the Extended Euclidean Algorithm, which is more efficient than Fermat's Little Theorem for single inversions. Since the input a is typically already less than primeQ in our use case, we conditionally compute the modulo only when necessary. The main loop is wrapped in an unchecked block since we don't need overflow checks there, and we handle edge cases like a == 0 and a == 1 with early returns. While these optimisations are less critical now that we only compute one inverse, they still provide meaningful savings and make the code more robust.

function _modInverse(uint256 a, uint256 primeQ) 
    internal pure returns (uint256) 
{
    if (a == 0) revert("Cannot invert zero");
    if (a == 1) return 1;  // Early return optimisation
    
    int256 t0 = 0;
    int256 t1 = 1;
    uint256 r0 = primeQ;
    
    // Conditional modulo: only compute a % primeQ if necessary
    uint256 r1 = a < primeQ ? a : a % primeQ;
    
    unchecked {  // Main loop doesn't need overflow checks
        while (r1 != 0) {
            uint256 q = r0 / r1;
            
            // Extended Euclidean updates
            int256 t2 = t0 - int256(q) * t1;
            t0 = t1;
            t1 = t2;
            
            uint256 r2 = r0 - q * r1;
            r0 = r1;
            r1 = r2;
        }
    }
    
    require(r0 == 1, "Modular inverse does not exist");
    
    // Handle negative result
    if (t0 < 0) {
        return uint256(int256(primeQ) + t0);
    }
    return uint256(t0);
}

These optimisations are crucial for making Verkle proof verification viable on-chain. The batched inversion optimisation alone reduced gas costs by ~77%, transforming Verkle verification from prohibitively expensive to competitive with Merkle trees. Combined with the assembly work for direct calldata access, scratch space reuse, and incremental computations, these optimisations make the difference between a theoretical curiosity and something that might actually be used in production.

Empirical evaluation

So, now we've made quite an optimised implementation, but let's check how it actually competes with the often-used Merkle proof verifier by OpenZeppelin. In these tests, we'll focus on the gas cost associated with performing a multi-proof on a Merkle tree (using MerkleProof.multiProofVerify) containing the exact same values as our Verkle trie. To reproduce, the TypeScript prover comes with a function compareMerkleVerkle which can generate a Merkle tree and Verkle trie on the same random input (here we use the seed 851057), generate a multi-proof and convert it into a Solidity test file that can be run directly.

For this test, we'll insert 100000 elements into our Merkle tree and Verkle trie, and we'll construct a multi-proof for 1, 10, 100 and 1000 values (branching factor 256 for both). For tests regarding the optimal branching factor and comparison regarding proof construction time, please see my previous article.

Merkle proofs

# of values proven	proof size (bytes)	execution gas	calldata gas
1	1,280	26,010	20,480
10	8,640	196,640	138,240
100	65,280	1,524,874	1,044,480
1000	436,096	10,876,246	6,977,536

Verkle proofs

# of values proven	proof size (bytes)	execution gas	calldata gas	total gas vs. Merkle
1	832	343,904	13,312	+669%
10	4,064	725,451	65,024	+136%
100	34,112	4,358,355	545,792	+91%
1000	263,808	34,034,456	4,220,928	+114%

The results show a dramatic improvement after implementing batched inversion: Verkle proofs are now 2.4x more expensive in total gas for 10 values (compared to 10x before the optimisation), while still achieving substantial proof size reductions (53% smaller). The batched inversion optimisation reduced gas costs by ~77%, transforming Verkle verification from prohibitively expensive to competitive with Merkle trees for many use cases.

To put these numbers in perspective, at current gas prices (Ethereum: 0.643 Gwei, Base: 0.004 Gwei), verifying 10 pairs costs $0.21 on Ethereum and $0.0016 on Base for Merkle proofs, compared to $0.51 on Ethereum and $0.0032 on Base for Verkle proofs. The 2.4x cost difference on Ethereum is much more manageable than the 10x difference we saw before optimisation. For applications where proof size is the primary constraint—such as layer 2 solutions or cross-chain bridges—Verkle tries now make compelling sense, offering smaller proofs at a reasonable execution cost premium.

The high execution cost of Verkle verification comes primarily from elliptic curve precompiles. For a proof verifying $m$ values, the verifier makes:

ECADD (0x06): $m + 3$ calls at 150 gas each (used for G1 point addition in computing $E$ and the final pairing check)
ECMUL (0x07): $m + 2$ calls at 6,000 gas each (used for G1 scalar multiplication to compute $E = \sum \frac{r^i}{t - z_i} \cdot C_i$ and final pairing setup)
ECPAIRING (0x08): 1 call at 113,000 gas (34,000 × 2 pairs + 45,000 base cost for the final KZG proof verification)

These precompile costs were substantially reduced by EIP-1108, which lowered ECADD from 500 to 150 gas, ECMUL from 40,000 to 6,000 gas, and pairing checks from $80,000 \cdot k + 100,000$ to $34,000 \cdot k + 45,000$ gas. This reduction made Verkle verification viable on Ethereum, though still expensive. As the need for more advanced cryptographic operations grows—particularly with the rise of zero-knowledge proofs and other pairing-based schemes—we may see further optimisations to these precompile costs. With Verkle proofs currently sitting at around 2x the gas cost of Merkle proofs for multiple values, a 50% reduction in precompile costs could bring them close to parity, making future cost reductions highly relevant for practical adoption.

Final thoughts

Implementing a Verkle proof verifier on the EVM is feasible, but requires deep dives into gas optimisation and EVM quirks. By exploring efficient computation techniques, I managed to reduce the gas cost from approximately 40x (in my initial implementation 😅) that of Merkle tree multi-proofs to around 2x. This optimisation journey highlights how crucial it is to dig deeper when designing gas-heavy systems on-chain: what initially seems prohibitively expensive can become practically viable through careful engineering.

If you made it this far, I encourage you to play with the implementation and try it on-chain. If you find ways to optimise it further, don't hesitate to reach out or open a pull request — there's always room for improvement, and the EVM optimisation rabbit hole runs deep. Thanks for the read!