Post-Quantum Performance: A Comprehensive Benchmark Analysis

The Performance Question

As organizations evaluate post-quantum cryptography adoption, performance is a critical concern. Post-quantum algorithms have fundamentally different computational characteristics than their classical counterparts—larger keys, different mathematical operations, and new optimization opportunities.

This analysis presents comprehensive benchmarks of HPCrypt's ML-DSA and ML-KEM implementations, explaining not just the numbers but the underlying factors that determine performance.

Benchmark Methodology

Test Environment

All measurements were conducted on standardized hardware to ensure reproducibility:

Component	Specification
CPU	AMD Ryzen 9 7950X (16C/32T, 5.7GHz boost)
RAM	64GB DDR5-5200 (2x32GB, dual channel)
OS	Ubuntu 24.04 LTS (kernel 6.8)
Compiler	rustc 1.82.0 (stable)
Optimization	`-C target-cpu=native -C opt-level=3`

Measurement Protocol

Each operation was measured using the following protocol:

Warmup: 1000 iterations discarded to stabilize CPU frequency
Measurement: 10,000 iterations with nanosecond precision timing
Statistical Analysis: Median reported with IQR for outlier resistance
Repetition: 5 independent runs to verify consistency

ML-DSA Benchmark Results

Key Generation Performance

Key generation is typically a one-time cost, but matters for applications generating many ephemeral keys:

Security Level	HPCrypt	RustCrypto	liboqs	Improvement
ML-DSA-44	22.5μs	45.2μs	41.8μs	1.86-2.01x
ML-DSA-65	38.7μs	78.3μs	72.1μs	1.86-2.02x
ML-DSA-87	61.2μs	125.8μs	118.4μs	1.93-2.05x

Signing Performance

Signing is the most computationally intensive operation due to rejection sampling:

Security Level	HPCrypt	RustCrypto	liboqs	Improvement
ML-DSA-44	52.3μs	98.7μs	89.2μs	1.71-1.89x
ML-DSA-65	74.4μs	124.1μs	115.8μs	1.56-1.67x
ML-DSA-87	108.2μs	189.4μs	175.3μs	1.62-1.75x

Verification Performance

Verification is typically the most frequent operation in deployed systems:

Security Level	HPCrypt	RustCrypto	liboqs	Improvement
ML-DSA-44	28.9μs	61.2μs	54.7μs	1.89-2.12x
ML-DSA-65	40.9μs	95.6μs	87.3μs	2.13-2.34x
ML-DSA-87	58.3μs	138.7μs	124.5μs	2.14-2.38x

ML-KEM Benchmark Results

Key Generation

Security Level	HPCrypt	RustCrypto	liboqs	Improvement
ML-KEM-512	12.8μs	21.3μs	19.7μs	1.54-1.66x
ML-KEM-768	21.2μs	35.8μs	33.1μs	1.56-1.69x
ML-KEM-1024	33.5μs	54.2μs	50.8μs	1.52-1.62x

Encapsulation

Security Level	HPCrypt	RustCrypto	liboqs	Improvement
ML-KEM-512	18.7μs	28.4μs	26.2μs	1.40-1.52x
ML-KEM-768	29.1μs	42.8μs	40.1μs	1.38-1.47x
ML-KEM-1024	49.0μs	66.5μs	62.8μs	1.28-1.36x

Decapsulation

Security Level	HPCrypt	RustCrypto	liboqs	Improvement
ML-KEM-512	19.2μs	29.8μs	27.5μs	1.43-1.55x
ML-KEM-768	30.5μs	45.2μs	42.1μs	1.38-1.48x
ML-KEM-1024	51.3μs	68.9μs	65.2μs	1.27-1.34x

Optimization Techniques

1. Vectorized Number Theoretic Transform (NTT)

The NTT is the computational core of lattice-based cryptography. Our AVX2 implementation processes 8 coefficients simultaneously:

#[target_feature(enable = "avx2")]
unsafe fn ntt_butterfly_avx2(
    a: &mut [i32; 256],
    zetas: &[i32; 128]
) {
    for layer in 0..8 {
        let m = 1 << layer;
        let k = 128 >> layer;

        for i in (0..256).step_by(2 * m) {
            let zeta = _mm256_set1_epi32(zetas[m + i / (2 * m)]);

            for j in (i..i + m).step_by(8) {
                let a_vec = _mm256_loadu_si256(a[j..].as_ptr() as *const __m256i);
                let b_vec = _mm256_loadu_si256(a[j + m..].as_ptr() as *const __m256i);

                let t = montgomery_mul_avx2(b_vec, zeta);
                let sum = _mm256_add_epi32(a_vec, t);
                let diff = _mm256_sub_epi32(a_vec, t);

                _mm256_storeu_si256(a[j..].as_mut_ptr() as *mut __m256i, sum);
                _mm256_storeu_si256(a[j + m..].as_mut_ptr() as *mut __m256i, diff);
            }
        }
    }
}

2. Montgomery Multiplication

We use Montgomery representation to avoid expensive division operations:

// Convert to Montgomery form once
let a_mont = to_montgomery(a);
let b_mont = to_montgomery(b);

// Multiply in Montgomery form (no division)
let c_mont = montgomery_mul(a_mont, b_mont);

// Convert back only when needed
let c = from_montgomery(c_mont);

This saves approximately 15-20% in polynomial multiplication routines.

3. Lazy Reduction

Instead of reducing after every arithmetic operation, we accumulate results and reduce once:

// Traditional approach: 3 reductions
let t1 = (a * b) % q;
let t2 = (t1 + c * d) % q;
let t3 = (t2 + e * f) % q;

// Lazy reduction: 1 reduction
let sum = a * b + c * d + e * f;  // May exceed q
let result = barrett_reduce(sum);  // Single reduction

4. Cache-Optimized Memory Access

We structure data layouts to maximize cache utilization:

// Poor cache behavior: strided access
for i in 0..256 {
    for j in 0..4 {
        result[j][i] = process(input[j][i]);
    }
}

// Optimized: sequential access
for j in 0..4 {
    for i in 0..256 {
        result[j][i] = process(input[j][i]);
    }
}

Platform Comparison

Performance varies significantly across platforms:

Platform	ML-DSA-65 Sign	ML-KEM-768 Decaps
x86_64 (AVX2)	74.4μs	30.5μs
x86_64 (scalar)	142.8μs	58.2μs
ARM64 (NEON)	89.1μs	36.8μs
ARM64 (scalar)	168.4μs	67.3μs
WASM	312.5μs	124.7μs

Recommendations by Use Case

Web Servers (TLS)

For TLS key exchange with ML-KEM:

Traffic Level	Recommended	Rationale
< 10K req/s	ML-KEM-1024	Maximum security, ample headroom
10K-100K req/s	ML-KEM-768	Balanced security/performance
> 100K req/s	ML-KEM-768 + batching	Optimize throughput

Code Signing

For software signatures with ML-DSA:

Scenario	Recommended	Rationale
Release signing	ML-DSA-87	Long-term security critical
CI/CD artifacts	ML-DSA-65	Balanced for automation
Development	ML-DSA-44	Fast iteration cycles

IoT and Embedded

For constrained environments:

Resource Constraint	Recommended
< 64KB RAM	ML-KEM-512 (minimal stack)
< 1MHz CPU	Consider hybrid classical/PQ
Battery-powered	ML-KEM-768 (efficient verify)

Running Your Own Benchmarks

Reproduce these results on your hardware:

# Clone and build
git clone https://github.com/seceq/hpcrypt
cd hpcrypt

# Run comprehensive benchmarks
cargo bench --features=bench

# Run specific algorithm
cargo bench --features=bench -- ml_dsa_65

# Generate comparison report
cargo bench --features=bench -- --save-baseline current
cargo bench --features=bench -- --baseline current

Benchmark results are output in both human-readable format and JSON for automated analysis.

Conclusion

Post-quantum cryptography is ready for production deployment. With careful implementation, the performance overhead compared to classical algorithms is modest—and in some cases, post-quantum algorithms outperform their classical counterparts for equivalent security levels.

HPCrypt demonstrates that pure Rust implementations can achieve performance competitive with optimized C libraries while providing the safety guarantees Rust is known for. As quantum computing advances, this combination of security, safety, and performance will become increasingly important.