Back to Blog
Benchmarks

Post-Quantum Performance: A Comprehensive Benchmark Analysis

Detailed performance measurements of ML-DSA and ML-KEM across different security levels, platforms, and use cases, with optimization insights from our HPCrypt implementation.

Mamone TarshaMamone Tarsha
August 14, 2025
11 min read

The Performance Question

As organizations evaluate post-quantum cryptography adoption, performance is a critical concern. Post-quantum algorithms have fundamentally different computational characteristics than their classical counterparts—larger keys, different mathematical operations, and new optimization opportunities.

This analysis presents comprehensive benchmarks of HPCrypt's ML-DSA and ML-KEM implementations, explaining not just the numbers but the underlying factors that determine performance.

Benchmark Methodology

Test Environment

All measurements were conducted on standardized hardware to ensure reproducibility:

ComponentSpecification
CPUAMD Ryzen 9 7950X (16C/32T, 5.7GHz boost)
RAM64GB DDR5-5200 (2x32GB, dual channel)
OSUbuntu 24.04 LTS (kernel 6.8)
Compilerrustc 1.82.0 (stable)
Optimization-C target-cpu=native -C opt-level=3

Measurement Protocol

Each operation was measured using the following protocol:

  1. Warmup: 1000 iterations discarded to stabilize CPU frequency
  2. Measurement: 10,000 iterations with nanosecond precision timing
  3. Statistical Analysis: Median reported with IQR for outlier resistance
  4. Repetition: 5 independent runs to verify consistency

ML-DSA Benchmark Results

Key Generation Performance

Key generation is typically a one-time cost, but matters for applications generating many ephemeral keys:

Security LevelHPCryptRustCryptoliboqsImprovement
ML-DSA-4422.5μs45.2μs41.8μs1.86-2.01x
ML-DSA-6538.7μs78.3μs72.1μs1.86-2.02x
ML-DSA-8761.2μs125.8μs118.4μs1.93-2.05x

Signing Performance

Signing is the most computationally intensive operation due to rejection sampling:

Security LevelHPCryptRustCryptoliboqsImprovement
ML-DSA-4452.3μs98.7μs89.2μs1.71-1.89x
ML-DSA-6574.4μs124.1μs115.8μs1.56-1.67x
ML-DSA-87108.2μs189.4μs175.3μs1.62-1.75x

Verification Performance

Verification is typically the most frequent operation in deployed systems:

Security LevelHPCryptRustCryptoliboqsImprovement
ML-DSA-4428.9μs61.2μs54.7μs1.89-2.12x
ML-DSA-6540.9μs95.6μs87.3μs2.13-2.34x
ML-DSA-8758.3μs138.7μs124.5μs2.14-2.38x

ML-KEM Benchmark Results

Key Generation

Security LevelHPCryptRustCryptoliboqsImprovement
ML-KEM-51212.8μs21.3μs19.7μs1.54-1.66x
ML-KEM-76821.2μs35.8μs33.1μs1.56-1.69x
ML-KEM-102433.5μs54.2μs50.8μs1.52-1.62x

Encapsulation

Security LevelHPCryptRustCryptoliboqsImprovement
ML-KEM-51218.7μs28.4μs26.2μs1.40-1.52x
ML-KEM-76829.1μs42.8μs40.1μs1.38-1.47x
ML-KEM-102449.0μs66.5μs62.8μs1.28-1.36x

Decapsulation

Security LevelHPCryptRustCryptoliboqsImprovement
ML-KEM-51219.2μs29.8μs27.5μs1.43-1.55x
ML-KEM-76830.5μs45.2μs42.1μs1.38-1.48x
ML-KEM-102451.3μs68.9μs65.2μs1.27-1.34x

Optimization Techniques

1. Vectorized Number Theoretic Transform (NTT)

The NTT is the computational core of lattice-based cryptography. Our AVX2 implementation processes 8 coefficients simultaneously:

#[target_feature(enable = "avx2")]
unsafe fn ntt_butterfly_avx2(
    a: &mut [i32; 256],
    zetas: &[i32; 128]
) {
    for layer in 0..8 {
        let m = 1 << layer;
        let k = 128 >> layer;

        for i in (0..256).step_by(2 * m) {
            let zeta = _mm256_set1_epi32(zetas[m + i / (2 * m)]);

            for j in (i..i + m).step_by(8) {
                let a_vec = _mm256_loadu_si256(a[j..].as_ptr() as *const __m256i);
                let b_vec = _mm256_loadu_si256(a[j + m..].as_ptr() as *const __m256i);

                let t = montgomery_mul_avx2(b_vec, zeta);
                let sum = _mm256_add_epi32(a_vec, t);
                let diff = _mm256_sub_epi32(a_vec, t);

                _mm256_storeu_si256(a[j..].as_mut_ptr() as *mut __m256i, sum);
                _mm256_storeu_si256(a[j + m..].as_mut_ptr() as *mut __m256i, diff);
            }
        }
    }
}

2. Montgomery Multiplication

We use Montgomery representation to avoid expensive division operations:

// Convert to Montgomery form once
let a_mont = to_montgomery(a);
let b_mont = to_montgomery(b);

// Multiply in Montgomery form (no division)
let c_mont = montgomery_mul(a_mont, b_mont);

// Convert back only when needed
let c = from_montgomery(c_mont);

This saves approximately 15-20% in polynomial multiplication routines.

3. Lazy Reduction

Instead of reducing after every arithmetic operation, we accumulate results and reduce once:

// Traditional approach: 3 reductions
let t1 = (a * b) % q;
let t2 = (t1 + c * d) % q;
let t3 = (t2 + e * f) % q;

// Lazy reduction: 1 reduction
let sum = a * b + c * d + e * f;  // May exceed q
let result = barrett_reduce(sum);  // Single reduction

4. Cache-Optimized Memory Access

We structure data layouts to maximize cache utilization:

// Poor cache behavior: strided access
for i in 0..256 {
    for j in 0..4 {
        result[j][i] = process(input[j][i]);
    }
}

// Optimized: sequential access
for j in 0..4 {
    for i in 0..256 {
        result[j][i] = process(input[j][i]);
    }
}

Platform Comparison

Performance varies significantly across platforms:

PlatformML-DSA-65 SignML-KEM-768 Decaps
x86_64 (AVX2)74.4μs30.5μs
x86_64 (scalar)142.8μs58.2μs
ARM64 (NEON)89.1μs36.8μs
ARM64 (scalar)168.4μs67.3μs
WASM312.5μs124.7μs

Recommendations by Use Case

Web Servers (TLS)

For TLS key exchange with ML-KEM:

Traffic LevelRecommendedRationale
< 10K req/sML-KEM-1024Maximum security, ample headroom
10K-100K req/sML-KEM-768Balanced security/performance
> 100K req/sML-KEM-768 + batchingOptimize throughput

Code Signing

For software signatures with ML-DSA:

ScenarioRecommendedRationale
Release signingML-DSA-87Long-term security critical
CI/CD artifactsML-DSA-65Balanced for automation
DevelopmentML-DSA-44Fast iteration cycles

IoT and Embedded

For constrained environments:

Resource ConstraintRecommended
< 64KB RAMML-KEM-512 (minimal stack)
< 1MHz CPUConsider hybrid classical/PQ
Battery-poweredML-KEM-768 (efficient verify)

Running Your Own Benchmarks

Reproduce these results on your hardware:

# Clone and build
git clone https://github.com/seceq/hpcrypt
cd hpcrypt

# Run comprehensive benchmarks
cargo bench --features=bench

# Run specific algorithm
cargo bench --features=bench -- ml_dsa_65

# Generate comparison report
cargo bench --features=bench -- --save-baseline current
cargo bench --features=bench -- --baseline current

Benchmark results are output in both human-readable format and JSON for automated analysis.

Conclusion

Post-quantum cryptography is ready for production deployment. With careful implementation, the performance overhead compared to classical algorithms is modest—and in some cases, post-quantum algorithms outperform their classical counterparts for equivalent security levels.

HPCrypt demonstrates that pure Rust implementations can achieve performance competitive with optimized C libraries while providing the safety guarantees Rust is known for. As quantum computing advances, this combination of security, safety, and performance will become increasingly important.

Interested in learning more?

Get in touch with our team to discuss how we can help with your cryptography needs.

Book a Meeting