The Performance Question
As organizations evaluate post-quantum cryptography adoption, performance is a critical concern. Post-quantum algorithms have fundamentally different computational characteristics than their classical counterparts—larger keys, different mathematical operations, and new optimization opportunities.
This analysis presents comprehensive benchmarks of HPCrypt's ML-DSA and ML-KEM implementations, explaining not just the numbers but the underlying factors that determine performance.
Benchmark Methodology
Test Environment
All measurements were conducted on standardized hardware to ensure reproducibility:
| Component | Specification |
|---|---|
| CPU | AMD Ryzen 9 7950X (16C/32T, 5.7GHz boost) |
| RAM | 64GB DDR5-5200 (2x32GB, dual channel) |
| OS | Ubuntu 24.04 LTS (kernel 6.8) |
| Compiler | rustc 1.82.0 (stable) |
| Optimization | -C target-cpu=native -C opt-level=3 |
Measurement Protocol
Each operation was measured using the following protocol:
- Warmup: 1000 iterations discarded to stabilize CPU frequency
- Measurement: 10,000 iterations with nanosecond precision timing
- Statistical Analysis: Median reported with IQR for outlier resistance
- Repetition: 5 independent runs to verify consistency
ML-DSA Benchmark Results
Key Generation Performance
Key generation is typically a one-time cost, but matters for applications generating many ephemeral keys:
| Security Level | HPCrypt | RustCrypto | liboqs | Improvement |
|---|---|---|---|---|
| ML-DSA-44 | 22.5μs | 45.2μs | 41.8μs | 1.86-2.01x |
| ML-DSA-65 | 38.7μs | 78.3μs | 72.1μs | 1.86-2.02x |
| ML-DSA-87 | 61.2μs | 125.8μs | 118.4μs | 1.93-2.05x |
Signing Performance
Signing is the most computationally intensive operation due to rejection sampling:
| Security Level | HPCrypt | RustCrypto | liboqs | Improvement |
|---|---|---|---|---|
| ML-DSA-44 | 52.3μs | 98.7μs | 89.2μs | 1.71-1.89x |
| ML-DSA-65 | 74.4μs | 124.1μs | 115.8μs | 1.56-1.67x |
| ML-DSA-87 | 108.2μs | 189.4μs | 175.3μs | 1.62-1.75x |
Verification Performance
Verification is typically the most frequent operation in deployed systems:
| Security Level | HPCrypt | RustCrypto | liboqs | Improvement |
|---|---|---|---|---|
| ML-DSA-44 | 28.9μs | 61.2μs | 54.7μs | 1.89-2.12x |
| ML-DSA-65 | 40.9μs | 95.6μs | 87.3μs | 2.13-2.34x |
| ML-DSA-87 | 58.3μs | 138.7μs | 124.5μs | 2.14-2.38x |
ML-KEM Benchmark Results
Key Generation
| Security Level | HPCrypt | RustCrypto | liboqs | Improvement |
|---|---|---|---|---|
| ML-KEM-512 | 12.8μs | 21.3μs | 19.7μs | 1.54-1.66x |
| ML-KEM-768 | 21.2μs | 35.8μs | 33.1μs | 1.56-1.69x |
| ML-KEM-1024 | 33.5μs | 54.2μs | 50.8μs | 1.52-1.62x |
Encapsulation
| Security Level | HPCrypt | RustCrypto | liboqs | Improvement |
|---|---|---|---|---|
| ML-KEM-512 | 18.7μs | 28.4μs | 26.2μs | 1.40-1.52x |
| ML-KEM-768 | 29.1μs | 42.8μs | 40.1μs | 1.38-1.47x |
| ML-KEM-1024 | 49.0μs | 66.5μs | 62.8μs | 1.28-1.36x |
Decapsulation
| Security Level | HPCrypt | RustCrypto | liboqs | Improvement |
|---|---|---|---|---|
| ML-KEM-512 | 19.2μs | 29.8μs | 27.5μs | 1.43-1.55x |
| ML-KEM-768 | 30.5μs | 45.2μs | 42.1μs | 1.38-1.48x |
| ML-KEM-1024 | 51.3μs | 68.9μs | 65.2μs | 1.27-1.34x |
Optimization Techniques
1. Vectorized Number Theoretic Transform (NTT)
The NTT is the computational core of lattice-based cryptography. Our AVX2 implementation processes 8 coefficients simultaneously:
#[target_feature(enable = "avx2")]
unsafe fn ntt_butterfly_avx2(
a: &mut [i32; 256],
zetas: &[i32; 128]
) {
for layer in 0..8 {
let m = 1 << layer;
let k = 128 >> layer;
for i in (0..256).step_by(2 * m) {
let zeta = _mm256_set1_epi32(zetas[m + i / (2 * m)]);
for j in (i..i + m).step_by(8) {
let a_vec = _mm256_loadu_si256(a[j..].as_ptr() as *const __m256i);
let b_vec = _mm256_loadu_si256(a[j + m..].as_ptr() as *const __m256i);
let t = montgomery_mul_avx2(b_vec, zeta);
let sum = _mm256_add_epi32(a_vec, t);
let diff = _mm256_sub_epi32(a_vec, t);
_mm256_storeu_si256(a[j..].as_mut_ptr() as *mut __m256i, sum);
_mm256_storeu_si256(a[j + m..].as_mut_ptr() as *mut __m256i, diff);
}
}
}
}
2. Montgomery Multiplication
We use Montgomery representation to avoid expensive division operations:
// Convert to Montgomery form once
let a_mont = to_montgomery(a);
let b_mont = to_montgomery(b);
// Multiply in Montgomery form (no division)
let c_mont = montgomery_mul(a_mont, b_mont);
// Convert back only when needed
let c = from_montgomery(c_mont);
This saves approximately 15-20% in polynomial multiplication routines.
3. Lazy Reduction
Instead of reducing after every arithmetic operation, we accumulate results and reduce once:
// Traditional approach: 3 reductions
let t1 = (a * b) % q;
let t2 = (t1 + c * d) % q;
let t3 = (t2 + e * f) % q;
// Lazy reduction: 1 reduction
let sum = a * b + c * d + e * f; // May exceed q
let result = barrett_reduce(sum); // Single reduction
4. Cache-Optimized Memory Access
We structure data layouts to maximize cache utilization:
// Poor cache behavior: strided access
for i in 0..256 {
for j in 0..4 {
result[j][i] = process(input[j][i]);
}
}
// Optimized: sequential access
for j in 0..4 {
for i in 0..256 {
result[j][i] = process(input[j][i]);
}
}
Platform Comparison
Performance varies significantly across platforms:
| Platform | ML-DSA-65 Sign | ML-KEM-768 Decaps |
|---|---|---|
| x86_64 (AVX2) | 74.4μs | 30.5μs |
| x86_64 (scalar) | 142.8μs | 58.2μs |
| ARM64 (NEON) | 89.1μs | 36.8μs |
| ARM64 (scalar) | 168.4μs | 67.3μs |
| WASM | 312.5μs | 124.7μs |
Recommendations by Use Case
Web Servers (TLS)
For TLS key exchange with ML-KEM:
| Traffic Level | Recommended | Rationale |
|---|---|---|
| < 10K req/s | ML-KEM-1024 | Maximum security, ample headroom |
| 10K-100K req/s | ML-KEM-768 | Balanced security/performance |
| > 100K req/s | ML-KEM-768 + batching | Optimize throughput |
Code Signing
For software signatures with ML-DSA:
| Scenario | Recommended | Rationale |
|---|---|---|
| Release signing | ML-DSA-87 | Long-term security critical |
| CI/CD artifacts | ML-DSA-65 | Balanced for automation |
| Development | ML-DSA-44 | Fast iteration cycles |
IoT and Embedded
For constrained environments:
| Resource Constraint | Recommended |
|---|---|
| < 64KB RAM | ML-KEM-512 (minimal stack) |
| < 1MHz CPU | Consider hybrid classical/PQ |
| Battery-powered | ML-KEM-768 (efficient verify) |
Running Your Own Benchmarks
Reproduce these results on your hardware:
# Clone and build
git clone https://github.com/seceq/hpcrypt
cd hpcrypt
# Run comprehensive benchmarks
cargo bench --features=bench
# Run specific algorithm
cargo bench --features=bench -- ml_dsa_65
# Generate comparison report
cargo bench --features=bench -- --save-baseline current
cargo bench --features=bench -- --baseline current
Benchmark results are output in both human-readable format and JSON for automated analysis.
Conclusion
Post-quantum cryptography is ready for production deployment. With careful implementation, the performance overhead compared to classical algorithms is modest—and in some cases, post-quantum algorithms outperform their classical counterparts for equivalent security levels.
HPCrypt demonstrates that pure Rust implementations can achieve performance competitive with optimized C libraries while providing the safety guarantees Rust is known for. As quantum computing advances, this combination of security, safety, and performance will become increasingly important.