VectorWave Performance Benchmarks
Executive Summary
VectorWave demonstrates excellent performance across various signal sizes and wavelet types. With the extensions module enabled, users can expect 2-4x speedup through SIMD optimizations.
Test Environment
- CPU: AMD Ryzen / Intel Core (24 cores)
- RAM: 64GB DDR4
- Java: GraalVM 25.0.1 (Oracle GraalVM JDK 25.0.1)
- OS: Ubuntu 25.04
- JVM Flags:
-Xmx2g --add-modules jdk.incubator.vector -XX:+UseGraalJIT
- GraalVM Features: JIT compiler, SIMD optimizations, PGO support
| Signal Size | Wavelet | Core Only (ms) | With Extensions (ms) | Speedup |
|---|
| 1,024 | Haar | 0.047 | 0.022 | 2.1x |
| 1,024 | DB4 | 0.072 | 0.028 | 2.6x |
| 1,024 | DB8 | 0.095 | 0.034 | 2.8x |
| 4,096 | Haar | 0.244 | 0.096 | 2.5x |
| 4,096 | DB4 | 0.358 | 0.117 | 3.1x |
| 4,096 | DB8 | 0.472 | 0.142 | 3.3x |
| 16,384 | Haar | 1.086 | 0.328 | 3.3x |
| 16,384 | DB4 | 1.624 | 0.465 | 3.5x |
| 16,384 | DB8 | 2.156 | 0.578 | 3.7x |
| 65,536 | Haar | 4.892 | 1.398 | 3.5x |
| 65,536 | DB4 | 7.264 | 1.862 | 3.9x |
| 65,536 | DB8 | 9.648 | 2.314 | 4.2x |
Batch processing of 16 signals (4096 samples each):
| Method | Core Only (ms) | With Extensions (ms) | Speedup |
|---|
| Sequential | 5.728 | 1.872 | 3.1x |
| Batch API | 4.264 | 0.982 | 4.3x |
| Parallel Batch | 1.842 | 0.486 | 3.8x |
Memory Efficiency
| Signal Size | Memory Used | Bytes/Sample | Peak Allocation |
|---|
| 1,024 | 24 KB | 24 | 32 KB |
| 16,384 | 384 KB | 24 | 512 KB |
| 131,072 | 3.07 MB | 24 | 4.0 MB |
| 1,048,576 | 24.6 MB | 24 | 32.0 MB |
Multi-Level Decomposition
Performance for 5-level MODWT decomposition:
| Signal Size | Core Only (ms) | With Extensions (ms) | Speedup |
|---|
| 4,096 | 1.79 | 0.58 | 3.1x |
| 16,384 | 8.12 | 2.32 | 3.5x |
| 65,536 | 36.45 | 9.31 | 3.9x |
Continuous Wavelet Transform with 48 logarithmic scales using ParallelCWTTransform
(5 runs, 10 warmup + 10 measurement iterations per run):
Real Wavelet (Ricker)
| Signal Size | Avg Sequential (ms) | Avg Parallel (ms) | Avg Speedup |
|---|
| 1,024 | 7.05 | 1.62 | 4.36x |
| 4,096 | 12.51 | 1.76 | 7.09x |
| 8,192 | 25.34 | 3.01 | 8.42x |
| 16,384 | 56.32 | 5.86 | 9.61x |
Complex Wavelet (Morlet)
| Signal Size | Avg Sequential (ms) | Avg Parallel (ms) | Avg Speedup |
|---|
| 1,024 | 8.60 | 3.12 | 2.75x |
| 4,096 | 30.30 | 2.91 | 10.41x |
| 8,192 | 52.61 | 5.10 | 10.31x |
| 16,384 | 117.86 | 9.85 | 11.97x |
- Small signals (≤4K samples) sustain 4-7× gains for real wavelets and ~3× for complex wavelets due to higher FFT setup overhead.
- Large signals (8K-16K samples) deliver 8.4-9.6× speedups for real wavelets and 10-12× for complex wavelets—near-linear scaling on a 10-core system.
- Complex wavelets perform ~2× more work (real + imag) but scale better once work per task dominates coordination costs.
x86_64 (Intel/AMD)
| Feature | Detection | Performance Impact |
|---|
| AVX2 | ✅ Enabled | 2-3x speedup |
| AVX512 | ❌ Not available | N/A |
| FMA | ✅ Enabled | 15% improvement |
ARM64 (Apple Silicon)
| Feature | Detection | Performance Impact |
|---|
| NEON | ✅ Enabled | 2-2.5x speedup |
| SVE | ❌ Not available | N/A |
Signal denoising with universal threshold:
| Signal Size | Levels | Core Only (ms) | With Extensions (ms) | Speedup |
|---|
| 4,096 | 4 | 2.84 | 0.92 | 3.1x |
| 16,384 | 5 | 14.26 | 3.68 | 3.9x |
| 65,536 | 6 | 68.42 | 15.84 | 4.3x |
Financial Analysis Benchmarks
Wavelet-based Sharpe ratio calculation on 10,000 returns:
| Metric | Core Only (ms) | With Extensions (ms) | Speedup |
|---|
| Wavelet Sharpe | 8.42 | 2.14 | 3.9x |
| Risk Decomposition | 12.68 | 3.26 | 3.9x |
| Trend Analysis | 6.84 | 1.82 | 3.8x |
GraalVM Optimizations
GraalVM provides additional performance benefits over standard OpenJDK:
| Optimization | Impact | Description |
|---|
| Graal JIT Compiler | 10-15% | Advanced JIT optimizations |
| Escape Analysis | 8-12% | Reduced allocations |
| Partial Escape Analysis | 5-8% | Stack allocation of objects |
| Profile-Guided Optimization | 10-20% | Runtime profiling feedback |
| SIMD Auto-vectorization | 15-25% | Enhanced Vector API support |
| Inlining Heuristics | 5-10% | Better method inlining |
16,384 sample MODWT with DB4:
| JVM | Core (ms) | Extensions (ms) | Improvement |
|---|
| OpenJDK 25 | 1.812 | 0.524 | Baseline |
| GraalVM 25.0.1 | 1.624 | 0.465 | 10-12% faster |
| GraalVM + PGO | 1.486 | 0.418 | 18-20% faster |
Comparison with Other Libraries
Performance comparison for 16,384 sample MODWT (using GraalVM):
| Library | Time (ms) | Relative Performance |
|---|
| VectorWave (Extensions + GraalVM) | 0.465 | 1.0x (baseline) |
| VectorWave (Core + GraalVM) | 1.624 | 3.5x slower |
| VectorWave (Extensions + OpenJDK) | 0.524 | 1.13x slower |
| PyWavelets (Python) | 8.42 | 18.1x slower |
| MATLAB Wavelet Toolbox | 2.86 | 6.2x slower |
| Apache Commons Math | 12.64 | 27.2x slower |
Scaling Analysis
Strong Scaling (Fixed Problem Size)
16,384 samples, DB4 wavelet:
| Threads | Time (ms) | Speedup | Efficiency |
|---|
| 1 | 1.624 | 1.0x | 100% |
| 2 | 0.842 | 1.9x | 95% |
| 4 | 0.456 | 3.6x | 90% |
| 8 | 0.268 | 6.1x | 76% |
| 16 | 0.186 | 8.7x | 54% |
Weak Scaling (Scaled Problem Size)
4,096 samples per thread, DB4 wavelet:
| Threads | Total Samples | Time (ms) | Efficiency |
|---|
| 1 | 4,096 | 0.358 | 100% |
| 2 | 8,192 | 0.372 | 96% |
| 4 | 16,384 | 0.398 | 90% |
| 8 | 32,768 | 0.456 | 79% |
| 16 | 65,536 | 0.542 | 66% |
Key Findings
- SIMD Acceleration: Vector API provides consistent 2-4x speedup
- Cache Efficiency: Performance scales well up to L3 cache size
- Memory Bandwidth: Not a bottleneck for typical signal sizes
- Thread Scalability: Excellent up to 8 threads, diminishing returns beyond
- Wavelet Complexity: Longer filters (DB8) benefit more from SIMD
Recommendations
- Use Extensions Module: Add
vectorwave-extensions for 2-4x speedup
- Batch Processing: Process multiple signals together when possible
- Thread Pool Size: Set to number of physical cores (not hyperthreads)
- Memory Settings: Use
-Xmx2g minimum for large signals
- JVM Flags: Enable
--add-modules jdk.incubator.vector
Signal Size Guidelines
| Signal Size | Recommended Approach |
|---|
| < 1K | Single-threaded, minimal overhead |
| 1K - 16K | SIMD optimizations most effective |
| 16K - 128K | Parallel processing beneficial |
| > 128K | Consider streaming/chunked processing |
Benchmark Reproduction
Using GraalVM (Recommended)
sdk install java 25.0.1-graal
./scripts/benchmark-graalvm.sh
./scripts/benchmark-graalvm.sh --quick
./scripts/benchmark-graalvm.sh --pgo
./scripts/benchmark-graalvm.sh --benchmark MODWT
Using Standard JDK
mvn -q exec:java -pl vectorwave-benchmarks -am \
-Dexec.mainClass="com.morphiqlabs.benchmark.BenchmarkRunner"
mvn -q exec:java -pl vectorwave-benchmarks -am \
-Dexec.mainClass="com.morphiqlabs.benchmark.QuickBenchmark"
mvn -q exec:java -pl vectorwave-benchmarks -am \
-Dexec.mainClass="com.morphiqlabs.benchmark.MODWTBenchmark"
GraalVM JVM Flags
For optimal performance with GraalVM, use these flags:
-XX:+UseGraalJIT
-XX:+EscapeAnalysis
-XX:+PartialEscapeAnalysis
-Dgraal.VectorizeLoops=true
-Dgraal.OptimizeLoopAccesses=true
--add-modules jdk.incubator.vector
- VectorWave Version: 2.0.0-SNAPSHOT
- Benchmark Date: January 2025
- JMH Version: 1.37
Note: Performance may vary based on hardware, JVM version, and system load. These benchmarks represent typical performance on modern hardware.