VectorWave Performance Benchmarking Guide
This guide explains how to run and interpret performance benchmarks for the VectorWave library using JMH (Java Microbenchmark Harness).
Quick Start
Run all benchmarks:
./jmh-runner.sh
Prerequisites
- Java 25+
- Maven 3.6+
- Sufficient heap memory (4GB+ recommended)
- For SIMD benchmarks:
- x86: CPU with AVX2/AVX512 support
- ARM: NEON support (standard on modern ARM)
- Apple Silicon: Automatic optimization for M-series chips
- Optional: Java Vector API support (Java 16+ as incubator module)
- Automatically detected by jmh-runner.sh
- Falls back to scalar implementation if not available
Vector API Configuration
Automatic Detection
The jmh-runner.sh script automatically detects if the Vector API module is available:
# Run all benchmarks
./jmh-runner.sh
# Run specific benchmark
./jmh-runner.sh OptimizedFFTBenchmark
Manual Configuration
With Vector API (Java 16+ with incubator modules)
JAVA_OPTS="-Xmx2G --add-modules=jdk.incubator.vector" ./jmh-runner.sh
Without Vector API (Scalar fallback)
JAVA_OPTS="-Xmx2G" ./jmh-runner.sh
Checking Vector API Status
To verify if benchmarks are using Vector API:
# Check if module is available
java --list-modules | grep jdk.incubator.vector
# Check runtime status
mvn compile && java -cp target/classes --add-modules=jdk.incubator.vector \
-cp target/classes com.morphiqlabs.wavelet.util.OptimizedFFT
Available Benchmarks
1. SIMD Performance Comparison
Compares scalar vs SIMD-optimized operations across different signal sizes.
./jmh-runner.sh SIMDBenchmark
Parameters:
- Signal sizes: 64, 128, 256, 512, 1024, 2048, 4096
- Padding strategys: PERIODIC, ZERO_PADDING
- Measures: Performance difference between scalar and vector operations
- Includes financial signal benchmarks
- Note: Platform-adaptive thresholds - Apple Silicon benefits from SIMD with signals ≥ 8 elements
2. Signal Size Scaling
Measures performance across different signal sizes to understand scaling characteristics.
./jmh-runner.sh SignalSizeBenchmark
Parameters:
- Signal sizes: 256, 512, 1024, 2048, 4096, 8192, 16384
- Measures: Throughput and average time
- Includes cold-start and batch processing tests
3. Wavelet Type Comparison
Compares performance across different wavelet families.
./jmh-runner.sh WaveletTypeBenchmark
Parameters:
- Wavelet types: Haar, Daubechies-2 (DB2), Daubechies-4 (DB4)
- Measures: Transform time for each wavelet type
- Fixed signal size: 4096 samples
4. Validation Performance
Measures the overhead of input validation.
./jmh-runner.sh ValidationBenchmark
Parameters:
- Various validation scenarios
- Measures validation overhead in nanoseconds
5. Batch Validation Performance
Measures batch validation efficiency.
./jmh-runner.sh BatchValidationBenchmark
Parameters:
- Batch sizes: 10, 100, 1000 signals
- Signal sizes: 256, 1024, 4096
- Measures: Throughput of batch validation
6. Quick Performance Test
A lightweight benchmark for quick performance checks.
./jmh-runner.sh QuickPerformanceTest
Parameters:
- Limited iterations for fast results
- Good for regression testing
7. Vector Optimization Comparison
Compares original VectorOps vs optimized VectorOps implementations.
./jmh-runner.sh VectorOptimizationBenchmark
Parameters:
- Signal sizes: 128, 256, 512, 1024, 2048, 4096
- Filter lengths: 4, 8 (DB2 and DB4)
- Measures: Performance of convolution, combined transforms, and Haar optimization
- Uses 3 forks for statistical reliability
8. Real-Time Application Benchmarks
Measures performance for real-time use cases like audio processing and financial tick data.
./jmh-runner.sh RealTimeBenchmark
Parameters:
- Audio buffer sizes: 64, 128, 256, 512 samples
- Measures: Latency, throughput, and memory allocation overhead
- Scenarios: Audio processing, financial tick batches, sensor data filtering
- Includes real-time denoising benchmarks
9. Latency-Focused Benchmarks
Detailed latency analysis for real-time constraints.
./jmh-runner.sh LatencyBenchmark
Parameters:
- Signal sizes: 16, 32, 64, 128, 256 samples
- Measures: Percentile latencies (50th, 90th, 95th, 99th, 99.9th)
- Tests: Jitter, GC impact, thread contention, allocation overhead
- Wavelet comparison: Haar vs DB2 vs DB4 latency
- Recent Results: Haar ~107 ns/op, DB2 ~193 ns/op, DB4 ~294 ns/op (64 samples)
- Thread Safety: Fixed indexing collision with AtomicInteger
10. Cache Prefetch Optimization Benchmarks
Measures the impact of cache prefetching optimizations on large signal processing.
./jmh-runner.sh PrefetchBenchmark
Parameters:
- Signal sizes: 256, 1024, 4096, 16384, 65536 samples
- Wavelets: Haar, DB4, Sym8
- Compares: Standard vs prefetch-optimized transforms
- Measures: Impact of cache-friendly access patterns
- Includes: Multi-level transform prefetch benefits
- Baseline: Random access pattern to show cache miss impact
11. Small Signal Optimization
Focused benchmarks for very small signals common in real-time applications.
./jmh-runner.sh SmallSignalBenchmark
Parameters:
- Signal sizes: 8, 16, 32, 64, 128 samples
- Wavelets: Haar, DB2, DB4
- Measures: Optimizations for small buffer processing
- Platform-specific: Apple Silicon optimizations for 8-element signals
12. Phase 4 Optimization Benchmarks
Measures advanced optimization strategies.
./jmh-runner.sh Phase4OptimizationBenchmark
Parameters:
- Various optimization techniques
- Memory pooling efficiency
- Cache-aware transformations
13. General Optimization Benchmarks
Comprehensive optimization comparison.
./jmh-runner.sh OptimizationBenchmark
Parameters:
- Full optimization suite comparison
- Scalar vs SIMD vs cache-aware
- Memory allocation patterns
JMH Parameters
Customize benchmark execution with these parameters:
# Example: Run with custom warmup and measurement iterations
./jmh-runner.sh -wi 10 -i 20
# Example: Run specific benchmark with fork count
./jmh-runner.sh SignalSizeBenchmark -f 3
# Example: Override parameter values
./jmh-runner.sh SignalSizeBenchmark -p signalSize=2048,8192
Common Parameters:
-wi N: Warmup iterations (default: 5)-i N: Measurement iterations (default: 10)-f N: Fork count (default: 2)-t N: Thread count (default: 1)-p param=value: Override @Param values-rf format: Result format (json, csv, text)-rff file: Result file
Interpreting Results
Throughput Metrics
- ops/s: Operations per second (higher is better)
- ms/op: Milliseconds per operation (lower is better)
- samples/sec: Signal samples processed per second
Understanding Scores
Benchmark Mode Cnt Score Error Units
SignalSizeBenchmark.forward thrpt 10 1234.567 ± 12.345 ops/s
^^^^^^^^^ ^^^^^^
throughput std dev
Performance Guidelines
Expected performance characteristics:
- Linear scaling: Transform time should scale linearly with signal size
- Wavelet complexity: Haar < DB2 < DB4 in terms of computation time
- Memory efficiency: No significant GC pressure during normal operation
Advanced Usage
Running from Source
# Using the JMH runner script (recommended)
./jmh-runner.sh SignalSizeBenchmark
# Or manually compile and run
mvn test-compile
java -cp target/test-classes:target/classes:$(mvn dependency:build-classpath -Dmdep.outputFile=/dev/stdout -q) \
org.openjdk.jmh.Main SignalSizeBenchmark
Profiling Integration
# Run with async profiler
./jmh-runner.sh SignalSizeBenchmark -prof async
# Run with GC profiler
./jmh-runner.sh ValidationBenchmark -prof gc
Custom Benchmark Development
Create new benchmarks by extending the base benchmark class:
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Benchmark)
public class MyBenchmark {
@Benchmark
public void myTest() {
// benchmark code
}
}
Troubleshooting
Out of Memory
The JMH runner script already sets heap size, but you can modify it in the benchmark's @Fork annotation or run manually:
java -Xmx8G -cp target/test-classes:target/classes:$(mvn dependency:build-classpath -Dmdep.outputFile=/dev/stdout -q) \
org.openjdk.jmh.Main SignalSizeBenchmark
Inconsistent Results
- Ensure no other CPU-intensive processes are running
- Use more warmup iterations:
-wi 10 - Increase fork count:
-f 3
Slow Execution
- Reduce iteration count for quick tests:
-wi 1 -i 1 - Run specific benchmarks instead of all
Best Practices
- Isolation: Run benchmarks on a quiet system
- Warmup: Allow sufficient warmup for JIT compilation
- Multiple runs: Use forks to ensure consistency
- Baseline: Always compare against a known baseline
- Documentation: Record environment details with results