Cross-Architecture HNSW Performance Degradation

info

performanceUpdated Feb 18, 2025

Default Chroma HNSW builds lack CPU-specific optimizations (SSE, AVX, AVX2, AVX-512 SIMD instructions), resulting in suboptimal vector distance calculations. Rebuilding HNSW for the target architecture enables hardware-accelerated vector operations, improving query throughput by 2-5x and reducing latency by 40-60%.

Sources

Performance Tips - Chroma Cookbookcookbook.chromadb.dev

GitHub - nmslib/hnswlib: Header-only C++/python library for fast approximate nearest neighborsgithub.com

Technologies:

ChromaThe root cause of this issue originates in Chroma

How to detect:

Query latency and throughput are significantly worse than expected for dataset size and hardware specs. Performance varies across deployment architectures (x86 vs ARM, cloud vs on-prem). No SIMD instruction usage visible in CPU profiling. HNSW was installed via pre-built wheel rather than compiled for target architecture.

Recommended action:

1. Diagnose: Profile query execution to identify vector distance calculation bottleneck. Check if HNSW was installed from pre-built wheel (generic) or compiled from source. Compare performance across different deployment environments — significant variance indicates architecture-specific issue. CPU profiling should show high time in distance calculations without SIMD instructions. 2. Rebuild HNSW (core package): Uninstall existing: `pip uninstall chroma-hnswlib`. Install from source: `pip install --no-binary :all: chroma-hnswlib`. This compiles HNSW with architecture-specific optimizations detected at build time. 3. Rebuild HNSW (server/container): For Docker deployments, rebuild image with REBUILD_HNSWLIB flag: `docker build --build-arg REBUILD_HNSWLIB=true -t chroma-optimized .`. This forces source compilation during image build. For local server installs, follow core package steps above. 4. Verify optimization: After rebuild, check CPU profiling for SIMD instruction usage (AVX, SSE). Measure query latency and throughput — expect 2-5x improvement. Verify consistent performance across deployment environments. 5. Document: Record optimized build in deployment documentation. Include architecture-specific build steps in CI/CD pipelines. Test performance on target hardware before production deployment.