Paper Overview
FusedMM: A Unified SDDMM–SpMM Kernel for Graph Embedding and Graph Neural Networks
(IPDPS 2021)
This paper addresses a fundamental performance issue in graph workloads:
excessive memory traffic caused by materializing intermediate messages.
Background: Why Graph Kernels Are Slow
Typical graph pipelines decompose computation into multiple phases:
- Similarity computation
- Edge weighting
- Feature aggregation
Each phase often materializes intermediate results, leading to:
- High memory traffic
- Poor cache locality
- Limited scalability for large graphs
For sparse graphs, these costs dominate execution time.
Core Idea of FusedMM
FusedMM fuses multiple graph operations into a single traversal of the graph, avoiding intermediate storage.
Conceptually, it combines:
- Sampled dense-dense matrix multiplication (SDDMM)
- Sparse-dense matrix multiplication (SpMM)
into one kernel that computes and accumulates results on the fly.
Performance Perspective (HPC View)
From a performance engineering standpoint, FusedMM succeeds because it:
- Minimizes memory traffic
- Improves data locality
- Keeps intermediate values in registers
- Aligns computation with memory bandwidth limits
Roofline analysis shows that the kernel approaches the memory bandwidth roof, which is the theoretical upper bound for such workloads.
CPU Optimization Insights
On CPUs, FusedMM benefits from:
- SIMD vectorization over feature dimensions
- Register blocking for vertex features
- Load-balanced partitioning by nonzeros
These optimizations allow the kernel to scale with memory bandwidth rather than being compute-limited.
GPU Considerations and Limitations
Although GPUs offer massive compute throughput, FusedMM remains largely memory-bound.
Key challenges include:
- Warp-level reductions to avoid atomic operations
- Irregular memory access patterns
- Limited benefit from Tensor Cores due to low arithmetic intensity
This highlights an important lesson: more compute does not always translate to more performance.
Results and Impact
The paper reports:
- Up to 34× speedup over baseline implementations
- Significant end-to-end improvements in GNN training
- Robust performance across CPU architectures
Most importantly, the performance gains are explained analytically, not just empirically.
Critical Reflection
While FusedMM is highly effective for memory-bound graph workloads:
- Benefits decrease when messages must be reused
- Less suitable when intermediate results must be materialized
- Algorithm redesign is required for GPUs to fully exploit hardware features
Understanding these limits is as important as understanding the speedups.
Why This Paper Matters
This paper is an excellent example of hardware-aware algorithm design, demonstrating how:
- Performance models guide optimization
- Kernel fusion reduces memory bottlenecks
- Analytical reasoning complements benchmarking
You can view my presentation here 😄
