Featured image of post Paper Reading: FusedMM — A Unified Sparse Kernel for Graph Embedding and GNNs

Paper Reading: FusedMM — A Unified Sparse Kernel for Graph Embedding and GNNs

Paper Overview

FusedMM: A Unified SDDMM–SpMM Kernel for Graph Embedding and Graph Neural Networks
(IPDPS 2021)

This paper addresses a fundamental performance issue in graph workloads:
excessive memory traffic caused by materializing intermediate messages.


Background: Why Graph Kernels Are Slow

Typical graph pipelines decompose computation into multiple phases:

  1. Similarity computation
  2. Edge weighting
  3. Feature aggregation

Each phase often materializes intermediate results, leading to:

  • High memory traffic
  • Poor cache locality
  • Limited scalability for large graphs

For sparse graphs, these costs dominate execution time.


Core Idea of FusedMM

FusedMM fuses multiple graph operations into a single traversal of the graph, avoiding intermediate storage.

Conceptually, it combines:

  • Sampled dense-dense matrix multiplication (SDDMM)
  • Sparse-dense matrix multiplication (SpMM)

into one kernel that computes and accumulates results on the fly.


Performance Perspective (HPC View)

From a performance engineering standpoint, FusedMM succeeds because it:

  • Minimizes memory traffic
  • Improves data locality
  • Keeps intermediate values in registers
  • Aligns computation with memory bandwidth limits

Roofline analysis shows that the kernel approaches the memory bandwidth roof, which is the theoretical upper bound for such workloads.


CPU Optimization Insights

On CPUs, FusedMM benefits from:

  • SIMD vectorization over feature dimensions
  • Register blocking for vertex features
  • Load-balanced partitioning by nonzeros

These optimizations allow the kernel to scale with memory bandwidth rather than being compute-limited.


GPU Considerations and Limitations

Although GPUs offer massive compute throughput, FusedMM remains largely memory-bound.

Key challenges include:

  • Warp-level reductions to avoid atomic operations
  • Irregular memory access patterns
  • Limited benefit from Tensor Cores due to low arithmetic intensity

This highlights an important lesson: more compute does not always translate to more performance.


Results and Impact

The paper reports:

  • Up to 34× speedup over baseline implementations
  • Significant end-to-end improvements in GNN training
  • Robust performance across CPU architectures

Most importantly, the performance gains are explained analytically, not just empirically.


Critical Reflection

While FusedMM is highly effective for memory-bound graph workloads:

  • Benefits decrease when messages must be reused
  • Less suitable when intermediate results must be materialized
  • Algorithm redesign is required for GPUs to fully exploit hardware features

Understanding these limits is as important as understanding the speedups.


Why This Paper Matters

This paper is an excellent example of hardware-aware algorithm design, demonstrating how:

  • Performance models guide optimization
  • Kernel fusion reduces memory bottlenecks
  • Analytical reasoning complements benchmarking

You can view my presentation here 😄