Paper Reading: FusedMM — A Unified Sparse Kernel for Graph Embedding and GNNs

Paper Overview

FusedMM: A Unified SDDMM–SpMM Kernel for Graph Embedding and Graph Neural Networks
(IPDPS 2021)

This paper addresses a fundamental performance issue in graph workloads:
excessive memory traffic caused by materializing intermediate messages.

Typical graph pipelines decompose computation into multiple phases:

Each phase often materializes intermediate results, leading to:

For sparse graphs, these costs dominate execution time.

FusedMM fuses multiple graph operations into a single traversal of the graph, avoiding intermediate storage.

Conceptually, it combines:

into one kernel that computes and accumulates results on the fly.

From a performance engineering standpoint, FusedMM succeeds because it:

Roofline analysis shows that the kernel approaches the memory bandwidth roof, which is the theoretical upper bound for such workloads.

On CPUs, FusedMM benefits from:

These optimizations allow the kernel to scale with memory bandwidth rather than being compute-limited.

Although GPUs offer massive compute throughput, FusedMM remains largely memory-bound.

Key challenges include:

This highlights an important lesson: more compute does not always translate to more performance.

The paper reports:

Most importantly, the performance gains are explained analytically, not just empirically.

While FusedMM is highly effective for memory-bound graph workloads:

Understanding these limits is as important as understanding the speedups.

This paper is an excellent example of hardware-aware algorithm design, demonstrating how:

You can view my presentation here 😄