Skip to content

Hardware-Aware Dispatch

Context

Different GPU architectures offer distinct instruction sets and performance characteristics:

Architecture SM Version Key Features
Hopper SM 90 WGMMA, TMA, warp specialization, FP8

A single kernel implementation cannot optimally serve all architectures.

Decision

Each Op declares a default_kernel_map that maps SM architecture identifiers to Kernel classes:

class MultiHeadAttentionFwdOp(Op):
    @property
    def default_kernel_map(self):
        return {
            "hopper": FlashAttnV3FwdKernel,
        }

At runtime, dispatch_kernel() queries the GPU's SM version via get_sm_version() and selects the matching kernel.

SM Version Detection

# tileops/utils/
def get_sm_version() -> int:
    """Returns SM version of the current CUDA device (e.g., 80, 86, 90)."""

Consequences

  • Ops automatically use the best kernel for the current hardware
  • New architectures can be added by extending the kernel map
  • Users can override the kernel map for custom dispatch logic
  • Fallback behavior when no architecture-specific kernel exists