Skip to content

GEMM Benchmarks

Performance results for matrix multiplication operators.

Note

This page will be populated with benchmark data from nightly CI runs.

Dense GEMM

Shape (M, N, K) Dtype TileOPs (ms) cuBLAS (ms) Speedup TFLOPS

GEMV

Shape (M, N) Dtype TileOPs (ms) PyTorch (ms) Speedup Bandwidth (GB/s)

Grouped GEMM

Groups × Shape Dtype TileOPs (ms) PyTorch (ms) Speedup TFLOPS