[GPU] Dynamic Dst Scale #4245

kealan-barbieri · 2025-10-30T01:08:46Z

Description

Enable MXFP4/FP8 dynamic dst scale generation in ref, JIT.

Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
Have you formatted the code using clang-format?

kealan-barbieri · 2025-10-30T01:10:10Z

make test
disable benchdnn_all
set test_scope=NIGHTLY
enable benchdnn_matmul
disable test_device_cpu
enable test_device_gpu
enable arch_gpu_xe-hpc
enable arch_gpu_xe-hpg-atsm
enable arch_gpu_xe-hpg-dg2
enable arch_gpu_xe-lp
enable arch_gpu_xe-lpg
enable arch_gpu_xe-lpg+
enable arch_gpu_xe2-hpg-bmg
enable arch_gpu_xe2-lpg

atkassen · 2025-10-30T16:46:00Z

src/gpu/intel/mx_scale.cl

+    float scale_val
+            = cvt_e8m0_to_f32(cvt_f32_to_e8m0(max_group)) / DST_DATA_FMAX;


The CPU reference implementation also seems to be missing it (and the comment is missing a closing paren), but

oneDNN/src/cpu/matmul/ref_matmul.cpp

Lines 324 to 327 in a031d34

// MXSPEC does round_down_pow2(dst_d.data_type() /

// round_down_pow2(max_dst_group) so the rounding

// to a power of two happens before the division,

// and not after.

reads like it should be

Suggested change

float scale_val

= cvt_e8m0_to_f32(cvt_f32_to_e8m0(max_group)) / DST_DATA_FMAX;

#define E8M0(x) cvt_e8m0_to_f32(cvt_f32_to_e8m0(x))

float scale_val = E8M0(max_group) / E8M0(DST_DATA_FMAX);

#undef E8M0

Since scale_val can be outside the range of e8m0, this will need an additional outer E8M0. Without it, we'd be scaling with and storing different values. Consider: all values in the group are zero.

fixed, thanks.

@dzarukin a similar change is required to make the benchdnn ref implementation align with this behavior, added as part of this PR.

Summoning @mgouicem as an architect of the feature.

atkassen · 2025-10-30T21:42:12Z

src/gpu/intel/jit/eltwise_injector.cpp

+    for (int i = 0; i < 16; i += 4) {
+        h->mov(4, tmp.ud(0)(1), max.ud(i)(1));
+        h->sel(4 | ge, max.f(0), max.f(0)(1), tmp.f(0)(1));
+    }
+    h->mov(2, tmp.ud(0)(1), max.ud(2)(1));
+    h->sel(2 | ge, max, max.f(0)(1), tmp.f(0)(1));
+    h->mov(2, tmp.ud(0)(1), max.ud(1)(1));
+    h->sel(1 | ge, max, max.f(0)(1), tmp.f(0)(1));


Would this work:

Suggested change

for (int i = 0; i < 16; i += 4) {

h->mov(4, tmp.ud(0)(1), max.ud(i)(1));

h->sel(4 | ge, max.f(0), max.f(0)(1), tmp.f(0)(1));

}

h->mov(2, tmp.ud(0)(1), max.ud(2)(1));

h->sel(2 | ge, max, max.f(0)(1), tmp.f(0)(1));

h->mov(2, tmp.ud(0)(1), max.ud(1)(1));

h->sel(1 | ge, max, max.f(0)(1), tmp.f(0)(1));

h->sel(8 | ge, max.ud(0)(1), max.ud(0)(2), max.ud(1)(2));

h->sel(4 | ge, max.ud(0)(1), max.ud(0)(2), max.ud(1)(2));

h->sel(3 | ge, max.ud(0)(1), max.ud(0)(2), max.ud(1)(2));

h->sel(1 | ge, max.ud(0)(1), max.ud(0)(0), max.ud(1)(0));

I guess there'd need to be some special handling of nan/infs before this sequence to avoid propagating them.

@atkassen -- you can add 0x80400000:ud to the inputs prior to the max, and then subtract after the max.

Another optimization to consider (maybe not in this PR) is a fully vectorized horizontal reduction, where you recombine partly reduced vectors as you go, so that you can get full SIMD usage at each stage -- vISA example here (DEFINE_HREDUCE16_FLOAT).

tests/benchdnn/utils/compare.cpp

kealan-barbieri · 2025-11-04T01:49:34Z

make test
disable benchdnn_all
set test_scope=NIGHTLY
enable benchdnn_matmul
disable test_device_cpu
enable test_device_gpu
enable arch_gpu_xe-hpc
enable arch_gpu_xe-hpg-atsm
enable arch_gpu_xe-hpg-dg2
enable arch_gpu_xe-lp
enable arch_gpu_xe-lpg
enable arch_gpu_xe-lpg+
enable arch_gpu_xe2-hpg-bmg
enable arch_gpu_xe2-lpg

kealan-barbieri · 2025-11-06T01:20:33Z

make test
disable benchdnn_all
set test_scope=NIGHTLY
enable benchdnn_matmul
disable test_device_cpu
enable test_device_gpu
enable arch_gpu_xe-hpc
enable arch_gpu_xe-hpg-atsm
enable arch_gpu_xe-hpg-dg2
enable arch_gpu_xe-lp
enable arch_gpu_xe-lpg
enable arch_gpu_xe-lpg+
enable arch_gpu_xe2-hpg-bmg
enable arch_gpu_xe2-lpg

src/gpu/intel/jit/eltwise_injector.cpp

kealan-barbieri · 2025-11-07T01:37:48Z

make test
disable benchdnn_all
set test_scope=NIGHTLY
enable benchdnn_matmul
disable test_device_cpu
enable test_device_gpu
enable arch_gpu_xe-hpc
enable arch_gpu_xe-hpg-atsm
enable arch_gpu_xe-hpg-dg2
enable arch_gpu_xe-lp
enable arch_gpu_xe-lpg
enable arch_gpu_xe-lpg+
enable arch_gpu_xe2-hpg-bmg
enable arch_gpu_xe2-lpg

kealan-barbieri requested review from a team as code owners October 30, 2025 01:08

github-actions bot added platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel component:tests Codeowner: @oneapi-src/onednn-arch component:common labels Oct 30, 2025

kealan-barbieri force-pushed the kealanba/dyn_scale_main branch 2 times, most recently from f58ea11 to 0c409b2 Compare October 30, 2025 16:42

kealan-barbieri added 2 commits October 30, 2025 09:46

src: common: add tracking enum for dynamic quant dst

86e2457

xe: include: add ocl e8m0->f32 cvt, scale offset calc

9554aa0

atkassen reviewed Oct 31, 2025

View reviewed changes

kealan-barbieri force-pushed the kealanba/dyn_scale_main branch from 0c409b2 to b4cb984 Compare November 3, 2025 20:05

dzarukin reviewed Nov 3, 2025

View reviewed changes

tests/benchdnn/utils/compare.cpp Outdated Show resolved Hide resolved

kealan-barbieri force-pushed the kealanba/dyn_scale_main branch from b4cb984 to 3237d6b Compare November 4, 2025 01:48

github-actions bot removed the component:tests Codeowner: @oneapi-src/onednn-arch label Nov 4, 2025

kealan-barbieri force-pushed the kealanba/dyn_scale_main branch 2 times, most recently from 820c954 to 158d43c Compare November 6, 2025 01:19

atkassen reviewed Nov 6, 2025

View reviewed changes

kealan-barbieri added 6 commits November 6, 2025 17:34

xe: gemm: ocl: enable dst dynamic scaling

3e30cb1

xe: gemm: with_po: enable dynamic scale

d524887

xe: ocl: fixup fp8 type detection

e02ed37

xe: gemm: enable jit dynamic scaling

0672fd1

xe: ocl: fixup e8m0 scale handling

f98b30f

tests: benchdnn: matmul: ensure mx scale value within range

5fc7b98

kealan-barbieri force-pushed the kealanba/dyn_scale_main branch from 158d43c to 5fc7b98 Compare November 7, 2025 01:36

github-actions bot added the component:tests Codeowner: @oneapi-src/onednn-arch label Nov 7, 2025

		float scale_val
		= cvt_e8m0_to_f32(cvt_f32_to_e8m0(max_group)) / DST_DATA_FMAX;

	// MXSPEC does round_down_pow2(dst_d.data_type() /
	// round_down_pow2(max_dst_group) so the rounding
	// to a power of two happens before the division,
	// and not after.

-    float scale_val
-            = cvt_e8m0_to_f32(cvt_f32_to_e8m0(max_group)) / DST_DATA_FMAX;
+#define E8M0(x) cvt_e8m0_to_f32(cvt_f32_to_e8m0(x))
+    float scale_val = E8M0(max_group) / E8M0(DST_DATA_FMAX);
+#undef E8M0

[GPU] Dynamic Dst Scale #4245

Are you sure you want to change the base?

[GPU] Dynamic Dst Scale #4245

Uh oh!

Conversation

kealan-barbieri commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

General

Uh oh!

kealan-barbieri commented Oct 30, 2025

Uh oh!

atkassen Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kealan-barbieri Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

kealan-barbieri Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dzarukin Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

atkassen Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

atkassen Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petercad Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kealan-barbieri commented Nov 4, 2025

Uh oh!

kealan-barbieri commented Nov 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kealan-barbieri commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kealan-barbieri commented Oct 30, 2025 •

edited

Loading

atkassen Oct 30, 2025 •

edited

Loading

kealan-barbieri Nov 7, 2025 •

edited

Loading

atkassen Nov 3, 2025 •

edited

Loading