Skip to content

Conversation

@kealan-barbieri
Copy link
Contributor

@kealan-barbieri kealan-barbieri commented Oct 30, 2025

Description

Enable MXFP4/FP8 dynamic dst scale generation in ref, JIT.

Fixes # MFDNN-14330

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

@kealan-barbieri kealan-barbieri requested review from a team as code owners October 30, 2025 01:08
@github-actions github-actions bot added platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel component:tests Codeowner: @oneapi-src/onednn-arch component:common labels Oct 30, 2025
@kealan-barbieri
Copy link
Contributor Author

make test
disable benchdnn_all
set test_scope=NIGHTLY
enable benchdnn_matmul
disable test_device_cpu
enable test_device_gpu
enable arch_gpu_xe-hpc
enable arch_gpu_xe-hpg-atsm
enable arch_gpu_xe-hpg-dg2
enable arch_gpu_xe-lp
enable arch_gpu_xe-lpg
enable arch_gpu_xe-lpg+
enable arch_gpu_xe2-hpg-bmg
enable arch_gpu_xe2-lpg

@kealan-barbieri kealan-barbieri force-pushed the kealanba/dyn_scale_main branch 2 times, most recently from f58ea11 to 0c409b2 Compare October 30, 2025 16:42
Comment on lines 56 to 57
float scale_val
= cvt_e8m0_to_f32(cvt_f32_to_e8m0(max_group)) / DST_DATA_FMAX;
Copy link
Contributor

@atkassen atkassen Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CPU reference implementation also seems to be missing it (and the comment is missing a closing paren), but

// MXSPEC does round_down_pow2(dst_d.data_type() /
// round_down_pow2(max_dst_group) so the rounding
// to a power of two happens before the division,
// and not after.
reads like it should be

Suggested change
float scale_val
= cvt_e8m0_to_f32(cvt_f32_to_e8m0(max_group)) / DST_DATA_FMAX;
#define E8M0(x) cvt_e8m0_to_f32(cvt_f32_to_e8m0(x))
float scale_val = E8M0(max_group) / E8M0(DST_DATA_FMAX);
#undef E8M0

Since scale_val can be outside the range of e8m0, this will need an additional outer E8M0. Without it, we'd be scaling with and storing different values. Consider: all values in the group are zero.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks.

Copy link
Contributor Author

@kealan-barbieri kealan-barbieri Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dzarukin a similar change is required to make the benchdnn ref implementation align with this behavior, added as part of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summoning @mgouicem as an architect of the feature.

Comment on lines 332 to 348
for (int i = 0; i < 16; i += 4) {
h->mov(4, tmp.ud(0)(1), max.ud(i)(1));
h->sel(4 | ge, max.f(0), max.f(0)(1), tmp.f(0)(1));
}
h->mov(2, tmp.ud(0)(1), max.ud(2)(1));
h->sel(2 | ge, max, max.f(0)(1), tmp.f(0)(1));
h->mov(2, tmp.ud(0)(1), max.ud(1)(1));
h->sel(1 | ge, max, max.f(0)(1), tmp.f(0)(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this work:

Suggested change
for (int i = 0; i < 16; i += 4) {
h->mov(4, tmp.ud(0)(1), max.ud(i)(1));
h->sel(4 | ge, max.f(0), max.f(0)(1), tmp.f(0)(1));
}
h->mov(2, tmp.ud(0)(1), max.ud(2)(1));
h->sel(2 | ge, max, max.f(0)(1), tmp.f(0)(1));
h->mov(2, tmp.ud(0)(1), max.ud(1)(1));
h->sel(1 | ge, max, max.f(0)(1), tmp.f(0)(1));
h->sel(8 | ge, max.ud(0)(1), max.ud(0)(2), max.ud(1)(2));
h->sel(4 | ge, max.ud(0)(1), max.ud(0)(2), max.ud(1)(2));
h->sel(3 | ge, max.ud(0)(1), max.ud(0)(2), max.ud(1)(2));
h->sel(1 | ge, max.ud(0)(1), max.ud(0)(0), max.ud(1)(0));

Copy link
Contributor

@atkassen atkassen Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there'd need to be some special handling of nan/infs before this sequence to avoid propagating them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atkassen -- you can add 0x80400000:ud to the inputs prior to the max, and then subtract after the max.

Another optimization to consider (maybe not in this PR) is a fully vectorized horizontal reduction, where you recombine partly reduced vectors as you go, so that you can get full SIMD usage at each stage -- vISA example here (DEFINE_HREDUCE16_FLOAT).

@kealan-barbieri kealan-barbieri force-pushed the kealanba/dyn_scale_main branch from 0c409b2 to b4cb984 Compare November 3, 2025 20:05
@kealan-barbieri kealan-barbieri force-pushed the kealanba/dyn_scale_main branch from b4cb984 to 3237d6b Compare November 4, 2025 01:48
@github-actions github-actions bot removed the component:tests Codeowner: @oneapi-src/onednn-arch label Nov 4, 2025
@kealan-barbieri
Copy link
Contributor Author

make test
disable benchdnn_all
set test_scope=NIGHTLY
enable benchdnn_matmul
disable test_device_cpu
enable test_device_gpu
enable arch_gpu_xe-hpc
enable arch_gpu_xe-hpg-atsm
enable arch_gpu_xe-hpg-dg2
enable arch_gpu_xe-lp
enable arch_gpu_xe-lpg
enable arch_gpu_xe-lpg+
enable arch_gpu_xe2-hpg-bmg
enable arch_gpu_xe2-lpg

@kealan-barbieri kealan-barbieri force-pushed the kealanba/dyn_scale_main branch 2 times, most recently from 820c954 to 158d43c Compare November 6, 2025 01:19
@kealan-barbieri
Copy link
Contributor Author

make test
disable benchdnn_all
set test_scope=NIGHTLY
enable benchdnn_matmul
disable test_device_cpu
enable test_device_gpu
enable arch_gpu_xe-hpc
enable arch_gpu_xe-hpg-atsm
enable arch_gpu_xe-hpg-dg2
enable arch_gpu_xe-lp
enable arch_gpu_xe-lpg
enable arch_gpu_xe-lpg+
enable arch_gpu_xe2-hpg-bmg
enable arch_gpu_xe2-lpg

@kealan-barbieri kealan-barbieri force-pushed the kealanba/dyn_scale_main branch from 158d43c to 5fc7b98 Compare November 7, 2025 01:36
@github-actions github-actions bot added the component:tests Codeowner: @oneapi-src/onednn-arch label Nov 7, 2025
@kealan-barbieri
Copy link
Contributor Author

make test
disable benchdnn_all
set test_scope=NIGHTLY
enable benchdnn_matmul
disable test_device_cpu
enable test_device_gpu
enable arch_gpu_xe-hpc
enable arch_gpu_xe-hpg-atsm
enable arch_gpu_xe-hpg-dg2
enable arch_gpu_xe-lp
enable arch_gpu_xe-lpg
enable arch_gpu_xe-lpg+
enable arch_gpu_xe2-hpg-bmg
enable arch_gpu_xe2-lpg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:common component:tests Codeowner: @oneapi-src/onednn-arch platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants