[GPU] Fix accuracy degradation issue for hbonet0.5/hbonet1.0/nanodet-m-1.5x-416 models caused by eltwise kernels #31140

yuanxion · 2025-06-30T03:41:43Z

Details

Fix accuracy degradation issue for hbonet0.5/hbonet1.0/nanodet-m-1.5x-416 models caused by none-zeros values existed in the blocked format (b_fs_yx_fsv16/b_fs_yx_fsv32) padded memory area in eltwise kernels.

Description of the issue

Symptom

Onednn Concatenation got different output values before/after commit d7f0f34 when with blocked format memory.

Root cause

eltwise_blocked_opt kernel will copy extra values from input (to pad features to 16/32 block) to output for the blocked format (fsv16/fsv32) padded memory area, instead of using zeros.
Onednn Concatenation requires blocked format padded memory area to be padded with zeros link. When the Concatenation's input comes from eltwise_blocked_opt or generic_eltwise_ref kernel's output and its blocked format padded memory area is not zero-padded, the onednn concatenation will have wrong output values, and cause the model's accuracy degradation.
When running model inference for the second time, crop's output memory is shared with other primitives and thus may have non-zero values in the padded memory area, and since it will not be filled with zeros in the eltwise_blocked_opt or generic_eltwise_ref kernel by default, the onednn concatenation after crop then will get wrong outputs.

How to fix it

Skip copying extra values from input in eltwise_blocked_opt kernel.
Fill zeros to crop if has blocked format memory and shaved with others, and followed by onednn concatenation.

The code and line that caused this issue

openvino/src/plugins/intel_gpu/src/kernel_selector/cl_kernels/eltwise_blocked_opt.cl

Line 64 in 0dcc5ad

if ((f_block*VEC_SIZE) > OUTPUT_FEATURE_NUM || b > OUTPUT_BATCH_NUM) {

Reproduction step and snapshot

For hbonet-1.0 FP16-INT8
python accuracy_check.py --target_framework openvino --target_devices GPU --config ./hbonet-1.0-onnx.yml --target_tags FP16-INT8 --models ./local_models --source ./datasets --annotations ./annotations --definitions ./dataset_definitions.yml --undefined_shapes_resolving_policy default --sub_evaluation true --use_new_api True

Problematic graph

eltwise_blocked_opt and generic_eltwise_ref kernels used in hbonot1.0 model

Checklist

Is it a proper fix? (not a workaround)
Did you include test case for this fix, if necessary?
Did you review existing test that can be extended to cover this scenario? Which test did you review? No testcase can cover this issue, so added a new crop_gpu testcase "basic_in1x176x52x52_crop_b_fs_yx_fsv16".

Tickets:

~~CVS-169075~~
CVS-168563

yuanxion · 2025-06-30T03:52:32Z

The added b_fs_yx_fsv16 & fp16 onednn concatenation testcase can be passed with older onednn gpu commit e7d51221ff8aa4698c4dd63fffc136ce7522ef62, but will be failed with new onednn gpu commit a42b47ff2cb81df552887dd4a3575f964386b25e (which is introduced into OpenVINO from d7f0f34).

Note the onednn gpu commit change will not affect ocl concatenation, and this testcase can be passed when setting ocl implementation for concatenation.

yuanxion · 2025-06-30T05:22:40Z

Note: CI test will be passed only when onednn concatenation issue is fixed.

INFO: [0;32m[ RUN      ] [mconcat_gpu_onednn.b_fs_yx_fsv16_input_types
INFO: src/plugins/intel_gpu/tests/unit/test_cases/concatenation_gpu_test.cpp:1756: Failure
INFO: Expected equality of these values:
INFO:   output_vec[x + offset_pad]
INFO:     Which is: 9584
INFO:   output_ptr[x]
INFO:     Which is: nan
INFO: [0;31m[  FAILED  ] [mconcat_gpu_onednn.b_fs_yx_fsv16_input_types (303 ms)
INFO: [0;32m[----------] [m1 test from concat_gpu_onednn (303 ms total)
INFO:
INFO: [0;32m[----------] [mGlobal test environment tear-down
INFO: [0;32m[==========] [m1 test from 1 test suite ran. (304 ms total)
INFO: [0;32m[  PASSED  ] [m0 tests.
INFO: [0;31m[  FAILED  ] [m1 test, listed below:
INFO: [0;31m[  FAILED  ] [mconcat_gpu_onednn.b_fs_yx_fsv16_input_types

src/plugins/intel_gpu/tests/unit/test_cases/concatenation_gpu_test.cpp

yuanxion · 2025-07-22T06:55:08Z

The CI tests will be passed after onednn PR 3630 merged to master.

wilson-seok · 2025-08-19T07:02:04Z

0 writing in eltwise kernel should impact to all eltwise blocked_opt/ref kernel execution time when blocked format. Please check kernel time for several input shapes(small to large).
Could you please update PR description to include how the non-zero value in blocked format padded area makes accuracy issue in concat. You can add onednn primitive description also.

yuanxion · 2025-08-22T08:10:12Z

0 writing in eltwise kernel should impact to all eltwise blocked_opt/ref kernel execution time when blocked format. Please check kernel time for several input shapes(small to large).

Could you please update PR description to include how the non-zero value in blocked format padded area makes accuracy issue in concat. You can add onednn primitive description also.

@wilson-seok I changed the behavior to only zero-pad blocked format memory when with eltwise_mode::ASSIGN now, and updated the PR descriptions.

The performance test shown that the performance degradation for eltwise_blocked_opt is not big, but for generic_eltwise_ref it is very large.

new kernel performance (%)	iters 1	iters 10	iters 20	iters 30
eltwise_blocked_opt 1x176x52x52_crop_b_fs_yx_fsv16	99.2%	98.3%	97.5%	98.7%
eltwise_blocked_opt 1x352x114x114_crop_b_fs_yx_fsv16	99.6%	99.1%	94.8%	100.3%
eltwise_blocked_opt 1x200x114x114_crop_b_fs_yx_fsv16	96.2%	97.7%	99.1%	97.5%
eltwise_blocked_opt 1x200x512x512_crop_b_fs_yx_fsv16	103.6%	102.0%	101.0%	103.8%
eltwise_blocked_opt 1x200x1024x1024_crop_b_fs_yx_fsv16	105.7%	100.5%	99.8%	99.4%
generic_eltwise_ref 1x200x114x114_crop_b_fs_yx_fsv16	253.3%	289.9%	270.3%	167.7%
generic_eltwise_ref 1x200x512x512_crop_b_fs_yx_fsv16	287.2%	295.0%	288.8%	290.9%
generic_eltwise_ref 1x200x1024x1024_crop_b_fs_yx_fsv16	110.4%	110.4%	110.6%	110.7%

yuanxion · 2025-08-25T03:58:07Z

Further optimized the performance for generic_eltwise_ref kernel using multiple threads for zero-padding, and its performance degradation (eltwise kernel itself, not E2E test) dropped from about 250% ~ 290% (difference shapes) to about 150% ~ 100%.

new kernel performance (%)	iters 1	iters 10	iters 20	iters 30
generic_eltwise_ref 1x200x52x52_crop_b_fs_yx_fsv16	148.7%	140.4%	140.5%	139.7%
generic_eltwise_ref 1x200x114x114_crop_b_fs_yx_fsv16	136.9%	134.4%	129.7%	128.0%
generic_eltwise_ref 1x200x512x512_crop_b_fs_yx_fsv16	98.9%	99.1%	106.2%	106.1%
generic_eltwise_ref 1x200x1024x1024_crop_b_fs_yx_fsv16	98.8%	99.4%	98.7%	99.3%
eltwise_blocked_opt 1x176x52x52_crop_b_fs_yx_fsv16	111.0%	109.6%	111.4%	110.2%
eltwise_blocked_opt 1x176x52x52_crop_b_fs_yx_fsv16	99.7%	98.5%	98.3%	97.6%
eltwise_blocked_opt 1x200x52x52_crop_b_fs_yx_fsv16	99.0%	99.7%	103.0%	102.3%
eltwise_blocked_opt 1x200x114x114_crop_b_fs_yx_fsv16	99.0%	98.9%	99.9%	100.7%
eltwise_blocked_opt 1x200x512x512_crop_b_fs_yx_fsv16	100.5%	99.0%	102.3%	101.0%
eltwise_blocked_opt 1x200x1024x1024_crop_b_fs_yx_fsv16	105.5%	101.4%	100.1%	100.6%

src/plugins/intel_gpu/src/kernel_selector/kernels/eltwise/eltwise_kernel_ref.cpp

Signed-off-by: yuan.xiong <[email protected]>

…crop testcase Signed-off-by: yuan.xiong <[email protected]>

…rmat output memory Signed-off-by: yuan.xiong <[email protected]>

Signed-off-by: yuan.xiong <[email protected]>

…ise_ref kernel Signed-off-by: yuan.xiong <[email protected]>

Signed-off-by: yuan.xiong <[email protected]>

yuanxion · 2025-08-29T03:50:16Z

The performance test result for models with benchmark_app shows slightly performance drop with this PR:

Throughput/FPS	original	after-fix	ratio
hbonet0.5-FP16	293.674	290.044	98.8%
hbonet1.0-FP16	303.176	298.59	98.5%
hbonet1.0-FP16-INT8	237.994	235.228	98.8%
nanodet	259.588	255.878	98.6%

…m-1.5x-416 models caused by eltwise kernels (openvinotoolkit#31140) ### Details - Fix accuracy degradation issue for hbonet0.5/hbonet1.0/nanodet-m-1.5x-416 models caused by none-zeros values existed in the blocked format (b_fs_yx_fsv16/b_fs_yx_fsv32) padded memory area in eltwise kernels. ### Description of the issue #### Symptom Onednn Concatenation got different output values before/after commit d7f0f34 when with blocked format memory. #### Root cause - eltwise_blocked_opt kernel will copy extra values from input (to pad features to 16/32 block) to output for the blocked format (fsv16/fsv32) padded memory area, instead of using zeros. - Onednn Concatenation requires blocked format padded memory area to be padded with zeros [link](https://uxlfoundation.github.io/oneDNN/dev_guide_understanding_memory_formats.html#what-if-channels-are-not-multiples-of-8-or-16). When the Concatenation's input comes from eltwise_blocked_opt or generic_eltwise_ref kernel's output and its blocked format padded memory area is not zero-padded, the onednn concatenation will have wrong output values, and cause the model's accuracy degradation. - When running model inference for the second time, crop's output memory is shared with other primitives and thus may have non-zero values in the padded memory area, and since it will not be filled with zeros in the eltwise_blocked_opt or generic_eltwise_ref kernel by default, the onednn concatenation after crop then will get wrong outputs. #### How to fix it - Skip copying extra values from input in eltwise_blocked_opt kernel. - Fill zeros to crop if has blocked format memory and shaved with others, and followed by onednn concatenation. #### The code and line that caused this issue https://github.com/openvinotoolkit/openvino/blob/0dcc5adfd89dc9151f0c4448e346d0ec030f70e6/src/plugins/intel_gpu/src/kernel_selector/cl_kernels/eltwise_blocked_opt.cl#L64 #### Reproduction step and snapshot - For hbonet-1.0 FP16-INT8 `python accuracy_check.py --target_framework openvino --target_devices GPU --config ./hbonet-1.0-onnx.yml --target_tags FP16-INT8 --models ./local_models --source ./datasets --annotations ./annotations --definitions ./dataset_definitions.yml --undefined_shapes_resolving_policy default --sub_evaluation true --use_new_api True` #### Problematic graph - eltwise_blocked_opt and generic_eltwise_ref kernels used in hbonot1.0 model <img width="1590" height="588" alt="image" src="https://github.com/user-attachments/assets/f558ca3c-bc7f-4a2e-a669-03606af54e7a" /> #### Checklist - [x] Is it a proper fix? (not a workaround) - [x] Did you include test case for this fix, if necessary? - [x] Did you review existing test that can be extended to cover this scenario? Which test did you review? No testcase can cover this issue, so added a new crop_gpu testcase "basic_in1x176x52x52_crop_b_fs_yx_fsv16". ### Tickets: - CVS-169075 --------- Signed-off-by: yuan.xiong <[email protected]>

nullptr error in node sleelcted_impl is null. Related regression: openvinotoolkit#31140 ### Tickets: - *173291* Signed-off-by: hyunback <[email protected]>

### Details - Fix performance regression introduced by PR #31140. ### Description of the issue #### Symptom manual_yolo11 model performance dropped from 362.4 FPS to 318.25 FPS on GPU. #### Root cause - previous PR will force all crop primitives followed by onednn concatenation to clean its GPU memory by filling with zeros if it is blocked format. - manual_yolo11 model also has many such crop primitives, so its performance will drop. #### How to fix it - Found that filling GPU memory with zeros can be skipped if the crop primitive uses eltwise_blocked_opt kernel and is not dynamic, so just skip it by checking crop primitive's kernel name. #### The code and line that caused this issue https://github.com/openvinotoolkit/openvino/blob/453c8ee337f4a1cadebb66551bb40d6a216c1001/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L2036 #### Reproduction step and snapshot - benchmark_app `benchmark_app -inference_only false -b 1 -t 60 -nireq 4 -d GPU.0 -hint none -nstreams 2 -m INT8/1/ov/optimized/manual_yolo11.xml` #### Problematic graph - crop primitive (eltwise_blocked_opt kernel) followed by onednn concatenation in manual_yolo11 <img width="570" height="616" alt="image" src="https://github.com/user-attachments/assets/c3c0b598-c147-4d23-940d-9a4ac9b4649e" /> #### Checklist - [x] Is it a proper fix? (not a workaround) - [ ] Did you include test case for this fix, if necessary? No need - [ ] Did you review existing test that can be extended to cover this scenario? Which test did you review? ### Tickets: - CVS-173402 --------- Signed-off-by: yuan.xiong <[email protected]>

yuanxion requested review from a team as code owners June 30, 2025 03:41

github-actions bot added the category: GPU OpenVINO GPU plugin label Jun 30, 2025

isanghao reviewed Jun 30, 2025

View reviewed changes

src/plugins/intel_gpu/tests/unit/test_cases/concatenation_gpu_test.cpp Outdated Show resolved Hide resolved

syurkevi reviewed Jul 18, 2025

View reviewed changes

src/plugins/intel_gpu/tests/unit/test_cases/concatenation_gpu_test.cpp Show resolved Hide resolved

yuanxion force-pushed the add-onednn-concat-testcase branch from bbf21eb to 91cb174 Compare July 21, 2025 03:41

yuanxion force-pushed the add-onednn-concat-testcase branch 2 times, most recently from 362986c to 950cc63 Compare August 6, 2025 06:07

yuanxion marked this pull request as draft August 8, 2025 07:59

yuanxion force-pushed the add-onednn-concat-testcase branch from 8ea6458 to a2a3a7b Compare August 15, 2025 12:50

yuanxion marked this pull request as ready for review August 15, 2025 12:50

yuanxion force-pushed the add-onednn-concat-testcase branch from 93dbeb0 to 76a2721 Compare August 18, 2025 06:33

yuanxion changed the title ~~[GPU] Add b_fs_yx_fsv16 fp16 onednn concatenation testcase~~ [GPU] Fix accuracy degradation issue for hbonet0.5/hbonet1.0/nanodet-m-1.5x-416 models caused by eltwise kernels Aug 19, 2025

yuanxion force-pushed the add-onednn-concat-testcase branch from e784eff to 508feb8 Compare August 22, 2025 01:25

yuanxion force-pushed the add-onednn-concat-testcase branch from c4eebc0 to 9287634 Compare August 25, 2025 03:44

Lyamin-Roman reviewed Aug 25, 2025

View reviewed changes

src/plugins/intel_gpu/src/kernel_selector/kernels/eltwise/eltwise_kernel_ref.cpp Outdated Show resolved Hide resolved

yuanxion marked this pull request as draft August 26, 2025 07:42

yuanxion added 7 commits August 28, 2025 20:47

Add b_fs_yx_fsv16 1x88x52x52 fp16 onednn concat testcase

8d8ec8a

Signed-off-by: yuan.xiong <[email protected]>

use ocl output as reference for onednn concatenation testcase

9581839

Signed-off-by: yuan.xiong <[email protected]>

fix input_y and input_x order for onednn concatenation testcase

313a5c3

Signed-off-by: yuan.xiong <[email protected]>

fix last block for b_fs_yx_fsv16 with zero-pad

e5ac8c4

Signed-off-by: yuan.xiong <[email protected]>

Add concatenation with crop testcase for b_fs_yx_fsv16 format

495e18d

Signed-off-by: yuan.xiong <[email protected]>

fix inputs' order for concatenation for the added concatenation with …

75b15d0

…crop testcase Signed-off-by: yuan.xiong <[email protected]>

only copy the needed values from input to the padded b_fs_yx_fsv16 fo…

30cb8a3

…rmat output memory Signed-off-by: yuan.xiong <[email protected]>

yuanxion added 14 commits August 28, 2025 20:47

Add crop b_fs_yx_fsv16 input testcase

aad2cbd

Signed-off-by: yuan.xiong <[email protected]>

Remove duplicated concat testcase with crop primitive

6752802

Signed-off-by: yuan.xiong <[email protected]>

Zero b_fs_yx_fsv format padded memory manually

aa4c2a9

Signed-off-by: yuan.xiong <[email protected]>

fix hardcode type issue for int8 in kernel

5d9fdfc

Signed-off-by: yuan.xiong <[email protected]>

Also filling zeros to padded memory of b_fs_yx_fsv16 for generic_eltw…

666cfb5

…ise_ref kernel Signed-off-by: yuan.xiong <[email protected]>

Fix for testcase with b_fs_yx_fsv4 format

8660abf

Signed-off-by: yuan.xiong <[email protected]>

Add more layout cases for generic_eltwise_ref

deff4bc

Signed-off-by: yuan.xiong <[email protected]>

Only zero-pad memory when needed

7fe4018

Signed-off-by: yuan.xiong <[email protected]>

fix double blocked format memory testcases

ab89803

Signed-off-by: yuan.xiong <[email protected]>

fix FEATURE_BLOCK_SIZE usage for testcases

d262c1a

Signed-off-by: yuan.xiong <[email protected]>

optimize zero-padding performance with more threads

d9bcb55

Signed-off-by: yuan.xiong <[email protected]>

fix typo

346a4ab

Signed-off-by: yuan.xiong <[email protected]>

leverage skip_reset flag to reset crop's output if followed by onednn

b25de23

Signed-off-by: yuan.xiong <[email protected]>

remove unused code

e6ab99a

Signed-off-by: yuan.xiong <[email protected]>

yuanxion force-pushed the add-onednn-concat-testcase branch from ab46403 to e6ab99a Compare August 28, 2025 12:50

yuanxion marked this pull request as ready for review August 28, 2025 12:52

p-durandin added this to the 2025.4 milestone Aug 29, 2025

Lyamin-Roman approved these changes Aug 29, 2025

View reviewed changes

p-durandin approved these changes Aug 29, 2025

View reviewed changes

p-durandin added this pull request to the merge queue Aug 29, 2025

Merged via the queue into openvinotoolkit:master with commit 997b5c4 Aug 29, 2025
196 of 198 checks passed

hyunback mentioned this pull request Sep 9, 2025

[GPU] Fix Yolo8, Yolo11 failure. #32015

Merged

davidsnam-intel mentioned this pull request Sep 9, 2025

[GPU] Fix relevant to zero padding for eltwise #32016

Draft

ahnyoung-paul mentioned this pull request Sep 12, 2025

[GPU] Add reorder for INT4 and increase precision for RoPE #31951

Merged

3 tasks

yuanxion mentioned this pull request Sep 12, 2025

[GPU] fix performance drop for crop in yolo11 model #32062

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GPU] Fix accuracy degradation issue for hbonet0.5/hbonet1.0/nanodet-m-1.5x-416 models caused by eltwise kernels #31140

[GPU] Fix accuracy degradation issue for hbonet0.5/hbonet1.0/nanodet-m-1.5x-416 models caused by eltwise kernels #31140

yuanxion commented Jun 30, 2025 •

edited

Loading

Uh oh!

yuanxion commented Jun 30, 2025

Uh oh!

yuanxion commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

yuanxion commented Jul 22, 2025

Uh oh!

wilson-seok commented Aug 19, 2025 •

edited

Loading

Uh oh!

yuanxion commented Aug 22, 2025 •

edited

Loading

Uh oh!

yuanxion commented Aug 25, 2025

Uh oh!

Uh oh!

yuanxion commented Aug 29, 2025

Uh oh!

Uh oh!

Uh oh!

[GPU] Fix accuracy degradation issue for hbonet0.5/hbonet1.0/nanodet-m-1.5x-416 models caused by eltwise kernels #31140

[GPU] Fix accuracy degradation issue for hbonet0.5/hbonet1.0/nanodet-m-1.5x-416 models caused by eltwise kernels #31140

Conversation

yuanxion commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Description of the issue

Symptom

Root cause

How to fix it

The code and line that caused this issue

Reproduction step and snapshot

Problematic graph

Checklist

Tickets:

Uh oh!

yuanxion commented Jun 30, 2025

Uh oh!

yuanxion commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

yuanxion commented Jul 22, 2025

Uh oh!

wilson-seok commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuanxion commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuanxion commented Aug 25, 2025

Uh oh!

Uh oh!

yuanxion commented Aug 29, 2025

Uh oh!

Uh oh!

Uh oh!

yuanxion commented Jun 30, 2025 •

edited

Loading

wilson-seok commented Aug 19, 2025 •

edited

Loading

yuanxion commented Aug 22, 2025 •

edited

Loading