Skip to content

Conversation

yuanxion
Copy link
Contributor

@yuanxion yuanxion commented Jun 30, 2025

Details

  • Fix accuracy degradation issue for hbonet0.5/hbonet1.0/nanodet-m-1.5x-416 models caused by none-zeros values existed in the blocked format (b_fs_yx_fsv16/b_fs_yx_fsv32) padded memory area in eltwise kernels.

Description of the issue

Symptom

Onednn Concatenation got different output values before/after commit d7f0f34 when with blocked format memory.

Root cause

  • eltwise_blocked_opt kernel will copy extra values from input (to pad features to 16/32 block) to output for the blocked format (fsv16/fsv32) padded memory area, instead of using zeros.
  • Onednn Concatenation requires blocked format padded memory area to be padded with zeros link. When the Concatenation's input comes from eltwise_blocked_opt or generic_eltwise_ref kernel's output and its blocked format padded memory area is not zero-padded, the onednn concatenation will have wrong output values, and cause the model's accuracy degradation.
  • When running model inference for the second time, crop's output memory is shared with other primitives and thus may have non-zero values in the padded memory area, and since it will not be filled with zeros in the eltwise_blocked_opt or generic_eltwise_ref kernel by default, the onednn concatenation after crop then will get wrong outputs.

How to fix it

  • Skip copying extra values from input in eltwise_blocked_opt kernel.
  • Fill zeros to crop if has blocked format memory and shaved with others, and followed by onednn concatenation.

The code and line that caused this issue

if ((f_block*VEC_SIZE) > OUTPUT_FEATURE_NUM || b > OUTPUT_BATCH_NUM) {

Reproduction step and snapshot

  • For hbonet-1.0 FP16-INT8
    python accuracy_check.py --target_framework openvino --target_devices GPU --config ./hbonet-1.0-onnx.yml --target_tags FP16-INT8 --models ./local_models --source ./datasets --annotations ./annotations --definitions ./dataset_definitions.yml --undefined_shapes_resolving_policy default --sub_evaluation true --use_new_api True

Problematic graph

  • eltwise_blocked_opt and generic_eltwise_ref kernels used in hbonot1.0 model
image

Checklist

  • Is it a proper fix? (not a workaround)
  • Did you include test case for this fix, if necessary?
  • Did you review existing test that can be extended to cover this scenario? Which test did you review? No testcase can cover this issue, so added a new crop_gpu testcase "basic_in1x176x52x52_crop_b_fs_yx_fsv16".

Tickets:

@yuanxion yuanxion requested review from a team as code owners June 30, 2025 03:41
@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Jun 30, 2025
@yuanxion
Copy link
Contributor Author

The added b_fs_yx_fsv16 & fp16 onednn concatenation testcase can be passed with older onednn gpu commit e7d51221ff8aa4698c4dd63fffc136ce7522ef62, but will be failed with new onednn gpu commit a42b47ff2cb81df552887dd4a3575f964386b25e (which is introduced into OpenVINO from d7f0f34).

Note the onednn gpu commit change will not affect ocl concatenation, and this testcase can be passed when setting ocl implementation for concatenation.

@yuanxion
Copy link
Contributor Author

Note: CI test will be passed only when onednn concatenation issue is fixed.

INFO: [0;32m[ RUN      ] [mconcat_gpu_onednn.b_fs_yx_fsv16_input_types
INFO: src/plugins/intel_gpu/tests/unit/test_cases/concatenation_gpu_test.cpp:1756: Failure
INFO: Expected equality of these values:
INFO:   output_vec[x + offset_pad]
INFO:     Which is: 9584
INFO:   output_ptr[x]
INFO:     Which is: nan
INFO: [0;31m[  FAILED  ] [mconcat_gpu_onednn.b_fs_yx_fsv16_input_types (303 ms)
INFO: [0;32m[----------] [m1 test from concat_gpu_onednn (303 ms total)
INFO:
INFO: [0;32m[----------] [mGlobal test environment tear-down
INFO: [0;32m[==========] [m1 test from 1 test suite ran. (304 ms total)
INFO: [0;32m[  PASSED  ] [m0 tests.
INFO: [0;31m[  FAILED  ] [m1 test, listed below:
INFO: [0;31m[  FAILED  ] [mconcat_gpu_onednn.b_fs_yx_fsv16_input_types

@yuanxion yuanxion force-pushed the add-onednn-concat-testcase branch from bbf21eb to 91cb174 Compare July 21, 2025 03:41
@yuanxion
Copy link
Contributor Author

The CI tests will be passed after onednn PR 3630 merged to master.

@yuanxion yuanxion force-pushed the add-onednn-concat-testcase branch 2 times, most recently from 362986c to 950cc63 Compare August 6, 2025 06:07
@yuanxion yuanxion marked this pull request as draft August 8, 2025 07:59
@yuanxion yuanxion force-pushed the add-onednn-concat-testcase branch from 8ea6458 to a2a3a7b Compare August 15, 2025 12:50
@yuanxion yuanxion marked this pull request as ready for review August 15, 2025 12:50
@yuanxion yuanxion force-pushed the add-onednn-concat-testcase branch from 93dbeb0 to 76a2721 Compare August 18, 2025 06:33
@yuanxion yuanxion changed the title [GPU] Add b_fs_yx_fsv16 fp16 onednn concatenation testcase [GPU] Fix accuracy degradation issue for hbonet0.5/hbonet1.0/nanodet-m-1.5x-416 models caused by eltwise kernels Aug 19, 2025
@wilson-seok
Copy link
Contributor

wilson-seok commented Aug 19, 2025

  1. 0 writing in eltwise kernel should impact to all eltwise blocked_opt/ref kernel execution time when blocked format. Please check kernel time for several input shapes(small to large).
  2. Could you please update PR description to include how the non-zero value in blocked format padded area makes accuracy issue in concat. You can add onednn primitive description also.

@yuanxion yuanxion force-pushed the add-onednn-concat-testcase branch from e784eff to 508feb8 Compare August 22, 2025 01:25
@yuanxion
Copy link
Contributor Author

yuanxion commented Aug 22, 2025

  1. 0 writing in eltwise kernel should impact to all eltwise blocked_opt/ref kernel execution time when blocked format. Please check kernel time for several input shapes(small to large).
  2. Could you please update PR description to include how the non-zero value in blocked format padded area makes accuracy issue in concat. You can add onednn primitive description also.

@wilson-seok I changed the behavior to only zero-pad blocked format memory when with eltwise_mode::ASSIGN now, and updated the PR descriptions.

The performance test shown that the performance degradation for eltwise_blocked_opt is not big, but for generic_eltwise_ref it is very large.

new kernel performance (%) iters 1 iters 10 iters 20 iters 30
eltwise_blocked_opt   1x176x52x52_crop_b_fs_yx_fsv16 99.2% 98.3% 97.5% 98.7%
eltwise_blocked_opt   1x352x114x114_crop_b_fs_yx_fsv16 99.6% 99.1% 94.8% 100.3%
eltwise_blocked_opt   1x200x114x114_crop_b_fs_yx_fsv16 96.2% 97.7% 99.1% 97.5%
eltwise_blocked_opt   1x200x512x512_crop_b_fs_yx_fsv16 103.6% 102.0% 101.0% 103.8%
eltwise_blocked_opt 1x200x1024x1024_crop_b_fs_yx_fsv16 105.7% 100.5% 99.8% 99.4%
generic_eltwise_ref     1x200x114x114_crop_b_fs_yx_fsv16 253.3% 289.9% 270.3% 167.7%
generic_eltwise_ref     1x200x512x512_crop_b_fs_yx_fsv16 287.2% 295.0% 288.8% 290.9%
generic_eltwise_ref   1x200x1024x1024_crop_b_fs_yx_fsv16 110.4% 110.4% 110.6% 110.7%

@yuanxion yuanxion force-pushed the add-onednn-concat-testcase branch from c4eebc0 to 9287634 Compare August 25, 2025 03:44
@yuanxion
Copy link
Contributor Author

Further optimized the performance for generic_eltwise_ref kernel using multiple threads for zero-padding, and its performance degradation (eltwise kernel itself, not E2E test) dropped from about 250% ~ 290% (difference shapes) to about 150% ~ 100%.

new kernel performance (%) iters 1 iters 10 iters 20 iters 30
generic_eltwise_ref   1x200x52x52_crop_b_fs_yx_fsv16 148.7% 140.4% 140.5% 139.7%
generic_eltwise_ref   1x200x114x114_crop_b_fs_yx_fsv16 136.9% 134.4% 129.7% 128.0%
generic_eltwise_ref   1x200x512x512_crop_b_fs_yx_fsv16 98.9% 99.1% 106.2% 106.1%
generic_eltwise_ref  1x200x1024x1024_crop_b_fs_yx_fsv16 98.8% 99.4% 98.7% 99.3%
eltwise_blocked_opt 1x176x52x52_crop_b_fs_yx_fsv16 111.0% 109.6% 111.4% 110.2%
eltwise_blocked_opt 1x176x52x52_crop_b_fs_yx_fsv16 99.7% 98.5% 98.3% 97.6%
eltwise_blocked_opt 1x200x52x52_crop_b_fs_yx_fsv16 99.0% 99.7% 103.0% 102.3%
eltwise_blocked_opt 1x200x114x114_crop_b_fs_yx_fsv16 99.0% 98.9% 99.9% 100.7%
eltwise_blocked_opt 1x200x512x512_crop_b_fs_yx_fsv16 100.5% 99.0% 102.3% 101.0%
eltwise_blocked_opt 1x200x1024x1024_crop_b_fs_yx_fsv16 105.5% 101.4% 100.1% 100.6%

@yuanxion yuanxion marked this pull request as draft August 26, 2025 07:42
Signed-off-by: yuan.xiong <[email protected]>
Signed-off-by: yuan.xiong <[email protected]>
@yuanxion yuanxion force-pushed the add-onednn-concat-testcase branch from ab46403 to e6ab99a Compare August 28, 2025 12:50
@yuanxion yuanxion marked this pull request as ready for review August 28, 2025 12:52
@yuanxion
Copy link
Contributor Author

The performance test result for models with benchmark_app shows slightly performance drop with this PR:

Throughput/FPS original after-fix ratio
hbonet0.5-FP16 293.674 290.044 98.8%
hbonet1.0-FP16 303.176 298.59 98.5%
hbonet1.0-FP16-INT8 237.994 235.228 98.8%
nanodet 259.588 255.878 98.6%

@p-durandin p-durandin added this to the 2025.4 milestone Aug 29, 2025
@p-durandin p-durandin added this pull request to the merge queue Aug 29, 2025
Merged via the queue into openvinotoolkit:master with commit 997b5c4 Aug 29, 2025
196 of 198 checks passed
praasz pushed a commit to praasz/openvino that referenced this pull request Sep 8, 2025
…m-1.5x-416 models caused by eltwise kernels (openvinotoolkit#31140)

### Details
- Fix accuracy degradation issue for
hbonet0.5/hbonet1.0/nanodet-m-1.5x-416 models caused by none-zeros
values existed in the blocked format (b_fs_yx_fsv16/b_fs_yx_fsv32)
padded memory area in eltwise kernels.

### Description of the issue

#### Symptom
Onednn Concatenation got different output values before/after commit
d7f0f34 when with blocked format memory.

#### Root cause
- eltwise_blocked_opt kernel will copy extra values from input (to pad
features to 16/32 block) to output for the blocked format (fsv16/fsv32)
padded memory area, instead of using zeros.
- Onednn Concatenation requires blocked format padded memory area to be
padded with zeros
[link](https://uxlfoundation.github.io/oneDNN/dev_guide_understanding_memory_formats.html#what-if-channels-are-not-multiples-of-8-or-16).
When the Concatenation's input comes from eltwise_blocked_opt or
generic_eltwise_ref kernel's output and its blocked format padded memory
area is not zero-padded, the onednn concatenation will have wrong output
values, and cause the model's accuracy degradation.
- When running model inference for the second time, crop's output memory
is shared with other primitives and thus may have non-zero values in the
padded memory area, and since it will not be filled with zeros in the
eltwise_blocked_opt or generic_eltwise_ref kernel by default, the onednn
concatenation after crop then will get wrong outputs.

#### How to fix it
- Skip copying extra values from input in eltwise_blocked_opt kernel.
- Fill zeros to crop if has blocked format memory and shaved with
others, and followed by onednn concatenation.

#### The code and line that caused this issue

https://github.com/openvinotoolkit/openvino/blob/0dcc5adfd89dc9151f0c4448e346d0ec030f70e6/src/plugins/intel_gpu/src/kernel_selector/cl_kernels/eltwise_blocked_opt.cl#L64

#### Reproduction step and snapshot
- For hbonet-1.0 FP16-INT8
`python accuracy_check.py --target_framework openvino --target_devices
GPU --config ./hbonet-1.0-onnx.yml --target_tags FP16-INT8 --models
./local_models --source ./datasets --annotations ./annotations
--definitions ./dataset_definitions.yml
--undefined_shapes_resolving_policy default --sub_evaluation true
--use_new_api True`

#### Problematic graph
- eltwise_blocked_opt and generic_eltwise_ref kernels used in hbonot1.0
model
<img width="1590" height="588" alt="image"
src="https://github.com/user-attachments/assets/f558ca3c-bc7f-4a2e-a669-03606af54e7a"
/>

#### Checklist 
 - [x] Is it a proper fix? (not a workaround) 
 - [x] Did you include test case for this fix, if necessary? 
- [x] Did you review existing test that can be extended to cover this
scenario? Which test did you review? No testcase can cover this issue,
so added a new crop_gpu testcase
"basic_in1x176x52x52_crop_b_fs_yx_fsv16".
 
### Tickets:
 - CVS-169075

---------

Signed-off-by: yuan.xiong <[email protected]>
pereanub pushed a commit to pereanub/openvino that referenced this pull request Sep 9, 2025
nullptr error in node sleelcted_impl is null.

Related regression:
openvinotoolkit#31140


### Tickets:
 - *173291*

Signed-off-by: hyunback <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Sep 16, 2025
### Details
- Fix performance regression introduced by PR
#31140.
 
### Description of the issue

#### Symptom
manual_yolo11 model performance dropped from 362.4 FPS to 318.25 FPS on
GPU.

#### Root cause
- previous PR will force all crop primitives followed by onednn
concatenation to clean its GPU memory by filling with zeros if it is
blocked format.
- manual_yolo11 model also has many such crop primitives, so its
performance will drop.

#### How to fix it
- Found that filling GPU memory with zeros can be skipped if the crop
primitive uses eltwise_blocked_opt kernel and is not dynamic, so just
skip it by checking crop primitive's kernel name.

#### The code and line that caused this issue

https://github.com/openvinotoolkit/openvino/blob/453c8ee337f4a1cadebb66551bb40d6a216c1001/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L2036

#### Reproduction step and snapshot
- benchmark_app
`benchmark_app -inference_only false -b 1 -t 60 -nireq 4 -d GPU.0 -hint
none -nstreams 2 -m INT8/1/ov/optimized/manual_yolo11.xml`

#### Problematic graph
- crop primitive (eltwise_blocked_opt kernel) followed by onednn
concatenation in manual_yolo11
<img width="570" height="616" alt="image"
src="https://github.com/user-attachments/assets/c3c0b598-c147-4d23-940d-9a4ac9b4649e"
/>

#### Checklist 
 - [x] Is it a proper fix? (not a workaround) 
 - [ ] Did you include test case for this fix, if necessary? No need
- [ ] Did you review existing test that can be extended to cover this
scenario? Which test did you review?
 
### Tickets:
 - CVS-173402

---------

Signed-off-by: yuan.xiong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GPU OpenVINO GPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants