Skip to content

Conversation

Mryange
Copy link
Contributor

@Mryange Mryange commented Aug 2, 2024

The main issue is that _mm_movemask_epi8 does not have a one-to-one corresponding instruction on ARM. Testing shows that it performs worse compared to using memcmp, which allows the compiler to generate the corresponding ARM instructions.
The following tests were conducted on ARM.

--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_memequal16_sse         3.77 ns         3.77 ns    743238946
BM_memequal16_orgin       2.11 ns         2.11 ns   1000000000

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@github-actions github-actions bot added the doing label Aug 2, 2024
@Mryange
Copy link
Contributor Author

Mryange commented Aug 2, 2024

run buildall

Copy link
Contributor

github-actions bot commented Aug 2, 2024

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 41755 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 243bb74544ef5a6570a8297e79bd5f23a25cd8f2, data reload: false

------ Round 1 ----------------------------------
q1	17637	4147	4111	4111
q2	2020	198	200	198
q3	10458	1301	1315	1301
q4	10157	838	924	838
q5	7648	3046	2988	2988
q6	221	141	143	141
q7	1068	628	618	618
q8	9437	1943	1938	1938
q9	8595	6622	6556	6556
q10	8751	3821	3858	3821
q11	432	250	259	250
q12	438	236	230	230
q13	17752	2939	2964	2939
q14	279	242	248	242
q15	524	492	495	492
q16	527	398	387	387
q17	989	948	944	944
q18	8049	7442	7225	7225
q19	1744	1224	1227	1224
q20	583	326	352	326
q21	5272	4704	4761	4704
q22	348	287	282	282
Total cold run time: 112929 ms
Total hot run time: 41755 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4123	4035	4059	4035
q2	336	226	218	218
q3	3005	3030	3185	3030
q4	2022	2046	1986	1986
q5	5657	5503	5486	5486
q6	223	134	133	133
q7	2124	1829	1808	1808
q8	3310	3392	3374	3374
q9	8717	8713	8775	8713
q10	3964	4067	3940	3940
q11	568	461	479	461
q12	802	587	579	579
q13	16403	3107	3115	3107
q14	311	282	275	275
q15	533	489	491	489
q16	467	423	418	418
q17	1772	1762	1758	1758
q18	8277	7800	7639	7639
q19	1734	1766	1784	1766
q20	2080	1835	1835	1835
q21	5774	5492	5360	5360
q22	531	469	475	469
Total cold run time: 72733 ms
Total hot run time: 56879 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 170049 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 243bb74544ef5a6570a8297e79bd5f23a25cd8f2, data reload: false

query1	916	384	371	371
query2	6480	1683	1640	1640
query3	6661	211	221	211
query4	19196	17370	17534	17370
query5	3631	501	519	501
query6	265	171	169	169
query7	4612	323	299	299
query8	251	193	206	193
query9	8521	2374	2370	2370
query10	463	285	272	272
query11	10374	10035	10113	10035
query12	120	93	94	93
query13	1641	404	379	379
query14	10303	7558	6971	6971
query15	199	163	165	163
query16	6928	463	429	429
query17	947	592	560	560
query18	1917	290	288	288
query19	195	146	141	141
query20	97	86	86	86
query21	198	104	109	104
query22	4206	3982	3918	3918
query23	33804	33889	33459	33459
query24	9513	3139	3113	3113
query25	666	412	397	397
query26	1479	153	160	153
query27	3003	299	289	289
query28	7364	2023	1988	1988
query29	1075	450	470	450
query30	232	161	160	160
query31	958	781	806	781
query32	104	62	62	62
query33	704	332	335	332
query34	930	499	519	499
query35	889	784	762	762
query36	1063	881	887	881
query37	172	91	91	91
query38	2914	2790	2786	2786
query39	878	801	828	801
query40	250	113	120	113
query41	47	45	46	45
query42	121	106	103	103
query43	455	415	427	415
query44	1197	741	749	741
query45	212	181	179	179
query46	1096	854	790	790
query47	1804	1728	1727	1727
query48	374	309	301	301
query49	868	443	449	443
query50	919	445	457	445
query51	6764	6702	6711	6702
query52	103	95	95	95
query53	260	187	183	183
query54	624	478	483	478
query55	83	79	87	79
query56	299	272	271	271
query57	1114	1049	1052	1049
query58	279	286	275	275
query59	2547	2436	2366	2366
query60	315	294	292	292
query61	154	94	92	92
query62	875	679	660	660
query63	216	190	190	190
query64	5050	1925	1923	1923
query65	3154	3103	3084	3084
query66	1292	334	333	333
query67	15237	14852	14886	14852
query68	4346	588	580	580
query69	477	300	314	300
query70	1112	1065	1026	1026
query71	429	295	289	289
query72	7783	2767	2475	2475
query73	764	333	343	333
query74	6019	5689	5663	5663
query75	3568	2771	2739	2739
query76	2445	1195	1283	1195
query77	538	310	316	310
query78	9350	9071	8830	8830
query79	2105	545	547	545
query80	1023	524	564	524
query81	573	228	227	227
query82	1059	135	133	133
query83	275	172	171	171
query84	260	85	80	80
query85	1328	331	308	308
query86	441	330	297	297
query87	3268	3114	3100	3100
query88	3095	2503	2502	2502
query89	387	299	294	294
query90	1810	203	199	199
query91	128	104	100	100
query92	62	52	51	51
query93	2032	617	611	611
query94	922	275	295	275
query95	396	272	270	270
query96	615	287	289	287
query97	3274	3093	3069	3069
query98	232	205	199	199
query99	1585	1274	1307	1274
Total cold run time: 261279 ms
Total hot run time: 170049 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.15 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 243bb74544ef5a6570a8297e79bd5f23a25cd8f2, data reload: false

query1	0.05	0.04	0.04
query2	0.08	0.04	0.04
query3	0.23	0.05	0.05
query4	1.68	0.09	0.07
query5	0.48	0.48	0.48
query6	1.17	0.72	0.72
query7	0.02	0.01	0.01
query8	0.06	0.05	0.05
query9	0.57	0.50	0.52
query10	0.56	0.57	0.58
query11	0.16	0.12	0.12
query12	0.14	0.12	0.12
query13	0.61	0.61	0.60
query14	0.78	0.79	0.81
query15	0.90	0.87	0.86
query16	0.35	0.37	0.36
query17	1.04	1.01	1.01
query18	0.23	0.22	0.21
query19	1.80	1.72	1.73
query20	0.01	0.01	0.02
query21	15.40	0.74	0.67
query22	3.68	8.11	1.27
query23	17.98	1.34	1.33
query24	2.22	0.23	0.22
query25	0.17	0.08	0.08
query26	0.31	0.21	0.22
query27	0.47	0.23	0.22
query28	13.17	1.00	0.97
query29	12.58	3.29	3.33
query30	0.25	0.06	0.05
query31	2.87	0.42	0.41
query32	3.23	0.50	0.49
query33	2.96	2.98	2.99
query34	15.42	4.28	4.25
query35	4.29	4.31	4.32
query36	0.67	0.48	0.48
query37	0.19	0.16	0.16
query38	0.16	0.16	0.15
query39	0.05	0.04	0.04
query40	0.16	0.13	0.12
query41	0.10	0.04	0.05
query42	0.06	0.05	0.04
query43	0.05	0.04	0.04
Total cold run time: 107.36 s
Total hot run time: 30.15 s

Copy link
Contributor

@HappenLee HappenLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Aug 16, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@Mryange
Copy link
Contributor Author

Mryange commented Aug 18, 2024

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 38153 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 243bb74544ef5a6570a8297e79bd5f23a25cd8f2, data reload: false

------ Round 1 ----------------------------------
q1	17859	4492	4323	4323
q2	2050	207	221	207
q3	11812	992	1126	992
q4	10532	798	713	713
q5	7811	2857	2774	2774
q6	262	158	156	156
q7	1000	658	660	658
q8	9594	2114	2127	2114
q9	8801	6602	6598	6598
q10	7069	2229	2247	2229
q11	522	275	270	270
q12	427	257	255	255
q13	18189	2997	2967	2967
q14	298	260	263	260
q15	562	529	542	529
q16	527	437	413	413
q17	979	716	750	716
q18	7561	6963	6796	6796
q19	6457	1074	995	995
q20	707	343	362	343
q21	3961	3010	2824	2824
q22	1132	1021	1046	1021
Total cold run time: 118112 ms
Total hot run time: 38153 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4572	4450	4288	4288
q2	394	302	297	297
q3	2860	2625	2768	2625
q4	1982	1726	1714	1714
q5	5621	5761	5640	5640
q6	249	156	145	145
q7	2130	1765	1810	1765
q8	3320	3545	3485	3485
q9	8831	8672	8764	8672
q10	3641	3340	3297	3297
q11	614	543	524	524
q12	827	687	674	674
q13	17208	3161	3149	3149
q14	323	304	292	292
q15	570	516	515	515
q16	504	442	458	442
q17	1826	1563	1549	1549
q18	8227	7929	7773	7773
q19	2883	1637	1540	1540
q20	2138	1926	1942	1926
q21	14487	5418	5333	5333
q22	1175	1108	1075	1075
Total cold run time: 84382 ms
Total hot run time: 56720 ms

@Mryange
Copy link
Contributor Author

Mryange commented Aug 18, 2024

run performance

@doris-robot
Copy link

TPC-H: Total hot run time: 38176 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 243bb74544ef5a6570a8297e79bd5f23a25cd8f2, data reload: false

------ Round 1 ----------------------------------
q1	17865	4805	4274	4274
q2	2067	209	202	202
q3	11588	953	1104	953
q4	10528	781	765	765
q5	7788	2818	2861	2818
q6	265	160	160	160
q7	1034	648	660	648
q8	9385	2078	2136	2078
q9	7075	6596	6598	6596
q10	7066	2249	2198	2198
q11	485	258	255	255
q12	419	250	264	250
q13	18951	3038	3030	3030
q14	285	251	251	251
q15	560	527	543	527
q16	519	415	417	415
q17	985	767	658	658
q18	7380	6806	6945	6806
q19	7023	1082	1101	1082
q20	712	343	348	343
q21	3825	2920	2859	2859
q22	1119	1026	1008	1008
Total cold run time: 116924 ms
Total hot run time: 38176 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4592	4478	4477	4477
q2	435	335	329	329
q3	2979	2843	2847	2843
q4	2016	1762	1765	1762
q5	5867	5910	5798	5798
q6	248	154	157	154
q7	2173	1825	1811	1811
q8	3348	3577	3541	3541
q9	8810	8753	8760	8753
q10	3619	3307	3261	3261
q11	611	528	510	510
q12	823	684	670	670
q13	17117	3234	3077	3077
q14	309	294	305	294
q15	547	525	534	525
q16	516	461	477	461
q17	1795	1567	1551	1551
q18	8332	7805	7564	7564
q19	9923	1626	1704	1626
q20	3544	1895	1857	1857
q21	12779	5296	5323	5296
q22	1537	1066	1091	1066
Total cold run time: 91920 ms
Total hot run time: 57226 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 195793 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 243bb74544ef5a6570a8297e79bd5f23a25cd8f2, data reload: false

query1	1321	915	897	897
query2	6585	2091	1972	1972
query3	10616	3871	3807	3807
query4	57863	27219	23228	23228
query5	5827	731	725	725
query6	550	212	230	212
query7	6414	329	332	329
query8	559	441	440	440
query9	9126	2557	2570	2557
query10	576	339	339	339
query11	16007	15080	15480	15080
query12	203	140	138	138
query13	1659	437	449	437
query14	11732	7467	7573	7467
query15	285	198	211	198
query16	7184	517	507	507
query17	1171	624	600	600
query18	2007	346	335	335
query19	294	178	166	166
query20	151	137	137	137
query21	245	147	145	145
query22	4592	4243	4159	4159
query23	34477	33870	33837	33837
query24	5708	2938	2946	2938
query25	586	431	434	431
query26	735	180	181	180
query27	1885	306	303	303
query28	4034	2186	2114	2114
query29	683	435	436	435
query30	214	184	185	184
query31	992	836	831	831
query32	107	77	77	77
query33	508	375	338	338
query34	908	489	499	489
query35	872	765	763	763
query36	1063	945	945	945
query37	156	104	104	104
query38	4032	3836	3923	3836
query39	1540	1494	1469	1469
query40	236	160	154	154
query41	137	135	139	135
query42	139	116	117	116
query43	530	505	514	505
query44	1146	772	773	772
query45	229	196	203	196
query46	1128	748	780	748
query47	1935	1816	1859	1816
query48	425	339	336	336
query49	911	585	581	581
query50	860	452	468	452
query51	6769	6800	6755	6755
query52	120	105	114	105
query53	298	220	222	220
query54	600	502	509	502
query55	90	88	87	87
query56	339	311	314	311
query57	1208	1137	1140	1137
query58	299	334	302	302
query59	3108	2804	2947	2804
query60	351	322	332	322
query61	153	150	152	150
query62	777	704	695	695
query63	252	227	228	227
query64	3302	1865	1895	1865
query65	3242	3140	3174	3140
query66	1051	667	671	667
query67	15382	14701	14697	14697
query68	6053	574	573	573
query69	629	319	305	305
query70	1247	1121	1202	1121
query71	571	316	348	316
query72	6962	2354	2101	2101
query73	811	354	348	348
query74	9691	8738	8740	8738
query75	4177	2746	2790	2746
query76	3635	1078	1038	1038
query77	701	438	444	438
query78	9821	9102	9770	9102
query79	3185	558	563	558
query80	1976	604	605	604
query81	602	256	259	256
query82	769	169	160	160
query83	355	217	216	216
query84	294	99	99	99
query85	1279	364	361	361
query86	346	323	322	322
query87	4459	4193	4204	4193
query88	4458	2508	2454	2454
query89	441	326	318	318
query90	1875	230	229	229
query91	150	132	125	125
query92	84	74	76	74
query93	4863	548	540	540
query94	737	333	333	333
query95	390	296	291	291
query96	622	284	281	281
query97	3214	3093	3079	3079
query98	251	234	230	230
query99	1664	1321	1298	1298
Total cold run time: 324796 ms
Total hot run time: 195793 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.74 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 243bb74544ef5a6570a8297e79bd5f23a25cd8f2, data reload: false

query1	0.05	0.04	0.05
query2	0.09	0.04	0.04
query3	0.23	0.05	0.06
query4	1.67	0.08	0.07
query5	0.49	0.48	0.49
query6	1.13	0.72	0.74
query7	0.02	0.02	0.02
query8	0.06	0.04	0.04
query9	0.54	0.48	0.48
query10	0.54	0.52	0.53
query11	0.15	0.12	0.12
query12	0.15	0.13	0.13
query13	0.61	0.59	0.60
query14	0.76	0.78	0.78
query15	0.86	0.82	0.82
query16	0.37	0.38	0.36
query17	0.99	1.00	0.95
query18	0.23	0.22	0.21
query19	1.83	1.80	1.84
query20	0.01	0.02	0.01
query21	15.42	0.66	0.66
query22	3.66	6.90	2.55
query23	18.24	1.39	1.28
query24	2.10	0.22	0.23
query25	0.15	0.09	0.09
query26	0.32	0.23	0.24
query27	0.46	0.24	0.23
query28	13.24	1.03	1.01
query29	12.66	3.32	3.34
query30	0.43	0.26	0.19
query31	2.80	0.41	0.39
query32	3.25	0.48	0.48
query33	3.00	2.94	3.00
query34	17.08	4.38	4.33
query35	4.40	4.43	4.42
query36	0.69	0.48	0.51
query37	0.21	0.17	0.17
query38	0.18	0.17	0.16
query39	0.06	0.05	0.05
query40	0.17	0.14	0.14
query41	0.11	0.07	0.06
query42	0.06	0.06	0.06
query43	0.06	0.05	0.06
Total cold run time: 109.53 s
Total hot run time: 31.74 s

@zhangstar333 zhangstar333 merged commit b2c87f4 into apache:master Aug 19, 2024
32 of 34 checks passed
dataroaring pushed a commit that referenced this pull request Oct 9, 2024
…#38759)

The main issue is that _mm_movemask_epi8 does not have a one-to-one
corresponding instruction on ARM. Testing shows that it performs worse
compared to using memcmp, which allows the compiler to generate the
corresponding ARM instructions.
The following tests were conducted on ARM.
```
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_memequal16_sse         3.77 ns         3.77 ns    743238946
BM_memequal16_orgin       2.11 ns         2.11 ns   1000000000
```
Mryange added a commit to Mryange/doris that referenced this pull request Nov 8, 2024
…apache#38759)

The main issue is that _mm_movemask_epi8 does not have a one-to-one
corresponding instruction on ARM. Testing shows that it performs worse
compared to using memcmp, which allows the compiler to generate the
corresponding ARM instructions.
The following tests were conducted on ARM.
```
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_memequal16_sse         3.77 ns         3.77 ns    743238946
BM_memequal16_orgin       2.11 ns         2.11 ns   1000000000
```
yiguolei pushed a commit that referenced this pull request Nov 10, 2024
#43510)

… (#38759)
#38759
The main issue is that _mm_movemask_epi8 does not have a one-to-one
corresponding instruction on ARM. Testing shows that it performs worse
compared to using memcmp, which allows the compiler to generate the
corresponding ARM instructions.
The following tests were conducted on ARM.
```
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_memequal16_sse         3.77 ns         3.77 ns    743238946
BM_memequal16_orgin       2.11 ns         2.11 ns   1000000000
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.8-merged dev/3.0.3-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants