如何解决pthread和printf
我正在测试具有大型数组的linux交流代码,以测量线程性能,当线程增加到最大内核数(对于Intel 4770为8)时,应用程序的伸缩性很好,但这仅适用于我代码的纯数学部分
如果我为结果数组添加printf部分,则时间变得太大,即使重定向到文件,也从几秒到几分钟不等,当printf时,这些数组应该只增加几秒钟。
代码:
(gcc 7.5.0-Ubuntu 18.04)
没有printf循环:
gcc -O3 -m64 exp_multi.c -pthread -lm
带有printf循环:
gcc -DPRINT_ARRAY -O3 -m64 exp_multi.c -pthread -lm
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <pthread.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
#define num_threads 8
static double xv[MAXSIZE];
static double yv[MAXSIZE];
/* gcc -O3 -m64 exp_multi.c -pthread -lm */
void* run(void *received_Val){
int single_val = *((int *) received_Val);
int r;
int i;
double p;
for (r = 0; r < REIT; r++) {
p = XXX + 0.00001*single_val*MAXSIZE/num_threads;
for (i = single_val*MAXSIZE/num_threads; i < (single_val+1)*MAXSIZE/num_threads; i++) {
xv[i]=p;
yv[i]=exp(p);
p += 0.00001;
}
}
return 0;
}
int main(){
int i;
pthread_t tid[num_threads];
for (i=0;i<num_threads;i++){
int *arg = malloc(sizeof(*arg));
if ( arg == NULL ) {
fprintf(stderr,"Couldn't allocate memory for thread arg.\n");
exit(1);
}
*arg = i;
pthread_create(&(tid[i]),NULL,run,arg);
}
for(i=0; i<num_threads; i++)
{
pthread_join(tid[i],NULL);
}
#ifdef PRINT_ARRAY
for (i=0;i<MAXSIZE;i++){
printf("x=%.20lf,e^x=%.20lf\n",xv[i],yv[i]);
}
#endif
return 0;
}
pthread_create中的malloc按照此post中的建议传递一个整数作为最后一个参数。
我尝试了clang,没有成功,添加了free(tid)指令,避免使用malloc指令,反向循环,仅1个一维数组,1个线程版本,但没有pthread。
EDIT2:我认为exp函数占用大量处理器资源,可能受处理器生成的每个内核缓存或SIMD资源的影响。以下示例代码基于许可代码posted on Starkoverflow。
无论有没有printf循环,这段代码都能快速运行,并且math.h的exp在几年前得到了改进,至少在Intel 4770(Haswell)上,它可以快40倍左右。 }是针对SSE2的数学库的已知测试代码,现在数学的exp速度应接近针对float和x8并行计算优化的AVX2算法。
测试结果:expf与其他SSE2算法(exp_ps):
sinf .. -> 55.5 millions of vector evaluations/second -> 12 cycles/value
cosf .. -> 57.3 millions of vector evaluations/second -> 11 cycles/value
sincos (x87) .. -> 9.1 millions of vector evaluations/second -> 71 cycles/value
expf .. -> 61.4 millions of vector evaluations/second -> 11 cycles/value
logf .. -> 55.6 millions of vector evaluations/second -> 12 cycles/value
cephes_sinf .. -> 52.5 millions of vector evaluations/second -> 12 cycles/value
cephes_cosf .. -> 41.9 millions of vector evaluations/second -> 15 cycles/value
cephes_expf .. -> 18.3 millions of vector evaluations/second -> 35 cycles/value
cephes_logf .. -> 20.2 millions of vector evaluations/second -> 32 cycles/value
sin_ps .. -> 54.1 millions of vector evaluations/second -> 12 cycles/value
cos_ps .. -> 54.8 millions of vector evaluations/second -> 12 cycles/value
sincos_ps .. -> 54.6 millions of vector evaluations/second -> 12 cycles/value
exp_ps .. -> 42.6 millions of vector evaluations/second -> 15 cycles/value
log_ps .. -> 41.0 millions of vector evaluations/second -> 16 cycles/value
/* Performance test exp(x) algorithm
based on AVX implementation of Giovanni Garberoglio
Copyright (C) 2020 Antonio R.
AVX implementation of exp:
Modified code from this source: https://github.com/reyoung/avx_mathfun
Based on "sse_mathfun.h",by Julien Pommier
http://gruntthepeon.free.fr/ssemath/
Copyright (C) 2012 Giovanni Garberoglio
Interdisciplinary Laboratory for Computational Science (LISC)
Fondazione Bruno Kessler and University of Trento
via Sommarive,18
I-38123 Trento (Italy)
This software is provided 'as-is',without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,including commercial applications,and to alter it and redistribute it
freely,subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product,an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such,and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
(this is the zlib license)
*/
/* gcc -O3 -m64 -Wall -mavx2 -march=haswell expc.c -lm */
#include <stdio.h>
#include <immintrin.h>
#include <math.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
__m256 exp256_ps(__m256 x) {
/*
To increase the compatibility across different compilers the original code is
converted to plain AVX2 intrinsics code without ingenious macro's,gcc style alignment attributes etc.
Moreover,the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
This modified code is not thoroughly tested!
*/
__m256 exp_hi = _mm256_set1_ps(88.3762626647949f);
__m256 exp_lo = _mm256_set1_ps(-88.3762626647949f);
__m256 cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256 inv_LOG2EF = _mm256_set1_ps(0.693147180559945f);
__m256 cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256 cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256 cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256 cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256 cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256 cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256 fx;
__m256i imm0;
__m256 one = _mm256_set1_ps(1.0f);
x = _mm256_min_ps(x,exp_hi);
x = _mm256_max_ps(x,exp_lo);
/* express exp(x) as exp(g + n*log(2)) */
fx = _mm256_mul_ps(x,cephes_LOG2EF);
fx = _mm256_round_ps(fx,_MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256 z = _mm256_mul_ps(fx,inv_LOG2EF);
x = _mm256_sub_ps(x,z);
z = _mm256_mul_ps(x,x);
__m256 y = cephes_exp_p0;
y = _mm256_mul_ps(y,x);
y = _mm256_add_ps(y,cephes_exp_p1);
y = _mm256_mul_ps(y,cephes_exp_p2);
y = _mm256_mul_ps(y,cephes_exp_p3);
y = _mm256_mul_ps(y,cephes_exp_p4);
y = _mm256_mul_ps(y,cephes_exp_p5);
y = _mm256_mul_ps(y,z);
y = _mm256_add_ps(y,one);
/* build 2^n */
imm0 = _mm256_cvttps_epi32(fx);
imm0 = _mm256_add_epi32(imm0,_mm256_set1_epi32(0x7f));
imm0 = _mm256_slli_epi32(imm0,23);
__m256 pow2n = _mm256_castsi256_ps(imm0);
y = _mm256_mul_ps(y,pow2n);
return y;
}
int main(){
int r;
int i;
float p;
static float xv[MAXSIZE];
static float yv[MAXSIZE];
float *xp;
float *yp;
for (r = 0; r < REIT; r++) {
p = XXX;
xp = xv;
yp = yv;
for (i = 0; i < MAXSIZE; i += 8) {
__m256 x = _mm256_setr_ps(p,p + 0.00001,p + 0.00002,p + 0.00003,p + 0.00004,p + 0.00005,p + 0.00006,p + 0.00007);
__m256 y = exp256_ps(x);
_mm256_store_ps(xp,x);
_mm256_store_ps(yp,y);
xp += 8;
yp += 8;
p += 0.00008;
}
}
for (i=0;i<MAXSIZE;i++){
printf("x=%.20f,e^x=%.20f\n",yv[i]);
}
return 0;
}
为了进行比较,这是数学库中具有exp(x),单线程和浮点的代码示例。
#include <stdio.h>
#include <math.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
/* gcc -O3 -m64 exp_st.c -lm */
int main(){
int r;
int i;
float p;
static float xv[MAXSIZE];
static float yv[MAXSIZE];
for (r = 0; r < REIT; r++) {
p = XXX;
for (i = 0; i < MAXSIZE; i++) {
xv[i]=p;
yv[i]=expf(p);
p += 0.00001;
}
}
for (i=0;i<MAXSIZE;i++){
printf("x=%.20f,yv[i]);
}
return 0;
}
解决方案:正如Andreas Wenzel所说,gcc编译器足够聪明,它决定不需要将结果实际写入数组,编译器对这些写入进行了优化。在根据新信息进行了新的性能测试之后,或者在我犯了一些错误或假定错误的事实之前,似乎结果更加清晰:exp(double arg)或expf(float arg)均为x2 + exp(double arg)改进了,但它不是一种快速的AVX2算法(x8并行浮点arg),比SSE2算法(x4并行浮点arg)快大约6倍。以下是一些与英特尔超线程CPU预期相同的结果,但SSE2算法除外:
exp(双arg)单线程:18分46秒
exp(double arg)4个线程:5分4秒
exp(double arg)8个线程:4分28秒
expf(float arg)单线程:7分32秒
expf(float arg)4个线程:1分58秒
expf(float arg)8个线程:1分41秒
相对错误:
i x y = expf(x) double precision exp relative error
i = 0 x =-5.000000000e+00 y = 6.737946998e-03 exp_dbl = 6.737946999e-03 rel_err =-1.124224480e-10
i = 124000 x =-3.758316040e+00 y = 2.332298271e-02 exp_dbl = 2.332298229e-02 rel_err = 1.803005727e-08
i = 248000 x =-2.518329620e+00 y = 8.059411496e-02 exp_dbl = 8.059411715e-02 rel_err =-2.716802480e-08
i = 372000 x =-1.278343201e+00 y = 2.784983218e-01 exp_dbl = 2.784983343e-01 rel_err =-4.490403948e-08
i = 496000 x =-3.867173195e-02 y = 9.620664716e-01 exp_dbl = 9.620664730e-01 rel_err =-1.481617428e-09
i = 620000 x = 1.201261759e+00 y = 3.324308872e+00 exp_dbl = 3.324308753e+00 rel_err = 3.571995830e-08
i = 744000 x = 2.441616058e+00 y = 1.149159718e+01 exp_dbl = 1.149159684e+01 rel_err = 2.955980805e-08
i = 868000 x = 3.681602478e+00 y = 3.970997620e+01 exp_dbl = 3.970997748e+01 rel_err =-3.232306688e-08
i = 992000 x = 4.921588898e+00 y = 1.372204742e+02 exp_dbl = 1.372204694e+02 rel_err = 3.563072184e-08
通过link的* SSE2算法,x6,8速度从一个线程增加到8个线程。我的性能测试代码对传递给该库的vector / 4浮点数组的typedef联合使用aligned(16),而不是对齐的整个浮点数组。这可能会导致性能下降,至少对于其他AVX2代码而言,它的多线程性能改进对于Intel Hyper-Threading似乎也不错,但速度较低时,时间在x2.5-x1.5之间增加。也许可以通过更好的数组对齐来加快SSE2代码的速度,而我无法改进:
exp_ps(x4并行浮点arg)单线程:12分7秒
exp_ps(x4并行float arg)4个线程:3分10秒
exp_ps(x4并行float arg)8个线程:1分46秒
相对错误:
i x y = exp_ps(x) double precision exp relative error
i = 0 x =-5.000000000e+00 y = 6.737946998e-03 exp_dbl = 6.737946999e-03 rel_err =-1.124224480e-10
i = 124000 x =-3.758316040e+00 y = 2.332298271e-02 exp_dbl = 2.332298229e-02 rel_err = 1.803005727e-08
i = 248000 x =-2.518329620e+00 y = 8.059412241e-02 exp_dbl = 8.059411715e-02 rel_err = 6.527768787e-08
i = 372000 x =-1.278343201e+00 y = 2.784983218e-01 exp_dbl = 2.784983343e-01 rel_err =-4.490403948e-08
i = 496000 x =-3.977407143e-02 y = 9.610065222e-01 exp_dbl = 9.610065335e-01 rel_err =-1.174323454e-08
i = 620000 x = 1.200158238e+00 y = 3.320642233e+00 exp_dbl = 3.320642334e+00 rel_err =-3.054731957e-08
i = 744000 x = 2.441616058e+00 y = 1.149159622e+01 exp_dbl = 1.149159684e+01 rel_err =-5.342903415e-08
i = 868000 x = 3.681602478e+00 y = 3.970997620e+01 exp_dbl = 3.970997748e+01 rel_err =-3.232306688e-08
i = 992000 x = 4.921588898e+00 y = 1.372204742e+02 exp_dbl = 1.372204694e+02 rel_err = 3.563072184e-08
AVX2算法(x8并行浮点arg)单线程:1分45秒
AVX2算法(x8并行浮点arg)4个线程:28秒
AVX2算法(x8并行浮点arg)8个线程:27秒
相对错误:
i x y = exp256_ps(x) double precision exp relative error
i = 0 x =-5.000000000e+00 y = 6.737946998e-03 exp_dbl = 6.737946999e-03 rel_err =-1.124224480e-10
i = 124000 x =-3.758316040e+00 y = 2.332298271e-02 exp_dbl = 2.332298229e-02 rel_err = 1.803005727e-08
i = 248000 x =-2.516632080e+00 y = 8.073104918e-02 exp_dbl = 8.073104510e-02 rel_err = 5.057888540e-08
i = 372000 x =-1.279417157e+00 y = 2.781994045e-01 exp_dbl = 2.781993997e-01 rel_err = 1.705288467e-08
i = 496000 x =-3.954863176e-02 y = 9.612231851e-01 exp_dbl = 9.612232069e-01 rel_err =-2.269774967e-08
i = 620000 x = 1.199879169e+00 y = 3.319715738e+00 exp_dbl = 3.319715775e+00 rel_err =-1.119642824e-08
i = 744000 x = 2.440370798e+00 y = 1.147729492e+01 exp_dbl = 1.147729571e+01 rel_err =-6.896860199e-08
i = 868000 x = 3.681602478e+00 y = 3.970997620e+01 exp_dbl = 3.970997748e+01 rel_err =-3.232306688e-08
i = 992000 x = 4.923286438e+00 y = 1.374535980e+02 exp_dbl = 1.374536045e+02 rel_err =-4.676466368e-08
AVX2算法多线程源代码
/* Performance test of a multithreaded exp(x) algorithm
based on AVX implementation of Giovanni Garberoglio
Copyright (C) 2020 Antonio R.
AVX implementation of exp:
Modified code from this source: https://github.com/reyoung/avx_mathfun
Based on "sse_mathfun.h",and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
(this is the zlib license)
*/
/* gcc -O3 -m64 -mavx2 -march=haswell expc_multi.c -pthread -lm */
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
#include <math.h>
#include <pthread.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
#define num_threads 4
typedef float FLOAT32[MAXSIZE] __attribute__((aligned(4)));
static FLOAT32 xv;
static FLOAT32 yv;
__m256 exp256_ps(__m256 x) {
/*
To increase the compatibility across different compilers the original code is
converted to plain AVX2 intrinsics code without ingenious macro's,pow2n);
return y;
}
void* run(void *received_Val){
int single_val = *((int *) received_Val);
int r;
int i;
float p;
float *xp;
float *yp;
for (r = 0; r < REIT; r++) {
p = XXX + 0.00001*single_val*MAXSIZE/num_threads;
xp = xv + single_val*MAXSIZE/num_threads;
yp = yv + single_val*MAXSIZE/num_threads;
for (i = single_val*MAXSIZE/num_threads; i < (single_val+1)*MAXSIZE/num_threads; i += 8) {
__m256 x = _mm256_setr_ps(p,p + 0.00007);
__m256 y = exp256_ps(x);
_mm256_store_ps(xp,x);
_mm256_store_ps(yp,y);
xp += 8;
yp += 8;
p += 0.00008;
}
}
return 0;
}
int main(){
int i;
pthread_t tid[num_threads];
for (i=0;i<num_threads;i++){
int *arg = malloc(sizeof(*arg));
if ( arg == NULL ) {
fprintf(stderr,NULL);
}
for (i=0;i<MAXSIZE;i++){
printf("x=%.20f,yv[i]);
}
return 0;
}
图表概述: Julien Pommier exp(double arg)没有printf循环,不是真正的性能,正如Andreas Wenzel发现的那样,当结果不为printf时,gcc不会计算exp(x),即使float版本也要慢得多,因为它的汇编指令很少。尽管图对于某些仅使用低级CPU缓存/寄存器的汇编算法可能有用。 expf(float arg)实际性能或带有printf循环 AVX2算法,性能最佳。
解决方法
当您不在程序末尾打印数组时,gcc编译器非常聪明,可以意识到计算结果没有明显的效果。因此,编译器认为没有必要将结果实际写入数组,因为从不使用这些结果。而是由编译器优化这些写操作。
此外,当您不打印结果时,库函数exp
没有可观察到的副作用,但前提是该函数未用过高的输入调用,以至于会引起浮点溢出(这将导致函数raise a floating point exception)。这样,编译器也可以优化这些函数调用。
如您在assembly instructions emitted by the gcc compiler for your code which does not print the results中所见,编译后的程序不会无条件调用函数exp
,而是测试函数exp
的输入是否大于{ {1}}(以确保不会发生溢出)。只有在发生溢出的情况下,程序才会跳转到调用函数7.09e2
的代码。这是相关的汇编代码:
exp
在上面的汇编代码中,CPU寄存器ucomisd xmm1,xmm3
jnb .L9
包含双精度浮点值xmm3
。如果输入大于此常数,则函数7.09e2
会导致浮点溢出,因为结果不能用双精度浮点值表示。
由于您的输入始终有效且足够低而不会导致浮点溢出,因此您的程序将永远不会执行此跳转,因此它将永远不会真正调用函数exp
。
这说明了为什么不打印结果时代码如此之快的原因。如果不打印结果,则编译器将确定计算没有可观察到的效果,因此它将优化它们。
因此,如果希望编译器实际执行计算,则必须确保计算具有可观察的效果。这并不意味着您必须实际打印所有结果(大小为几兆字节)。仅打印一行取决于所有结果(例如所有结果之和)的行就足够了。
但是,如果将对库函数exp
的函数调用替换为对某些其他自定义函数的调用,则编译器不够聪明,无法意识到该函数调用没有可观察到的效果,因此无法即使您不打印计算结果,也可以优化函数调用。
由于上述原因,如果要比较两个函数的性能,必须通过确保结果具有可观察的效果来确保计算确实进行。否则,您将冒着编译器至少优化一些计算的风险,并且这种比较将是不公平的。
,我认为这与pthread
没有太大关系,因为您的代码似乎仅在加入线程之后才调用printf
。取而代之的是,由于在每次打印循环迭代中都需要从xv
和yv
数组中进行读取而导致的高速缓存未命中,因此性能不佳。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。