Example of a concrete Optimization Step by Step on a CPU

1 May 2016

When asked how to optimize codes, I get many small steps swarming into my brain at once. Usually, it is a bit complicated for me to order them, because it's very case depending.
So, I decided to make an example of optimization step by step to have something more understandable that general and very detached rules.
The optimization process described is linked to this specific code. For sure it is possible to apply the same steps to other code, but be warned that it is barely impossible to predict whether that would actually improve (and how much) the speed.

The original code

I chose this specific code that I used recently, to prove some performance. I like it because it is a very simple code that we can parallelize very easily. The code is a simulation of 3D temperature diffusion, by a finite difference approach. The discussion here is not about the algorithm efficiency, but about the optimization process.

#include 	<stdlib.h>
#include 	<stdio.h>
#include 	<ctime>
#include 	<chrono>
#include	<cstring>
#ifdef __linux__ 
	#include 	<unistd.h>
	#include 	<sys/resource.h>
#endif


#define valueLevel0(x,y,z) (Te_in[sizeX*sizeY*(z)+sizeX*(y)+(x)])


#define localVariation(x,y,z) (1.f*(-lam)*( \
					-(valueLevel0(x+1,y,z)+valueLevel0(x-1,y,z))	 \
					-(valueLevel0(x,y+1,z)+valueLevel0(x,y-1,z))	 \
					-(valueLevel0(x,y,z+1)+valueLevel0(x,y,z-1))	 \
					+6*valueLevel0(x,y,z)	 \
					)*dt)


float compute(float * Te_out, float * Te_in, float * Cp, int sizeX, int sizeY, int sizeZ, float dx, float dy, float dz, float dt, float lam){

	float variation=0.f;
	for (int k = 1; k 	< sizeZ-1; ++k)
	{
		for (int j = 1; j 	< sizeY-1; ++j)
		{
			for (int i = 1; i 	< sizeX-1; ++i)
			{
				Te_out[sizeX*sizeY*(k)+sizeX*(j)+(i)]=valueLevel0(i,j,k)+localVariation(i,j,k);
				float localVariation= localVariation(i,j,k);
				if (variation	<localVariation)
				{
					variation=localVariation;
				}
			}	
		}
	}
	return variation;

}

int main(int argc, char const *argv[])
{

	int nx=128;
	int ny=128;
	int nz=128;
	float dx=1;
	float dy=1;
	float dz=1;
	float lam=1.0f;
	float max_dt=dx*dx/lam/6.1;
	int rad=10;


	float *Te=(float*)malloc(nx * ny * nz * sizeof(float));
	float *Cp=(float*)malloc(nx * ny * nz * sizeof(float));
	float *Te_bis=(float*)malloc(nx * ny * nz * sizeof(float));

	memset(Te,0,nx * ny * nz * sizeof(float));
	memset(Cp,0,nx * ny * nz * sizeof(float));
	memset(Te_bis,0,nx * ny * nz * sizeof(float));

//init

	int centerX=0.5*(nx)*dx;
	int centerY=0.5*(ny)*dy;
	int centerZ=0.5*(nz)*dz;
	for (int k = 0; k 	< nz; ++k)
	{
		for (int j = 0; j 	< ny; ++j)
		{
			for (int i = 0; i 	< nx; ++i)
			{
				int positionX=i*dx-centerX;
				int positionY=j*dy-centerY;
				int positionZ=k*dz-centerZ;
				if(positionX*positionX + positionY*positionY + positionZ*positionZ 	<rad*rad) {
					Te[i+nx*j+k*nx*ny]=10;
				}
			}
		}
	}
	// end init

	//computation
	float totaltime=0.f;
	std::clock_t start,stop;
	int facteur=1;  // after optimization maybe the time is coming to small so we need average the time on a bigest problem
	auto t_start = std::chrono::high_resolution_clock::now();
	{

		float dt = max_dt;
		float variation=0.f;
		float oldVariation=0.f;
		double delatVariation=1/0.;
		for (int i = 0; i 	< facteur*5000/2; ++i)
		{
			compute(Te_bis,Te,Cp,nx,ny,nz,dx,dy,dz,dt,lam);
			variation=compute(Te,Te_bis,Cp,nx,ny,nz,dx,dy,dz,dt,lam);
		}
	}
	auto t_end = std::chrono::high_resolution_clock::now();
	totaltime = std::chrono::duration	<double, std::milli>(t_end-t_start).count();
	printf("time = %.3f s \n",totaltime/facteur/1000. );

	return 0;
}

First analyze

Memory

As we see the memory is managed in two big arrays, and memory position are computed at each memory access. Generally, a contiguous memory approach have good performances.
Another point to discus is the size of the problem, here 128 x 128 x 128 cells of floats. The needed memory is 8 MB for each array, which is the same order of magnitude that processor L3-cache, hence here the problem stay small.

Possible points of speedup

With a fast overview of all the code, it looks possible to parallelize for-loops and also to vectorize some parts.

Speed test

To check the speed-up of each improvement, I will use two computers.

1. A Mackbook Pro
OS X, i7 2.3Ghz, 4 physical cores, L3: 6 MB, memory: 16 GB
2. A Bi-Xeon
Linux: ubuntu 14.04, 2x Xeon E5-2680 v2 @ 2.80GHz, 10 physical cores, L3: 25 MB, memory :64GB

As exposed previously, the problem is small enough to be treated within the L3-cache of the Bi-Xeon.

First compilation and first run

icpc base.cpp -std=c++11 -o base

With some optimization it should look better!

icpc base.cpp -O3 -std=c++11 -o base_O3

Results are the same! Why? Because the compiler optimizes the code by default!

And if the compilation depends of the hardware?

icpc base.cpp -xhost -O3 -std=c++11 -o base_O3_xHost

Optimization

First step

The first idea that I have when I see the code is to inline the "compute" function.

icpc inline.cpp -xhost -O3 -std=c++11 -o inline_O3_xHost
These good results can easily be explain, by the fact that with xhost flag the compiler will use AVX instruction too.

Time to parallelize!

After this first optimization, let's try parallelization! I start to parallelize the most used part by introducing two new OpenMP lines (before line 26).

	#pragma omp parallel default(none) firstprivate(Te_out,Te_in,Cp,sizeX,sizeY,sizeZ,dx,dy,dz,dt,lam) reduction(max:variation)
	#pragma omp for collapse(3) schedule(static)

I use static because in this situation it's a non-conditional computation, so each cell is computed with the same duration.

The overall function looks like

inline float compute(float * Te_out, float * Te_in, float * Cp, int sizeX, int sizeY, int sizeZ, float dx, float dy, float dz, float dt, float lam){

	float variation=0.f;
	#pragma omp parallel default(none) firstprivate(Te_out,Te_in,Cp,sizeX,sizeY,sizeZ,dx,dy,dz,dt,lam) reduction(max:variation)
	#pragma omp for collapse(3) schedule(static)
	for (int k = 1; k < sizeZ-1; ++k)
	{
		for (int j = 1; j < sizeY-1; ++j)
		{
			for (int i = 1; i < sizeX-1; ++i)
			{
				Te_out[sizeX*sizeY*(k)+sizeX*(j)+(i)]=valueLevel0(i,j,k)+localVariation(i,j,k);
				float localVariation= localVariation(i,j,k);
				if (variation < localVariation)
				{
					variation=localVariation;
				}
			}	
		}
	}
	return variation;

}
icpc inline_OMP_V1.cpp -xhost -O3 -fopenmp -std=c++11 -o inline_OMP_V1_O3_xHost

Definitely something is going badly. On the mac, with 8 threads it's more than 5 time slower than with a single thread. And on the Bi-Xeon 40 threads give only 15% of improvement.
Generally, optimization efforts start like that! Especially with parallelization!

After some consideration, it may be better to collapse only the two first for-loops. Maybe the collapse command affects badly how the memory is accessed.
Let's try the new version!

icpc inline_OMP_V2.cpp -xhost -O3 -fopenmp -std=c++11 -o inline_OMP_V2_O3_xHost

It's much better! But something strange happens with the Bi-Xeon.

Vectorization

Why are results better now? The compiler is possibly vectorizing the last loop. That kind of vectorization would improve computation and memory access. So let’s check if we can speed up the compilation by embedding in our code this vectorization of the last loop.

inline float compute(float * Te_out, float * Te_in, float * Cp, int sizeX, int sizeY, int sizeZ, float dx, float dy, float dz, float dt, float lam){

	float variation=0.f;
	#pragma omp parallel default(none) firstprivate(Te_out,Te_in,Cp,sizeX,sizeY,sizeZ,dx,dy,dz,dt,lam) reduction(max:variation)
	#pragma omp for collapse(2) schedule(static)
	for (int k = 1; k < sizeZ-1; ++k)
	{
		for (int j = 1; j < sizeY-1; ++j)
		{
			#pragma omp simd
			for (int i = 1; i < sizeX-1; ++i)
			{
				Te_out[sizeX*sizeY*(k)+sizeX*(j)+(i)]=valueLevel0(i,j,k)+localVariation(i,j,k);
				float localVariation= localVariation(i,j,k);
				if (variation < localVariation)
				{
					variation=localVariation;
				}
			}	
		}
	}
	return variation;

}
icpc inline_OMP_V3.cpp -xhost -O3 -fopenmp -std=c++11 -o inline_OMP_V3_O3_xHost

Same time! Like I said before, the compiler did it automatically, but it's cool to specify it. Some compiler are much more stupid!

Variation in results! Strange, no?

It gives variation in results only on the Bi-Xeon! But not with a single processor. It's clearly a sign that something is wrong with the NUMA. To fix that problem, let's try to initialize the memory, in the same way that we use it after. To do that I add a new part in the initializations of data and remove the old part.

	#pragma omp parallel default(none) firstprivate(Te,Te_bis,Cp,nx,ny)
	#pragma omp for collapse(2) schedule(static) nowait
	for (int k = 0; k < nx; ++k)
	{
		for (int j = 0; j < ny; ++j)
		{
			#pragma omp simd
			for (int i = 0; i < nx; ++i)
			{
				Te[i+nx*j+k*nx*ny]=0.f;
				Cp[i+nx*j+k*nx*ny]=0.f;
				Te_bis[i+nx*j+k*nx*ny]=0.f;
			}

		}
	}
icpc inline_OMP_V4.cpp -xhost -O3 -fopenmp -std=c++11 -o inline_OMP_V4_O3_xHost

It looks a little better, but still varies. Is the memory stored in the good ram? To compute one cell, neighbors are needed, it's probably the affectation of threads regarding chips that have the biggest impact.
Then to get rid of the time variations, we have to specify this affectation. That could be done by setting some of the environment variables. But I will prefer to specify it at the compilation. (I find it better, in particularly if someone else uses the program after.)

icpc inline_OMP_V4.cpp -xhost -O3 -fopenmp -std=c++11 -par-affinity=compact -o inline_OMP_V4_O3_xHost_affinity

It's much better!

The Bi-Xeon is much faster because we have 40 threads vs. 8 threads.

The current speed-up

Let's try to improve a little bit more!

What if we slightly change the algorithm?

For a further optimization of the performance, we can check the optimization rapport.
With the Intel compiler it's easy to get it directly on the console by adding the following flags.

-opt_report -opt-report-file=stdout

In the rapport we have "Preprocess Loopnests: Moving Out Store [ inline_OMP_V4.cpp(37,5) ]", that means that the test was moved. And it makes sense, because a test cannot really be vectorized.
But, do we really need this test? We can use this information to know the speed of the convergence. But maybe it is more interesting to do it separately, only when we need this information! So let's remove it!

inline float compute(float * Te_out, float * Te_in, float * Cp, int sizeX, int sizeY, int sizeZ, float dx, float dy, float dz, float dt, float lam){

	float variation=0.f;
	#pragma omp parallel default(none) firstprivate(Te_out,Te_in,Cp,sizeX,sizeY,sizeZ,dx,dy,dz,dt,lam) //reduction(max:variation)
	#pragma omp for collapse(2) schedule(static) nowait
	for (int k = 1; k < sizeZ-1; ++k)
	{
		for (int j = 1; j < sizeY-1; ++j)
		{
			#pragma omp simd
			for (int i = 1; i < sizeX-1; ++i)
			{
				Te_out[sizeX*sizeY*(k)+sizeX*(j)+(i)]=valueLevel0(i,j,k)+localVariation(i,j,k);
				/*float localVariation= localVariation(i,j,k);
				if (variation<localVariation)
				{
					variation=localVariation;
				}*/
			}	
		}
	}
	return variation;

}

And compile it.

icpc inline_OMP_V5.cpp -xhost -O3 -fopenmp -std=c++11 -par-affinity=compact -o inline_OMP_V5_O3_xHost_affinity

Try Cilk Plus

For that just rewrite the function with the Cilk Plus syntax.

#ifdef __INTEL_COMPILER 
	#define valueLevel0(x,y,z) (Te_in[sizeX*sizeY*(z)+sizeX*(y)+(x):sizeX-2])
#else
	#define valueLevel0(x,y,z) (Te_in[sizeX*sizeY*(z)+sizeX*(y)+(x)])
#endif


#define localVariation(x,y,z) (1.f*(-lam)*( \
					-(valueLevel0(x+1,y,z)+valueLevel0(x-1,y,z))	 \
					-(valueLevel0(x,y+1,z)+valueLevel0(x,y-1,z))	 \
					-(valueLevel0(x,y,z+1)+valueLevel0(x,y,z-1))	 \
					+6*valueLevel0(x,y,z)	 \
					)*dt)


inline float compute(float * Te_out, float * Te_in, float * Cp, int sizeX, int sizeY, int sizeZ, float dx, float dy, float dz, float dt, float lam){

	float variation=0.f;
	#pragma omp parallel default(none) firstprivate(Te_out,Te_in,Cp,sizeX,sizeY,sizeZ,dx,dy,dz,dt,lam) //reduction(max:variation)
	#pragma omp for collapse(2) schedule(static) nowait
	for (int k = 1; k < sizeZ-1; ++k)
	{
		for (int j = 1; j < sizeY-1; ++j)
		{
			#ifdef __INTEL_COMPILER 
				Te_out[1+sizeX*j+k*sizeX*sizeY:sizeX-2]=valueLevel0(1,j,k)+localVariation(1,j,k);
			#else
			#pragma omp simd
				for (int i = 1; i < sizeX-1; ++i)
				{
					Te_out[sizeX*sizeY*(k)+sizeX*(j)+(i)]=valueLevel0(i,j,k)+localVariation(i,j,k);
					/*float localVariation= localVariation(i,j,k);
					if (variation < localVariation)
					{
						variation=localVariation;
					}*/
				}		
			#endif
		}
	}
	return variation;

}

Now compile it!

icpc inline_OMP_V6.cpp -xhost -O3 -fopenmp -std=c++11 -par-affinity=compact -o inline_OMP_V6_O3_xHost_affinity

I don't know the real origin of this improvement. I continue to investigate this point!

Conclusion

Finally, the optimization steps which bring the most significant improvements are different for each hardware! And that makes sense! So always think about the hardware to know how deep to go in the optimization process.