Elk and linear algebra libraries

Dear users,

What is your experience with Elk and linear algebra libraries?

One can choose from MKL, built-in BLAS and LAPACK, OpenBLAS and others.

From my experience, elks 4.3.6 and 5.2.14 compiled with gcc 4.9.2 or 7.3.0 (respectively) and built-in libs run faster then their MKL plus Intel 15 counterparts.
Now I have elk 6.2.8 with OpenBLAS, compiled with gcc 7.4.0 , which seems to run a bit slower then the gcc 5.2.14. I will try it with built-in libs - are those supposed to be the fastest?

So, which are yours favorite schemes?

Andrew

Edit: Also, which one is more critical for Elk performance - fft or linear algebra? Fft is used heavily, it must be important.

Last edit: Andrew Shyichuk 2019-08-16

J. K. Dewhurst - 2019-08-21

Hi Andrew,

I'm currently writing a HowTo on compiling Elk optimally. It should be ready in a few days.

Briefly however, we've found that MKL is fastest on Intel hardware but OpenBLAS and BLIS are faster on AMD hardware.

You can also compile Elk with FFTW (either with MKL or external FFTW libraries) or use MKL's native Fourier transform. BLAS and LAPACK are more critical for perfomance than the FFT library.

Regards,
Kay.

Andrew Shyichuk - 2019-08-21

Dear Kay,

Since my original post, I did (and keep doing) some benchmarks which I am going to post here.

I've come to the same conclusions, namely MKL brings improvements on a Xeon cluster, while the effect of FFT is inconclusive (apparanetly, not that much important). Also, 6.2.8 is indeed faster.

However, playing around with OMP environment variables, I've found out that properly set OMP_MAX_ACTIVE_LEVELS changes the MKL improvement from noticeable to impressive - i.e. 2x faster then built-in libs. Seems like OMP_DYNAMIC=TRUE and OMP_NESTED=TRUE are a must.
The performance of a particular library apparently depends on the way it spawns threads (duh!) - it thus would have been nice if your How To covered that as well (or, it could be another How To).
Elk does a great job of using threads, maybe even better then some external libs.

The Question basically is, having X k-points, which one is better:
- 1 MPI node, X threads
- 1 node, X+Y threads
- N nodes, X/N threads per node
- N nodes, X/N + Y threads per node
- X nodes, 1,2,3 thread(s) per node (my gut tells me this one is the best)
- X+Y nodes, 1,2,3 thread(s) per node
What should be the values of Y, N?
What would be the optimal OMP settings, and do/how they depend on the particular libraries used?

Thank you.
Best regards.
Andrew

J. K. Dewhurst - 2019-08-21

Hi Andrew,

That's quite a complicated question and it depends on the type of task being run.

I usually run with OMP_NUM_THREADS equal to the number of cores and on as many nodes as are available with

mpirun --mca btl openib,self,vader -bind-to none -pernode -np $NODES -x OMP_NUM_THREADS=$NCORES $ELK

Elk chooses how many OpenMP threads to use on each loop with the subroutines 'holdthd' and 'freethd' in the module modomp. Because of this, Elk automatically sets OMP_DYNAMIC=false and OMP_NESTED=true. You can change the number of nesting levels in elk.in with

maxlvl 4

In earlier versions of the code the number of threads was determined by the compiler but we found that this generally oversubcribed threads, so now Elk handles it entirely.

You should also set

export OPENBLAS_NUM_THREADS=1

or

export BLIS_NUM_THREADS=1

because there doesn't seem to be any advantage to using multi-threaded BLAS in Elk.

MKL LAPACK is multithreaded and Elk uses up to 8 threads for this.

You can also set

export OMP_PROC_BIND=true export OMP_PLACES="{0:4}:16:4"

which seems to best for our 32 core AMD machine.

Lastly, you can set the number of threads used at the first nesting level with

maxthd1
2

for example. All the remaining threads are used at deeper nesting levels.

J. K. Dewhurst - 2019-08-21

Also, you should be aware that the scaling with number of cores is not linear and sometimes regressive.

We recently built a test Beowulf cluster consisting of two 32 core Threadripper 2990WX machines:

These were connected with 40Gb Infiniband cards and a switch bought second-hand off eBay.

Unfortunately, for TDDFT runs (which is one of our major interests at the moment) the scaling with the number of cores is not great. Here is the run time on a single machine for a 6 atom slab of CoPt using TDDFT for a few time steps:

We get linear scaling up to 4 cores and then dimishing improvements until 12 cores and then worse performance.

The reason is that this task is memory-bound. The 2990WX has only four memory channels for 32 cores of which only 16 are directly connected to memory. All the cores are accessing the memory together which is too much for the memory bus to handle. Unfortunately, there's not much that can be done about this: each k-point requires a few GB of memory which far exceeds the capacity of the level 3 cache.

It would have been better to have bought twice the number machines with half the number of cores. However we've ordered a mini cluster with the new Eypc Rome chips instead, which each have 8 memory channels and twice the amount of cache. We'll set this up ourselves with the aim of maximising the performance of Elk and I'll post a HowTo for doing this along with some additional benchmarks.

Regards,
Kay.

Andrew Shyichuk - 2019-08-21

Dear Kay,

Thank you for this details.

So, in case I do not specify any threading-related keywords in elk.in, would elk comply with the OMP_THREAD_LIMIT and OMP_MAX_ACTIVE_LEVELS?
Also, as long as elk changes OMP_DYNAMIC to false, export OMP_DYNAMIC=TRUE must have no effect, correct?

Your compilation example from make.def suggests use of sequential MKL. However, I will try playing around with threaded MKL - for instance, with 6 cores, OMP and MKL can have (1,6), (2,3) or (3,2) max threads (6,1 being equal to sequential MKL), and some options might turn out better.

Also, would it make sense to make the threading subroutines post messages like, "your setup seems optimal", or "recommended number of nodes/cores is...", or "recommended values of OMP control variables are..."?

Best regards.
Andrew.

Last edit: Andrew Shyichuk 2019-08-21

J. K. Dewhurst - 2019-08-21

Elk ignores OMP_THREAD_LIMIT, OMP_MAX_ACTIVE_LEVELS and OMP_DYNAMIC environment variables.

Instead it works out the number of threads based on the number of running threads in the higher nesting level. Setting the enviroment variables correctly can be quite confusing for a typical user, so Elk chooses the optimal number of threads for each loop or LAPACK call.

By default, Elk will use the number of threads given by OMP_NUM_THREADS. If you set

maxthd -1

then Elk will use the number of processors (which is actually the number of hyperthreads).

You can change the number of MKL threads with

maxthdmkl 8

These will be used only when there are idle threads available.

Also, on AMD machines you can compile with

LIB_LPK = -lopenblas -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -lpthread

This will use OpenBLAS for BLAS calls and MKL for LAPACK calls.

Regards,
Kay.

Last edit: J. K. Dewhurst 2019-08-22

Andrew Shyichuk - 2019-08-21

Dear Kay,

Thanks, I will test those.

So far, for some reason, my benchmarks have shown apparent dependence on OMP_MAX_ACTIVE_LEVELS.

Also, the most recent benchmarks (with threaded MKL) show apparent dependence on OMP_THREAD_LIMIT. Why could that be? Can the compliance be compiler-dependent? I used GCC.
I've set OMP_MAX_ACTIVE_LEVELS to 1, OMP_THREAD_LIMIT to either 1, 2 or 3, MKL_NUM_THREADS to 6, 3, 2 respectively and ran it on 6 threads. Such a setup was made with "each OMP thread spawns x MKL threads" logic in mind. These tasks have cpu percentage of 100, ~200 and ~300%, indicating no use of threaded MKL. Thus, calling MKL counts as a nested level. Again, apparent dependence at least on OMP_THREAD_LIMIT.

Best regards.
Andrew

Last edit: Andrew Shyichuk 2019-08-21

J. K. Dewhurst - 2019-08-21

Are you accidentally including 'omp_stub.f90' when you compile? That will override the actual MKL routines.

Threaded MKL is two nesting levels down, so if you have more k-points than threads then MKL LAPACK will use only one thread.

I'm not sure about OMP_THREAD_LIMIT. It should probably be unset or set to OMP_NUM_THREADS since Elk limits the total number of threads to OMP_NUM_THREADS.

OMP_MAX_ACTIVE_LEVELS should be over writted because Elk calls omp_set_max_active_levels(maxlvl) during initialisation. maxlvl is 4 by default.

Cheers,
Kay.

Andrew Shyichuk - 2019-08-21

Dear Kay,

I might be including omp_stub.f90, it is present in $elk/src.
However, there is no respective .o file.
The only stub.o files are those specified (uncommented) in make.inc.
Just in case, I will delete it (and other unused stubs) and recompile.

I use OMP_DISPLAY_ENV=TRUE. I set OMP_THREAD_LIMIT manually, for testing purposes. It is smaller then OMP_NUM_THREADS, which I did not set and it turned out to be equal to the number of physical threads, which is expected. Elk complies with OMP_THREAD_LIMIT, which can be deduced from cpu percentage that was used on the task.

As for OMP_MAX_ACTIVE_LEVELS, I have no direct way to check. Indirectly, I can see that identical tasks with different OMP_MAX_ACTIVE_LEVELS behave differently. Some effect of the cluster is present (it's on a large supercomputer, nodes are not identical in their performance, disk responce is not identical over time), meaning that otherwise identical tasks show different used walltimes and cpu times. The OMP_MAX_ACTIVE_LEVELS effect seems to be larger then the cluster effect though.

Best regards.
Andrew

Last edit: Andrew Shyichuk 2019-08-21

J. K. Dewhurst - 2019-08-22

Dear Andrew,

Setting OMP_MAX_ACTIVE_LEVELS should not affect the number of nesting levels because Elk calls omp_set_max_active_levels(maxlvl) before it starts any calculation.

The code which selects the number of threads for a parallel loop is found in modomp.f90. I'll reproduce it here:

subroutine holdthd(nloop,nthd) implicit none ! arguments integer, intent(in) :: nloop integer, intent(out) :: nthd ! local variables integer lvl,na,n ! current nesting level lvl=omp_get_level() if ((lvl.lt.0).or.(lvl.ge.maxlvl)) then nthd=1 return end if !$OMP CRITICAL(holdthd_) ! determine number of active threads at the current nesting level na=nathd(lvl) na=max(min(na,maxthd),1) ! number of threads allowed for this loop nthd=maxthd/na if (lvl.eq.0) nthd=min(nthd,maxthd1) nthd=max(min(nthd,maxthd,nloop),1) ! add to number of active threads in next nesting level n=nathd(lvl+1)+nthd n=max(min(n,maxthd),0) nathd(lvl+1)=n !$OMP END CRITICAL(holdthd_) return end subroutine

It determines which nesting level the subroutine was called from and finds the number of already active threads (na) in that level. Then it chooses the number of threads as maxthd/na.

Thus the thread selection can vary between runs and is unpredictable.

Hopefully this scheme is 'future-proof' and will scale for nodes with hundreds of cores (as long as the memory can keep up!).

Regards,
Kay.

Andrew Shyichuk - 2019-08-23

Dear Kay,

I've studied the modomp module, and added some output of variables to it.
Elk does not change OMP_NUM_THREADS (unless I change maxthd in elk.in from its default value of 0), and it does not change OMP_THREAD_LIMIT.
Thus, the dependence on OMP_MAX_ACTIVE_LEVELS I observed must be either some kind of artifact, or hidden in OMP_NUM_THREADS / OMP_THREAD_LIMIT which I was also exporting. I.e. its my setup, not elk.

Further on, and going back to threaded MKL.
From OMP documentation, I see that OMP_NUM_THREADS can be smaller then OMP_THREAD_LIMIT. Thus, I can set the OMP_THREAD_LIMIT to the number of physical threads, and limit my tweaks to OMP_NUM_THREADS/maxthd and maxthdmkl.

Best regards.
Andrew