[DFTB-Plus-User] DFTB+MPI-NEGF on SLURM

Gabriele Penazzi penazzi at uni-bremen.de
Thu Apr 9 10:05:20 CEST 2015


On 04/09/2015 09:07 AM, Alessandro Pirrotta wrote:
> Dear DFTB+ users,
>
> I am having a problem running the DFTB+ on SLURM.
> When I am connected to the front end of my account in my university
> computer cluster, the executable runs correctly: I have run the test
> and only 2 tests failed 
> (
> ======= spinorbit/EuN =======
> electronic_stress    element              0.000101791078878          
> Failed
> stress               element              0.000101791078878          
> Failed 
> )
>
> When I submit a job with SLURM and I execute normally ./dftb+ I get a
> MPI error (see below).
> If I run "mpi -n 1 dftb+" the job runs correctly over a node and a
> single core.
> *How do I run dftb+ over a single node, using n cores?*
*[cut]*

Hi Alessandro,

when running NEGF, the parallelization is very different with respect to
solving the aigenvalue problem. dftb+negf is parallelized on two level:
MPI by distribution of energy points and possibly OMP by linking with
threaded blas/lapack libraries. The former is on us and it is mandatory
to compile supporting MPI, the latter is on the BLAS/LAPACK vendor and
it may be active or not depending on the way you compile it. See the
README.NEGF and README.PARALLEL files in the src directory.

If you link a threaded library, then you will have to specify how many
OMP threads you assign per process in your job script. For example

$ export OMP_NUM_THREADS=4
$ mpirun -n 1 dftb+

would use 4 cores on 1 process (the correct specification depends on
your architecture, you may need or not additional flags but probably you
have an howto related to your facility). Therefore the answer to you
question is that you may want to use n threads on 1 process, or n
processes on n cores, or (more likely) something in the middle depending
on your system.

A note on efficiency. As on "common" test systems (tens to thousands
atoms) the lapack/scalack scale efficiently up to 2-4 threads, it is
usually convenient to reserve some cores for threading. Also, as the
parallelization on energy points implies solving N independent Green's
functions, therefore it needs to allocate N times memory where N is the
number of processes. For large systems it may be necessary to run a
process per socket, to get the maximum available memory. With the
current version also the Poisson is a bit more efficient if you have
less processes on a socket, considering these points at the end of the
day I usually run with 2 or 4 omp threads (if I don't hit memory problems).

Hope this helps,
Gabriele




-- 
--
Dr. Gabriele Penazzi
BCCMS - University of Bremen

http://www.bccms.uni-bremen.de/
http://sites.google.com/site/gabrielepenazzi/
phone: +49 (0) 421 218 62337
mobile: +49 (0) 151 19650383

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.zfn.uni-bremen.de/pipermail/dftb-plus-user/attachments/20150409/29b9c301/attachment.htm>


More information about the DFTB-Plus-User mailing list