[DFTB-Plus-User] MPI Version crashes

Bálint Aradi balint.aradi at bccms.uni-bremen.de
Wed Dec 11 10:52:14 CET 2013

Dear Fu Xiao-Xiao, dear Silvio

> First of all, I must tell you that in our case the performances of the
> MPI version of DFTB+ were quite unsatisfactory. The code ran slower than
> the serial version, and the scaling was negative.

Silvio, did you manage at least to reproduce the scaling for the
SiC-system I've sent you? If that can not be reproduced, I am somewhat
inclined to think that something is wrong with your local setup. I've
tried that one now on two different supercomputers with different
interconnects (Infiniband and Cray's one) and both show excellent
performance. (Figure attached). Even using one node (8 cores), the MPI
version is only slower by ca. 20% as the OpenMP version. Going to more
nodes, it scales almost perfectly up to 64, 128 or 256 cores (depending
whether you calculate 1000, 2000 or 4000 atoms). The sytems tested was a
perfect SiC-crystal with the given sizes. Inputs are attached, use the
sk-files from the PBC-set.

> Anyway, if you want to try, I can tell you that we compiled everything:
> DFTB+ libraries, LAPACK/SCALAPACK libraries (with respective
> dependencies) and OpenMPI libraries with the 4.9.0
> 20130922(experimental) version of the GCC compiler, and the code ran
> giving the same results as the serial version.

For the figure attached, I'v used a DFTB+ binary compiled with ifort
12.1 and linked against SCALAPACK as contained in the MKL-library. The
nodes were interconnected with Infiniband (4x DDR).

Of course, the MPI code may still contain bugs influencing performance.
We had one (fixed in the version on the web) where all nodes tried to
write into the same file at the same file. On my supercomputer system it
did not cause any bottleneck (due to performant global file system),
while on my local machine it caused a huge lag at the end of the SCC
cycle when charges.bin was written. So, make sure you switch off all
file writing features and if the program is too slow, check whether it
is slow global, or only has a lag at a certain point.

Also, for periodic systems please make sure, that for huge system sizes
you fix the Ewald-alpha parameter to a reasonable value (~0.1), as
described in the README.

Additionally, make sure OMP_NUM_THREADS is set to 1, otherwise parallel
MKL-threads may be launched which would kill performance. For a linking
example see the README.PARALLEL file, which you should study carefully
before setting up your parallel DFTB+.

Let me know any experiences. Unfortunately, if I can't reproduce the
behaviour (as it is unfortunately the case for Silvio), I can hardly
help. I can only say, that the SiC example I've used was scaling nicely
on all systems I've tried, provided they had a reasonable interconnect.

  Best regards,


Dr. Bálint Aradi
Bremen Center for Computational Materials Science, University of Bremen

-------------- next part --------------
A non-text attachment was scrubbed...
Name: timing.pdf
Type: application/pdf
Size: 9478 bytes
Desc: not available
URL: <http://www.dftb-plus.info/pipermail/dftb-plus-user/attachments/20131211/12fd1edc/attachment-0001.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scaling.tar.xz
Type: application/x-xz
Size: 10752 bytes
Desc: not available
URL: <http://www.dftb-plus.info/pipermail/dftb-plus-user/attachments/20131211/12fd1edc/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://www.dftb-plus.info/pipermail/dftb-plus-user/attachments/20131211/12fd1edc/attachment-0001.sig>

More information about the DFTB-Plus-User mailing list