[DFTB-Plus-User] MPI Version crashes

Wed Dec 11 12:12:39 CET 2013

Dear Bálint,

when replying to Fu Xiao-Xiao I stated that what I reported is our own 
experience.
Our results are likely to strongly depend on our machine's architecture 
(which may indeed be faulty)
and do not presume to hold in general. Your benchmarks show indeed very 
good numbers.

I did not want other users to be scared about the MPI version, I am 
sorry if I unwillingly did so.

Best regards

Silvio a Beccara

> Dear Fu Xiao-Xiao, dear Silvio
>
>> First of all, I must tell you that in our case the performances of the
>> MPI version of DFTB+ were quite unsatisfactory. The code ran slower than
>> the serial version, and the scaling was negative.
> Silvio, did you manage at least to reproduce the scaling for the
> SiC-system I've sent you? If that can not be reproduced, I am somewhat
> inclined to think that something is wrong with your local setup. I've
> tried that one now on two different supercomputers with different
> interconnects (Infiniband and Cray's one) and both show excellent
> performance. (Figure attached). Even using one node (8 cores), the MPI
> version is only slower by ca. 20% as the OpenMP version. Going to more
> nodes, it scales almost perfectly up to 64, 128 or 256 cores (depending
> whether you calculate 1000, 2000 or 4000 atoms). The sytems tested was a
> perfect SiC-crystal with the given sizes. Inputs are attached, use the
> sk-files from the PBC-set.
>
>> Anyway, if you want to try, I can tell you that we compiled everything:
>> DFTB+ libraries, LAPACK/SCALAPACK libraries (with respective
>> dependencies) and OpenMPI libraries with the 4.9.0
>> 20130922(experimental) version of the GCC compiler, and the code ran
>> giving the same results as the serial version.
> For the figure attached, I'v used a DFTB+ binary compiled with ifort
> 12.1 and linked against SCALAPACK as contained in the MKL-library. The
> nodes were interconnected with Infiniband (4x DDR).
>
> Of course, the MPI code may still contain bugs influencing performance.
> We had one (fixed in the version on the web) where all nodes tried to
> write into the same file at the same file. On my supercomputer system it
> did not cause any bottleneck (due to performant global file system),
> while on my local machine it caused a huge lag at the end of the SCC
> cycle when charges.bin was written. So, make sure you switch off all
> file writing features and if the program is too slow, check whether it
> is slow global, or only has a lag at a certain point.
>
> Also, for periodic systems please make sure, that for huge system sizes
> you fix the Ewald-alpha parameter to a reasonable value (~0.1), as
> described in the README.
>
> Additionally, make sure OMP_NUM_THREADS is set to 1, otherwise parallel
> MKL-threads may be launched which would kill performance. For a linking
> example see the README.PARALLEL file, which you should study carefully
> before setting up your parallel DFTB+.
>
> Let me know any experiences. Unfortunately, if I can't reproduce the
> behaviour (as it is unfortunately the case for Silvio), I can hardly
> help. I can only say, that the SiC example I've used was scaling nicely
> on all systems I've tried, provided they had a reasonable interconnect.
>
>    Best regards,
>
>      Bálint
>
>
>
> _______________________________________________
> DFTB-Plus-User mailing list
> DFTB-Plus-User at dftb-plus.info
> http://www.dftb-plus.info/mailman/listinfo/dftb-plus-user