[DFTB-Plus-User] DFTB+MPI-NEGF on SLURM

Thu Apr 9 09:46:36 CEST 2015

Hi. Maybe for SLURM, if you want say 4 cores in 1 node, in your job
script:

srun --n4 -cores-per-socket=4
mpirun -np 4 dftb+

If you compile the openmp version of dftb+ (which is the default version
dftb anyway), add omp_num_threads

srun --n4 -cores-per-socket=4
export OMP_NUM_THREADS=4
dftb+

For some who use lsf instead of SLURM, this is in the lsf jobscript

#BSUB -R "span[ptile=4]"

would do the same thing as far as I know.

Argo.

On Thu, 2015-04-09 at 09:07 +0200, Alessandro Pirrotta wrote:
> Dear DFTB+ users,
> 
> 
> 
> I am having a problem running the DFTB+ on SLURM.
> When I am connected to the front end of my account in my university
> computer cluster, the executable runs correctly: I have run the test
> and only 2 tests failed 
> (
> ======= spinorbit/EuN =======
> electronic_stress    element              0.000101791078878
> Failed
> stress               element              0.000101791078878
> Failed 
> )
> 
> 
> When I submit a job with SLURM and I execute normally ./dftb+ I get a
> MPI error (see below).
> If I run "mpi -n 1 dftb+" the job runs correctly over a node and a
> single core.
> How do I run dftb+ over a single node, using n cores?
> 
> 
> [bhc0141:20956] [[64086,1],0][grpcomm_pmi_module.c:398:modex]
> PMI_KVS_Commit failed: Operation failed
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process
> is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
> 
>   orte_grpcomm_modex failed
>   --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> [bhc0141:20956] *** An error occurred in MPI_Init
> [bhc0141:20956] *** on a NULL communicator
> [bhc0141:20956] *** Unknown error
> [bhc0141:20956] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> --------------------------------------------------------------------------
> An MPI process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly.  You should
> double check that everything has shut down cleanly.
> 
> 
>   Reason:     Before MPI_INIT completed
>   Local host: bhc0141
>   PID:        20956
> --------------------------------------------------------------------------
> 
> 
> Kind regards,
> Alessandro
> 
> 
> Alessandro Pirrotta
> PhD student
> 
>  
> 
> Faculty of Science
> Department of Chemistry &
> Nano-Science Center
> University of Copenhagen
> Universitetsparken 5, C321
> 2100 Copenhagen Ø
> Denmark
> 
> DIR +45 21 18 11 90
> MOB +45 52 81 23 41
> 
> 
> 
> alessandro.pirrotta at chem.ku.dk
> 
> alessandro.pirrotta at gmail.com
> 
> 
> www.ki.ku.dk
> 
> 
> 
> _______________________________________________
> DFTB-Plus-User mailing list
> DFTB-Plus-User at mailman.zfn.uni-bremen.de
> https://mailman.zfn.uni-bremen.de/cgi-bin/mailman/listinfo/dftb-plus-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.zfn.uni-bremen.de/pipermail/dftb-plus-user/attachments/20150409/f1c9a40d/attachment.htm>