how to run mpi python script across multiple nodes on slurm cluster? error: warning: can't run 1 processes on 2 nodes, setting nnodes to 1

  • Last Update :
  • Techknowledgy :

The reason for the warning is this line:

#SBATCH--ntasks = 1

For your reference, this is the script I run on my system,

#!/bin/bash

#SBATCH - C knl
#SBATCH - q regular
#SBATCH - t 00: 10: 00

#SBATCH--nodes = 2

module load python3

START_TIME = $SECONDS

srun - n 4 python mpi_py.py > & py_$ {
   SLURM_JOB_ID
}.log

ELAPSED_TIME = $(($SECONDS - $START_TIME))
echo $ELAPSED_TIME

Suggestion : 2

where you're specifying that you're anycodings_mpi going to run only 1 mpi process, just anycodings_mpi before you request 2 nodes. ,It's faster to run your code on the same node if possible. Internode communication is slower than within a node, it may be a bit slower but may also be much much slower which depends on things like cluster architecture.,Consult your cluster settings recommendations. For instance on mine I should be adding certain slurm options to this script - specifically -c and cpu_bind= (more here).,Retrofit socket timeout exception while uploading larger files using multipart

I've been trying to get it to work with a anycodings_parallel-processing simple Hello World script, but still run anycodings_parallel-processing into the above error. I added anycodings_parallel-processing --oversubscribe to the options when I run anycodings_parallel-processing the MPI script, but still get this error.

    #SBATCH--job - name = a_test
    #SBATCH--mail - type = ALL
    #SBATCH--ntasks = 1
    #SBATCH--cpu - freq = high
    #SBATCH--nodes = 2
    #SBATCH--cpus - per - task = 2
    #SBATCH--mem - per - cpu = 1 gb
    #SBATCH--mem - bind = verbose, local
    #SBATCH--time = 01: 00: 00
    #SBATCH--output = out_ % x.log

    module load python / 3.6 .2
    mpirun - np 4--oversubscribe python par_PyScript2.py ``
    `bash

I still get the expected output, but only after the error message "Warning: can't run 1 processes on 2 nodes, setting nnodes to 1." I'm worried that without being able to run on multiple nodes, my actual script will be a lot slower.

The reason for the warning is this line:

#SBATCH--ntasks = 1

For your reference, this is the script I anycodings_mpi run on my system,

#!/bin/bash

#SBATCH - C knl
#SBATCH - q regular
#SBATCH - t 00: 10: 00

#SBATCH--nodes = 2

module load python3

START_TIME = $SECONDS

srun - n 4 python mpi_py.py > & py_$ {
   SLURM_JOB_ID
}.log

ELAPSED_TIME = $(($SECONDS - $START_TIME))
echo $ELAPSED_TIME

Suggestion : 3

Last modified 3 May 2022

Starting in 20.11, the recommended way to get an interactive shell prompt is to configure use_interactive_step in slurm.conf:

LaunchParameters = use_interactive_step

How can I get the task ID in the output or error file name for a batch job?
If you want separate output by task, you will need to build a script containing this specification. For example:

$ cat test
#!/bin/sh
echo begin_test
srun - o out_ % j_ % t hostname

$ sbatch - n7 - o out_ % j test
sbatch: Submitted batch job 65541

$ ls - l out *
   -rw - rw - r--1 jette jette 11 Jun 15 09: 15 out_65541 -
   rw - rw - r--1 jette jette 6 Jun 15 09: 15 out_65541_0 -
   rw - rw - r--1 jette jette 6 Jun 15 09: 15 out_65541_1 -
   rw - rw - r--1 jette jette 6 Jun 15 09: 15 out_65541_2 -
   rw - rw - r--1 jette jette 6 Jun 15 09: 15 out_65541_3 -
   rw - rw - r--1 jette jette 6 Jun 15 09: 15 out_65541_4 -
   rw - rw - r--1 jette jette 6 Jun 15 09: 15 out_65541_5 -
   rw - rw - r--1 jette jette 6 Jun 15 09: 15 out_65541_6

$ cat out_65541
begin_test

$ cat out_65541_2
tdev2

Why is my MPI job failing due to the locked memory (memlock) limit being too low?
By default, Slurm propagates all of your resource limits at the time of job submission to the spawned tasks. This can be disabled by specifically excluding the propagation of specific limits in the slurm.conf file. For example PropagateResourceLimitsExcept=MEMLOCK might be used to prevent the propagation of a user's locked memory limit from a login node to a dedicated node used for his parallel job. If the user's resource limit is not propagated, the limit in effect for the slurmd daemon will be used for the spawned job. A simple way to control this is to ensure that user root has a sufficiently large resource limit and ensuring that slurmd takes full advantage of this limit. For example, you can set user root's locked memory limit ulimit to be unlimited on the compute nodes (see "man limits.conf") and ensuring that slurmd takes full advantage of this limit (e.g. by adding "LimitMEMLOCK=infinity" to your systemd's slurmd.service file). It may also be desirable to lock the slurmd daemon's memory to help ensure that it keeps responding if memory swapping begins. A sample /etc/sysconfig/slurm which can be read from systemd is shown below. Related information about PAM is also available.

#
# Example / etc / sysconfig / slurm
#
# Memlocks the slurmd process 's memory so that if a node
# starts swapping, the slurmd will
continue to respond
SLURMD_OPTIONS = "-M"

How can I temporarily prevent a job from running (e.g. place it into a hold state)?
The easiest way to do this is to change a job's earliest begin time (optionally set at job submit time using the --begin option). The example below places a job into hold state (preventing its initiation for 30 days) and later permitting it to start now.

$ scontrol update JobId = 1234 StartTime = now + 30 days
   ...later...
   $ scontrol update JobId = 1234 StartTime = now

Use the scontrol command to change a job's size either by specifying a new node count (NumNodes=) for the job or identify the specific nodes (NodeList=) that you want the job to retain. Any job steps running on the nodes which are relinquished by the job will be killed unless initiated with the --no-kill option. After the job size is changed, some environment variables created by Slurm containing information about the job's environment will no longer be valid and should either be removed or altered (e.g. SLURM_JOB_NUM_NODES, SLURM_JOB_NODELIST and SLURM_NTASKS). The scontrol command will generate a script that can be executed to reset local environment variables. You must retain the SLURM_JOB_ID environment variable in order for the srun command to gather information about the job's current state and specify the desired node and/or task count in subsequent srun invocations. A new accounting record is generated when a job is resized, showing the job to have been resubmitted and restarted at the new size. An example is shown below.

#!/bin/bash

srun my_big_job
scontrol update JobId = $SLURM_JOB_ID NumNodes = 2
   .slurm_job_$ {
      SLURM_JOB_ID
   }
_resize.sh
srun - N2 my_small_job
rm slurm_job_$ {
   SLURM_JOB_ID
}
_resize.*

Suggestion : 4

Slurm ignores the concept of parallel environment as such. Slurm simply requires that the number of nodes, or number of cores be specified. But you can have the control on how the cores are allocated; on a single node, on several nodes, etc. using the --cpus-per-task and --ntasks-per-node options for instance.,If your job is still running, you can have memory information with sstat. If your job is done, the information is provided by sacct. Both support the --format option so you can run, for instance:,The job will be submitted to the partition which offers the earliest allocation according to your job parameters and priority.,you want those cores to spread across distinct nodes and no interference from other jobs: --ntasks=16 --nodes=16 --exclusive

sinfo - o "%15N %10c %10m  %25f %10G"
ceciuser @cecicluster: ~$ sinfo - o "%15N %10c %10m  %25f %10G"
NODELIST CPUS MEMORY FEATURES GRES
mback[01 - 02] 8 31860 + Opteron, 875, InfiniBand(null)
mback[03 - 04] 4 31482 + Opteron, 852, InfiniBand(null)
mback05 8 64559 Opteron, 2356(null)
mback06 16 64052 Opteron, 885(null)
mback07 8 24150 Xeon, X5550 TeslaC1060
mback[08 - 19] 8 24151 Xeon, L5520, InfiniBand(null)
mback[20 - 32, 34] 8 16077 Xeon, L5420(null)
export OMP_NUM_THREADS = $SLURM_CPUS_PER_TASK
mpirun. / a.out
sacct--format JobID, jobname, NTasks, nodelist, MaxRSS, MaxVMSize, AveRSS, AveVMSize
man sstat
man sacct

Suggestion : 5

sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.,NERSC uses Slurm for cluster/resource management and job scheduling. Slurm is responsible for allocating resources to users, providing a framework for starting, executing and monitoring work on allocated resources and scheduling work for future execution.,salloc is used to allocate resources for a job in real time as an interactive batch job. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.,A job is an allocation of resources such as compute nodes assigned to a user for an amount of time. Jobs can be interactive or batch (e.g., a script) scheduled for later execution.

$ sbatch first - job.sh
Submitted batch job 864933
#!/bin/bash

#SBATCH - N 2
sbatch - N 2 . / first - job.sh
#!/bin/bash
#SBATCH --nodes=<nnodes>
   #SBATCH --time=hh:mm:ss
   #SBATCH --constraint=<architecture>
      #SBATCH --qos=<QOS>
         #SBATCH --account=<project_name>

            # set up for problem & define any environment variables here

            srun -n <num_mpi_processes> -c <cpus_per_task> a.out

                  # perform any cleanup or short post-processing here
if [
   [$NERSC_HOST == perlmutter]
];
then
export SBATCH_ACCOUNT = mxxxx_g
export SALLOC_ACCOUNT = mxxxx_g
elif[[$NERSC_HOST == cori]];
then
export SBATCH_ACCOUNT = mxxxx
export SALLOC_ACCOUNT = mxxxx
else
   echo unknown
fi
 no CUDA - capable device is detected