running ROMS on supercomputers using slurm

Message

Prnceton · #1 Unread post by **Prnceton** » Thu May 24, 2018 8:46 am

Hi, everyone. I am a beginner to ROMS, and I am trying to run ROMS on supercomputers. After solving a lot of trouble, I got to the last step of running ROMS，but a new error occurred.
The supercomputers have installed module tool to manage softwares and use Slurm job management system with node exclusive mode, and every node has 20 cores.
The software I have loaded are as follows:

Code: Select all

module load hdf5/intel18/1.8.20-parallel
module load intel/18.0.2
module load mpi/intel/18.0.2
module load mpi/openmpi/3.0.1-pmi-icc18
module load netcdf/intel18/4.4.1-parallel

I have also modified the build.bash as follows:

Code: Select all

export  MY_ROOT_DIR=${HOME}/roms
export  MY_PROJECT_DIR=${MY_ROOT_DIR}/Projects/Upwelling
export  PATH=/public1/soft/openmpi/3.0.1-pmi-icc/bin:$PATH
export  USE_MY_LIBS=on
export  NF_CONFIG=/public1/soft/netcdf/4.4.1-parallel-icc18/bin/nf-config
export  NETCDF_INCDIR=/public1/soft/netcdf/4.4.1-parallel-icc18/include
export  NETCDF_LIBDIR=/public1/soft/netcdf/4.4.1-parallel-icc18/lib

In order to avoid some errors, I have also modified the Compilers/Linux-ifort.mk as follows:

Code: Select all

LIBS := -L$(NETCDF_LIBDIR) -lnetcdff -lnetcdf
#FFLAGS += -Wl,-stack_size,0x64000000
#FFLAGS += -Wl,-stack_size,0x64000000

The last step is to run ROMS，and here comes the trouble. The ROMS wiki website tells me to type "mpirun -np 40 oceanM ocean_upwelling.in" to to run in parallel (distributed-memory) on 40 processors, while in supercomputers I have to use Slurm. I need to write a script job.sh, and the content is as follows:

Code: Select all

#!/bin/bash
#SBATCH -N 2
#SBATCH -n 40
srun -n 40 oceanM ocean_upwelling.in

The I need to type "sbatch -p paratera job.sh" to run ROMS. paratera is the queue name，for example pg2_64_pool. Another point to mention is that I need to modify the ocean_upwelling.in to make NtileI*NtileJ equal to 40. The job ran for a short time and in the output log file slurm-xxx.out an error still occurred and it can be shown as follows:

I want to know the cause of the error and correct it. Hope to get some advice.

kate · #2 Unread post by **kate** » Thu May 24, 2018 3:18 pm

I don't know what your error is. I assume you have talked to your local supercomputer people? We too use slurm and here is a job script:

Code: Select all

#!/bin/bash
#SBATCH -t 144:00:00
#SBATCH --ntasks=192
#SBATCH --job-name=ARCTIC4
#SBATCH --tasks-per-node=24
#SBATCH -p t2standard
#SBATCH --account=akwaters
#SBATCH --output=ARCTIC4.%j
#SBATCH --no-requeue

cd $SLURM_SUBMIT_DIR
. /usr/share/Modules/init/bash
module purge
module load slurm
module load toolchain/pic-iompi/2016b
module load numlib/imkl/11.3.3.210-pic-iompi-2016b
module load toolchain/pic-intel/2016b
module load compiler/icc/2016.3.210-GCC-5.4.0-2.26
module load compiler/ifort/2016.3.210-GCC-5.4.0-2.26
module load openmpi/intel/1.10.4
module load data/netCDF-Fortran/4.4.4-pic-intel-2016b
module list

#
#  Prolog
#
echo " "
echo "++++ Chinook ++++ $PGM_NAME began:    `date`"
echo "++++ Chinook ++++ $PGM_NAME hostname: `hostname`"
echo "++++ Chinook ++++ $PGM_NAME uname -a: `uname -a`"
echo " "
TBEGIN=`echo "print time();" | perl`

srun -l /bin/hostname | sort -n | awk '{print $2}' > ./nodes
mpirun -np $SLURM_NTASKS -machinefile ./nodes --mca mpi_paffinity_alone 1 ./oceanM ocean_arctic4.in

#
#  Epilog
#
TEND=`echo "print time();" | perl`
echo " "
echo "++++ Chinook ++++ $PGM_NAME pwd:      `pwd`"
echo "++++ Chinook ++++ $PGM_NAME ended:    `date`"
echo "++++ Chinook ++++ $PGM_NAME walltime: `expr $TEND - $TBEGIN` seconds"

Ocean Modeling Discussion

running ROMS on supercomputers using slurm

running ROMS on supercomputers using slurm

Re: running ROMS on supercomputers using slurm