Run fails when using more than 1 node in a computing cluster

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
lcbernardo
Posts: 88
Joined: Wed Oct 01, 2014 8:57 pm
Location: International Coastal Research Center

Run fails when using more than 1 node in a computing cluster

#1 Unread post by lcbernardo »

Dear ROMS users,

We've had this problem for around half a year already and have attempted working with the technical staff in our institution. However, the problem persists and I thought I'd try asking here in the forums.

We're running ROMS on a parallel computing cluster, and when we use only 1 node (which in our case consists of 28 cores), we are able to run successfully. However, whenever we try to use 2 or more nodes, the run fails near the start and seems to occur while reading the initial condition netcdf file. In the log file, here's how it appears:

Metrics information for Grid 01:
===============================

Minimum X-grid spacing, DXmin = 1.50000000E+00 km
Maximum X-grid spacing, DXmax = 1.50000000E+00 km
Minimum Y-grid spacing, DYmin = 1.50000000E+00 km
Maximum Y-grid spacing, DYmax = 1.50000000E+00 km
Minimum Z-grid spacing, DZmin = -1.33120450E+01 m
Maximum Z-grid spacing, DZmax = 2.34310913E+03 m

Minimum barotropic Courant Number = 2.66422670E-02
Maximum barotropic Courant Number = 7.21447999E-01
Maximum Coriolis Courant Number = 3.96367952E-03


Minimum horizontal diffusion coefficient = 1.25000000E+01 m2/s
Maximum horizontal diffusion coefficient = 1.25000000E+01 m2/s

Minimum horizontal viscosity coefficient = 1.25000000E+01 m2/s
Maximum horizontal viscosity coefficient = 1.00000000E+20 m2/s

NLM: GET_STATE - Reading state initial conditions, 2016-04-30 00:00:00.00
(Grid 01, t = 5964.0000, File: CRSE_MB1_ini_160501.nc, Rec=0001, Index=1)
- free-surface
(Min = -2.04140008E-01 Max = 1.37334052E+00)
- vertically integrated u-momentum component
(Min = -3.23873087E-01 Max = 7.55435041E-01)
- vertically integrated v-momentum component
(Min = -2.82343629E-01 Max = 6.29832212E-01)

And when the run fails, a file with a *.btr extension is generated and contains the following lines:

oceanM:56948 terminated with signal 11 at PC=0 SP=7fffffff74a8. Backtrace:
/usr/lib64/libinfinipath.so.4(+0x45a8)[0x2aaac28bd5a8]
/lib64/libpthread.so.0(+0x10b20)[0x2aaaac957b20]

If anyone has experienced a similar issue and solved it or might have some thoughts on how to go about doing so, I would greatly appreciate any help.

Thanks,
Lawrence

jcwarner
Posts: 1172
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

Re: Run fails when using more than 1 node in a computing clu

#2 Unread post by jcwarner »

this looks like an architecture/lib issue. this looks similar:
https://software.intel.com/en-us/forums ... pic/270080

-j

lcbernardo
Posts: 88
Joined: Wed Oct 01, 2014 8:57 pm
Location: International Coastal Research Center

Re: Run fails when using more than 1 node in a computing clu

#3 Unread post by lcbernardo »

Thank you for the link Dr. Warner. I'll see if I can use this when I get a chance to consult with our technical staff on the issue.

Lawrence

Post Reply