MPI process terminated unexpectedly

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
FengZhou
Posts: 52
Joined: Wed Apr 07, 2004 10:48 pm
Location: 2nd Institute of Oceanography,SOA

MPI process terminated unexpectedly

#1 Unread post by FengZhou »

Hi, all

I got into a new trouble after upgraded to ROMS3.1. The model will quit after several months(some cases longer than 1 year) without blow up messages. I have compared the results that ROMS3.1 behaves better at open boundaries than ROMS3.0, which yield beautiful patterns in the whole domain. But I really don't know why it stops unexpectedly.

The messages in the end of the log file are:

791830 1897 02:13:30 3.629872E-02 8.952483E+03 8.952519E+03 9.528867E+14
791840 1897 02:48:00 3.516587E-02 8.953228E+03 8.953263E+03 9.529380E+14
791850 1897 03:22:30 3.441360E-02 8.953581E+03 8.953615E+03 9.530236E+14
WRT_HIS - wrote history fields (Index=1,1) into time record = 0000002
MPI process terminated unexpectedly
Exit code -5 signaled from node34
Killing remote processes...DONE
Signal 15 received.
Signal 15 received.
Signal 15 received.
Signal 15 received.
The model restarted from the 4th year RST file, and run from 1440 day to 1897 day, then quit.

My configuration:
Operating system : Linux
CPU/hardware : x86_64
Compiler system : pgi
Compiler command : /opt/mpi/mvapich/1.1/gcc.pgf90/bin/mpif90
Compiler flags : -O3 -tp k8-64 -Mfree
For the climatological forcing and boundary input, it runs for 457 days; for other two kinds of forcings, it runs for 381 days or 457days.

I have asked our cluster administrator, he said its not a system problem.

The first thing I could imagine is the model blows up, but there is no NaN in the log file.
The second thing I would suppose is the problem of cycle_length (my cycle_length is 360 days), but it already runs more than 14 months.

Any suggestion is welcomed! Thank you!

zhou

nacholibre
Posts: 81
Joined: Thu Dec 07, 2006 3:14 pm
Location: USGS
Contact:

Re: MPI process terminated unexpectedly

#2 Unread post by nacholibre »

Hello Zhou,
Have you found the reason for this? I have a similar issue, where model exits with signal 15 and no error messages. Are you modeling any river inflow?
Zafer

FengZhou
Posts: 52
Joined: Wed Apr 07, 2004 10:48 pm
Location: 2nd Institute of Oceanography,SOA

Re: MPI process terminated unexpectedly

#3 Unread post by FengZhou »

My experience is : this message is caused by the limitation of the running time, so please modify the job script to increase running time at this line:

#PBS -l walltime=600:00:00

please try it.

zhou

gcreager

Re: MPI process terminated unexpectedly

#4 Unread post by gcreager »

This actually looks like a node crashed on you (node34). I see you're using Torque (pbs). Is there a stderr file that shows any more information?

gerry

FengZhou
Posts: 52
Joined: Wed Apr 07, 2004 10:48 pm
Location: 2nd Institute of Oceanography,SOA

Re: MPI process terminated unexpectedly

#5 Unread post by FengZhou »

Ok, I failed to fix the problem completely. Some of problems come from the insufficent time for the model to finish the specified running. However, others didn't. I have the trouble again like this:
WRT_HIS - wrote history fields (Index=1,1) into time record = 0000005
169779 406 18:17:33 1.471362E-02 8.948875E+03 8.948890E+03 9.533103E+14
MPI process terminated unexpectedly
Exit code -5 signaled from node33
Killing remote processes...DONE
Signal 15 received.
or like that:
Abort signaled by rank 2: [node48:2] Got completion with error IBV_WC_WR_FLUSH_ERR, code=5, dest rank=34

Exit code -3 signaled from node39
Killing remote processes...MPI process terminated unexpectedly
DONE
Signal 15 received.
but the model can go ahead if restarting with previous completed HIS file. why?
Any suggestions?

FengZhou
Posts: 52
Joined: Wed Apr 07, 2004 10:48 pm
Location: 2nd Institute of Oceanography,SOA

Re: MPI process terminated unexpectedly

#6 Unread post by FengZhou »

gcreager wrote:This actually looks like a node crashed on you (node34). I see you're using Torque (pbs). Is there a stderr file that shows any more information?

gerry
Yes I use PBS. There is no other further information.

User avatar
hetland
Posts: 81
Joined: Thu Jul 03, 2003 3:39 pm
Location: TAMU,USA

Re: MPI process terminated unexpectedly

#7 Unread post by hetland »

There is a known issue with openmpi that gives an error like this. I used version openmpi-1.2.8 and got this error (this is the version that ships with the latest version of rocks, so you probably have this, or older..). This flag helped:

Code: Select all

-mca mpi_leave_pinned 0
as in:

Code: Select all

/usr/mpi/gcc/openmpi-1.2.8/bin/mpirun -mca mpi_leave_pinned 0 -np $NODES -machinefile $PBS_NODEFILE ./oceanM $INFILE > $OUTFILE
as the run statement at the end of the script sent to torque/maui.

I still get this error occasionally, but this flag seemed to fix many of the problems I was having. The proper solution is to upgrade to openmpi 1.3.x

-Rob

FengZhou
Posts: 52
Joined: Wed Apr 07, 2004 10:48 pm
Location: 2nd Institute of Oceanography,SOA

Re: MPI process terminated unexpectedly

#8 Unread post by FengZhou »

Hi,Rob,

Thank you for proposing a solution! I didn't use mpi1.2, but mpi1.1 + pgi compiler. My job file are:
#PBS -N nwpo
#PBS -l nodes=4:ppn=4
#PBS -l walltime=500:00:00
#PBS -j oe
#PBS -q general
#define variables
NSLOTS=`cat ${PBS_NODEFILE} | wc -l`
echo "This jobs is "$PBS_JOBID@$PBS_QUEUE
#running jobs
cd $PBS_O_WORKDIR
# PGI
#time -p /opt/mpi/mvapich/1.1/gcc.pgf90/bin/mpirun_rsh -ssh -np ${NSLOTS} -hostfile ${PBS_NODEFILE} ./oceanM ROMS/External/ocean_nwp.in >& logroms_nwp.txt
where I can't insert the code you propose because there is not a parameter for
-mca mpi_leave_pinned 0
as the running error message suggested.

Post Reply