MPI error when define TS_MPDATA???

General scientific issues regarding ROMS

Moderators: arango, robertson

Post Reply
Message
Author
rongzr
Posts: 35
Joined: Mon Jul 17, 2006 4:03 pm
Location: OUC/UMCES

MPI error when define TS_MPDATA???

#1 Unread post by rongzr »

When I define TS_MPDATA, the model can be compiled successfully. But the following error occurs when running it:

STEP Day HH:MM:SS KINETIC_ENRG POTEN_ENRG TOTAL_ENRG NET_VOLUME

0 0 00:00:00 0.000000E+00 3.897334E+03 3.897334E+03 5.162515E+14
DEF_HIS - creating history file: high_keps_0001.nc
WRT_HIS - wrote history fields (Index=1,1) into time record = 0000001
aborting job:
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(139): MPI_Wait(request=0x7fffff61681c, status0x825688) failed
MPIDI_CH3_Progress(327): handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(93):
connection_recv_fail(642):
MPIDU_Socki_handle_read(603): connection closed by peer (set=0,sock=2)
rank 9 in job 70 master_4268 caused collective abort of all ranks
exit status of rank 9: killed by signal 11
rank 8 in job 70 master_4268 caused collective abort of all ranks
exit status of rank 8: killed by signal 11
rank 6 in job 70 master_4268 caused collective abort of all ranks
exit status of rank 6: return code 13
aborting job:
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(139): MPI_Wait(request=0x7fffff42af8c, status0x825688) failed
MPIDI_CH3_Progress(327): handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(93):
connection_recv_fail(642):
MPIDU_Socki_handle_read(603): connection closed by peer (set=0,sock=2)
aborting job:
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(139): MPI_Wait(request=0x7fffff689ce4, status0x825674) failed
MPIDI_CH3_Progress(327): handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(93):
connection_recv_fail(642):
MPIDU_Socki_handle_read(603): connection closed by peer (set=0,sock=3)
aborting job:
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(139): MPI_Wait(request=0x7fffff474a2c, status0x825414) failed
MPIDI_CH3_Progress(327): handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(93):
connection_recv_fail(642):
MPIDU_Socki_handle_read(603): connection closed by peer (set=0,sock=5)
rank 5 in job 70 master_4268 caused collective abort of all ranks
exit status of rank 5: return code 13
rank 4 in job 70 master_4268 caused collective abort of all ranks
exit status of rank 4: return code 13
rank 3 in job 70 master_4268 caused collective abort of all ranks
exit status of rank 3: return code 13
rank 2 in job 70 master_4268 caused collective abort of all ranks
exit status of rank 2: return code 13
aborting job:
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(177): MPI_Send(buf=0x7fffff40cc10, count=15750, MPI_DOUBLE_PRECISION, dest=3, tag=4, MPI_COMM_WORLD) failed
MPIDI_CH3_Progress(327): handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(93):
connection_recv_fail(642):
MPIDU_Socki_handle_read(603): connection closed by peer (set=0,sock=3)
aborting job:
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(139): MPI_Wait(request=0x7fffff4c540c, status0x825688) failed
MPIDI_CH3_Progress(327): handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(93):
connection_recv_fail(642):
MPIDU_Socki_handle_read(603): connection closed by peer (set=0,sock=5)
aborting job:
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(177): MPI_Send(buf=0x7fffff8642c0, count=23625, MPI_DOUBLE_PRECISION, dest=5, tag=4, MPI_COMM_WORLD) failed
MPIDI_CH3_Progress(327): handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(93):
connection_recv_fail(642):
MPIDU_Socki_handle_read(603): connection closed by peer (set=0,sock=4)
aborting job:
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(139): MPI_Wait(request=0x7fffffa1d434, status0x825674) failed
MPIDI_CH3_Progress(327): handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(93):
connection_recv_fail(642):
MPIDU_Socki_handle_read(603): connection closed by peer (set=0,sock=4)



When I replace TS_MPDATA by TS_U3HADVECTION and TS_SVADECTION, everything works fine. I didn’t change any other flags and I’m using the latest released ROMS. I remember that i can use this flag in the old version ROMS.
Does anyone encounter this problem??

jcwarner
Posts: 1181
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#2 Unread post by jcwarner »

how many mpi processors did you use?
what is the grid size?
how did you tile it?
Are you using periodic bc's ?

rongzr
Posts: 35
Joined: Mon Jul 17, 2006 4:03 pm
Location: OUC/UMCES

#3 Unread post by rongzr »

The grid size is 362X242X20; I use 10 CPUS and the real boundary conditions(tide). Does MPDATA need large memories??
The following are the relative parameters:
Lm == 240 ! Number of I-direction INTERIOR RHO-points
Mm == 360 ! Number of J-direction INTERIOR RHO-points
N == 20 ! Number of vertical levels
NtileI == 2 ! I-direction partition
NtileJ == 5 ! J-direction partition

jcwarner
Posts: 1181
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#4 Unread post by jcwarner »

can you try to run a standard distributed test, like UPWELLING, or something else that is distributed with the code, and then see if mpdata works for that? Then I will know if it is a problem with the code or with your system.

rongzr
Posts: 35
Joined: Mon Jul 17, 2006 4:03 pm
Location: OUC/UMCES

#5 Unread post by rongzr »

TS_MPDATA works fine for UPWELLING!
The following is my header file, are there any conflict??

/* Basic physics options */
#define UV_COR
#define UV_ADV
#define UV_VIS2
#define TS_DIF2
#define SOLVE3D
#define SALINITY
#define MIX_GEO_UV /* mixing on geopotential (constant Z) surfaces */
#define MIX_GEO_TS /* mixing on geopotential (constant Z) surfaces */
#define NONLIN_EOS

#define SSH_TIDES
#define FSOBC_REDUCED
!#define UV_TIDES
#define ADD_FSOBC
#define ADD_M2OBC
#define INLINE_2DIO

/* Basic numerics options */
#define CURVGRID
#define MASKING
#define SPLINES
#define TS_MPDATA
!#define TS_U3HADVECTION /* define if 3rd-order upstream horiz. advection */
!#define TS_SVADVECTION /* define if splines vertical advection */
#define UV_SADVECTION /* turn ON or OFF splines vertical advection */
#define DJ_GRADPS
#define RADIATION_2D

/* Outputs */
#define NO_WRITE_GRID /* define if not writing grid arrays */

/* Surface and bottom boundary conditions */
#define ANA_BSFLUX /* analytical bottom salinity flux */
#define ANA_BTFLUX /* analytical bottom temperature flux */
#define ANA_SSFLUX /* analytical surface salinity flux */
#define ANA_STFLUX /* analytical surface temperature flux */
#define UV_LOGDRAG /* turn ON or OFF logarithmic bottom friction */


/* Vertical subgridscale turbulence closure */
#ifdef SOLVE3D
# define GLS_MIXING /* Activate Generic Length-Scale mixing */
# undef LMD_MIXING
# ifdef MY25_MIXING
# undef N2S2_HORAVG /* Activate horizontal smoothing of buoyancy/shear */
# undef KANTHA_CLAYSON
# endif
# ifdef LMD_MIXING
# define LMD_RIMIX
# define LMD_CONVEC
# define LMD_BKPP
# define LMD_SKPP
# undef LMD_NONLOCAL
# endif
# ifdef GLS_MIXING
# define KANTHA_CLAYSON /* Kantha and Clayson stability function formulation */
# define N2S2_HORAVG /* Activate horizontal smoothing of buoyancy/shear */
# undef CANUTO_A /* Canuto A-stability function formulation */
!# undef CANUTO_B /* Canuto B-stability function formulation */
# endif
#endif

/* Open boundary conditions */
#define EAST_FSCHAPMAN
#define EAST_M2FLATHER
#define EAST_M3RADIATION
#define EAST_TRADIATION

#define SOUTH_FSCHAPMAN
#define SOUTH_M2FLATHER
#define SOUTH_M3RADIATION
#define SOUTH_TRADIATION

#define WESTERN_WALL
#define NORTHERN_WALL


#define TS_PSOURCE
#define UV_PSOURCE
#ifdef UV_PSOURCE
# define PSOURCE_FSCHAPMAN
#endif

jcwarner
Posts: 1181
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#6 Unread post by jcwarner »

what is this:
define PSOURCE_FSCHAPMAN

rongzr
Posts: 35
Joined: Mon Jul 17, 2006 4:03 pm
Location: OUC/UMCES

#7 Unread post by rongzr »

I add a free surface Chapman boundary condition for the river source. I works fine for other advection schemes.

jcwarner
Posts: 1181
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#8 Unread post by jcwarner »

Does your application work if you:
undefine the Psources and define MPDATA ?

rongzr
Posts: 35
Joined: Mon Jul 17, 2006 4:03 pm
Location: OUC/UMCES

#9 Unread post by rongzr »

It doesn't work even i undefine all the flags related to PSources. The same error occurs.
I think the grid size maybe the issue! I test it with my old grid(240x180x20). It works fine!
It seems MPDATA needs large memories/CPUS. Now the maximum CPU i can use is 12. How to solve this problem? Any system variables to change?

jcwarner
Posts: 1181
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#10 Unread post by jcwarner »

Mpdata has a larger tile footprint.
The errors do not seem like a memory issue, but it is hard to tell from that info.
1) can you rebuild with debug on and see if the errors are more useful?
2) can you try 240x360x10 to see if that works?

rongzr
Posts: 35
Joined: Mon Jul 17, 2006 4:03 pm
Location: OUC/UMCES

#11 Unread post by rongzr »

Warner, thanks for your help. I test both of them
1) Has the USE_DEBUG on give no more error information.
2) It does work for the 10 vertical layers case!!
Any idea about that?

jcwarner
Posts: 1181
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#12 Unread post by jcwarner »

maybe it is a memory issue, since it works for
240x360x10, and works for 240x180x20
but not for 240x360x20.
What kind of system are you using?
Try to google search for memory commands for your system,
like ulimit, etc.

gianni

runtime problem with larger domains

#13 Unread post by gianni »

I'm getting an error and it seams that is related to memory issues.
I'm trying to run a UPWELLING test and it works fine for a grid size of 300x200x20 but not for a grid size of 600X300x40, that is the size needed by my simulation.
Also I've already tried to use INLINE_2DIO option without success

This is the error output:

NL ROMS/TOMS: started time-stepping: (Grid: 01 TimeSteps: 00000001 - 00000100)

STEP Day HH:MM:SS KINETIC_ENRG POTEN_ENRG TOTAL_ENRG NET_VOLUME

0 0 00:00:00 0.000000E+00 7.190644E+02 7.190644E+02 2.548445E+13
[cli_0]: aborting job:
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(156).............................: MPI_Wait(request=0x7fff517a2280, status0xbf4400) failed
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(633)..............: connection failure (set=0,sock=6,errno=104:(strerror() not found))
rank 7 in job 9 hpc01_58410 caused collective abort of all ranks
exit status of rank 7: killed by signal 11
rank 6 in job 9 hpc01_58410 caused collective abort of all ranks
exit status of rank 6: killed by signal 11
rank 3 in job 9 hpc01_58410 caused collective abort of all ranks
exit status of rank 3: killed by signal 11
rank 2 in job 9 hpc01_58410 caused collective abort of all ranks
exit status of rank 2: killed by signal 11


Any idea about that?
I appreciate any help.
Giovanni

User avatar
kate
Posts: 4088
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: MPI error when define TS_MPDATA???

#14 Unread post by kate »

How much memory do you have? Can you move to a bigger system? Can you use more nodes of the parallel system? Also, some options such as AVERAGES or vertical mixing use memory. Can you turn them off?

gianni

Re: runtime problem with large

#15 Unread post by gianni »

Hi Kate,
The Cluster where I'm trying to work is a good cluster with good hardware(80 nodes with 4 core each and 4 Gbytes of RAM) so I'm supposing that this problem is related in some way to MPI.
At the moment I'm doing my simulations using an older cluster with less performance but I hope to make ROMS3.2 works well on the newer cluster.
I used the standard UPWELLING model just to make a test but my model configuration probably will require even more resource then this simple test.
I can't change the model options so I have to try to fix this problem.
any other idea?

Thanks for your help,
Giovanni.

User avatar
kate
Posts: 4088
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: MPI error when define TS_MPDATA???

#16 Unread post by kate »

You have 4 GB per node or for the whole system? How much memory do you have on the system you are using now? My problem is roughly the same size as yours and I can run on a system with 4 GB per core (processor) even when turning on biology which doubles the memory needed.

longmtm
Posts: 55
Joined: Tue Dec 13, 2005 7:23 pm
Location: Univ of Maryland Center for Environmental Science

Re: MPI error when define TS_MPDATA???

#17 Unread post by longmtm »

Same interesting/odd things occurred with my runs of Chesapeake Bay with TS_MPDATA:

I have TS_MPDATA and biology (fasham/fennel) on and tried it on two different machines. It works on one machine, but not on another when using MPI. Both machines work with serial model (no MPI). So I guess it is related to MPI somehow. Since serial code works, so I guess it is not memory issue for computation, but it may still be linked to memory issues of data exchange between tiles when MPI is on.

My grid size is 150x100x20 and I have 8 processors with 4GB ram on each, so should be plentiful of memory. "ulimit -s unlimited" is used.

Any news on this?

Wen

User avatar
kate
Posts: 4088
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: MPI error when define TS_MPDATA???

#18 Unread post by kate »

Does the one where it runs give the same answer as the serial code?

Post Reply