Distributed Memory

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
bronwyn
Posts: 26
Joined: Sun Nov 27, 2005 10:54 pm
Location: Free University Berlin

Distributed Memory

#1 Unread post by bronwyn »

Hi,

I'm trying to run ROMS/Ecosim using 8 processors.
1) Executable was compiled with MPI on

2) Tiling in input set to:

NtileI == 2
NtileJ == 4

3) command line:

mpirun -np 8 oceanMlatteecosim External/ocean_latte_2005_Apr_bio.in > & loglatteecosim2

It gets as far as the first line in the time stepping and then gives the errors below. Anyone know what this means?

STEP time[DAYS] KINETIC_ENRG POTEN_ENRG TOTAL_ENRG NET_VOLUME trd

103200 107.500000 2.914093E+00 2.047989E+02 2.077130E+02 1.364563E+12 0

p2_1008: p4_error: interrupt SIGSEGV: 11
0: DEALLOCATE: memory at 0xef5720 not allocated
0: DEALLOCATE: memory at 0xf1b940 not allocated
0: DEALLOCATE: memory at 0xf11f90 not allocated
0: DEALLOCATE: memory at 0xf9d1e0 not allocated
p4_1012: p4_error: interrupt SIGSEGV: 11
p1_1006: p4_error: interrupt SIGSEGV: 11
p5_1014: p4_error: interrupt SIGSEGV: 11
rm_l_6_1018: (155.343750) net_send: could not write to fd=6, errno = 9
p4_error: latest msg from perror: Bad file descriptor
rm_l_6_1018: p4_error: net_send write: -1
rm_l_7_1020: (155.300781) net_send: could not write to fd=6, errno = 9
p4_error: latest msg from perror: Bad file descriptor
rm_l_7_1020: p4_error: net_send write: -1

jcwarner
Posts: 1181
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#2 Unread post by jcwarner »

try running with just one processor.
set tiling:
NtileI == 1
NtileJ == 1

command

mpirun -np 1 oceanMlatteecosim External/ocean_latte_2005_Apr_bio.in > & loglatteecosim2

User avatar
kate
Posts: 4088
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

#3 Unread post by kate »

Also, do simple MPI "hello" type programs work?

bronwyn
Posts: 26
Joined: Sun Nov 27, 2005 10:54 pm
Location: Free University Berlin

#4 Unread post by bronwyn »

ok, running with one processor seems to be working so far, it's created a history file and is going through time-steps (slowly).

what does this mean?

jcwarner
Posts: 1181
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#5 Unread post by jcwarner »

ok. good so far.

Next step would be to try 2 tiles. Let's go with
NtileI == 1
NtileJ == 2

as a hunch, it may be an issue in the "J" direction because you had 4
0: DEALLOCATE: memory at 0xef5720 not allocated
0: DEALLOCATE: memory at 0xf1b940 not allocated
0: DEALLOCATE: memory at 0xf11f90 not allocated
0: DEALLOCATE: memory at 0xf9d1e0 not allocated

errors, and J was set at 4.

Just a hunch, lets see if I am close.

bronwyn
Posts: 26
Joined: Sun Nov 27, 2005 10:54 pm
Location: Free University Berlin

#6 Unread post by bronwyn »

OK, this is what happens with 2 tiles, error message after first step

STEP time[DAYS] KINETIC_ENRG POTEN_ENRG TOTAL_ENRG NET_VOLUME trd

103200 107.500000 2.914093E+00 2.047989E+02 2.077130E+02 1.364563E+12 0
p1_16392: p4_error: interrupt SIGSEGV: 11
DEF_HIS - creating history file: latte_out/ecosim/his_latte_003_2005_0108.nc

jcwarner
Posts: 1181
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#7 Unread post by jcwarner »

ok. so that ( i assume) was 2 tiles in the J. Try
NtileI == 2
NtileJ == 1

and see fif you get the error.

Are you running on the computer with Totalview??

bronwyn
Posts: 26
Joined: Sun Nov 27, 2005 10:54 pm
Location: Free University Berlin

#8 Unread post by bronwyn »

Not sure what Totalview is, is it for debugging?

And, Yes, the last run was
NtileI == 1
NtileJ == 2

Just tried it the other way around
NtileI == 2
NtileJ == 1

and this happens

STEP time[DAYS] KINETIC_ENRG POTEN_ENRG TOTAL_ENRG NET_VOLUME trd

103200 107.500000 2.914093E+00 2.047989E+02 2.077130E+02 1.364563E+12 0
0: DEALLOCATE: memory at 0xadab30 not allocated
p1_20887: p4_error: interrupt SIGSEGV: 11

jcwarner
Posts: 1181
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#9 Unread post by jcwarner »

without a debugger you will have to try things the 'old fashioned way' - tried and proven to work.

First compile the program with -g (or what ever the debug flag is for your system) and with the flag for check bounds. Run the model again. If the errors do not shed more light on the situation then I recommend that you put write statements in, such as:
write(*,*) 'line xx of ecosim'
recompile and run that.
It may recompile as oceanG (not oceanM).

this will help determine where in the code the error occurs.
You can usually locate the error in a few tries.

User avatar
kate
Posts: 4088
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

#10 Unread post by kate »

Totalview is a debugger that works for both serial and parallel applications.

Do you know this system works for other MPI jobs?

bronwyn
Posts: 26
Joined: Sun Nov 27, 2005 10:54 pm
Location: Free University Berlin

#11 Unread post by bronwyn »

thanks for the help and suggestions, i'll let you know how it works out.

Post Reply