ROMS not running with mvapich

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
prakrati
Posts: 24
Joined: Thu Oct 21, 2010 9:35 pm
Location: CRL

ROMS not running with mvapich

#1 Post by prakrati »

I tried to run ROMS with mvapich on 64 cores for 100 iterations but it is failing with SIGSEGV 11 fault
Please can you guide what should I do?

User avatar
kate
Posts: 3940
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: ROMS not running with mvapich

#2 Post by kate »

This is when I get out a debugger or some print statements to find out where in the code you are getting into trouble. You don't provide enough information (nor is it always easy to get).

prakrati
Posts: 24
Joined: Thu Oct 21, 2010 9:35 pm
Location: CRL

Re: ROMS not running with mvapich

#3 Post by prakrati »

The following is the error that I am getting when running ROMS on 64 cores for 100 iterations using mvapich

Resource usage summary:

CPU time : 1.44 sec.
Max Memory : 5 MB
Max Swap : 36 MB


The output (if any) follows:

Process Information:

Node # 19 (pid= 8769) is active.
Node # 27 (pid= 8819) is active.
Node # 51 (pid= 9920) is active.
Node # 59 (pid= 12682) is active.
Node # 62 (pid= 12919) is active.
Node # 30 (pid= 9038) is active.
Node # 22 (pid= 8988) is active.
Node # 54 (pid= 10139) is active.
Node # 26 (pid= 8746) is active.
Node # 58 (pid= 12609) is active.
Node # 23 (pid= 9061) is active.
Node # 7 (pid= 11308) is active.
Node # 3 (pid= 10536) is active.
Node # 55 (pid= 10212) is active.
Node # 31 (pid= 9111) is active.
Node # 63 (pid= 12994) is active.
Node # 35 (pid= 9915) is active.
Node # 39 (pid= 10207) is active.
Node # 47 (pid= 8892) is active.
Node # 43 (pid= 8600) is active.
Node # 15 (pid= 10214) is active.
Node # 11 (pid= 9908) is active.
Node # 2 (pid= 10338) is active.
Node # 18 (pid= 8696) is active.
Node # 50 (pid= 9847) is active.
Node # 6 (pid= 11113) is active.
Node # 34 (pid= 9842) is active.
Node # 38 (pid= 10134) is active.
Node # 10 (pid= 9832) is active.
Node # 42 (pid= 8527) is active.
Node # 0 (pid= 9848) is active.
Node # 40 (pid= 8381) is active.
Node # 8 (pid= 9570) is active.
Node # 32 (pid= 9696) is active.
Node # 24 (pid= 8600) is active.
Node # 56 (pid= 12170) is active.
Node # 12 (pid= 9987) is active.
Node # 28 (pid= 8892) is active.
Node # 44 (pid= 8673) is active.
Node # 60 (pid= 12761) is active.
Node # 14 (pid= 10141) is active.
Node # 46 (pid= 8819) is active.
Node # 20 (pid= 8842) is active.
Node # 52 (pid= 9993) is active.
Node # 4 (pid= 10732) is active.
Node # 36 (pid= 9988) is active.
Node # 5 (pid= 10924) is active.
Node # 13 (pid= 10068) is active.
Node # 45 (pid= 8746) is active.
Node # 21 (pid= 8915) is active.
Node # 53 (pid= 10066) is active.
Node # 37 (pid= 10061) is active.
Node # 29 (pid= 8965) is active.
Node # 61 (pid= 12840) is active.

Model Input Parameters: ROMS/TOMS version 3.2
Monday - November 15, 2010 - 5:17:58 PM
-----------------------------------------------------------------------------

INP_PAR - Unable to open ROMS/TOMS input script file.
In distributed-memory applications, the input
script file is processed in parallel. The Unix
routine GETARG is used to get script file name.
For example, in MPI applications make sure that
command line is something like:

mpirun -np 4 ocean ocean.in

and not

mpirun -np 4 ocean < ocean.in


Elapsed CPU time (seconds):

Node # 48 (pid= 9701) is active.
Node # 16 (pid= 8550) is active.
Node # 1 (pid= 10142) is active.
Node # 33 (pid= 9769) is active.
Node # 17 (pid= 8623) is active.
Node # 49 (pid= 9774) is active.
Node # 9 (pid= 9756) is active.
Node # 41 (pid= 8454) is active.
Node # 57 (pid= 12530) is active.
Node # 25 (pid= 8673) is active.
Node # 2 CPU: 0.001
Node # 5 CPU: 0.006
Node # 0 CPU: 0.003
Node # 4 CPU: 0.000
Node # 18 CPU: 0.087
Node # 28 CPU: 0.194
Node # 27 CPU: 0.027
Node # 59 CPU: 0.018
Node # 21 CPU: 0.016
Node # 56 CPU: 0.015
Node # 26 CPU: 0.089
Node # 58 CPU: 0.002
Node # 29 CPU: 0.223
Node # 50 CPU: 0.201
Node # 19 CPU: 0.002
Node # 31 CPU: 0.003
Node # 53 CPU: 0.207
Node # 20 CPU: 0.103
Node # 7 CPU: 0.006
Node # 60 CPU: 0.106
Node # 6 CPU: 0.006
Node # 54 CPU: 0.102
Node # 63 CPU: 0.019
Node # 22 CPU: 0.007
Node # 55 CPU: 0.010
Node # 30 CPU: 0.005
Node # 24 CPU: 0.106
Node # 62 CPU: 0.006
Node # 51 CPU: 0.015
Node # 52 CPU: 0.105
Node # 61 CPU: 0.225
Node # 23 CPU: 0.111
Node # 3 CPU: 0.028
p61_12840: p4_error: interrupt SIGSEGV: 11
p59_12682: p4_error: interrupt SIGSEGV: 11
p56_12170: p4_error: interrupt SIGSEGV: 11
p58_12609: p4_error: interrupt SIGSEGV: 11
p60_12761: p4_error: interrupt SIGSEGV: 11
p28_8892: p4_error: interrupt SIGSEGV: 11
p26_8746: p4_error: interrupt SIGSEGV: 11
p29_8965: p4_error: interrupt SIGSEGV: 11
p31_9111: p4_error: interrupt SIGSEGV: 11
p63_12994: p4_error: interrupt SIGSEGV: 11
p24_8600: p4_error: interrupt SIGSEGV: 11
p23_9061: p4_error: interrupt SIGSEGV: 11
p62_12919: p4_error: interrupt SIGSEGV: 11
p22_8988: p4_error: interrupt SIGSEGV: 11
p20_8842: p4_error: interrupt SIGSEGV: 11
p4_10732: p4_error: net_recv read: probable EOF on socket: 1
p55_10212: p4_error: interrupt SIGSEGV: 11
p52_9993: p4_error: interrupt SIGSEGV: 11
p51_9920: p4_error: interrupt SIGSEGV: 11
p54_10139: p4_error: interrupt SIGSEGV: 11
p27_8819: p4_error: interrupt SIGSEGV: 11
p18_8696: p4_error: interrupt SIGSEGV: 11
p19_8769: p4_error: interrupt SIGSEGV: 11
p30_9038: p4_error: interrupt SIGSEGV: 11
p5_10924: p4_error: net_recv read: probable EOF on socket: 1
p21_8915: p4_error: interrupt SIGSEGV: 11
p50_9847: p4_error: interrupt SIGSEGV: 11
p53_10066: p4_error: interrupt SIGSEGV: 11
Node # 40 CPU: 0.203
Node # 47 CPU: 0.110
Node # 43 CPU: 0.114
Node # 45 CPU: 0.021
Node # 44 CPU: 0.203
Node # 42 CPU: 0.200
Node # 46 CPU: 0.207
Node # 35 CPU: 0.108
Node # 37 CPU: 0.012
Node # 34 CPU: 0.104
Node # 36 CPU: 0.104
Node # 39 CPU: 0.104
Node # 32 CPU: 0.001
Node # 38 CPU: 0.098
p40_8381: p4_error: interrupt SIGSEGV: 11
p35_9915: p4_error: interrupt SIGSEGV: 11
p34_9842: p4_error: interrupt SIGSEGV: 11
p36_9988: p4_error: interrupt SIGSEGV: 11
p37_10061: p4_error: interrupt SIGSEGV: 11
p39_10207: p4_error: interrupt SIGSEGV: 11
p43_8600: p4_error: interrupt SIGSEGV: 11
p45_8746: p4_error: interrupt SIGSEGV: 11
p44_8673: p4_error: interrupt SIGSEGV: 11
p42_8527: p4_error: interrupt SIGSEGV: 11
p38_10134: p4_error: interrupt SIGSEGV: 11
p32_9696: p4_error: interrupt SIGSEGV: 11
p46_8819: p4_error: interrupt SIGSEGV: 11
p47_8892: p4_error: interrupt SIGSEGV: 11
Node # 16 CPU: 0.004
p16_8550: p4_error: interrupt SIGSEGV: 11
Node # 48 CPU: 0.200
p48_9701: p4_error: interrupt SIGSEGV: 11
Node # 17 CPU: 0.006
p17_8623: p4_error: interrupt SIGSEGV: 11
Node # 49 CPU: 0.211
p49_9774: p4_error: interrupt SIGSEGV: 11
Node # 33 CPU: 0.110
p33_9769: p4_error: interrupt SIGSEGV: 11
Node # 41 CPU: 0.310
p41_8454: p4_error: interrupt SIGSEGV: 11
Node # 1 CPU: 0.000
Node # 57 CPU: 0.319
p57_12530: p4_error: interrupt SIGSEGV: 11
Node # 25 CPU: 0.109
p25_8673: p4_error: interrupt SIGSEGV: 11


PS:

Read file <benchmark4_err_file> for stderr output of this job.

User avatar
kate
Posts: 3940
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: ROMS not running with mvapich

#4 Post by kate »

How did you invoke ROMS in your script? It should have the ocean.in filename as the first argument. This is described here.

prakrati
Posts: 24
Joined: Thu Oct 21, 2010 9:35 pm
Location: CRL

Re: ROMS not running with mvapich

#5 Post by prakrati »

I did mention ocean_benchmark4.in quite correctly int the script and its running for all sizes except for 1024

Post Reply