Ocean Modeling Discussion

ROMS/TOMS

Search for:
It is currently Fri Dec 14, 2018 3:11 am




Post new topic Reply to topic  [ 6 posts ] 

All times are UTC

Author Message
PostPosted: Fri Mar 09, 2018 1:34 pm 
Offline

Joined: Wed Jan 02, 2008 3:15 pm
Posts: 78
Location: University of Copenhagen
Dear Users,
Has anybody ever experienced the following problem? it has happened in the sixth year of the model run :!: :evil: . The model is configured with two, one-way nested domains. I don't know really :shock: , but I guess it can be because of the memory process and calculations carried out by the CPU. Thank you in advance for your kind help :idea: :wink: .



664885 2308 15:05:00 4.384476E-03 1.947892E+04 1.947893E+04 1.179191E+15 01
(071,100,32) 0.000000E+00 2.165410E-02 1.365106E+00 1.204906E+00
1994655 2308 15:05:00 4.166940E-03 9.709681E+03 9.709686E+03 2.807089E+13 02
(120,108,26) 1.272484E-02 1.510138E-03 7.822898E-02 8.277512E-01
1994656 2308 15:06:40 4.149592E-03 9.709604E+03 9.709608E+03 2.807119E+13 02
(120,108,26) 1.256059E-02 1.479506E-03 7.514652E-02 8.372069E-01

Blowing-up: Saving latest model state into RESTART file

WRT_RST - wrote re-start fields (Index=1,2) in record = 0000003 01

Blowing-up: Saving latest model state into RESTART file

Node # 30 CPU: 948932.114
WRT_RST - wrote re-start fields (Index=1,1) in record = 0000003 02

Elapsed CPU time (seconds):

Node # 0 CPU: 948231.009
Node # 22 CPU: 948987.058
Node # 1 CPU: 948848.587
Node # 2 CPU: 949108.055
Node # 3 CPU: 948931.968
Node # 4 CPU: 948258.722
Node # 5 CPU: 949246.523
Node # 6 CPU: 949428.057
Node # 7 CPU: 949589.333
Node # 8 CPU: 949136.765
Node # 9 CPU: 948345.301
Node # 10 CPU: 948864.655
Node # 11 CPU: 948901.377
Node # 12 CPU: 949084.637
Node # 13 CPU: 949021.165
Node # 14 CPU: 948570.448
Node # 15 CPU: 949175.940
Node # 16 CPU: 949067.197
Node # 17 CPU: 949038.260
Node # 18 CPU: 949279.973
Node # 19 CPU: 949803.201
Node # 20 CPU: 949101.272
Node # 21 CPU: 949068.425
Node # 31 CPU: 948373.140
Node # 23 CPU: 948167.371
Node # 24 CPU: 948593.581
Node # 25 CPU: 948825.095
Node # 26 CPU: 948294.588
Node # 27 CPU: 948344.970
Node # 28 CPU: 948572.975
Node # 29 CPU: 949112.105
Total: 30364303.869

Cheers,
Farshid :wink:


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 09, 2018 4:23 pm 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3517
Location: IMS/UAF, USA
I've had plenty of single-grid domains blow up like that after some years of running. It could have nothing whatever to do with the nesting. I always look at the saved restart file(s) to see where things go bad, how they go bad. I've made diag.F more verbose when things go bad to help find the trouble. Then usually I can restart with a shorter timestep and get through the troubles. If that doesn't work, that's when the real work begins.


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 09, 2018 4:33 pm 
Offline

Joined: Wed Jan 02, 2008 3:15 pm
Posts: 78
Location: University of Copenhagen
Dear users,
I think the problem can be because of the tiling computation :idea: . This is what I have :arrow:;

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2


I have 2 CPU sockets, each CPU can have, up to, 8 cores and each core can have 2 threads. Maximum thread is: 2 CPU x 8 cores x 2 threads per core = 32. Therefore the maximum thread count is 32, and maximum core count is 16.

Lm == 114 120
Mm == 114 120

NtileI == 4 4
NtileJ == 8 8

4x8 partitioning of the 114x114 grid ("Grid 01") results in MPI subdomain size of 28.5x14.25, It does not seem to be the proper amount :roll: , although it seems to be proper for the Grid 02, 30x15. Anyway, I would greatly appreciate it if you kindly give me any comments. Regards, Farshid :wink:


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 09, 2018 4:41 pm 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3517
Location: IMS/UAF, USA
That tiling will make things slightly inefficient, but won't cause it to blow up. A blow up from a mis-configured run will happen in timestep one, not after six years.


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 09, 2018 5:00 pm 
Offline

Joined: Wed Jan 02, 2008 3:15 pm
Posts: 78
Location: University of Copenhagen
Dear Kate,
I've tried for the two different "timestep" as follow :arrow: ;

NTIMES == 1036800 3110400
DT == 900.0d0 300.0d0
NDTFAST == 30 30

&

NTIMES == 3110400 9331200
DT == 300.0d0 100.0d0
NDTFAST == 50

Second one running too slow. What's your suggestion for having a proper timestep here? :idea: :wink:

Regards, Farshid :wink:


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 09, 2018 5:50 pm 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3517
Location: IMS/UAF, USA
The timestep that worked for six years should be close. If that was the 900, 300, then try 840, 280 or 810, 270.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 6 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group