CPU problem for the Nesting run

Message

fdaryabor · #1 Unread post by **fdaryabor** » Fri Mar 09, 2018 1:34 pm

Dear Users,
Has anybody ever experienced the following problem? it has happened in the sixth year of the model run

. The model is configured with two, one-way nested domains. I don't know really

, but I guess it can be because of the memory process and calculations carried out by the CPU. Thank you in advance for your kind help

.

664885 2308 15:05:00 4.384476E-03 1.947892E+04 1.947893E+04 1.179191E+15 01
(071,100,32) 0.000000E+00 2.165410E-02 1.365106E+00 1.204906E+00
1994655 2308 15:05:00 4.166940E-03 9.709681E+03 9.709686E+03 2.807089E+13 02
(120,108,26) 1.272484E-02 1.510138E-03 7.822898E-02 8.277512E-01
1994656 2308 15:06:40 4.149592E-03 9.709604E+03 9.709608E+03 2.807119E+13 02
(120,108,26) 1.256059E-02 1.479506E-03 7.514652E-02 8.372069E-01

Blowing-up: Saving latest model state into RESTART file

WRT_RST - wrote re-start fields (Index=1,2) in record = 0000003 01

Blowing-up: Saving latest model state into RESTART file

Node # 30 CPU: 948932.114
WRT_RST - wrote re-start fields (Index=1,1) in record = 0000003 02

Elapsed CPU time (seconds):

Node # 0 CPU: 948231.009
Node # 22 CPU: 948987.058
Node # 1 CPU: 948848.587
Node # 2 CPU: 949108.055
Node # 3 CPU: 948931.968
Node # 4 CPU: 948258.722
Node # 5 CPU: 949246.523
Node # 6 CPU: 949428.057
Node # 7 CPU: 949589.333
Node # 8 CPU: 949136.765
Node # 9 CPU: 948345.301
Node # 10 CPU: 948864.655
Node # 11 CPU: 948901.377
Node # 12 CPU: 949084.637
Node # 13 CPU: 949021.165
Node # 14 CPU: 948570.448
Node # 15 CPU: 949175.940
Node # 16 CPU: 949067.197
Node # 17 CPU: 949038.260
Node # 18 CPU: 949279.973
Node # 19 CPU: 949803.201
Node # 20 CPU: 949101.272
Node # 21 CPU: 949068.425
Node # 31 CPU: 948373.140
Node # 23 CPU: 948167.371
Node # 24 CPU: 948593.581
Node # 25 CPU: 948825.095
Node # 26 CPU: 948294.588
Node # 27 CPU: 948344.970
Node # 28 CPU: 948572.975
Node # 29 CPU: 949112.105
Total: 30364303.869

Cheers,
Farshid

kate · #2 Unread post by **kate** » Fri Mar 09, 2018 4:23 pm

I've had plenty of single-grid domains blow up like that after some years of running. It could have nothing whatever to do with the nesting. I always look at the saved restart file(s) to see where things go bad, how they go bad. I've made diag.F more verbose when things go bad to help find the trouble. Then usually I can restart with a shorter timestep and get through the troubles. If that doesn't work, that's when the real work begins.

fdaryabor · #3 Unread post by **fdaryabor** » Fri Mar 09, 2018 4:33 pm

Dear users,
I think the problem can be because of the tiling computation

. This is what I have

;

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2

I have 2 CPU sockets, each CPU can have, up to, 8 cores and each core can have 2 threads. Maximum thread is: 2 CPU x 8 cores x 2 threads per core = 32. Therefore the maximum thread count is 32, and maximum core count is 16.

Lm == 114 120
Mm == 114 120

NtileI == 4 4
NtileJ == 8 8

4x8 partitioning of the 114x114 grid ("Grid 01") results in MPI subdomain size of 28.5x14.25, It does not seem to be the proper amount

, although it seems to be proper for the Grid 02, 30x15. Anyway, I would greatly appreciate it if you kindly give me any comments. Regards, Farshid

kate · #4 Unread post by **kate** » Fri Mar 09, 2018 4:41 pm

That tiling will make things slightly inefficient, but won't cause it to blow up. A blow up from a mis-configured run will happen in timestep one, not after six years.

fdaryabor · #5 Unread post by **fdaryabor** » Fri Mar 09, 2018 5:00 pm

Dear Kate,
I've tried for the two different "timestep" as follow

;

NTIMES == 1036800 3110400
DT == 900.0d0 300.0d0
NDTFAST == 30 30

&

NTIMES == 3110400 9331200
DT == 300.0d0 100.0d0
NDTFAST == 50

Second one running too slow. What's your suggestion for having a proper timestep here?

Regards, Farshid

kate · #6 Unread post by **kate** » Fri Mar 09, 2018 5:50 pm

The timestep that worked for six years should be close. If that was the 900, 300, then try 840, 280 or 810, 270.

Ocean Modeling Discussion

CPU problem for the Nesting run

CPU problem for the Nesting run

Re: CPU problem for the Nesting run

Re: CPU problem for the Nesting run

Re: CPU problem for the Nesting run

Re: CPU problem for the Nesting run

Re: CPU problem for the Nesting run