Inconsistent timestep reporting after restart under OpenMP?

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
kawase
Posts: 7
Joined: Tue Aug 31, 2004 4:40 pm
Location: University of Washington

Inconsistent timestep reporting after restart under OpenMP?

#1 Post by kawase » Sat Mar 01, 2014 8:39 pm

Am having a puzzling problem running a very simple (2-D), idealized tidal simulation on an Ubuntu box with a 6-core/12-hyperthread Intel processor. After a restart, the timestep count and the time reported on the standart output line switch erratically between the through count/time and ones from the beginning of the restart: e.g.

Code: Select all

1037740    60 01:18:20  3.056049E-04  3.876369E-04  6.932419E-04  3.751680E+17
             (444,396)  1.765230E-03  2.295712E-03  4.060942E-03  3.896062E-01
    950     0 01:19:10  3.047232E-04  3.884660E-04  6.931892E-04  3.751680E+17
             (444,396)  1.782631E-03  2.305139E-03  4.087770E-03  3.903817E-01
    960     0 01:20:00  3.038647E-04  3.892719E-04  6.931366E-04  3.751680E+17
             (444,396)  1.800103E-03  2.314523E-03  4.114626E-03  3.911381E-01
    970     0 01:20:50  3.030295E-04  3.900545E-04  6.930840E-04  3.751680E+17
             (444,396)  1.817652E-03  2.323872E-03  4.141525E-03  3.918753E-01
1037780    60 01:21:40  3.022178E-04  3.908137E-04  6.930315E-04  3.751680E+17
             (444,396)  1.835281E-03  2.333205E-03  4.168486E-03  3.925934E-01
    990     0 01:22:30  3.014298E-04  3.915492E-04  6.929790E-04  3.751680E+17
             (444,396)  1.852984E-03  2.342514E-03  4.195498E-03  3.932922E-01
   1000     0 01:23:20  3.006655E-04  3.922610E-04  6.929265E-04  3.751680E+17
             (444,396)  1.870759E-03  2.351797E-03  4.222556E-03  3.939718E-01

And - no doubt related to this - the tidal forcing appears to become erratic as well, since it is analytically tied to time(ng). This happens with OpenMP under all compile options (-g and various -Os) but not with a single-thread executable. I'm using gfortran 4.6.3 and Ubuntu-supplied netCDF libraries (4.1.1) on Ubuntu 12.04 LTS.

Has anyone experienced something like this? I'm suspecting that the time information from the restart file is not properly passed onto all threads (and as such probably a gfortran bug, if any). But it's very recently that I migrated this code and attendant input files from ROMS 2 to the latest, so I may be overlooking something obvious.

TIA - Mitsuhiro

User avatar
m.hadfield
Posts: 520
Joined: Tue Jul 01, 2003 4:12 am
Location: NIWA

Re: Inconsistent timestep reporting after restart under Open

#2 Post by m.hadfield » Mon Mar 03, 2014 5:01 am

It's a bug in the current ROMS when restarting under OpenMP, not specific to any specific compiler or platform. See this thread

viewtopic.php?f=19&t=3260

AS far as I am aware, there has been no solution posted. You may be able to work around it by reverting to an earlier version, but I don't know which one.

kawase
Posts: 7
Joined: Tue Aug 31, 2004 4:40 pm
Location: University of Washington

Re: Inconsistent timestep reporting after restart under Open

#3 Post by kawase » Thu Mar 06, 2014 1:41 pm

Thank you for the info. Pity that :( I've switched to using shared-memory MPI for parallelism and that certainly is working....

User avatar
m.hadfield
Posts: 520
Joined: Tue Jul 01, 2003 4:12 am
Location: NIWA

Re: Inconsistent timestep reporting after restart under Open

#4 Post by m.hadfield » Thu Mar 06, 2014 9:16 pm

Yes, MPI does work and is the best choice for most situations. OpenMP is generally not used as much and is therefore less well-tested. However OpenMP is much faster than MPI for floats simulations, in my experience.

I don't think this bug will be too hard to fix, but don't have time to look at it further right now.

User avatar
arango
Site Admin
Posts: 1103
Joined: Wed Feb 26, 2003 4:41 pm
Location: IMCS, Rutgers University
Contact:

Re: Inconsistent timestep reporting after restart under Open

#5 Post by arango » Mon Mar 17, 2014 2:11 pm

This bug was corrected. Please check the following :arrow: ticket for details. Thank you for reporting this problem. Please update.

Post Reply