NIWA (Taihoro Nukurangi)Computer benchmarks

ROMS Upwelling Case

ROMS 1

This is one of standard test cases in the ROMS model. ROMS 1 is a family of versions written in Fortran 77 and supporting shared–memory parallelism via OpenMP. The upwelling test case is a smallish 3D model run, occupying some 20 MiB of memory when compiled with the 8-byte floating point numbers (the default). To use this model as a benchmark I have taken the test case, reduced the number of time steps to 72, set the output frequency to the minimum and directed output to /tmp (which is normally a local disk or RAM disk). I therefore expect that disk I/O performance will not be a factor in the execution time. Ocean models like ROMS tend to be sensitive to memory I/O performance because they cycle through 2D and 3D fields, doing relatively little computation on each value. However ROMS has been specifically designed to be cache-friendly on modern RISC CPUs so I think a smallish run like this will primarily test CPU performance, especially floating-point.

ROMS 1 UPWELLING results
Machine CPU OS Compiler & switches Notes CPU time
Hadfield (2001)

P3 800 MHz

Win 2000 df (release)   170
      df (debug)   630
      g77 -O   340
    Linux g77 -O   310
Hadfield (2003) P4 2.67 GHz DDR 266 Win 2000 g77 -O3   41
      g95 -O3   42
      df /fast   40
      df /check:bounds   45
Fargo P3 600 MHz Linux g77 -O   300
Lebowski P3 600 MHz Linux g77 -O   240
Duathlon Athlon MP1800 × 2 Linux g77 -O   78
        2 concurrent runs 110/110
Weinberg P4 Xeon 2.? GHz Linux g77 -O   57
      g77 -O REAL*4 38
Shuttle Athlon XP2600 + 333FSB Linux g77 -O   49
      g77 -O REAL*4 36
Grass P4 2.4 GHz DDR266 Linux g77 -O   41
Wetocean P4 Xeon 2.4 GHz × 2 Linux g77 -O3   36
      g95 -O3   37
Otter P4 Xeon 2.8 GHz × 2 Linux g77 -O   41
      f90 -O3 (Absoft)   41
      g77 -O 2 concurrent runs 55–90
        4 concurrent runs 140
Kupe Alpha EV5 600 MHz UNICOS/mk f90 -O   220
Rangi Alpha EV56 600MHz Digital Unix f95 -O   110
Thor Alpha EV67 667MHz Digital Unix f95 -O   32
      f95 -O REAL*4 25
      f95 -O -check bounds   52

ROMS 2

ROMS 2 is a rewrite of the model in Fortran 95, allowing multiple nested grids (not yet fully implemented) and supporting distributed–memory parallelism (via MPI) as an alternative to the shared–memory parallelism of the earlier versions. ROMS 2 includes code to measure CPU time during a run—this is the source of the numbers in the table below. One is inclined to be suspicious of these numbers at first, as they imply that the upwelling test case runs approx. 60% faster in ROMS 2 than in ROMS 1. However in several cases I have compared the CPU time reported by the model with the results of the "time" utility and found good agreement.

ROMS 2 raises some interesting issues about memory handling and performance. As mentioned in the previous paragraph (see also the table below) the upwelling test case runs significantly faster on many compilers in ROMS 2 than in ROMS 1. However on some compilers (mostly older ones) early versions of ROMS 2 ran much slower than ROMS 1. This seems to be related to the way in which dummy arguments are declared in subprograms. Early versions of ROMS used explicit-shape declarations; it seems that this causes some compilers to create (unnecessary) temporary copies of array data, which slows down performance drastically. Later versions use assumed-shape declarations, which eliminates the copying. Another issue relates to tiling. Like most parallel ocean models, ROMS 2 divides the domain horizontally into tiles. In MPI mode the number of tiles must equal the number of MPI nodes (processors). In OpenMP and serial mode the number of tiles must be an integer multiple of the number of threads. I haven't experimented with OpenMP but I have played around with varying the number of tiles in serial mode. For smaller cases like UPWELLING there is no benefit in running more than one tile on a single processor, but on larger runs like BENCHMARK1 (below) the multi-tile configurations run slightly faster. This presumably occurs because the data from each tile fit in the processor's cache.

ROMS 2 UPWELLING results
Machine CPU OS Compiler & switches Notes CPU time
Kupe Alpha EV5 600 MHz UNICOS/mk f90 -O3   140
      f90 -R b   416
      f90 -O3 #undef ASSUMED_SHAPE 119
      f90 -R b #undef ASSUMED_SHAPE 173
      f90 -O3 MPI 2 × 2 38
Thor Alpha EV67 667MHz Digital Unix f90 -fast   22
Rickard (2002) P4 1.8 GHz Win 2000 df /fast   32
Hadfield (2003) P4 2.67 GHz DDR 266 Win 2000 df /fast   22
      df /check:bounds   30
      g95 -O3   31
Otter P4 Xeon 2.8 GHz × 2   f90 -O1 (Absoft)   40
      g95 -O3   29

ROMS Benchmark Case

A set of three BENCHMARK runs is bundled in ROMS 2. They are all simulations of an idealised Southern Ocean on grids of 512 x 64 x 30 (BENCHMARK1), 1024 x 128 x 30 (BENCHMARK2) and 2048 x 256 x 30 (BENCHMARK3). BENCHMARK1 takes approx. 300 MiB of RAM in REAL*8 precision so can be run on a number of machines at Greta Point. BENCHMARK2 takes approx. 1200 MiB of RAM in REAL*8 precision and I have not yet found a machine that will run it in serial mode. On Kupe it requires a minimum of between 8 and 16 processors.

ROMS 1

I have ported the BENCHMARK cases back into the ROMS 1 source code. Since ROMS 1 supports only serial and OpenMP modes, BENCHMARK1 is the only one I can run. Here it is run for 20 time steps. 

ROMS 1 BENCHMARK1 results
Machine CPU OS Compiler & switches Notes CPU time
Hadfield (2003) P4 2.67 GHz DDR 266 Win 2000 df /fast   185
Otter P4 Xeon 2.8 GHz × 2 Linux f90 -O3   205
      g95 -O3   195
      g77 -O   195

ROMS 2

Here are BENCHMARK1 results from ROMS 2. I originally ran most of these for 20 time steps but have been redoing them with the standard ROMS 2.1 input file, which runs the simulation for 200 steps.

ROMS 2 BENCHMARK1 results
Machine CPU OS Compiler & switches Notes CPU time
Kupe Alpha EV5 600 MHz UNICOS/mk f90 -O3 Serial 770
      f90 -O3 Serial
#undef ASSUMED_SHAPE
740
      f90 -O3 MPI 12 × 2 36
Nforce2

Athlon XP2600

Linux

ifc -O3 -tpp7

  198
Hadfield (2003) P4 2.67 GHz DDR 266 Win 2000 df /fast 200 steps 1600
      df /fast Serial 4 × 4 136
      df /fast Serial 8 × 2 147
      df /fast Serial 8 × 8 137
      g95 -O3 Serial 4 × 4 230
Otter P4 Xeon 2.8 GHz × 2 Linux f90 -O1 Serial 4 × 4 240
      g95 -O3 Serial 4 × 4 200

Here are BENCHMARK2 results from ROMS 2 on Kupe:

ROMS 2 BENCHMARK2 results
Machine CPU OS Compiler & switches Notes CPU time
Kupe Alpha EV5 600 MHz UNICOS/mk f90 -O3 MPI 16 × 1 2080
      f90 -O3 MPI 16 × 2 1340

Valid HTML 4.01!Mark Hadfield 2004-11-10