ROMS 3.0 on the Altix 4700, benchmarks

Message

gerardo · #1 Unread post by **gerardo** » Thu Feb 08, 2007 9:58 pm

I ran the BENCHMARK1 and BENCHMARK2 test cases on 32 cores of an Altix 4700 with dual-core Itanium2 ("Montecito") processors (1.6GHz, 9MBL3 cache per core), and the BENCHMARK3 test case on 64, 128 and 192 cores of the same system.

The benchmarks ran pretty much out of the box, with minor changes to Compilers/Linux-ifort.mk, most notably some paths and the ifort compiler (ifort 10.0.013 beta, the latest) options, for which I used:

FFLAGS += -ip -O3 -unroll0 -ftz -fno-alias -g

Note that using -g with Intel Fortran doesn't affect any of the optimizations, and serves only to keep extra symbol table information in the executable, which is excellent for debugging and profiling purposes.

The MPI version has clearly improved when compared against ROMS 2.2, and is now uniformly faster than the OpenMP version. The SGI MPI implementation (over shared memory) is pretty fast. Curiously, the MPI version does better with NtileI > NtileJ, in contrast to the OpenMP version, where NtileI = 2 does best in most test cases.

Here are the wallclock times in secods for the various runs I made:

Code: Select all

Benchmark1: 512x64x30

Decomp     MPI elapsed     OpenMP elapsed
 1x32        29.35            34.90
 2x16        21.84            27.94
 4x8         18.65            28.84
32x1         16.29           100.22
 8x4         16.16            43.61
16x2         15.14            56.53

Benchmark2: 1024x128x30

Decomp     MPI elapsed     OpenMP elapsed
 1x32        91.39            90.28
 2x16        70.96            78.21
 4x8         63.70            79.56
32x1         58.57           277.21
 8x4         57.73           123.67
16x2         54.94           178.90

Benchmark3: 2048x256x30

Decomp     MPI elapsed     OpenMP elapsed
 2x32       157.53           178.60
 4x16       135.70           171.18
 2x64       100.33           118.23
 8x8        120.45           263.14
32x2        116.24           474.47
16x4        115.92           382.44
 4x32        81.47           103.86
 8x16        70.91           138.24
16x8         66.02           197.59
64x2         64.69           387.64
32x4         61.06           250.06
 6x32        58.29           117.84
 8x24        55.63           117.85
12x16        52.35           138.27
16x12        52.23           152.11
64x3         51.32           314.30
48x4         50.27           250.98
24x8         49.81           168.98
32x6         49.46           200.25

hetland · #2 Unread post by **hetland** » Wed Feb 28, 2007 3:47 pm

This short suite of benchmarks was done on a cluster of 2.0 GHz Duel Opterons. The cluster has eight nodes, two processors per node, and two cores per processor for a total of 16 actual CPUs and 32 kind-of CPUs. The nodes are connected with Infiniband. The total cost of the cluster was about 40K, with 15K for the 24 port Infiniband switch (which only 1/3 ful presentlyl). I only ran the BENCHMARK2 test case over a few settings that I would normally try. Here are the results:

Code: Select all

BENCHMARK2

   16 CPUS

      16x1
      nodes=8:ppn=2
      Total :                3513
      avg seconds per node :  219

      8x2
      nodes=8:ppn=2
      Total:                 3489
      avg seconds per node :  218

   32 CPUS

      32x1
      nodes=8:ppn=4
      Total:                 3380
      avg seconds per node :  105

      16x2
      nodes=8:ppn=4
      Total:                 3255
      avg seconds per node :  102

The avg seconds per node, as computed in the roms output file, are within a second of the actual wall clock time. The interesting thing here is that the using both CPU cores does indeed speed things up. The other thing is that this machine is generally only about half as fast as the Altix 4700, described above.