Slow performance on SGI with Opterons

Message

antoniofetter · Thu Apr 04, 2013 4:00 pm

I have been struggling with a benchmark problem with ROMS and maybe you can bring a different insight to this issue.

I bought a SGI server to run ROMS in a modeling project. The server has four opteron processors with 16 core in each processor (2GHz) and has 256GB of RAM memory.

I have complied and run the upwelling application to run serially in one core. This is the result:

Code: Select all

Elapsed CPU time (seconds):

Thread #  0 CPU:     460.741
Total:               460.741

Nonlinear model elapsed time profile:

 Allocation and array initialization ..............                 0.104  ( 0.0226 %)
 Ocean state initialization .......................                 0.012  ( 0.0026 %)
 Reading of input data ............................                 0.004  ( 0.0009 %)
 Processing of input data .........................                 0.084  ( 0.0182 %)
 Processing of output time averaged data ..........                 81.489  (17.6865 %)
 Computation of vertical boundary conditions ......                 0.068  ( 0.0148 %)
 Computation of global information integrals ......                 2.432  ( 0.5279 %)
 Writing of output data ...........................                 3.752  ( 0.8144 %)
 Model 2D kernel ..................................                 256.432  (55.6565 %)
 2D/3D coupling, vertical metrics .................                 2.352  ( 0.5105 %)
 Omega vertical velocity ..........................                 1.892  ( 0.4107 %)
 Equation of state for seawater ...................                 1.884  ( 0.4089 %)
 3D equations right-side terms ....................                 11.097  ( 2.4084 %)
 3D equations predictor step ......................                 22.493  ( 4.8820 %)
 Pressure gradient ................................                 8.405  ( 1.8241 %)
 Harmonic mixing of tracers, S-surfaces ...........                 3.552  ( 0.7710 %)
 Harmonic stress tensor, S-surfaces ...............                 8.345  ( 1.8111 %)
 Corrector time-step for 3D momentum ..............                 28.246  ( 6.1305 %)
 Corrector time-step for tracers ..................                 23.653  ( 5.1338 %)
                                             Total:                 456.297   99.0354

All percentages are with respect to total time =          460.741

As a comparison, I run the upwelling problem in one core in my mac laptop (quad core I7, 2,2GHz). This is the result:

Code: Select all

Elapsed CPU time (seconds):

Thread #  0 CPU:     130.859
Total:               130.859

Nonlinear model elapsed time profile:

 Allocation and array initialization ..............                 0.047  ( 0.0355 %)
 Ocean state initialization .......................                 0.006  ( 0.0048 %)
 Reading of input data ............................                 0.003  ( 0.0020 %)
 Processing of input data .........................                 0.040  ( 0.0302 %)
 Processing of output time averaged data ..........                 7.020  ( 5.3642 %)
 Computation of vertical boundary conditions ......                 0.042  ( 0.0318 %)
 Computation of global information integrals ......                 1.452  ( 1.1096 %)
 Writing of output data ...........................                 0.561  ( 0.4288 %)
 Model 2D kernel ..................................                 68.150  (52.0792 %)
 2D/3D coupling, vertical metrics .................                 1.103  ( 0.8427 %)
 Omega vertical velocity ..........................                 1.032  ( 0.7890 %)
 Equation of state for seawater ...................                 0.642  ( 0.4907 %)
 3D equations right-side terms ....................                 5.884  ( 4.4967 %)
 3D equations predictor step ......................                 13.001  ( 9.9347 %)
 Pressure gradient ................................                 3.383  ( 2.5854 %)
 Harmonic mixing of tracers, S-surfaces ...........                 1.418  ( 1.0839 %)
 Harmonic stress tensor, S-surfaces ...............                 2.264  ( 1.7298 %)
 Corrector time-step for 3D momentum ..............                 14.086  (10.7640 %)
 Corrector time-step for tracers ..................                 8.481  ( 6.4809 %)
                                             Total:                 128.614   98.2840

All percentages are with respect to total time =          130.859

As you can see, the mac intel processor is much faster. The difference on the "processing of output time averaged data" maybe explained by the fact that the server has a RAID in it.

However, the differences the calculation times are too large. I can not believe that the opteron processors are so inferior to the intel ones. Specially in the "model 2D kernel" part.

A friend of mine has raised the question that the server maybe using the GPU (graphics CPU) to do some of the computations. I doubt it! Is there anything in ROMS code that use this kind of resource?

With your experience, do you see any other possibility to explain such a difference?

The HPC architecture of the SGI servers is becoming very popular.I have been struggling with a benchmark problem with ROMS and maybe you can bring a different insight to this issue.

robertson · #2 Unread post by **robertson** » Thu Apr 04, 2013 4:36 pm

First thing I would check is that your process (oceanS) is not constantly switching cores on your machine. I don't know a good way to do this with oceanS but I would suggest compiling with Open MPI then run with one processor and enable the --bind-to-core (or equivalent memory affinity option for pre version 1.4 implementations of Open MPI):

Code: Select all

mpirun --bind-to-core -np 1 oceanM ocean_upwelling.in

The --bind-to-core option will ensure that your program is not continually shuffled between different cores on your machine thus causing extra memory reads and copies, page faults, etc.. Many thanks Andy Moore for letting us know about this option.

As you can see, the mac intel processor is much faster. The difference on the "processing of output time averaged data" maybe explained by the fact that the server has a RAID in it.

This shouldn't be affected by RAID because the averaging is done in memory not by reading and writing to the hard disk. Averages are calculated then written to the averages file. More likely this is due to needing to access memory that is not local to the processor where calculations are taking place. If this is the case the --bind-to-core option I suggested before should help.

A friend of mine has raised the question that the server maybe using the GPU (graphics CPU) to do some of the computations. I doubt it! Is there anything in ROMS code that use this kind of resource?

ROMS does not currently contain any GPU directives so unless SGI is doing something behind the scenes this is not the issue.

There is also the difference between AMD and Intel here. I can't imagine the performance gap is that wide, but in your particular case I believe your laptop will always be faster than the SGI for serial problems and MPI problems of 4 processes or less. Not only does your laptop have a higher GHz rate, it has a higher per core cache ratio which should mean fewer accesses to memory. Your laptop also doesn't have any memory that isn't local to any given core.

Good luck and please post your progress here.

shchepet · #3 Unread post by **shchepet** » Sat Apr 06, 2013 10:11 pm

Besides everything else highlighted by David, your code spends ~55% of time computing
barotropic mode. This is way too much. What is your barotropic gravity wave Courant number,
Cg_max=max_{i,j}{ dtfast*sqrt[g*h*(1/dx^2+1/dy^2)]} computed by "metrics.F" and reported
as "Maximum barotropic Courant Number = ???" in ROMS print out?

Anything less that 0.8 is plain waste of computing resources.

antoniofetter · Wed Apr 10, 2013 5:21 pm

Dear Shchepet,

This is the default upwelling example that comes with ROMS. I have just compiled and run it, in both machines. It is just a benchmark test.

Thanks for your comment.
Best,

Antonio

antoniofetter · Wed Apr 10, 2013 5:51 pm

First thing I would check is that your process (oceanS) is not constantly switching cores on your machine. I don't know a good way to do this with oceanS but I would suggest compiling with Open MPI then run with one processor and enable the --bind-to-core (or equivalent memory affinity option for pre version 1.4 implementations of Open MPI):

Dear Robertson,

Your bet is absolutely right. Actually, that was one of my guesses early in the game. I noticed that the process jumped from core to core during execution time. So, I forced the process to stay in one specific core. SGI has a tool called dplace to specify the given core that one wants to use. However, to my surprise, the execution time did not change. Still too slow.

There is also the difference between AMD and Intel here. I can't imagine the performance gap is that wide, but in your particular case I believe your laptop will always be faster than the SGI for serial problems and MPI problems of 4 processes or less. Not only does your laptop have a higher GHz rate, it has a higher per core cache ratio which should mean fewer accesses to memory. Your laptop also doesn't have any memory that isn't local to any given core.

The difference on clock is quite small, 2.2GHz in the mac, as opposed to 2GHz in the opteron.

I have run a little floating point benchmark program in both of them. It is called 'flops' and basically performs a bunch of floating point operations. The program provides an estimate of PEAK MFLOPS performance by making maximal use of register variables with minimal interaction with main memory. The execution loops are all small so that they will fit in any cache.

This is the result for the mac:

FLOPS C Program (Double Precision), V2.0 18 Dec 1992

Module Error RunTime MFLOPS
(usec)
1 4.0146e-13 0.0070 1990.9336
2 -1.4166e-13 0.0058 1208.9005
3 4.7184e-14 0.0046 3701.9945
4 -1.2557e-13 0.0042 3564.6420
5 -1.3800e-13 0.0204 1419.3780
6 3.2380e-13 0.0081 3568.9503
7 -8.4583e-11 0.0210 570.3810
8 3.4867e-13 0.0207 1450.5600

Iterations = 512000000
NullTime (usec) = 0.0000
MFLOPS(1) = 1550.2006
MFLOPS(2) = 1182.4800
MFLOPS(3) = 1695.5185
MFLOPS(4) = 2419.7351

This is the result for the opteron:

FLOPS C Program (Double Precision), V2.0 18 Dec 1992

Module Error RunTime MFLOPS
(usec)
1 4.0146e-13 0.0088 1599.5694
2 -1.4166e-13 0.0060 1170.2486
3 4.7184e-14 0.0062 2723.9412
4 -1.2557e-13 0.0076 1976.4463
5 -1.3800e-13 0.0133 2175.2217
6 3.2380e-13 0.0154 1879.5498
7 -8.4583e-11 0.0183 654.0273
8 3.4867e-13 0.0123 2441.6659

Iterations = 512000000
NullTime (usec) = 0.0000
MFLOPS(1) = 1438.4855
MFLOPS(2) = 1218.5802
MFLOPS(3) = 1780.9560
MFLOPS(4) = 2190.3297

Regardless of the meaning of those floating point operations, the performance of both processors are not that different, as expected. Overall, the intel processor seems to perform better, but not that much. Nothing close to the 3x execution time when running ROMS. As I said before, the upwelling case took 2,5min in the mac and over 8min in the opteron.

So, there must be something peculiar to the ROMS code that slows it down when running in the SGI architecture.

I am still investigating it.

Best.
Antonio

hugobastos · Thu Apr 11, 2013 2:58 pm

Hello Antonio,

are you sure that you use the same compiler (including versions) and same flags over the runs(See the first lines of your log)!? Using intel compilers!? If so, maybe you should check the compiler flags or disable optimizations and check again the performance... The intel compiler have some automatic detections of arch/cpus that could raise the performance over your mac (i7) and are not set over the sgi workstation (opteron).

Here a test that maybe can help you:

Set compiler to gfortran, change the flag "-O2" to "-O0" and run both tests on the machines. See the timings ( i hope that they are close, slightly faster on the i7...) after that you can start to increase the optimizations to track the magic option that raise your performance on the i7 and try to find an alternative to the opteron...

If using intel compilers things are a little different since they put some magicflags over the default values and you need to force the compiler to not optimize (it's better to look over the manual of your version...since i don't remember that intel have a specific mtune/march values for opterons)

On the other side, if u are using gcc and the timmings diffs are still large after removing the optmizations, check if the march/mtune flags(i think it's march=bdver for opterons or something like that...) for your sgi workstation are correct or the gcc versions differs to much between the machines...

As always, take care of the optimization levels/flags you use, because in parallel some flags combinations can lead to roundoffs and resulting in differences over the runs...that's why ROMS uses " fp-model precise" on the default makefile <-- and the flag order is important too!-->...

antoniofetter · Fri Apr 12, 2013 12:02 am

Hi Hugo,

I am using gfortran in both machines, same flags.

Thanks,
Antonio

Ocean Modeling Discussion

Slow performance on SGI with Opterons

Slow performance on SGI with Opterons

Re: Slow performance on SGI with Opterons

Re: Slow performance on SGI with Opterons

Re: Slow performance on SGI with Opterons

Re: Slow performance on SGI with Opterons

Re: Slow performance on SGI with Opterons

Re: Slow performance on SGI with Opterons