Ocean Modeling Discussion

ROMS/TOMS

Search for:
It is currently Fri Nov 24, 2017 1:54 am




Post new topic Reply to topic  [ 2 posts ] 

All times are UTC

Author Message
PostPosted: Tue Sep 12, 2017 6:14 pm 
Offline

Joined: Sun Jul 27, 2003 6:49 pm
Posts: 70
Location: UNH, USA
Dear all --

I hope some of you find this useful; I have been researching purchasing a new modeling server, and the results should be interesting to those trying to optimize the performance on their own machines, or are thinking of purchasing new machines. There are a number of new architectures that have interesting properties to consider when trying to optimize the performance of ROMS. To understand some of the conclusions and results I present below, it is useful to read Sacha's post: viewtopic.php?f=17&t=2001&p=7771#p7714 . Not all of these results are new, but it is worth pointing out that they are still valid on new architectures.

Summary of results:

  • On modern architectures like AMD's Epyc and Intel's Scalable Xeon with non-uniform memory access (NUMA), MPI performs much better than openMP when running with many threads or processes.
  • ROMS scales well with increasing number of MPI processes (eg. NtileI*NtileJ) until the number of processes exceeds twice or three times the number of memory channels -- two to three processes per memory channel is about as good as it gets.
  • The new AMD architectures are very competitive and at the high end give the best performance among the systems I tested.

All results are from the benchmark test case with the largest domain (ocean_benchmark3.in). The tiling (NtileI and NtileJ) were chosen to be optimal for each architecture and number of processes/threads and thus varied between runs (for openMP, the number of tiles is independent of the number of threads; for MPI the number of tiles must match the number of processes). I used gfortran version 6.3 with openMPI and openMP on Ubuntu 17.04. I used the default flags in Linux-gfortran.mk. The number of timesteps was choosen so that each run lasts at least 200 seconds or at least 20 timesteps were taken. All results are presented as (number of model time steps)/(total time of run).

Three systems were tested. A Ryzen 1800x that I own; the memory is clocked at 3200Mhz, the CPU at 3.6Ghz, and there are 8 cores and two memory channels. Silicon Mechanics allowed me to benchmark three systems. The first was a dual CPU Epyc 7601 with 2600Mz memory, the CPU at 2.2Ghz, and 64 cores and 16 memory channels. The second was a single CPU Epyc 7601 with 2600Mz memory, the CPU at 2.2Ghz, and 32 cores and 8 memory channels. The third was a dual Xeon Gold 6138 with 2600Mhz memory, the CPU at 2Ghz, 40 cores and 12 memory channels. In all systems, all memory channels are populated by a single stick of RAM.

The Epyc and Xeon systems have non-uniform memory access (NUMA); this means that each core can only access some of the memory at the highest speed, and access to the rest of the memory is at a significant speed and latency penalty. When ROMS is run in parallel using openMP it seems to resided as a single region of memory. So when run with many threads on many cores, many of the cores have to access the ROMS arrays at locations where the memory bandwidth from that core is slow. However, when ROMS is parallelized with MPI, the code is broken up into 1 process per thread (and also one process per tile). Linux will automatically move the process in memory to memory that is most rapidly accessed by that process (google numastat to see how to monitor this). Because of this, memory access will be faster for codes run with MPI when running on a large number of cores. This is clearly seen in the graph below, which shows how quickly the model runs as a function of the number of processes or threads used for both MPI (solid lines) and openMP (dashed lines) for the Epyc and Xeon systems. This hypothesis was confirmed by using numastat to monitor memory latency while running the model. The circles indicate when the number of threads or processes equals the number of memory channels for the system.
Image

In the following figure, the speed of the model runs in time-steps/second is shown as a function of thread or process count for each of the systems, each configured in the optimal manner w.r.t. openMP/MPI and tiling. At low thread counts, the high-clock speed Ryzen wins. But as the numbers of threads increases, systems with more cores run faster, as we would expect. However, note the circles on the plots -- these mark where the number of MPI processes or openMP threads matches the number of memory channels. It is clear from this figure that the increase in performance with number of processes plateaus at 2 to 3 times the number of threads/processes as memory channels.
Image

To hammer this point home, below is a figure showing the maximum model speed as a function of number of memory channels. Squares are with MPI, stars with openMP. When running with the more efficient MPI on the Epyc and Xeon systems, it can be seen that the maximum performance scales as roughly linearly with number of memory channels. The Xeon is a bit of outlier in the last two plots -- it is unclear to me if this is because the ratio of (CPU clock speed) to (Memory speed) is lower, or because its memory system has lower latency. Nonetheless, the 2 CPU Epyc system ends up out performing it, consistent with its increased number of memory channels.
Image

Finally, a few thoughts on performance/price of these systems. I hesitate to put this in here. If you are reading this much after the Fall of 2017, realize the world has likely moved on... the principles will likely remain valid, but the details will have shifted. Also, PLEASE NOTE WELL: this discussion is about the performance/price of using these machines to run ROMS. For doing, for instance, data analysis with Python and Matlab, one should likely pay more attention to the performance of a single thread (e.g. Intel i7s, Ryzens, Threadrippers, etc). These have much higher clock speeds, and so per core performance that can be twice as fast as the big server chips. Also, note that I am GUESSING at the performance of an AMD Threadripper chip based on the number of memory channels it has. I have not used one. My estimate is conservative, given that a Threadripper has the high clock speed and memory speed of the Ryzen chips.

If I configure a basic system for each of these, with just the CPU(s) and a 1Tb SSD boot disk (and a cheap graphics card if needed). I assume I need no more than 3 cores per memory channel. All memory channels are filled with 8Gb each, with the fastest supported. I estimated prices from various vendors on the internet -- this is approximate! Also, I chose the Xeon 6140 since it has slightly fewer cores and slightly faster per core clock, and so I am slightly guessing as to its speed...

  • Ryzen 1800x 8 cores: 1360$ for 2 memory channels, 0.44 timesteps/second in benchmark3
  • Threadripper 1920x 12cores: 2220$ for 4 memory channels, 0.7 timesteps/second
  • Dual Epyc 7351 2*16 Cores: 7800$ for 16 memory channels, 2.1 timesteps/second
  • Dual Xeon Gold 6140 2*18 cores: 9250$ for 12 memory channels, 1.9 timesteps/second

The cost of these scaled by how quickly they run can be calculated as (cost/(timesteps/second)), a lower number is a better number.

  • Ryzen 1800x 8 cores: 3090
  • Threadripper 1920x 12cores: 3170
  • Dual Epyc 7351 2*16 Cores: 3710
  • Dual Xeon Gold 6140 2*18 cores: 4868

This is interesting. As one would expect, the more consumer oriented and cheaper boxes have a better price/performance ratio than the more specialized server stuff. But the difference is much less than it has been in the past. The AMD solutions seem to be a better deal for the high core count chips than Intel -- but I am not at all confident I understand how Intel segments there chips, and I may not have found the optimal CPU.

Comments/corrections welcome.
Jamie


Top
 Profile  
Reply with quote  
PostPosted: Wed Sep 13, 2017 2:39 pm 
Offline

Joined: Mon Jan 21, 2013 3:58 pm
Posts: 1
Location: RPS ASA
Jamie,

Really helpful information here, thank you for this!

Brian McKenna


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 2 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group