Benchmarking Epyc, Ryzen, and Xeon: Tyranny of Memory

Message

jpringle · #1 Unread post by **jpringle** » Tue Sep 12, 2017 6:14 pm

Dear all --

I hope some of you find this useful; I have been researching purchasing a new modeling server, and the results should be interesting to those trying to optimize the performance on their own machines, or are thinking of purchasing new machines. There are a number of new architectures that have interesting properties to consider when trying to optimize the performance of ROMS. To understand some of the conclusions and results I present below, it is useful to read Sacha's post: viewtopic.php?f=17&t=2001&p=7771#p7714 . Not all of these results are new, but it is worth pointing out that they are still valid on new architectures.

Summary of results:

On modern architectures like AMD's Epyc and Intel's Scalable Xeon with non-uniform memory access (NUMA), MPI performs much better than openMP when running with many threads or processes.
ROMS scales well with increasing number of MPI processes (eg. NtileI*NtileJ) until the number of processes exceeds twice or three times the number of memory channels -- two to three processes per memory channel is about as good as it gets.
The new AMD architectures are very competitive and at the high end give the best performance among the systems I tested.

All results are from the benchmark test case with the largest domain (ocean_benchmark3.in). The tiling (NtileI and NtileJ) were chosen to be optimal for each architecture and number of processes/threads and thus varied between runs (for openMP, the number of tiles is independent of the number of threads; for MPI the number of tiles must match the number of processes). I used gfortran version 6.3 with openMPI and openMP on Ubuntu 17.04. I used the default flags in Linux-gfortran.mk. The number of timesteps was choosen so that each run lasts at least 200 seconds or at least 20 timesteps were taken. All results are presented as (number of model time steps)/(total time of run).

Three systems were tested. A Ryzen 1800x that I own; the memory is clocked at 3200Mhz, the CPU at 3.6Ghz, and there are 8 cores and two memory channels. Silicon Mechanics allowed me to benchmark three systems. The first was a dual CPU Epyc 7601 with 2600Mz memory, the CPU at 2.2Ghz, and 64 cores and 16 memory channels. The second was a single CPU Epyc 7601 with 2600Mz memory, the CPU at 2.2Ghz, and 32 cores and 8 memory channels. The third was a dual Xeon Gold 6138 with 2600Mhz memory, the CPU at 2Ghz, 40 cores and 12 memory channels. In all systems, all memory channels are populated by a single stick of RAM.

The Epyc and Xeon systems have non-uniform memory access (NUMA); this means that each core can only access some of the memory at the highest speed, and access to the rest of the memory is at a significant speed and latency penalty. When ROMS is run in parallel using openMP it seems to resided as a single region of memory. So when run with many threads on many cores, many of the cores have to access the ROMS arrays at locations where the memory bandwidth from that core is slow. However, when ROMS is parallelized with MPI, the code is broken up into 1 process per thread (and also one process per tile). Linux will automatically move the process in memory to memory that is most rapidly accessed by that process (google numastat to see how to monitor this). Because of this, memory access will be faster for codes run with MPI when running on a large number of cores. This is clearly seen in the graph below, which shows how quickly the model runs as a function of the number of processes or threads used for both MPI (solid lines) and openMP (dashed lines) for the Epyc and Xeon systems. This hypothesis was confirmed by using numastat to monitor memory latency while running the model. The circles indicate when the number of threads or processes equals the number of memory channels for the system.

In the following figure, the speed of the model runs in time-steps/second is shown as a function of thread or process count for each of the systems, each configured in the optimal manner w.r.t. openMP/MPI and tiling. At low thread counts, the high-clock speed Ryzen wins. But as the numbers of threads increases, systems with more cores run faster, as we would expect. However, note the circles on the plots -- these mark where the number of MPI processes or openMP threads matches the number of memory channels. It is clear from this figure that the increase in performance with number of processes plateaus at 2 to 3 times the number of threads/processes as memory channels.

To hammer this point home, below is a figure showing the maximum model speed as a function of number of memory channels. Squares are with MPI, stars with openMP. When running with the more efficient MPI on the Epyc and Xeon systems, it can be seen that the maximum performance scales as roughly linearly with number of memory channels. The Xeon is a bit of outlier in the last two plots -- it is unclear to me if this is because the ratio of (CPU clock speed) to (Memory speed) is lower, or because its memory system has lower latency. Nonetheless, the 2 CPU Epyc system ends up out performing it, consistent with its increased number of memory channels.

Finally, a few thoughts on performance/price of these systems. I hesitate to put this in here. If you are reading this much after the Fall of 2017, realize the world has likely moved on... the principles will likely remain valid, but the details will have shifted. Also, PLEASE NOTE WELL: this discussion is about the performance/price of using these machines to run ROMS. For doing, for instance, data analysis with Python and Matlab, one should likely pay more attention to the performance of a single thread (e.g. Intel i7s, Ryzens, Threadrippers, etc). These have much higher clock speeds, and so per core performance that can be twice as fast as the big server chips. Also, note that I am GUESSING at the performance of an AMD Threadripper chip based on the number of memory channels it has. I have not used one. My estimate is conservative, given that a Threadripper has the high clock speed and memory speed of the Ryzen chips.

If I configure a basic system for each of these, with just the CPU(s) and a 1Tb SSD boot disk (and a cheap graphics card if needed). I assume I need no more than 3 cores per memory channel. All memory channels are filled with 8Gb each, with the fastest supported. I estimated prices from various vendors on the internet -- this is approximate! Also, I chose the Xeon 6140 since it has slightly fewer cores and slightly faster per core clock, and so I am slightly guessing as to its speed...

Ryzen 1800x 8 cores: 1360$ for 2 memory channels, 0.44 timesteps/second in benchmark3
Threadripper 1920x 12cores: 2220$ for 4 memory channels, 0.7 timesteps/second
Dual Epyc 7351 2*16 Cores: 7800$ for 16 memory channels, 2.1 timesteps/second
Dual Xeon Gold 6140 2*18 cores: 9250$ for 12 memory channels, 1.9 timesteps/second

The cost of these scaled by how quickly they run can be calculated as (cost/(timesteps/second)), a lower number is a better number.

Ryzen 1800x 8 cores: 3090
Threadripper 1920x 12cores: 3170
Dual Epyc 7351 2*16 Cores: 3710
Dual Xeon Gold 6140 2*18 cores: 4868

This is interesting. As one would expect, the more consumer oriented and cheaper boxes have a better price/performance ratio than the more specialized server stuff. But the difference is much less than it has been in the past. The AMD solutions seem to be a better deal for the high core count chips than Intel -- but I am not at all confident I understand how Intel segments there chips, and I may not have found the optimal CPU.

Comments/corrections welcome.
Jamie

brianmckenna · Wed Sep 13, 2017 2:39 pm

Jamie,

Really helpful information here, thank you for this!

Brian McKenna

wmartin · #3 Unread post by **wmartin** » Sun Dec 03, 2017 10:09 pm

Jamie
This is great and very reasonable info. Thanks.

I want to build a ROMS system for under or around $2k and was headed toward THreadRipper. Using PartsPicker I have been able to put together: Thread Ripper 1900, MSI 399 Motherboard, Corsair water cooling, DDR4 4x4k 3200 memory, Samsung 1Tb SSD, low end graphics and a case for about $1700.

I was hoping for your, or anybody’s, opinion on this build. In particular, I wonder if the larger cache and extra cores of the 1920 or 1950 would be worthwhile. I’ve tried to find any info on how to estimate ROMS cache use in this environment, but haven’t had much luck. I assume a total of 16GB ram is ok. Lastly, I will be over clocking like a gamer to get the CPU and memory up to full speed (4GHz, 3200) so I wondered if the ASUS motherboards might be more reliable and capable for this.

Also, does the OS make a difference and the free compilers etc you can get with them? I’m thinking Umbuto, but really don’t know.

Thanks to anyone for comments.

Wayne

jpringle · #4 Unread post by **jpringle** » Mon Dec 04, 2017 12:44 am

Wayne--

This configuration seems reasonable -- but I have not experimented with the intel compilers yet. At some point I will have one of my students install the student version and let people know. If you would like to benchmark my threadripper system, let me know via email.

On the 1920x (12 cores) versus 1950x, for ROMS I find best performance at 8 threads with openMP, and little or negative improvement beyond that. This suggests that a 1920x should do fine. But with my 1950x, I can run ROMS and have a virtual machine running windows for powerpoint and word and music playing, and the ROMS speed seems unaffected. I do have 32Gb memory (4x8).

Jamie

wmartin · #5 Unread post by **wmartin** » Wed Dec 06, 2017 6:36 pm

Jamie, thank you for your quick reply.

Reading your note was a little confusing. I wondered if you have heard of the TR 1900x with 8 cores for about $500? It has four channels and runs a bit faster than the 1920x (same overclock numbers) but has less cache: L1/L2/L3 = 768K/4MB/16MB vs 1.125/6/32. I started trying to do some estimates (like Sasha seems to do in his head) but got tangled in so many independent variables: model definition, tiles, compilers, cores, threads, cache, etc. etc. that I gave up and decided you have to benchmark. But, how do you do that when you don't have the hardware? (You did great getting those folks to loan you equipment!) I started thinking about 3rd party benchmarks, like PassMark, who actually have a whole suite of tests focused on different features and applications. If we could identify public benchmark tests which correlate with ROMS performance, then we could check the public sites for all kinds of hardware. I'm going to see what I can do and wonder if this sort of analysis has already been done?
Wayne

jpringle · #6 Unread post by **jpringle** » Wed Dec 06, 2017 7:46 pm

Wayne --

A couple of things. To get the model to run on my Threadripper 1950x with only 8 threads, I used the environmental variable omp_num_threads. I suspect, based on my earlier post, that the speed of ROMS on this kind of system is limited by memory bandwidth (see Sasha's long thread on this that I reference, and my plot showing speed vrs. memory channels). If you had to pick a single synthetic benchmark, I would then look at one for memory bandwidth... But nothing substitutes for benchmarking.

On benchmarking computers that you don't own. If the system will cost more than about 8k, call a vendor and ask them. Silicon Mechanics was helpful for me; it helps that I and others in my institute have used them multiple times, and they were curious about this new processor. Be polite, and it is amazing what you can ask for. They don't loan me the equipment -- they let me log in remotely. I know exactly what packages to install, so it only takes me a few hours to benchmark the system.

On cheaper systems, it is trickier. You could ask someone like Puget systems and see what you find. Or you can ask your local sys-admin types to see if anyone at your institute has one.

I went with a 16 core machine because for other modeling I do (primarily genetics in C and python) cores matter and memory bandwidth is not important. I have an Epyc system for larger runs. I run the threadripper system with ubuntu, and windows 10 in virtualbox for running Word and Powerpoint (I am not going to get collaborators to learn latex and libreoffice is not worth it...)

Jamie

Ocean Modeling Discussion

Benchmarking Epyc, Ryzen, and Xeon: Tyranny of Memory

Benchmarking Epyc, Ryzen, and Xeon: Tyranny of Memory

Re: Benchmarking Epyc, Ryzen, and Xeon: Tyranny of Memory

Re: Benchmarking Epyc, Ryzen, and Xeon: Tyranny of Memory

Re: Benchmarking Epyc, Ryzen, and Xeon: Tyranny of Memory

Re: Benchmarking Epyc, Ryzen, and Xeon: Tyranny of Memory

Re: Benchmarking Epyc, Ryzen, and Xeon: Tyranny of Memory