Ocean Modeling Discussion

ROMS/TOMS

Search for:
It is currently Mon Oct 16, 2017 10:19 pm




Post new topic Reply to topic  [ 3 posts ] 

All times are UTC

Author Message
PostPosted: Tue Jul 25, 2017 9:43 pm 
Offline

Joined: Sun Jul 27, 2003 6:49 pm
Posts: 70
Location: UNH, USA
Dear all --

IF YOU ARE INTERESTED IN THIS POST, PLEASE ALSO SEE: viewtopic.php?f=29&t=4665

I hope some of you find this helpful. I needed a computer for a new graduate student to run models and do data analysis. For 1500$ I built a Ryzen 1800x box. Since this is a new and popular new CPU, I thought others might be interested in benchmarks. Ryzen has 8 real cores, and can have up to 16 threads. I compare my results to runs on a dual CPU Xeon X5650 with 12 cores purchased in 2011 for 5400$. Some of my understanding of the results is grounded in the discussion in viewtopic.php?f=17&t=2001&p=7771#p7714 ; this is a useful read for anyone trying to make ROMS run faster.

I ran the benchmark test case with the largest domain (ocean_benchmark3.in); all results are the same with the next smaller domain (ocean_benchmark2.in) when the appropriate number of tiles are chosen. For the large domain, I used a tiling of 4x32; this was empirically found to be optimal (with the Xeon system, I used 4x30). I used gfortran version 6.3; I will install ifort when my student arrives and he can get a student license; I will update this post then. The model was parallelized with openMP, and I used the default flags in Linux-gfortran.mk; compling with -march=znver1 made no difference.

On the Ryzen system, the memory speed was set to either the default of 2400MHz, or the maximum supported speed for the memory I purchased of 3200Mhz.

I attach two plots; the first shows the time to compute one grid point for one timestep for various numbers of threads. To calculate the time to run one timestep, multiply by the grid size of 2048*256. This figure illustrates
  • The new CPU is about twice as fast as the old one; time marches on. Despite the comments below, the Ryzen box is always faster than the (old!) dual-Xeon setup.
  • ROMS on Ryzen in this application does not show much perfomance increase beyond 4 threads; there is some marginal increase in performance to 8 threads.
  • Hyperthreading (any threads beyond 8 ) hurts performance on this CPU in ROMS with Ryzen. This is not true for compiling or some of my biology python codes, but it is certainly true for ROMS. The Intel chip does gain (some) with virtual threads.
  • Overclocking the Ryzen CPU made nearly no difference on ROMS run speeds -- suggesting memory is the bottleneck. I do not show the CPU overclocking results, since they would be visually indistinguishable from the other results.

The second plot, titled "scaling with threads," shows the change in computation speed with increasing number of threads, scaled by the speed for one thread. In an (unrealistic) perfect world, the increase would be linear with thread number.
  • ROMS on Ryzen shows somewhat better scaling with faster memory speed. Faster memory on a 1 thread job increases performance by 7%, on an 8 thread job by 17%.
  • ROMS on the dual-Xeon shows better scaling with increased numbers of threads. I strongly suspect, due to the dependence of scaling on memory speed in Ryzen shown above, that the the Xeon scales better because it has three memory channels, while Ryzen only has two.

These issues of scaling will be important when thinking about buying larger systems from AMD. They are comming out with systems with more cores and silly names: Threadripper with up to 16 cores and 4 channels of memory and Epyc with up to 32 cores, slower clock, and 8 memory channels per CPU (and up to two CPUs). AMD is pushing the very large core counts... but these results suggest that additional performance per core (for ROMS, in this configuration!) dimminishes rapidly at roughly 2 to 3 cores per memory channel.

I have money for a 7k$ system that I need to spend; anybody have a new Intel system to compare to? :-).

I also welcome comments (Sacha?) about what I am being stupid about...

Jamie

Image
Image


Last edited by jpringle on Thu Sep 21, 2017 2:01 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Wed Jul 26, 2017 5:22 am 
Offline

Joined: Fri Nov 14, 2003 4:57 pm
Posts: 171
Location: UCLA, USA
Jamie,

This is very useful as you bringing it in, and actually encouraging. The first time
I heard about AMD Ryzen is from my son, who is a kind of enthusiast, but I have no first
hand experience.

Over long period of time Intel and AMD had very different design philosophies: let's
just assume that there is a crystal of silicon of 12x14mm size limited by the amount of
heat it generates (cannot be made larger), and for given technology (14nm this time), one
can fit only as many transistors (or gates now days), what would be the best use of them?

Intel: fewer cores, but more advanced design (e.g. two load-stores per clock cycle)
and larger caches to keep them busy (in fact over half of the die area of Intel CPUs
is cache.

AMD: just put more cores of less sophisticated design and the expense of having smaller
caches, and actually non-shared caches (in principle these caches
may hold duplicated data).

Memory controllers (and bandwidth) seem to be comparable between Intel and AMD,
or are they? Current Intel counterpart is i7-6800k is quad-channel DDR4, but what
about Ryzen 1800x? Is is dual- or quad-channel?

C.f.: I looked for a "best-for-your-buck" hardware configuration, a machine which can host
a RAID array, http://people.atmos.ucla.edu/alex/complete_linux_raid.htm and have it along
the line of an i7-6800k CPU on ASUS X99-AII motherboard. It is a obvious Intel
counterpart of Ryzen 1800x based machine, as they have comparable costs.

Scaling with the number of threads: more cores and/or powerful cores makes it harder
to achieve better scaling because of memory bandwidth limitation.

Then for the first time in history of Intel/AMD starting with quad-channel DDR4 CPUs
are designed in such a way that single core alone cannot saturate memory bandwidth --
accordingly, you observe a healthy near-perfect speedup from 1 to 2 threads, but not
beyond -- two cores can saturate, and modern CPUs tend to have more and more cores,
so it does not offer any relief, and cache misses are as punishing as they were before,
and even more.

As the result, I think that optimum tiling should be finer for AMD CPUs than for Intel,
so that just using the same 4x32 (empirical optimum from dual Xeon experience)
should be revised toward more tiles.

For a 2048*256 grid I would try something like 16x32 and see how it goes.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jul 26, 2017 2:37 pm 
Offline

Joined: Sun Jul 27, 2003 6:49 pm
Posts: 70
Location: UNH, USA
Sacha and All--

A couple of things. First, the AMD Ryzen has a dual memory channels, so is has less memory bandwidth than the x99 Intel boards. Compared to older AMD processors, the L1 and L2 cache's are private to each core, but the L3 cache is shared amoung 4 cores. But see below for more on this, and why I am interested in looking at AMD.

I checked, the optimal tiling for ocean_benchmark3.in is 4x32 on my Ryzen system.

Sacha -- could you give me numbers for your i7-6800k machine with optimal tiling for benchmark3 and 1 through 6 threads? I would prefer compling with gfortran, but ifort numbers would be fine -- I am most interested in scaling. I run for 20 timesteps for this benchmark (yes, I know...). Also, for this configuration, if we had infinite memory bandwidth, do you have a sense of when Amdahl's law would start to limit scaling?

On AMD's future. They will be releasing cpu's with 4 memory channels (threadripper) and 8 memory channels (epyc). These chips have lots of cores, but there are cheaper versions with fewer cores and higher clockspeeds (and more cache per core, because the amount of L3 cache is fixed per type).

If we assume that ROMS scales best with two cores per memory channel, then the optimal core count for Threadripper is 8 cores/CPU, and for Epyc 16 cores/CPU. A Threadripper 12 core chip is 800$ (there is no 8 core Threadripper) and an Epyc 16 core costs 1100$ for a dual socket part, and 700$ for a single socket part. These parts are considerably cheaper than the full core count versions, and are probably optimal for ROMS runs. (The price for the 16 core Epyc part is likely less than the price for similarly cored Threadripper since the clock is less; This may suggest that Epyc will allow more cores/(memory channel), only benchmarking will tell.)

It is worth noting that other kinds of codes are less memory sensitive -- my population genetics codes scale nicely out to 12 threads on Ryzen.

In the next few weeks I will see if any vendors let me benchmark on a Epyc system (Threadkiller is not yet out); I shall let y'all know what I find.

Jamie


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 3 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group