Intel’s new i7 980x CPU gives disappointing speedup

Message

BAN · #1 Unread post by **BAN** » Wed Sep 15, 2010 1:56 pm

Hi
I have been running ROMS on a pc with an i7 930 CPU (4 core, 2.80GHz, 64 bit, triple channel DDR3 1333MHz RAM). The operating system is the 64-bit version of Ubuntu 10.04 (lucid).

Now I tried to run ROMS on an equivalent pc but with an i7 980x CPU (6 core, 3.33GHz) which is the present flagship from Intel regarding CPU performance (and price!!).

The results so far have been disappointing. On paper the CPU power is increased from 4*2.8=11.2 to 6*3.33=19.98 i.e. almost double. Parallel computations do not scale linearly, but I was hoping/expecting a speedup in the range of 30%, but this was not by any means realized.

Running an identical ROMS run with 4 treads (model domain partitioned into 4 sections) on the 930-pc and the 980x-pc, the speedup is merely 3.2%. (clock speed difference is 19%)

Running the same run with 6 treads (model domain partitioned into 6 sections) on the 980x-pc only gives a speedup of 11.5% compared to the 4 tread run on the 930-pc.

This tells me that either something strange has gone wrong, or that my ROMS application is more dependent on memory bandwidth (RAM speed, RAM accessed in parallel, speed of motherboard etc.) than brute CPU force (clock speed and number of cores).

Now to the QUESTIONS:
- Does this observation (memory bandwidth vs. brute CPU force) agree with your experience with ROMS?
- Do you know any rules of thumb regarding what hardware is best to purchase - that is for the unfortunate of us who do not have cluster kind of money?
- Is it correct in my case, that the performance would be much better using DDR3 2000MHz RAM rather that the present DDR3 1333MHz RAM?

p.s. I tried to run different compiler optimizations which on paper should be faster (take more advantage of the new hardware) compared to the default ROMS-Linux-ifort settings, but the default values turned out to be just as fast. . . fore some reason ROMS could not compile with the option -ipo
turned on!?

kate · #2 Unread post by **kate** » Wed Sep 15, 2010 4:51 pm

The person who could write a thesis in response to this is Sasha Shchepetkin, who would explain that the community ROMS does not do things in the best way for speed.

Beyond that, you have to consider things like memory bandwidth to the chip. Now the same memory bus has to serve six cores instead of four, but did the pipe get better? I don't know. Yes, our code is memory bandwidth limited.

We have several Linux clusters here and the per-core speed hasn't changed much in recent years. Instead, the idea is that you add more cores. The community ROMS is disappointing in its speedup beyond some modest number of cores, the number depending on the details of your problem. I believe there's going to have to be some fundamental change in how we do computing if we plan to take advantage of 1000+ cores. Then there's the parallel output and the parallel pre- and post-processing that we aren't doing yet.

ocecept · #3 Unread post by **ocecept** » Thu Sep 16, 2010 2:24 am

Hi Ban;

I'm wondering how many grid points (i x j) do you have?
Probably you already know that, but have a look on the Herman comments about partitions and number of cores at viewtopic.php?t=1979

Your question came in a good time, I'm going to buy a computer with Intel i7 processors and I was curios to know how the fortram compilers (and ROMS) deal with the intel hyper threading technology.

Comments from other people would be great.

Cheers;

BAN · #4 Unread post by **BAN** » Thu Sep 16, 2010 11:40 am

Hi Kate and ocecept, and thank you for your quick replies.

As you agree that the memory bandwidth is an important issue, I’ll make the relative modest investment and purchase the faster RAM.

The present model domain is not ideal for parallel tiling; it was initially derived for a serial Fortran77 application, where water depth was the only important factor (model domain size and rotation give least possible max depth i.e. as large time step as possible).

kurapov · #5 Unread post by **kurapov** » Fri Sep 17, 2010 1:03 am

Ban, -- Thanks for interesting statistics. Did you run ROMS test in OpenMP or MPI regime?
-- Alex

BAN · #6 Unread post by **BAN** » Mon Sep 20, 2010 10:44 am

Hi Alex

I first ran it using MPI on an old small Linux cluster that the University owns (five Dell 2600 2.6-GHz Xeon servers with a total of 10 dual core CPUs). When the cluster recently stopped working I tried to run the application in OpenMP mode on my Ubuntu i7 930 PC. To my surprise the 4-core PC was actually slightly faster than the cluster.
I have not tried to run the application in MPI mode on the PCs, but I assume that it would run slower compared to OpenMP mode.
My MPI experience from the cluster was that an application would run faster when adding nodes/CPUs but activating more treads/cores on already active node would not speedup the process, at best it would remain the same.

arango · #7 Unread post by **arango** » Mon Sep 20, 2010 4:30 pm

Benchmarking ROMS on a computer is not as simple as you may think. There are several things that you need to consider: application, grid size, I/O, number of threads/nodes, memory, cache size, tile partition, tile balancing, compiler, compiler flags, IEEE standard representation of floating-point operations, math processor, distributed-memory library (MPICH, MPICH2, OpenMPI), OpenMP shared-library version, number and type of jobs running in the computer, intra-processor communication, outside disk communications, time of the day, and so on.

ROMS comes with it own benchmark. It is an idealized Southern Ocean application and it is activated with the BENCHMARK option. By default there is no I/O. There are 3 grid sizes: 512x64x30 (benchmark1.in), 1024x128x30 (benchmark2.in), and 2048x256x30 (benchmark3.in). Notice that all the horizontal grid sizes are powers of two, so we can have endless, balanced tile partitions.

You need to carry an ensemble of benchmarks to be statistically meaningful. They need to be carried at different times of the day. We need to avoid I/O always to get realistic timings. You for instance can turn on I/O in the BENCHMARK application to see what I am talking about.

My experience with shared-memory and distributed-memory is that for small grids the timings is about the same and I haven't observed any statistical trends. You will be surprised sometimes that actually the distributed-memory is faster. As the grid gets larger and no longer fit in cache, the distributed-memory configuration is actually faster. This is because of the page faulting in shared-memory state global arrays after the cache size is exceeded. In distributed-memory, the state global arrays are not global but of the size of the partition plus ghost-points. The efficiency of an application has an optimal tile partition for a particular grid size. Once that this is reached, the efficiency deteriorates due to excessive MPI communications. I have made this point countless times in this forum.

kate · #8 Unread post by **kate** » Mon Sep 20, 2010 4:47 pm

I would argue that you need to check the timings on your full realistic application, with I/O. We had one system that looked really pretty good with the BENCHMARK case. Then it was an absolute dog with my realistic setup. Doing the profiling showed that it was all in the I/O, where the base ROMS code had vectorized, but the netcdf library had not (notably the conversion to single precision for smaller output).

Turning on an ecosystem model with its dozen+ tracers will change things too. Suddenly the chunk of code taking the most time is the rotated mixing tensor for tracers as opposed to the 2-D timestepping.

arango · #9 Unread post by **arango** » Mon Sep 20, 2010 4:58 pm

True but the I/O is not benchmarking the computer CPU but the connectivity to the disk where the files are read or written. This the problem with parallel I/O. I agree that the NetCDF and HDF5 libraries are very inefficient and there is a lot of room for improvement. When doing I/O, it also depends on the frequency of the I/O. This is the killer.

Like I said, benchmarking is not trivial

susonic · #10 Unread post by **susonic** » Sat Sep 25, 2010 12:05 pm

Hi Mr.Ban,

Did you fix the problem ?

I had faced with similar problem which you mentioned above.

Recently, I tried with updated ROMS svn 514 and it shows better performance than previous ROMS version.

Dr.Arango amended some bugs with parrallel and I believe that the amending brought better result.

Would you try that one?

-JH

shchepet · #11 Unread post by **shchepet** » Sun Sep 26, 2010 3:01 am

I just saw this thread of conversation and to my amusement found what I would characterize as
pristine naiveness: did not we been through this before?

MPI vs. OpenMP .... influence of I/O, and, ... the usual award winning phrase that "benchmarking ROMS is not trivial..."

Contrary to popular belief it is trivial: Poor Man's computing at work -- there is nothing
new about it.

Furthermore, to my experience, the i7-family CPUs machines (by at least a factor of 2)
outperform all previous generations of CPUs, including Core 2 and 5400-series Xeons (that
is ROMS running 8 theads on a Core i7 920 is faster than on 8 threads dual-quad Xeon 5420s).
And yes, it is 8 threads on 4-core i7: hyperthreading makes its 4 cores appear to the
operating system as 8 CPUs, and unlike in the case of Pentium 4 (remember that Intel
HT-technology gimmik?) at this time it actually works(!). I do observe 15...20% gain when
going from 4 to 8 threads. It is impressive.

First, lets eliminate irrelevant suspects: time spent in I/O is neglidible: you running
on a single machine.

Obviously OpenMP is faster than MPI in such conditions, but that is kind of trivial.

Now I have a set of questions:

old:

What is you grid size and what is you partition?

and new:

What are the BIOS settings of your machine?

FSB?

Memory speed?

Memory profile?

Memory timings?

Did you set you BIOS to "all default" or you played with it?

ce107 · #12 Unread post by **ce107** » Mon Sep 27, 2010 11:45 am

shchepet wrote: Furthermore, to my experience, the i7-family CPUs machines (by at least a factor of 2) outperform all previous generations of CPUs, including Core 2 and 5400-series Xeons (that is ROMS running 8 theads on a Core i7 920 is faster than on 8 threads dual-quad Xeon 5420s).

This is a statement that holds true more or less for the Nehalem/Westmere (Core i7 tick/tock) when it comes to other ocean models (eg. MITgcm) as well and many other codes that exercise the memory subsystem extensively enough. Intel had been constricted performance-wise for years by sticking to the Front Side Bus and with QPI in the i7 it got things right for a change. A superior branch predictor helps as well but not so much for our type of codes.

shchepet wrote: And yes, it is 8 threads on 4-core i7: hyperthreading makes its 4 cores appear to the operating system as 8 CPUs, and unlike in the case of Pentium 4 (remember that Intel HT-technology gimmik?) at this time it actually works(!). I do observe 15...20% gain when going from 4 to 8 threads. It is impressive.

Again simultaneous multithreading (the academic non-Intel name for hyperthreading) is supposed to work when the execution pipelines have bubbles because of dependencies (either branch or load) - if you're waiting on main memory and you cannot saturate the main memory bus with one ocean model thread it can help you. The old Pentium4 for example could use up all of the FSB bandwidth with one thread just doing memory copies (and its implementation of hyperthreading had a few other limitations as well).

shchepet wrote: First, lets eliminate irrelevant suspects: time spent in I/O is neglidible: you running on a single machine.

Obviously OpenMP is faster than MPI in such conditions, but that is kind of trivial.

In certain cases OpenMP is going to be slower than MPI just by virtue of having less strictly enforced data ownership (all data is private to its MPI process while you have the potential for false sharing with OpenMP). So even within the same box, it might be beneficial to do OpenMP within a socket and MPI across sockets - ROMS does not have such a "hybrid" mode (yet).

BAN · #13 Unread post by **BAN** » Mon Sep 27, 2010 3:04 pm

Hi again, and thank you all for the interest and comments

To answer the questions from one end:

It appears that I’m using ROMS svn 511.
- I’ll follow the suggestion by subsonic and update the code.

Grid size: 1398 x 726 x 1
Tilling: is 3 x 2

Regarding the BIOS settings, I haven’t done anything at all (both PC's were purchased assembled and tested from the same vendor). I guess everything is default, but I do not know.

Some hardware information regarding both PC's which might be of interest:
- Motherboard: ASUS P6T SE, X58 chipset
- RAM: DDR3 1333MHz , in triple channel configuration.
I have just received new ram DDR3 which is 2000Hz (XMP), which I hope will give some extra speedup.

- I have not tested yet if MPI for some reason should be faster on this PC

shchepet · #14 Unread post by **shchepet** » Mon Sep 27, 2010 4:52 pm

Grid size: 1398 x 726 x 1
Tilling: is 3 x 2

This is not the best way to run ROMS.

Try to set tiling to 8 x 81 (yes, meaning eight by eighty one),
rerun it and report your finding back to this forum. We will
continue after that...

Motherboard: ASUS P6T SE, X58 chipset
RAM: DDR3 1333MHz , in triple channel configuration.
I have just received new ram DDR3 which is 2000Hz (XMP)...

This is a good choice of motherboard, although faster
memory would be useful (you obviously realized that
already). One somewhat confusing part about buying
memory is that one has to pay attention to latencies
and cooling. As a rule, increase of frequency comes
with with some penalty in CAS latencies, which partially
devalues the gain of larger frequency. For example,
Corsair makes 1600MHz 8-8-8-24 memory, but going to higher
speed, say 1800MHz, it ends up set to 9-9-9-whatever.

Also, typically higher memory clock speed requires higher
voltages (beyond the JEDEC standards), so in any case
you have to open the box, check exactly what kind of memory
you have, go to manufacturer web suite and get some technical
document about recommended settings for that particular
memory module. No BIOS will do it for you. Neither does
vendor/computer manufacturer.

Regarding the BIOS settings, I haven’t done anything at all
(both PC's were purchased assembled and tested from the same
vendor). I guess everything is default, but I do not know.

ASUS boards always have very rich BIOS. XMP profile is
NEVER enabled by default. Furthermore, your 1333MHz memory
may be clocked at 1066 because this is kind of default set
by Intel specifications for i7.

Even if your memory is not "XMP certified" you may take advantage
of Manual settings for timing. For example, on Intel DX58SO boards
I ended up not using XMP profile, but rather set memory to 1333
(even thought it is 1600MHz memory), but instead cranking up base
clock from 133 to 145MHz, while having FSB locked to memory.
This overclocks both FSB (hence the processor) and memory (if it
would be 1333, but in fact, it ends up being actually downclocked,
since it is 1600MHz memory set to run at 1450).

Inspect memory timings, but do not mess with them during the
first time [you messing with BIOS is far from over, so you will
revise memory timings later].

Got to Boot section man make sure that Qick Boot and Logo are
both set to "Disable". Enable Summary screen. This way you will
see what the machine is set to.

...also go to SouthBridge SATA settings and make sure that AHCI
mode is set to "Enabled". This is specifically important to
satisfy Hernan's concern about I/O. ...and, please ignore that
warning about that you have to have latest Windows XP (or 7 what
ever it is called now days) to enable this feature: Linux is
perfectly capable of taking advantage of it.

arango · #15 Unread post by **arango** » Mon Sep 27, 2010 5:24 pm

Interesting information...

Perhaps, we need to create a page in

WikiROMS that contains all these information for future reference and to avoid repeating this again in the future.

BAN · #16 Unread post by **BAN** » Wed Sep 29, 2010 12:50 pm

Hi again shchepet
and thank you for your advice, which so far have given surprisingly high performance gains.

QUESTION: where does the 8x81 tiling come from, memory configurations?

So far I’ve been thinking the tiling should correspond to the number of cores/treads on the PC/Cluster but apparently I was very wrong, at least on the PC-part!

In order to get some perspective I used to model to calculate the barotropic currents over a 40 day period (tiling: 3x2, number of treads: 6). This has taken, until now, in the order of 5-6 days to complete on the 980x-PC. As a base reference these runs take 8:32 (8min 32sec) to model one hour.

In order to make comparison of different setups of the PC and the model I only run the model for one hour in the following examples.

As startup effects of very short model runs will bias the model run-time, the following SETUP 1 reference runs were made:

SETUP 1 default
(ROMS svn 511, tiling: 3x2, number of treads: 6, RAM 1333MHz)
Run 1: duration 9:02
Run 2: duration 8:38
Run 3: duration 8:40
Run 4: duration 8:59
Average duration SETUP1: 8:50

SETUP 2 new svn version
(ROMS svn 514, tiling: 3x2, number of treads: 6, RAM 1333MHz)
Run 1: duration 8:41
Run 2: duration 8:41
Run 3: duration 8:39
Run 4: duration 8:47
Average duration SETUP2: 8:42

SETUP 3 new tiling, 6 treads
(ROMS svn 514, tiling: 8x81, number of treads: 6, RAM 1333MHz)
Run 1: duration 4:26
Run 2: duration 4:24
Run 3: duration 4:22
Run 4: duration 4:29
Average duration SETUP3: 4:25

SETUP 4 new tiling, 12 treads
(ROMS svn 514, tiling: 8x81, number of treads: 12, RAM 1333MHz)
Run 1: duration 4:26
Run 2: duration 4:25
Run 3: duration 4:25
Run 4: duration 4:27
Average duration SETUP4: 4:25

SETUP 5 faster RAM
(ROMS svn 514, tiling: 8x81, number of treads: 6, RAM 2000MHz)
Run 1: duration 4:02
Run 2: duration 4:02
Average duration SETUP5: 4:02

SETUP 6 faster RAM 12 treads
(ROMS svn 514, tiling: 8x81, number of treads: 12, RAM 2000MHz)
Run 1: duration 3:55
Run 2: duration 3:55
Run 3: duration 3:55
Average duration SETUP6: 3:55

. . . and an update . . . with some additional questions . . .
The XMP option is activated in BIOS, and it tells me that Profile #1 (which I now use is)
Profile Info: 2000MHz-9-9-10-27-1N-1.65V-1.70V
Profile #2 is:
Profile Info: 1866MHz-9-9-10-27-1N-1.65V-1.60V

Quick Boot and Logo are both set to "Disable", but there is no option regarding the Summary screen, but it I think it appears never the less. The only problem is that it is written so fast that I cannot see what is in the beginning of the table.

Question: Is there some way that the summary screen can be accessed afterward as a text file?

Regarding:

...also go to SouthBridge SATA settings and make sure that AHCI
mode is set to "Enabled"

I’m unfamiliar with many of these hardware words/terms, and the closest I can find in the BIOS to the above is a setting in: [MAIN] then [Storage Configuration] then [Configure SATA] where the options are {[IDE],[RAID],[AHCI]}, by default the setting is IDE.

Question: Is this the one option that speeds up I/O?

I just want to be sure as I did some googling and some think that this option has to be set prior to installing the operating system (OS), otherwise there might be complications with the existing OS after this option has been changed!?

shchepet · Thu Sep 30, 2010 11:13 pm

So you recover some of the performance loss, but there is still a way to go.

QUESTION: where does the 8x81 tiling come from, memory configurations?

I made it up just as a first guess. Generally for these kind of problems one wants to chose the size of tile to reach the best possible compromise to satisfy all of the following:
(1) the size must be sufficiently small to fit into processor cache [typically this means the outermost-level cache, L3 if the processor has it, or L2];
(2) "perimeter-vs-area" consideration: as tiling is introduced there is a bit of redundant computations takes place along the subdivision lines (literally certain provisional variables [e.g., fluxes, etc...] are computed twice -- when processing a boundary row of a tile, and then the adjacent boundary row of the neighboring tile]. Consequently, if your tiles are too narrow, say only a few points wide, the cost of extra computing along the sides may be not negligible;
(3) length of innermost loop [the i-loop in ROMS] must be large to ensure good processor performance of pipelined execution.

Obviously it is hard to compromise and satisfy all three, but the situation is simplified a bit by realizing that the optimum geometry [size and shape] of tiles does not actually depend on your problem -- horizontal grid dimensions -- but is actually a function of CPU and computer hardware you use. Thus, when going to a larger problem, it makes sense to increase the number of tiles rather than their size.

Some of these ideas are expressed in http://marine.rutgers.edu/po/Workshops/ ... petkin.pdf ...and do not be afraid that it is too old: sometimes a fresh news is just nothing but well forgotten news. What you observe on your i7 machine is the kind of behavior described on page 9 of that poster.

To get a more precise feeling about the optimal tiling for your machine and problem you have to run and compare more possibilities: not necessarily as many as in that plot, but a dozen would be useful to orient yourself.

[Configure SATA] where the options are {[IDE],[RAID],[AHCI]}, by default the setting is IDE.
Question: Is this the one option that speeds up I/O?

Is is the only option? Perhaps not. Advanced Host Controller Interface (AHCI) is Intel's term for a new standard for SATA interface, which may or may not be supported by hard drives and operating systems. It is basically to enable SCSI-like behavior of SATA disk, like native queuing, or simply put, a revision/expansion of SATA commands on hardware level. If you buy a modern disk, it is fully supported, and so does modern Linux. Old (in this context 5-year old) drives do not. That is why it is not enabled by default.

As far as I can tell, AHCI --- non-AHCI makes difference only under heavy load, but barely noticeable on common use, i.e., running model which does computing 99% of the time.

Now lets focus on other insignificant details which may affect your timings:

1. What version, branch and release of Linux do you have?

2. Are you running "desktop" or "server" kernel? What version?

uname -r ???

uname --all ???

To my experience one must use server-type kernel for these kind of computations. Some Linux distributions, notably Mandriva, maintain two or three branches of their kernels, notably "kernel-desktop", "kernel-server", and "kernel-laptop" (although it appears that in the most recent release laptop and desktop are kind of merged). Their Linux installer automatically decides which one to put in depending on what hardware it finds, and desktop version is installed by default on most motherboards. I always un-install desktop and install server, no matter what is the intended usage of the machine.

The difference? Different optimization targets: server is optimized for maximum throughput; desktop is more about latencies of responses to user input. Basically different priority policies in process management. When it is about running multi-threaded jobs, scheduling becomes important. I observer as much as 20%
difference in performance on i7 just due to the kernel alone.

3. If a threaded job, like ROMS is running inside a "konsole" window [konsole is the default shell of KDE, very popular for its versatility] and printing its output to the screen, it actually running slower that than if it is running from an xterm, or if redirecting its standard output into a file. It is bizarre. I do not have explanation for this. In any case, the most reliable timings can be done when all output is redirected into a file, which should always be done.

4. What version if Intel compiler do you use?

ifort -V ???

As of today, the current release is 11.1.073. If you running anything older than that, please update.

5. Compiler FLAGS ??? is it

ifort -fpp2 -openmp -pc80 -axSSE4.2 -axSSE4.2 -auto -stack_temps -O3 -IPF_fma -ip ....

or something else?

In your first post you mentioned that you tried to use a more advanced compiler options but it makes no difference relative to "default". Is it still the case?

Note that this whole process may be iterative: once you fix one thing, you may end up adjusting what you did before. For example, your insensitivity to compiler options may be explained by the code is memory bandwidth limited, so compiler optimizations of what is going on inside processor core do not make much difference (since mostly it waits for data to be retrieved from main memory). But once that is fixed, the effect of different optimization levels may become noticeable.

6. Did you try to see whether #define/#undef CPP switch ASSUMED_SHAPE in file Include/globaldefs.h makes any difference?

Just edit that file and define or undefine it manually, replace