Serial faster then openmp

Message

leommcruz · Tue Sep 28, 2010 10:13 pm

Hi all,

I'm running a grid with 200x256x30 (x,y,z) using ifort and openmp in a 64bit-8cpu machine:

model name : Intel(R) Xeon(R) CPU E5410 @ 2.33GHz
cpu MHz : 2333.414
cache size : 6144 KB

I have two doubts:

1) my serial runs are always faster then the openmp ones. No matter which tile I use.
With a 1x8 tile I got the best openmp result however not faster then the serial one.

2) My log file after the openmp runs indicates a CPU time much bigger then the run time measured by a stop watch. This issue does not occur when I run the serial cases.

Ps: the machine is doing nothing but running my roms cases (once at time).

Thank you very much.

leonjld · #2 Unread post by **leonjld** » Wed Sep 29, 2010 1:30 am

according to my experience, at least for roms ver. 3.3, openmp run will incorrectly show the cpu time. e.g., if I use 8 threads to run 5 minutes, roms output will show 40 minutes (which is 5 mins times 8 threads I think) for each thread (each thread has exactly the same number as cpu time).

so you should not estimate the running time based on cpu time in the roms output.

flcastej · #3 Unread post by **flcastej** » Fri Oct 01, 2010 6:07 am

I can´t give you an answer, but I made a research about auto-optimization with OpenMP for Coherens(another ocean model) and one of the conclusion obtained were the use of more cpu dont mean that you model will run faster, depending of the grid use,the machine were you run the model, if you are reading files,... the optimum number of core to use is different.

I give you the link to the abstract in case it could help you.

http://servinf.dif.um.es/~domingo/10/WA ... stract.pdf

If you resolve your problem, please, post the solution

leommcruz · Fri Oct 08, 2010 11:48 am

Dear all,

It seems to me that the cputime is realy a bug in the openmp runs. The cputime shown in the logfile is always much longer than it really is.
Regards the tile option, in my case, the best results I got were using a 2x2 tile (remembering my grid - 200x256x30).
I also tried running benchmark3 with diferent tiles and then i saw a better performance in the openmp runs. An 8x1 tile gave me a 1,5 faster run then the serial.
I´m testing another machine with faster processors and as soon as i get new results I´ll post it here.

Cheers

leonhardherrmann · Sat Oct 09, 2010 10:31 am

the cpu time should be the sum of all the cpu-times together i thinnk.
since a parallel run is never 1/P as fast as a sequential run (in SMPs), the sum of cpu time will be higher.
As long as the time is lower than P*(sequential time) you should be faster. (not exactly true in case one processor is taking more time than others but roughly)

also if your code seems to run serially over each tile (did you watch processor performance while the code was running?) then it sounds like you or ROMS forgot to set openmpthreads=8.

its a long time i looked at ROMS actively so i cant give you any better information, hope this is useful

shchepet · #6 Unread post by **shchepet** » Sat Oct 09, 2010 6:14 pm

the cpu time should be the sum of all the cpu-times together
since a parallel run is never 1/P as fast as a sequential run (in SMPs),
the sum of cpu time will be higher.

ROMS reports two times: CPU time spent and wall-clock time. This is the intent,
Wall clock time is easy to understand. CPU time spent is a bit more confusing, and
there is also historical legacy inherited by ROMS leading to confusion (incl. miss-
setting of related CPP flag).

The following is to clarify:

First, there are (actually there were) two kind of threads: kernel-level threads and
user-level threads. Secondly, there are differences among different operating systems
(incl. different versions of Linux and Linux kernels) regarding how CPU time consumed
by the job is reported.

In the past, in good old days of SGI computing (PowerChallenge and Origin 200/2000)
when you run a shared-memory parallel code (Open MP was not officially standartized
then, but did exist in form of SGI/Workshop directives) and use "top" command to
monitor what is going on, you can see all your threads appearing under different PIDs.
You can also observe miss-balance among the threads by noting that some of them
accumulate more CPU time than others: if you code is not fully parallelized and there
are sections done by a single process you can clearly observe it by "top" command.
The CPU time accumulated by each thread (as reported by "top") always been "reasonable":
slightly less than wall-clock time, and match %CPU (reported by top) multiplied by
wall-clock. These are kernel-level threads.

What is CPU time? It is literally the time accumulated by a thread when it occupies
a CPU. If, for example, you job has intense I/O, so the time spent for reading/writing
on the disk does not count. Consequently, these kind of jobs would never get close to
100% CPU as reported by "top".

Early Linux systems (Mandrake 7.2 up to 9.1, Red Hat 7.1 to 8 ) back in 2000 -- 2002
behaved in a similar way.

ROMS code of that time was adapted to do what is supposed to do: (1) report CPU time
consumed by each thread individually; (2) add them up --> this is CPU time consumed by
the job overall; and (3) report wall-clock time.

Later Red Hat (I believe it was Red Hat 9) made a big publicity stunt and declared
that their kernels are "native posix thread compliant" (obviously inplying that nobody
else's do). And appended suffix -nptl to the name on their kernels. The threads became
user-level threads. They are supposedly light-weight, meaning easier/cheaper
to create and close, as well as manage, suspend, swap in/out of CPU.
They are no longer viewed as separate processes by the operating system, but are
considered "different threads of the same process". As the result, they no longer
receive different IPs, so "top" command does not distinguish them. There is absolutely
no way to see/diagnose the load miss-balance. And this is all done for convenience of
users. In reality Red Hat screwed up badly at that time. Performance sucked. I mean
it did suck. So does operating system stability -- Open MP jobs were locking up.
They were not before.

[Personally at that time I said myself no more Red Hat: contemporary Mandrake
beat then decisively by all measures. Three years later it gave them another
chance: Fedora 4 vs. Mandriva 2006 ==> same outcome.]

[Not mentioning that approximately at the same time Intel decided to screw up its
compiler, and parallel performance of v. 7.1.xxx series compiler was much worse
than that of 6.1.xxx.]

This type of "native posix thread compliant" behavior was as following: "top" sees
only one process with CPU % not exceeding 100%, but if you keep looking, you notice
that it accumulates CPU time faster than wall clock. ROMS reports the same PID for
all the threads and the same CPU time for all the threads reported individually.

Mandrake did not bother with "native posix thread compliant" for about one year
after Red Hat revolutionary move (thus sticking with old good behavior), but made
a similar switch sometime later.

For a while it was just an inconvenience. Then you became used to it, thought never
like it. The only way to see that the job running under PID number is indeed a multi-
threaded job, you have to go to /proc/PID_NUMBER directory corresponding to that job
and look a subdirectories belonging to individual threads.

Finally, back in 2006 the behavior of "top" command was changed again: modern operating
systems report %CPU exceeding 100% for multi-threaded jobs. This is more convenient.

To reflect these legacies, you may see CPP switch KERNEL_THREADS affecting "timers.F"
(at least in UCLA and AGRIF branches):

Code: Select all

# undef KERNEL_THREADS
....
....
# ifdef KERNEL_THREADS
        CPU_time_ALL(1)=CPU_time_ALL(1) +CPU_time(1)
        CPU_time_ALL(2)=CPU_time_ALL(2) +CPU_time(2)
        CPU_time_ALL(3)=CPU_time_ALL(3) +CPU_time(3)
# else
        CPU_time_ALL(1)=max(CPU_time_ALL(1), CPU_time(1))
        CPU_time_ALL(2)=max(CPU_time_ALL(2), CPU_time(2))
        CPU_time_ALL(3)=max(CPU_time_ALL(3), CPU_time(3))
# endif

In summary, check your "timers.F". It is most likely than not that you operating
system and compilers behave according to native posix standard, that is your threads
are user-level, so do not add up CPU times of individual threads.

In any case, as a sanity check, your total CPU time reported by ROMS must be
close, but not exceeding the wall-clock time multiplied by number of threads.
Why not exceeding? Because no thread gets more that 100% CPU.

shchepet · #7 Unread post by **shchepet** » Sat Oct 09, 2010 7:04 pm

Also caught my attention is

leommcruz
cpu: Intel(R) Xeon(R) CPU E5410 @ 2.33GHz cache size : 6144 KB
I'm running a grid with 200x256x30 ...With a 1x8 tile I got the best openmp result ...

and

leommcruz
Regarding the tile option, in my case the best results I got were using 2x2 tile (my grid - 200x256x30)....
I also tried running benchmark3 with diferent tiles and saw a better performance in the openmp runs.
An 8x1 tile gave me a 1,5 faster run then the serial....

Sounds like the number of tiles is chosen just to match the number of CPUs cores in Open MP job.
This is not the optimal way of running ROMS in this kind of environment, which already discussed in
a parallel line of conversation on this forum, viewtopic.php?f=17&t=2001

Ocean Modeling Discussion

Serial faster then openmp

Serial faster then openmp

Re: Serial faster then openmp

Re: Serial faster then openmp

Re: Serial faster then openmp

Re: Serial faster then openmp

Re: Serial faster then openmp

Re: Serial faster then openmp