Query: GPU Offload "Math Co-Processor" with ROMS ?

Message

timchipman · Tue Nov 27, 2007 5:49 pm

Hi all,

I've been reading on-and-off about this topic in the past year, and figure I should ask the forums, if anyone has reviewed how relevant this sort of approach might be towards increasing performance of ROMS model runs, by the use of "GPU as math co-processor", something which many video card makers are pushing these days..

Nvidia has a web site,

http://www.developer.nvidia.com/object/cuda.html

which goes into pretty good detail about the types of "math routines" which are suitable and easily offloaded to GPU. It seems this can be "fairly easily done" using C/C++ code / routines to spawn appropriate lumps of work to the GPU present in an appropriate system. They claim (for example) that some offloaded code can yield order-of-magnitude boost to performance (with the GPU acting as a "highly parallell SMP coprocessor, effectively).

Just was curious if anyone has real-world experience with this, or if it is in fact a big messy can of worms // too much trouble to implement (etc).

Thanks,

--Tim Chipman

kate · #2 Unread post by **kate** » Tue Nov 27, 2007 6:00 pm

My boss is looking to get such a test system to play with. He says they are good at single precision floating point, but double precision is not so fast. I think the people who would get the easiest benefit are using libraries such as lapack where you just change your function calls or the library you link to. Unfortunately, ROMS is not such a code. Let us know if you find out it's easier than I'd guess.

ce107 · #3 Unread post by **ce107** » Wed Nov 28, 2007 6:08 pm

kate wrote:My boss is looking to get such a test system to play with. He says they are good at single precision floating point, but double precision is not so fast. I think the people who would get the easiest benefit are using libraries such as lapack where you just change your function calls or the library you link to. Unfortunately, ROMS is not such a code. Let us know if you find out it's easier than I'd guess.

Nvidia's hardware currently doesn't do DP FP at all (it's in the plans however), it's all SP FP. AMD's latest GPU just added DP FP at reduced speeds to single precision (1/2 or less?). Nvidia offers a BLAS library (I'm not sure about LAPACK but it can be built on it) and AMD offers ACML for GPUs which is BLAS+LAPACK+FFT+some more but as you correctly point out for Fortran user code one has to rewrite a performance critical subroutine in CUDA (Nvidia) or Brook (AMD), write interface code between the GPGPU code and the Fortran code (I'm not sure how that is done for Brook) and hope there is enough work for the GPU to do that the cost of moving data between the GPU and main memory is outweighted by the performance gains (which are indeed substantial).

longmtm · #4 Unread post by **longmtm** » Sat Dec 01, 2007 9:27 pm

Tough but certainly a very good direction to go.

The gaming industry/graphics gang certainly have more money and power than the gozilla gigantic super computer sect

To optimize the entire ROMS in with GPU would be quite challenging, especially that ROMS is still under fast evolution. Yet I think focusing on optimizing a critical part of it such as main2d() is a good starting point and will give enough learning experiences for GPU programming.

We plan to implement a shallow water wave equation model first.

Wen