We have been working on several optimizations to the ROMS code base primarily for performance. The changes that were done to make use of some of the newer MPI-3 interfaces like neighborhood collectives have also made the code much cleaner.
The changes in no particular order are:
- Alighment and Padding to make efficient use of the vector units. These changes are mostly in the functions that were the hotspots in the benchmark application - step2d, lmd_skpp, step3d_uv, rhs3d and pre_step3d.
- Loop transformations for improving cache performance.
- Cartesian topology: Use of cartesian communicator for neighbor exchanges in mp_exchanges
- Derived Data Types: Use of MPI derived data types help to avoid explicit packing/unpacking in mp_exchange code and makes the code more efficient and simpler.
- Neighborhood Collectives: Use of MPI-3 neighborhood collectives replaces the four sends and recvs in mp_exchange with one all_to_all call making the code simpler and more efficient.