I have a problem that I hope someone could guide me how to resolve it.
I am encountering an MPI-related error while running the nonlinear ROMS model on a Cray system. Interestingly, this issue does not occur when I run ROMS 4D-Var or ROMS-split on the same system with similar configurations. The problem arises during the initialization phase of the nonlinear simulation. Below are the details of the error message and my current settings:
Code: Select all
Job started at:
Tue 24 Dec 2024 12:04:13 PM EST
MPICH ERROR [Rank 15] [job id 207418914.0] [Tue Dec 24 12:04:17 2024] [c6n0145] - Abort(939550479) (rank 15 in comm 0): Fatal error in MPIDI_Cray_shared_mem_coll_bcast: Other MPI error, error stack:
MPIDI_Cray_shared_mem_coll_bcast(500): message sizes do not match across processes in the collective routine: I am using 4 but a peer process on my node is using 12
aborting job:
Fatal error in MPIDI_Cray_shared_mem_coll_bcast: Other MPI error, error stack:
MPIDI_Cray_shared_mem_coll_bcast(500): message sizes do not match across processes in the collective routine: I am using 4 but a peer process on my node is using 12
MPICH ERROR [Rank 13] [job id 207457413.0] [Mon Dec 30 14:03:34 2024] [c6n0150] - Abort(939550479) (rank 13 in comm 0): Fatal error in MPIDI_Cray_shared_mem_coll_bcast: Other MPI error, error stack:
MPIDI_Cray_shared_mem_coll_bcast(500): message sizes do not match across processes in the collective routine: I am using 4 but a peer process on my node is using 12
...
Any guidance or suggestions would be greatly appreciated. Please let me know if more information is needed to troubleshoot the issue. I have attached my output and error files here.
Best,
Parisa