Trac # 443 Bug in MPI programs

Bug reports, work arounds and fixes

Moderators: arango, robertson

Post Reply
Message
Author
hugobastos
Posts: 15
Joined: Tue May 06, 2008 8:46 pm
Location: FURG

Trac # 443 Bug in MPI programs

#1 Unread post by hugobastos »

Hello,


the trivial:

recently i opened a ticket in the trac. But it was wrong. I try to edit but the trac gives me an error ( when enter the Submit changes box):

---
Trac Error
Invalid action "view"
---

-------------------------

the bug:

About the error:

I recently update the code (to 484) , from an older revision (i think it was the 459).
the bug:

Compile with MPI and undef ANA_INITIAL.Set tile > 1x1. Model stop with mpi error Bcast.

Compile with MPI and define ANA_INITIAL.Set tile > 1x1. Model run!

The program runs with MPI and 1x1, the same as with OpenMP and Serial. In OpenMP the variable tiles work to.


My configuration for the ana_initial change the boundary conditions ( so the reproduce tests posted at trac are not 100% ok! :oops: ), but the error persist with or without boundary conditions .


Thankz

At the end of message the mpi error.

Code: Select all

Fatal error in PMPI_Bcast: Message truncated, error stack:
PMPI_Bcast(1302)........................: MPI_Bcast(buf=0x8c57e8, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(998).........................: 
MPIR_Bcast_scatter_ring_allgather(849)..: 
MPIR_Bcast_binomial(157)................: 
MPIDI_CH3_PktHandler_EagerShortSend(351): Message from rank 0 and tag 2 truncated; 12 bytes received but buffer size is 4
Fatal error in PMPI_Bcast: Message truncated, error stack:
PMPI_Bcast(1302)........................: MPI_Bcast(buf=0x8c57e8, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(998).........................: 
MPIR_Bcast_scatter_ring_allgather(849)..: 
MPIR_Bcast_binomial(157)................: 
MPIDI_CH3_PktHandler_EagerShortSend(351): Message from rank 0 and tag 2 truncated; 12 bytes received but buffer size is 4
rank 2 in job 52  fox_55915   caused collective abort of all ranks
  exit status of rank 2: killed by signal 9 
rank 1 in job 52  fox_55915   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9 

hugobastos
Posts: 15
Joined: Tue May 06, 2008 8:46 pm
Location: FURG

Re: Trac # 443 Bug in MPI programs

#2 Unread post by hugobastos »

Found the answers hehe:


my Err...My inifile. Recently i create a new one with the ocean_time variable fixed to a value =1 (ncks).

So if u have a file with :

Code: Select all

	double ocean_time ;
		ocean_time:long_name = "time since initialization" ;
		ocean_time:units = "days since 1968-05-23 00:00:00 GMT" ;
	double salt(ocean_time, s_rho, eta_rho, xi_rho) ;
		salt:long_name = "salinity" ;
		salt:time = "ini_time" ;
		salt:coordinates = "lon_rho lat_rho s_rho" ;

and try to run with MPI with more than 1x1 tiles, u get an error with mpiBcast.


But, if u do a ncecat in the ifle them:

Code: Select all

	double ocean_time(record) ;
		ocean_time:long_name = "time since initialization" ;
		ocean_time:units = "days since 1968-05-23 00:00:00 GMT" ;
	double salt(ocean_time, s_rho, eta_rho, xi_rho) ;
		salt:long_name = "salinity" ;
		salt:time = "ini_time" ;
		salt:coordinates = "lon_rho lat_rho s_rho" ;


and this will work.

I think this is still a bug ( :x ), because u can run with both configurations in Serial,OpenMP, but not in MPI, and there is no warning for the lacks of dimension.



The error with the debug flag:

Code: Select all

At line 198 of file get_state.f90
Fortran runtime error: Array reference out of bounds for array 'mytime', upper bound of dimension 1 exceeded (1 > 0)
Fortran runtime error: Array reference out of bounds for array 'mytime', upper bound of dimension 1 exceeded (1 > 0)
At line 198 of file get_state.f90
Fortran runtime error: Array reference out of bounds for array 'mytime', upper bound of dimension 1 exceeded (1 > 0)At line 198 of file get_state.f90
Fortran runtime error: Array reference out of bounds for array 'mytime', upper bound of dimension 1 exceeded (1 > 0)

additionally, in the search of this bug i think i have found another one:


When running with OpenMP and Tiles the "TotVolume" output in the stdout changes ( with grep):

Ok, all the numbers r in the same magnitude, but I was using the default flags of the gfortran compiler in the ROMS folder, i was expecting the exact value, right?. Maybe this is not the only field that ahve diferent calculations with the default flags :shock: ?




Below the output of Totalvolume:
Without Tiling in OpenMP:

Trynumber = 2
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 3
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 4
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 5
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 6
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 7
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 8
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 9
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 10
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 11
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 12
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 13
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 14
Initial basin volumes: TotVolume = 5.2601331615E+14 m3

With Tiling in OpenMP:
Trynumber = 2
Initial basin volumes: TotVolume = 3.7050210981E+14 m3
Trynumber = 3
Initial basin volumes: TotVolume = 3.7050210981E+14 m3
Trynumber = 4
Initial basin volumes: TotVolume = 2.2462582921E+14 m3
Trynumber = 5
Initial basin volumes: TotVolume = 3.7050210981E+14 m3
Trynumber = 6
Initial basin volumes: TotVolume = 3.7050210981E+14 m3
Trynumber = 7
Initial basin volumes: TotVolume = 3.7050210981E+14 m3
Trynumber = 8
Initial basin volumes: TotVolume = 3.7050210981E+14 m3
Trynumber = 9
Initial basin volumes: TotVolume = 3.7050210981E+14 m3
Trynumber = 10
Initial basin volumes: TotVolume = 3.7050210981E+14 m3
Trynumber = 11
Initial basin volumes: TotVolume = 3.7050210981E+14 m3
Trynumber = 12
Initial basin volumes: TotVolume = 3.7050210981E+14 m3
Trynumber = 13
Initial basin volumes: TotVolume = 3.7050210981E+14 m3
Trynumber = 14
Initial basin volumes: TotVolume = 4.5740565073E+14 m3
Trynumber = 15


With Tiling in MPI:


Trynumber = 2
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 3
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 4
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 5
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 6
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 7
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 8
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 9
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 10
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 11
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 12
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 13
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 14
Initial basin volumes: TotVolume = 5.2601331615E+14 m3
Trynumber = 15

Post Reply