-
Notifications
You must be signed in to change notification settings - Fork 67
Description
Version
main
Are there any linked Issues or Pull Requests?
No response
What happened?
GC6 with LFRic at the git_migration release fails sporadically with "rank 510 died from signal 6 and dumped core". I have interrogated the core file with ddt and the traceback shows that it is the call to mpi_bcast in NEMO's file cpl_oasis3.F90 (subroutine cpl_rcv_1d). This is when NEMO receives 1D data from OASIS (ice mass or river outflow) on processor zero and then passes it to all the other NEMO processors using mpi_bcast. This has only started happening after we have upgraded the modules in the GC6 trunk workflow which I think involves an upgrade to MPICH. I have tried replacing mpi_bcast with mpi_send (from processor zero) and mpi_recv (from all the other processors) and it still fails with a similar error however it appears to fail in one of the last NEMO processors when it is in mpi_recv while it is waiting for processor zero to get to the mpi_send stage.