Skip to content

Sporadic failures in mpi_bcast #269

@DanCopsey

Description

@DanCopsey

Version

main

Are there any linked Issues or Pull Requests?

No response

What happened?

GC6 with LFRic at the git_migration release fails sporadically with "rank 510 died from signal 6 and dumped core". I have interrogated the core file with ddt and the traceback shows that it is the call to mpi_bcast in NEMO's file cpl_oasis3.F90 (subroutine cpl_rcv_1d). This is when NEMO receives 1D data from OASIS (ice mass or river outflow) on processor zero and then passes it to all the other NEMO processors using mpi_bcast. This has only started happening after we have upgraded the modules in the GC6 trunk workflow which I think involves an upgrade to MPICH. I have tried replacing mpi_bcast with mpi_send (from processor zero) and mpi_recv (from all the other processors) and it still fails with a similar error however it appears to fail in one of the last NEMO processors when it is in mpi_recv while it is waiting for processor zero to get to the mpi_send stage.

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions