dx.doi.org/10.1109/IPDPSW63119.2024.00193

Preview meta tags from the dx.doi.org website.

Linked Hostnames

2

Thumbnail

Search Engine Appearance

Google

https://dx.doi.org/10.1109/IPDPSW63119.2024.00193

Asynchrony and Failure Masking via Pseudo-Local Process Recovery in MPI Applications

For parallel solvers susceptible to hardware-related failures, localizing recovery to the processes directly affected by the failure allows preserving asynchronous progress and exhibits “failure masking” due to limited propagation of recovery delays. This results in improved scalability compared to global recovery which is a disproportionate response. However, localizing recovery from hard failures is challenging because such failures are not transparent to the MPI runtime, requiring reconstruction of the communication layers and of a consistent application state. In this work we present the process- and data-recovery concepts that enable the performance and scalability of localized recovery despite the inherently non-local nature of some recovery steps. We present design enhancements to existing resilience middleware-the Fenix library and MPI User-Level Failure Mitigation-to robustly support larger-scale execution and “pseudo-local” checkpointing and recovery from many process failures. Using an example stencil solver with emulated hard failures we present an experimental evaluation, with runs on up to ~1000 ranks subject to ~100 process failures, which confirms that that pseudo-local recovery has significantly improved weak scaling compared to the roughly exponential slowdown of global recovery. Our work shows how fault tolerance infrastructure originally designed for global checkpoint/restart can be repurposed to enable greater efficiency in a resilience-aware application.



Bing

Asynchrony and Failure Masking via Pseudo-Local Process Recovery in MPI Applications

https://dx.doi.org/10.1109/IPDPSW63119.2024.00193

For parallel solvers susceptible to hardware-related failures, localizing recovery to the processes directly affected by the failure allows preserving asynchronous progress and exhibits “failure masking” due to limited propagation of recovery delays. This results in improved scalability compared to global recovery which is a disproportionate response. However, localizing recovery from hard failures is challenging because such failures are not transparent to the MPI runtime, requiring reconstruction of the communication layers and of a consistent application state. In this work we present the process- and data-recovery concepts that enable the performance and scalability of localized recovery despite the inherently non-local nature of some recovery steps. We present design enhancements to existing resilience middleware-the Fenix library and MPI User-Level Failure Mitigation-to robustly support larger-scale execution and “pseudo-local” checkpointing and recovery from many process failures. Using an example stencil solver with emulated hard failures we present an experimental evaluation, with runs on up to ~1000 ranks subject to ~100 process failures, which confirms that that pseudo-local recovery has significantly improved weak scaling compared to the roughly exponential slowdown of global recovery. Our work shows how fault tolerance infrastructure originally designed for global checkpoint/restart can be repurposed to enable greater efficiency in a resilience-aware application.



DuckDuckGo

https://dx.doi.org/10.1109/IPDPSW63119.2024.00193

Asynchrony and Failure Masking via Pseudo-Local Process Recovery in MPI Applications

For parallel solvers susceptible to hardware-related failures, localizing recovery to the processes directly affected by the failure allows preserving asynchronous progress and exhibits “failure masking” due to limited propagation of recovery delays. This results in improved scalability compared to global recovery which is a disproportionate response. However, localizing recovery from hard failures is challenging because such failures are not transparent to the MPI runtime, requiring reconstruction of the communication layers and of a consistent application state. In this work we present the process- and data-recovery concepts that enable the performance and scalability of localized recovery despite the inherently non-local nature of some recovery steps. We present design enhancements to existing resilience middleware-the Fenix library and MPI User-Level Failure Mitigation-to robustly support larger-scale execution and “pseudo-local” checkpointing and recovery from many process failures. Using an example stencil solver with emulated hard failures we present an experimental evaluation, with runs on up to ~1000 ranks subject to ~100 process failures, which confirms that that pseudo-local recovery has significantly improved weak scaling compared to the roughly exponential slowdown of global recovery. Our work shows how fault tolerance infrastructure originally designed for global checkpoint/restart can be repurposed to enable greater efficiency in a resilience-aware application.

  • General Meta Tags

    12
    • title
      Asynchrony and Failure Masking via Pseudo-Local Process Recovery in MPI Applications | IEEE Conference Publication | IEEE Xplore
    • google-site-verification
      qibYCgIKpiVF_VVjPYutgStwKn-0-KBB6Gw4Fc57FZg
    • Description
      For parallel solvers susceptible to hardware-related failures, localizing recovery to the processes directly affected by the failure allows preserving asynchron
    • Content-Type
      text/html; charset=utf-8
    • viewport
      width=device-width, initial-scale=1.0
  • Open Graph Meta Tags

    3
    • og:image
      https://ieeexplore.ieee.org/assets/img/ieee_logo_smedia_200X200.png
    • og:title
      Asynchrony and Failure Masking via Pseudo-Local Process Recovery in MPI Applications
    • og:description
      For parallel solvers susceptible to hardware-related failures, localizing recovery to the processes directly affected by the failure allows preserving asynchronous progress and exhibits “failure masking” due to limited propagation of recovery delays. This results in improved scalability compared to global recovery which is a disproportionate response. However, localizing recovery from hard failures is challenging because such failures are not transparent to the MPI runtime, requiring reconstruction of the communication layers and of a consistent application state. In this work we present the process- and data-recovery concepts that enable the performance and scalability of localized recovery despite the inherently non-local nature of some recovery steps. We present design enhancements to existing resilience middleware-the Fenix library and MPI User-Level Failure Mitigation-to robustly support larger-scale execution and “pseudo-local” checkpointing and recovery from many process failures. Using an example stencil solver with emulated hard failures we present an experimental evaluation, with runs on up to ~1000 ranks subject to ~100 process failures, which confirms that that pseudo-local recovery has significantly improved weak scaling compared to the roughly exponential slowdown of global recovery. Our work shows how fault tolerance infrastructure originally designed for global checkpoint/restart can be repurposed to enable greater efficiency in a resilience-aware application.
  • Twitter Meta Tags

    1
    • twitter:card
      summary
  • Link Tags

    9
    • canonical
      https://ieeexplore.ieee.org/document/10596520/
    • icon
      /assets/img/favicon.ico
    • stylesheet
      https://ieeexplore.ieee.org/assets/css/osano-cookie-consent-xplore.css
    • stylesheet
      /assets/css/simplePassMeter.min.css?cv=20250701_00000
    • stylesheet
      /assets/dist/ng-new/styles.css?cv=20250701_00000

Links

17