STTR Phase I: Fault-Tolerant MPI: An Enabling Strategy, Product Concept, and Enhancement of Production Cluster Computing

Award Information

Agency: National Science Foundation

Branch: N/A

Contract: 0340071

Agency Tracking Number: 0340071

Amount: $99,958.00

Phase: Phase I

Program: STTR

Solicitation Topic Code: N/A

Solicitation Number: N/A

Timeline

Solicitation Year: N/A

Award Year: 2004

Award Start Date (Proposal Award Date): N/A

Award End Date (Contract End Date): N/A

Small Business Information

MPI SOFTWARE TECHNOLOGY, INC.

110 12th Street North

Birmingham, AL 35203

United States

DUNS: N/A

HUBZone Owned: No

Woman Owned: No

Socially and Economically Disadvantaged: No

Principal Investigator

Name: Pirabhu Raman
Title: PI
Phone: (205) 314-3471
Email: pirabhu@mpi-softtech.com

Business Contact

Name: Jennifer Herrett-Skjellum
Phone: (662) 320-4300
Email: jennifer@mpi-softtech.com

Research Institution

Name: University of Florida
Contact: Alan George
Address:

218 Grinter Hall

Gainsville, FL 32611

United States

Phone: (352) 392-1582
Type: Nonprofit College or University

Abstract

This Small Business Technology Transfer Research (STTR) Phase I proposal proposes to develop a scalable, fault-tolerant and reliable message-passing interface. MPI-2 is a key technology for the next several years in enabling scalable parallel computing on massively parallel machines and Beowulf clusters. Fault-tolerance is absent in both the MPI-1 and MPI-2 standards, and no satisfactory products or research results offer an effective path to providing production scalable computing applications with effective fault-tolerance. This proposal addresses fault-tolerance in both the MPI-1 and MPI-2 standards, with attention to key application classes, fault-free overhead, and checkpoint-restart strategies. It connects the resource management infrastructure in use by many types of MPI users in science and industry with the fault-tolerant mechanisms, to be useful in practical scientific computing.

If successful this product will provide key new capabilities to parallel programs, programmers, and cluster systems, including the enhancement of existing commercial applications based on MPI, such as CFD applications. The proposed effort is the first towards realizing a fault-tolerant MPI-2, a technology that would be exploited by scientists and engineers across the spectrum, since parallel computing based on MPI is a widespread enabler of scientific discovery. The project expects to increase the adoption of MPI-2, as well as higher productivity parallel computing for clusters and potentially for the grid. The computer science and engineering experience gained through this work will enable better interoperation of MPI implementations and resource allocators, which in turn will further enable efficient production parallel computing. The optimizations targeted at the popular recovery mechanisms, for key classes of applications, can be applied to any middleware and hence would result in improving the performance of applications in general.

* Information listed above is at the time of submission. *

You are here

STTR Phase I: Fault-Tolerant MPI: An Enabling Strategy, Product Concept, and Enhancement of Production Cluster Computing