Fault-tolerance for many-task computing applications

From Master Projects
Jump to: navigation, search

About Fault-tolerance for many-task computing applications


|free text=}} Many-task computing (MTC) applications execute many small tasks that communicate by reading and writing temporary files. MemFS, our in-memory file system for MTC applications stores these files in the main memory of the involved compute nodes.

With growing scale of execution, the risk of failing compute nodes becomes imminent. The focus of this project is to study and analyze large MTC applications and their communication patterns in order to propose and implement a strategy that implements fault-tolerant execution of MTC applications.

For this strategy, trade-offs between redundant storage and recomputation of tasks need to be made. We are looking for minimal redundancy in storage as main memory is limited. At the same time, we want to minimize the number of tasks that need to be recomputed should files be lost due to a failure (in the presence of a given storage redundancy scheme).