Title: Replication For Efficiency And Fault Tolerance In A Dsm System


Description: Distributed Shared Memory (DSM) systems implemented on a network of workstations (NOW) have become a convenient alternative to shared memory architectures to execute long running parallel applications. However, such architectures are susceptible to experience failures. This paper presents the design and implementation of a recoverable DSM (RDSM) based on a backward error recovery (BER) mechanism. Our RDSM's design has focused on exploiting replication of data for both fault-tolerance and efficiency. This RDSM has been implemented on a NOW and performance evaluation shows the benefits of exploiting both types of replication to design an efficient, scalable and low-cost recoverable DSM. Key Words: Distributed Shared Memory, Replication, Fault Tolerance, Network of Workstations. 1 INTRODUCTION Networks of workstations (now) are an attractive and much cheaper alternative [1] to shared memory parallel architectures for executing long-running parallel applications. A dsm [2] implemented o...

Date: 1998-04-03

