Go to ...[Main Page] [Speeches] [Documents] [What's new] [Feedback]

Fault-Tolerance by Replication in Distributed Systems

by André Schiper


Email: schiper@lsesun3.epfl.ch
URL: ftp://ftp-lse.epfl.ch/pub/private/people/schiper.html

Abstract: Dependability, i.e. reliability and availability, is one of the biggest trends in software technologies. In the past, it has been considered acceptable for services to be unavailable because of failures. This is rapidly changing: the requirement for high software reliability and availability is continually increasing in domains such as finance, booking-reservation, industrial control, telecommunication, etc. One obvious possibility is to build reliable software on top of fault-tolerant (replicated) hardware. This may indeed be a viable solution for some application classes, and has been been successfully pursued by companies such as Tandem and Stratus. Economic factors have, however, motivated the search for cheaper software-based fault-tolerance, i.e. fault-tolerance by software-based replication. While this principle is readily understood, the techniques required to implement replication are surprisingly difficult to master.

The talk will concentrate on the techniques that have been developed, since the mid-eighties, to implement replicated services. These techniques consider process groups and group communication as the basic abstraction. A replicated service is encapsulated within a group and thus appears as a single entity with respect to the outside world. Interaction between processes within the group as well as those between the service and the client are implemented through group communication primitives, including reliable multicast with different message ordering guarantees. The Isis system developed at Cornell University, has pioneered this approach.

The talk will describe the various replication techniques (passive replication, active replication, coordinator-cohort replication), and the group multicast primitives needed to implement these replication techniques. Implementation of group multicast primitives will be briefly discussed, stressing the difficulty to guarantee the safety and liveness properties. Finally, the relationship between the group communication paradigm and classical quorum based techniques (e.g. static voting, dynamic voting), in regards to handling replication, will be mentioned.



Biography: André SCHIPER has been a professor of Computer Science at the EPFL since 1985, leading the Operating Systems laboratory. His current research interests are in the areas of fault-tolerant distributed systems. He has worked on the new implementation of communication primitives for the Isis system, developed at Cornell University. André Schiper has taken part in the ESPRIT Basic Research Projet BROADCAST, and is member of the CABERNET Network of Excellence in Distributed Systems. He is also partner in the new Esprit R&D project OpenDREAMS, whose objective is to integrate various technologies, including fault-tolerance by replication, within a CORBA infrastructure.




Your comments/suggestions to the maintainer.