: .
Distributed Systt part of a component’s state that can lead to a failure
Fault: The cause of an error
What to do about faults
Fault prevention: prevent the occurrence of a fault
Fault tolerance: build a component such that it can mask the presence of
faults
Fault removal: reduce presence, number, seriousness of faults
Fault forecasting: estimate present number, future incidence, and
consequences of faults
5 / 65Fault Tolerance Introduction
Failure models
Failure semantics
Crash failures: Component halts, but behaves correctly before halting
Omission failures: Component fails to respond
Timing failures: Output is correct, but lies outside a specified real-time
interval (performance failures: too slow)
Response failures: Output is incorrect (but can at least not be accounted
to another component)
Value failure: Wrong value is produced
State transition failure: Execution of component brings it into a
wrong state
Arbitrary failures: Component produces arbitrary output and be subject
to arbitrary timing failures
6 / 65Fault Tolerance Introduction
Crash failures
Problem
Clients cannot distinguish between a crashed component and one that is just
a bit slow
Consider a server from which a client is expecting output
Is the server perhaps exhibiting timing or omission failures?
Is the channel between client and server faulty?
Assumptions we can make
Fail-silent: The component exhibits omission or crash failures; clients
cannot tell what went wrong
Fail-stop: The component exhibits crash failures, but its failure can be
detected (either thr
分布式系统设计.08 来自淘豆网www.taodocs.com转载请标明出处.