failure_detector.rs

Overview

This file implements a phi accrual failure detector mechanism. The failure detector is used to monitor the health of nodes (identified by ChitchatId) in a distributed system by analyzing heartbeat intervals and computing a suspicion level (phi value) that indicates the likelihood of node failure.

The primary goal is to classify nodes as live or dead based on the statistical analysis of heartbeat arrival times, and to support garbage collection of nodes that have been dead for a configurable grace period. The implementation includes smoothing techniques to handle early startup behavior and avoids false positives due to transient network or node delays.


Main Components

FailureDetector

Description

The core struct that maintains heartbeat samples for each node, tracks live and dead nodes, and applies failure detection logic based on phi accrual.

Fields

Methods

Usage Example

let config = FailureDetectorConfig::default();
let mut detector = FailureDetector::new(config);

let node_id = ChitchatId::for_local_test(42);
detector.report_heartbeat(&node_id);
detector.update_node_liveness(&node_id);

for live_node in detector.live_nodes() {
    println!("Live node: {}", live_node.node_id);
}

FailureDetectorConfig

Description

Configuration struct defining parameters controlling the behavior and sensitivity of the failure detector.

Fields

Methods


SamplingWindow

Description

A fixed-size window that stores recent heartbeat intervals for a single node. It computes the phi value to estimate node failure likelihood. Uses additive smoothing to mitigate fluctuations during startup.

Fields

Methods

Implementation Details


AdditiveSmoothing

Description

Utility struct that applies additive smoothing to interval mean calculations to reduce volatility when sample sizes are small.

Fields

Methods


BoundedArrayStats

Description

A fixed-size circular buffer storing floating-point values representing heartbeat intervals. It tracks the sum of values for efficient mean calculation.

Fields

Methods


Algorithms and Implementation Details


Interaction with Other Components


Visual Diagram

classDiagram
class FailureDetector {
- node_samples: HashMap<ChitchatId, SamplingWindow>
- config: FailureDetectorConfig
- live_nodes: HashSet<ChitchatId>
- dead_nodes: HashMap<ChitchatId, Instant>
+ new()
+ report_heartbeat()
+ update_node_liveness()
+ garbage_collect()
+ live_nodes()
+ dead_nodes()
+ scheduled_for_deletion_nodes()
- phi()
}
class FailureDetectorConfig {
+ phi_threshold: f64
+ sampling_window_size: usize
+ max_interval: Duration
+ initial_interval: Duration
+ dead_node_grace_period: Duration
+ new()
+ default()
}
class SamplingWindow {
- intervals: BoundedArrayStats
- last_heartbeat: Option<Instant>
- max_interval: Duration
- additive_smoothing: AdditiveSmoothing
+ new()
+ report_heartbeat()
+ reset()
+ phi()
}
class AdditiveSmoothing {
- prior_mean: f64
- prior_weight: f64
+ compute_mean()
}
class BoundedArrayStats {
- values: Box<[f64]>
- is_filled: bool
- index: usize
- sum: f64
+ with_capacity()
+ sum()
+ append()
+ clear()
+ len()
}
FailureDetector --> FailureDetectorConfig
FailureDetector --> SamplingWindow
SamplingWindow --> BoundedArrayStats
SamplingWindow --> AdditiveSmoothing