Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DoubleMAD outlier detector based on the Harrell-Davis quantile estimator #20

Open
FilipAndersson245 opened this issue Apr 4, 2024 · 1 comment

Comments

@FilipAndersson245
Copy link
Contributor

Hello, I read this blog post from the maintainers of BenchmarkDotNet,
It talks about an alternative and better suitable outlier detection for non normally distributed data. As benchmarks has high variance I thought it interesting even if I myself did not understand the exact math around it. Is this something Tango.rs may consider?

@FilipAndersson245 FilipAndersson245 changed the title Alternative outlier detection algorithm. DoubleMAD outlier detector based on the Harrell-Davis quantile estimator Apr 4, 2024
@FilipAndersson245
Copy link
Contributor Author

FilipAndersson245 commented Apr 6, 2024

I looked at your IQR and tried copy it to use a double MAD, I am not confident enough to try to implement the Harrell-Davis quantile estimator.
Double MAD perform better when we have unsymetric distributions, ex when we have long tails. in one direction.
https://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/

fn double_mad_thresholds(mut input: Vec<f64>) -> Option<RangeInclusive<f64>> {
    const C: f64 = 1.4826;
    const K: f64 = 3.0;

    input.sort_unstable_by(|a, b| a.partial_cmp(b).unwrap_or(Ordering::Equal));

    let m = median(&mut input);

    let x_l: Vec<f64> = input
        .iter()
        .filter(|v| **v <= m)
        .map(|v| f64::abs(*v - m))
        .collect();
    let x_u: Vec<f64> = input
        .iter()
        .filter(|v| **v >= m)
        .map(|v| f64::abs(*v - m))
        .collect();

    let mad_l = C * median(&x_l);
    let mad_u = C * median(&x_u);

    let lower = m - K * mad_l;
    let upper: f64 = m + K * mad_u;
    println!("lower: {}, upper: {}", lower, upper);

    // Calculating the indicies of the thresholds in an dataset
    let low_threshold_idx = match input[..].binary_search_by(|probe| probe.total_cmp(&lower)) {
        Ok(idx) => idx,
        Err(idx) => idx,
    };

    let high_threshold_idx = match input[..].binary_search_by(|probe| probe.total_cmp(&upper)) {
        Ok(idx) => idx,
        Err(idx) => idx,
    };

    println!(
        "low_threshold_idx: {}, high_threshold_idx: {}",
        low_threshold_idx, high_threshold_idx
    );

    Some(input[low_threshold_idx]..=(input[high_threshold_idx - 1]))
}

/// Calculate the median of a slice of data, expects the data to be sorted already.
fn median(data: &[f64]) -> f64 {
    let len = data.len();
    let mid = len / 2;

    if len % 2 == 0 {
        // Even number of elements
        (data[mid - 1] + data[mid]) / 2.0
    } else {
        // Odd number of elements
        data[mid]
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant