Skip to content

Latest commit

 

History

History
165 lines (121 loc) · 6.41 KB

Github-workflow-usage-instruction.md

File metadata and controls

165 lines (121 loc) · 6.41 KB

Documentation

This repo is accompanying the publication: “The Use of Computational Fingerprinting Techniques to Distinguish Sources of Accelerants Used in Wildfire Arson”.

Users need to first install R with this link and Rstudio with this link.

This workflow ran on Windows 11 OS 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz, 16 GB RAM;

THe RStudio version used in this demo is 2023.06.0+421 “Mountain Hydrangea” Release for Windows;

The R version used in this demo is 4.3.1

Data processing

First, the following R packages are installed and loaded in the global environment along with in-house built functions to minimize repetitiveness in the code.

Details about these functions can be found in Data processing.R file in this repo.

PCA

Gas versus Diesel

## [1] TRUE

Between groups of gas stations

## [1] TRUE

Multiple Wilcoxon tests with p-value correction for multiple testing

Research Question 1: Gas versus Diesel

Before running multiple Wilcoxon tests, it is recommended to examine whether criteria for univariate parametric test, such as t-test, are violated. If yes, then it is safe to proceed using non-parametric univariate test, such as Wilcoxon test.

First, equal variance and normally distributed between two populations, here are Gas and Diesel using histogram, Q-Q plots. Here, both Gas and Diesel populations are NOT normally distributed.

GasData <- as.vector(t(cat_5[,c(1:21)])) # 100149 data points
DieselData <- as.vector(t(cat_5[,c(22:25)])) # 19076 data point

# Histogram
hist(GasData, col='steelblue', main='Gas')

hist(DieselData, col='steelblue', main='Diesel')

# Q-Q plots aka. Normal Probability plots
stats::qqnorm(GasData, main='Gas')
stats::qqline(GasData)

stats::qqnorm(DieselData, main='Diesel')
stats::qqline(DieselData)

Then, equality of variance between Gas and Diesel populations can be examined using Levene’s test and Fligner-Killeen test for non-normally distributed data. Here, for both tests, p values are < 0.05, and thus, there is significant difference in variances between Gas and Diesel populations.

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    1  34.244 5.016e-09 ***
##       9948                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  data by group
## Fligner-Killeen:med chi-squared = 531.7, df = 1, p-value < 2.2e-16

Here, multiple Wilcoxon tests followed by p-value correction for multiple testing was done. Different method for p-value correction from function p.adjust from package stats were used. After p-value correction, the threshold for p-value can be set (for example,p.adjust < 0.05 or < 0.1), to see how it affects the number of significant compounds that can be found. Importantly, the threshold should be set so that it can include as many ASTM reference compounds in the list of significant compounds as possible. For example here, no significant compounds can be found with adjusted p-value < 0.05. But when adjusted p-value is < 0.1, 127 significant compounds are found.

## [1] "Number of significant compounds with adjusted p-value < 0.05 = 0"

## [1] "Number of significant compounds adjusted p-value < 0.1 = 127"

Research Question 2: Distinguishing samples from groups of gas stations

Here again, before running multiple Wilcoxon tests, it is recommended to examine whether criteria for univariate parametric test, such as t-test, are violated. If yes, then it is safe to proceed using non-parametric univariate test, such as Wilcoxon test.

First, equal variance and normally distributed between two populations, here are Gas and Diesel using histogram, Q-Q plots. Here, both Gas station group 1 (1, 3, 8) and Gas station group 2 (5, 7, 9) populations are NOT normally distributed.

Equality of variance between Gas station group 1 (1, 3, 8) and Gas station group 2 (5, 7, 9) populations can be examined using Levene’s test and Fligner-Killeen test for non-normally distributed data. Here, for both tests, p values are < 0.05, and thus, there is significant difference in variances between Gas station group 1 (1, 3, 8) and Gas station group 2 (5, 7, 9) populations.

## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value Pr(>F)
## group     1  0.6128 0.4338
##       11149

## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  data by group
## Fligner-Killeen:med chi-squared = 713.21, df = 1, p-value < 2.2e-16

Here, multiple Wilcoxon tests followed by p-value correction for multiple testing was done. Different method for p-value correction from function p.adjust from package stats were used. After p-value correction, the threshold for p-value can be set (for example,p.adjust < 0.05 or < 0.1), to see how it affects the number of significant compounds that can be found. Importantly, the threshold should be set so that it can include as many ASTM reference compounds in the list of significant compounds as possible. For example here, no significant compounds can be found with adjusted p-value < 0.05. But when adjusted p-value is < 0.1, 127 significant compounds are found.

## [1] "Number of significant compounds with adjusted p-value < 0.05 = 79"

## [1] "Number of significant compounds adjusted p-value < 0.1 = 89"