This script, provides tools for performing ANCOVA (Analysis of Covariance) and related statistical analyses. It includes a primary function, do_ancova
, which integrates multiple steps of ANCOVA analysis and allows for flexible customization of inputs and outputs, including graphical representations of results.
The package can be installed via:
Clone the repository and install it manually:
git clone https://github.com/GERMAN00VP/ANCOVA
cd ./ANCOVA
pip install .
Install it directly from PyPI:
pip install ANCOVA
python>=3.10
The script relies on the following Python packages:
numpy
pandas
statsmodels
scipy
seaborn
matplotlib
scikit_posthocs
Install these dependencies using:
pip install numpy pandas statsmodels scipy seaborn matplotlib scikit-posthocs
The main purpose of the do_ancova
function is to perform parametric or non-parametric ANCOVA on a dataset. It accepts a DataFrame containing the dependent variable, categorical variables, and covariates to evaluate the relationship between them while adjusting for covariates.
-
Parametric and Non-Parametric ANCOVA:
Automatically switches between parametric or ranked (non-parametric) ANCOVA depending on the assumptions of normality and homoscedasticity. -
Interaction Effects:
Allows inclusion of interactions between variables. -
Post-Hoc Analysis:
Automatically performs Tukey or Dunn post-hoc tests when significant differences are found between groups. -
Data Visualization:
Generates boxplots and scatterplots with regression lines, including statistical significance indicators. -
Customizable Options:
Users can customize interactions, colors, and plot details.
-
data
:
A pandas DataFrame containing:- Column 1: Dependent (response) variable.
- Column 2 (to n categories): Categorical independent variable(s).
- Remaining columns: Continuous covariates.
-
interactions
(Optional):
Specifies interactions between variables:"ALL"
: Includes all interactions.list
: List of tuples specifying interacting variables.
-
plot
(Default: False):
IfTrue
, generates a regression plot and a boxplot. -
save_plot
(Default: False):
If provided with a file path, saves the generated plots to the specified location. -
covariate_to_plot
(Optional):
Specifies the covariate to display in plots. -
palette
(Optional):
A dictionary mapping categorical levels to colors. -
categories
(Default: 1):
Number of categorical variables. -
ax
(Optional):
A Matplotlib axis for custom plotting. -
y_lab
(Optional): Label for the y-axis in the generated plot. Default is False (no label). -
x_lab
(Optional): Label for the x-axis in the generated plot. Default is False (no label). -
sum_of_squares_type
(Optional): Specifies the type of sums of squares for ANCOVA. Default is Type 2 (value = 2).
-
Results:
- A summary data frame with the ANCOVA parameters and outcomes.
- An ANCOVA table with p-values for each effect.
- Post-hoc results (if applicable).
-
Plots:
- Scatterplot with regression lines for covariates + Boxplot for main categorical copmpaisons.
- A Matplotlib axis with a Boxplot for categorical comparisons (allows customizing).
-
Files (Optional):
Saves plots to the specified file path ifsave_plot
is provided.
- Ensure that your dataset has the shape: Cases*Variables.
- The script assumes the columns are sorted like this: [Response variable, Main category to compare, Other categorical co-variables (optional), Other continous co-variables].
- For multiple categorical variables, specify the number using the categories parameter.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Charge the main function from our package
from Ancova_analysis import do_ancova
-
Number of T Cells: The number of T cells, which is affected by the individual's age and HIV status. Individuals with HIV+ (Untreated) have a significant reduction in T cells, while HIV+ (TAR Treatment) individuals have a minimal reduction compared to HIV- individuals.
-
HIV Status: A categorical variable representing the individual's HIV status. It can take three values:
-> HIV- (no HIV) -> HIV+ (TAR Treatment) (HIV positive, receiving treatment) -> HIV+ (Untreated) (HIV positive, not receiving treatment)
-
Sex: The individual's sex, either Male or Female.
-
Age: The individual's age, ranging from 20 to 70 years.
The Number of T Cells decreases with age, and the reduction is more significant for individuals with HIV+ (Untreated).
# Set the seed for reproducibility
np.random.seed(4)
# Number of samples
n = 150
# Categorical variables
sex = np.random.choice(['Male', 'Female'], size=n)
hiv_status = np.random.choice(['HIV-', 'HIV+ (TAR Treatment)', 'HIV+ (Untreated)'], size=n, p=[0.4, 0.3, 0.3])
# Covariate: Age
age = np.random.randint(20, 70, size=n)
# Generate T cell count
t_cells = []
for i in range(n):
base_t_cells = 1000 # General base for T cells
age_effect = -3 * (age[i] - 30) # Mild effect of age
if hiv_status[i] == 'HIV+ (Untreated)':
hiv_effect = -200 # Significant reduction for untreated
elif hiv_status[i] == 'HIV+ (TAR Treatment)':
hiv_effect = -30 # Minimal reduction for treated
else:
hiv_effect = 0 # No effect for HIV-
noise = np.random.normal(0, 50) # Random noise
t_cells.append(base_t_cells + age_effect + hiv_effect + noise)
# Define a palette to select the plotting colors for each category, else it would be randomly assigned
palette = {"HIV-":"skyblue",
"HIV+ (Untreated)":"salmon",
"HIV+ (TAR Treatment)":"orange"}
# Create the DataFrame
data_hiv = pd.DataFrame({
'Number of T Cells': np.round(t_cells).astype(int),
'HIV Status': hiv_status,
'Sex': sex,
'Age': age
})
data_hiv.head()
Lets see if the ANCOVA analysis is able to capture this differences:
# Run the main function and display the results
df_results, ancova_summary,post_hoc = do_ancova(data=data_hiv,
palette=palette,
categories=2, # HIV Status and Sex
interactions=[('HIV Status',"Age")], # Test the significance of the interaction of these variables
y_lab="CD4 T Cells (count)",# Set the y_label
plot=True, # Create the plot
save_plot= "./Images/ANCOVA_Regression_boxplot.png" # Sves the plot in that path
)
display(df_results)
display(ancova_summary)
display(post_hoc)
# Create two subplots in a row
fig, axs = plt.subplots(ncols=2,figsize=(12,6))
df_results, ancova_summary,post_hoc,ax= do_ancova(data=data_hiv,palette=palette,categories=2, y_lab="CD4 T Cells (count)",plot=True,
ax=axs[0] # When the axis is provided it returns the boxplot and can be integrated with other subplots as you wish
)
# Modify the df order to plot the sex differences
data_hiv_sex = data_hiv[['Number of T Cells','Sex','HIV Status','Age']]
df_results, ancova_summary,post_hoc,ax= do_ancova(data=data_hiv_sex,categories=2, y_lab="CD4 T Cells (count)",plot=True,
ax=axs[1], # The other subplot
)
# Save and show
plt.savefig("./Images/ANCOVA_two_boxplots.png",bbox_inches="tight")
plt.show()