Feature importance in the formation of binary black holes

Goal of the project

Feature importance: Apply machine learning to understand the most important features that determine the evolution of binary stellar systems into merging binary black holes. A dataset of 2 million instances of simulated binary stars is analysed.

Different techniques are used to infer feature imporance:

Weights analysis (Linear SVM)
Medium difference in impurity (Random Forest)
Permutation importance
Feature - dropping

Theoretical introduction

Since the first detections of gravitational waves (GWs) in 2015 from the LIGO-VIRGO interferometers, a huge effort has been made from the scientific community to understand the processes underlying their emissions. Mergers of binary black holes (BBHs) are considered to be the main sources of these kind of signals. The open question of these years is then to understand how these kind of systems formed from their progenitors, which are the stellar binaries, in order to actually interpret the detected signal, linking it to their astronomical sources.

You can find the details of the theory behind this in the notebook.

Preprocessing

Principal Components Analysis (PCA)

We ispect the dataset for the possibility of dimensionality reduction by performing PCA. This doesn't actually help us with the ultimate task of finding feature importance, but it will give us an idea of how many dimensions are relevant in the space of covariance. If PCA showed that dimensionality reduction is possible, it would be a good indicator that some features are better than others. If PCA showed instead that all PCs are relevant then this is an indication that most features are important in the sense that most of them contribute to the variance of the data.

Moreover, we observe that PCA is sensitive to the scaling of the data. This means that it will assign high relevance to the largest-valued features (for instance, ZAMS mass), whilst ignoring small-valued feature, like orbital eccentricty (even if it could have some physical importance).

Classification and feature importance

We run some simple machine learning algorithm in order to extract the most important features in our dataset.

We compare Linear SVM, Random forest, and a Perceptron. We use sklearn built-in feature importance methods, such as permutation importance and mean decrease in impurity (MDI). For the linear SVM we also interpret the coefficients found by the model to infer feature importance.

Finally, we implement a custom made measure that compares the accuracy of the model with all the features against the accuracy of the model without using the feature we want to assess.

Defining custom feature importance functions

def feat_omit_imp(model,X_train,y_train,X_test,y_test):
    scores=[]
    dif=[]
    #score.append(['All',model.score(X_test)])
    for col in X_train.columns:
        #drop a column
        ds_train, ds_test = X_train.drop(columns=col), X_test.drop(columns=col)
        #fit the model
        model.fit(ds_train,y_train)
        #score the classifier and keep the value of the accuracy for that specific columns we ommited
        model_score = model.score(ds_test,y_test)
        scores.append([col,model_score])
    model.fit(X_train,y_train)
    baseline = model.score(X_test,y_test)
    dif = [ 100*(baseline - scores[i][1]) for i in range(len(scores))]
    
    return baseline, scores, dif

# Train-Test-Split.

X_train, X_test, y_train, y_test = train_test_split(X ,Y, train_size=0.5, test_size=0.5, random_state=None, shuffle=True)

# Normalization of data
X_train = pd.DataFrame(StandardScaler().fit_transform(X_train),columns=X_train.columns, index=X_train.index)

X_test = pd.DataFrame(StandardScaler().fit_transform(X_test),columns=X_test.columns, index=X_test.index)

Linear SVM

lin_svm = LinearSVC(penalty='l2',
                    dual=False,
                    C=0.1,
                    max_iter=100,
                    loss='squared_hinge',
                    fit_intercept=True, 
                    verbose=False, 
                    random_state=None)

lin_svm.fit(X_train,y_train)

The ranking is shown in the following plot:

Random Forest

# Optimum hyperparameters from GridSearchCV: {'max_depth': 10, 'min_samples_leaf': 5, 'min_samples_split': 10, 'n_estimators': 75}

rfc = RandomForestClassifier(n_estimators=75, 
                       criterion='gini', 
                       max_depth=10, 
                       min_samples_split=10, 
                       min_samples_leaf=5, 
                       min_weight_fraction_leaf=0.0, 
                       max_features='sqrt', 
                       max_leaf_nodes=None, 
                       min_impurity_decrease=0.0, 
                       bootstrap=True, 
                       oob_score=False, 
                       n_jobs=-1, 
                       random_state=None, 
                       verbose=False, 
                       warm_start=False, 
                       ccp_alpha=0.0 )

rfc.fit(X_train,y_train)

To determine feature importance from the random forest we use three different methods:

Feature dropping

Change in model score when a single feature is removed

Feature permutation:

Decrease in a model score when a single feature is randomly shuffled

Mean difference in impurity (MDI)

Total decrease in node impurity, weighted by the probability of reaching that node (approximated by the proportion of samples reaching that node), averaged over all trees of the ensemble

Perceptron

mlp = MLPClassifier(hidden_layer_sizes=(1), 
                    activation='logistic',
                    solver='adam',
                    alpha=0.01,
                    batch_size= 10, 
                    learning_rate='adaptive', 
                    learning_rate_init=1e-05, 
                    #beta_1
                    #beta_2
                    #epsilon
                    max_iter=40,
                    tol=0.001, 
                    verbose=1,
                    early_stopping=True,
                    n_iter_no_change=3 )

mlp.fit(X_train,y_train)

The coefficients of the Perceptron are used to infer feature importance.

## Results

We find that the ratio Q between the two masses is the most important feature, followed by the semi-major axis. This is consistent with the physical theory.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
imgs		imgs
LCP Project presentation.pdf		LCP Project presentation.pdf
README.md		README.md
The Formation of Binary Black Holes - A Machine Learning Approach.ipynb		The Formation of Binary Black Holes - A Machine Learning Approach.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Feature importance in the formation of binary black holes

Goal of the project

Theoretical introduction

Preprocessing

Principal Components Analysis (PCA)

Classification and feature importance

Defining custom feature importance functions

Linear SVM

Random Forest

Perceptron

About

Releases

Packages

Languages

diegobonato/The-Formation-of-Binary-Black-Holes-A-Machine-Learning-Approach

Folders and files

Latest commit

History

Repository files navigation

Feature importance in the formation of binary black holes

Goal of the project

Theoretical introduction

Preprocessing

Principal Components Analysis (PCA)

Classification and feature importance

Defining custom feature importance functions

Linear SVM

Random Forest

Perceptron

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages