- Feature importance: Apply machine learning to understand the most important features that determine the evolution of binary stellar systems into merging binary black holes. A dataset of 2 million instances of simulated binary stars is analysed.
Different techniques are used to infer feature imporance:
- Weights analysis (Linear SVM)
- Medium difference in impurity (Random Forest)
- Permutation importance
- Feature - dropping
Since the first detections of gravitational waves (GWs) in 2015 from the LIGO-VIRGO interferometers, a huge effort has been made from the scientific community to understand the processes underlying their emissions. Mergers of binary black holes (BBHs) are considered to be the main sources of these kind of signals. The open question of these years is then to understand how these kind of systems formed from their progenitors, which are the stellar binaries, in order to actually interpret the detected signal, linking it to their astronomical sources.
You can find the details of the theory behind this in the notebook.
We ispect the dataset for the possibility of dimensionality reduction by performing PCA. This doesn't actually help us with the ultimate task of finding feature importance, but it will give us an idea of how many dimensions are relevant in the space of covariance. If PCA showed that dimensionality reduction is possible, it would be a good indicator that some features are better than others. If PCA showed instead that all PCs are relevant then this is an indication that most features are important in the sense that most of them contribute to the variance of the data.
Moreover, we observe that PCA is sensitive to the scaling of the data. This means that it will assign high relevance to the largest-valued features (for instance, ZAMS mass), whilst ignoring small-valued feature, like orbital eccentricty (even if it could have some physical importance).
We run some simple machine learning algorithm in order to extract the most important features in our dataset.
We compare Linear SVM, Random forest, and a Perceptron. We use sklearn built-in feature importance methods, such as permutation importance and mean decrease in impurity (MDI). For the linear SVM we also interpret the coefficients found by the model to infer feature importance.
Finally, we implement a custom made measure that compares the accuracy of the model with all the features against the accuracy of the model without using the feature we want to assess.
def feat_omit_imp(model,X_train,y_train,X_test,y_test):
scores=[]
dif=[]
#score.append(['All',model.score(X_test)])
for col in X_train.columns:
#drop a column
ds_train, ds_test = X_train.drop(columns=col), X_test.drop(columns=col)
#fit the model
model.fit(ds_train,y_train)
#score the classifier and keep the value of the accuracy for that specific columns we ommited
model_score = model.score(ds_test,y_test)
scores.append([col,model_score])
model.fit(X_train,y_train)
baseline = model.score(X_test,y_test)
dif = [ 100*(baseline - scores[i][1]) for i in range(len(scores))]
return baseline, scores, dif
# Train-Test-Split.
X_train, X_test, y_train, y_test = train_test_split(X ,Y, train_size=0.5, test_size=0.5, random_state=None, shuffle=True)
# Normalization of data
X_train = pd.DataFrame(StandardScaler().fit_transform(X_train),columns=X_train.columns, index=X_train.index)
X_test = pd.DataFrame(StandardScaler().fit_transform(X_test),columns=X_test.columns, index=X_test.index)
lin_svm = LinearSVC(penalty='l2',
dual=False,
C=0.1,
max_iter=100,
loss='squared_hinge',
fit_intercept=True,
verbose=False,
random_state=None)
lin_svm.fit(X_train,y_train)
The ranking is shown in the following plot:
# Optimum hyperparameters from GridSearchCV: {'max_depth': 10, 'min_samples_leaf': 5, 'min_samples_split': 10, 'n_estimators': 75}
rfc = RandomForestClassifier(n_estimators=75,
criterion='gini',
max_depth=10,
min_samples_split=10,
min_samples_leaf=5,
min_weight_fraction_leaf=0.0,
max_features='sqrt',
max_leaf_nodes=None,
min_impurity_decrease=0.0,
bootstrap=True,
oob_score=False,
n_jobs=-1,
random_state=None,
verbose=False,
warm_start=False,
ccp_alpha=0.0 )
rfc.fit(X_train,y_train)
To determine feature importance from the random forest we use three different methods:
- Feature dropping
Change in model score when a single feature is removed
- Feature permutation:
Decrease in a model score when a single feature is randomly shuffled
- Mean difference in impurity (MDI)
Total decrease in node impurity, weighted by the probability of reaching that node (approximated by the proportion of samples reaching that node), averaged over all trees of the ensemble
mlp = MLPClassifier(hidden_layer_sizes=(1),
activation='logistic',
solver='adam',
alpha=0.01,
batch_size= 10,
learning_rate='adaptive',
learning_rate_init=1e-05,
#beta_1
#beta_2
#epsilon
max_iter=40,
tol=0.001,
verbose=1,
early_stopping=True,
n_iter_no_change=3 )
mlp.fit(X_train,y_train)
The coefficients of the Perceptron are used to infer feature importance.
## Results
We find that the ratio Q between the two masses is the most important feature, followed by the semi-major axis. This is consistent with the physical theory.