The method implemented here is based on that presented in Logan & Fotopoulou 2020 (hereafter LF2020), which uses photometric data to classify objects into stars, galaxies and QSOs. A couple of potential uses of this method include:
- training the classifier using a training set with known labels in binary_classifier.py and classifier_consolidation.py, and then running the predict.py script on a new catalogue with unseen labels, to predict labels for a completely new dataset.
- given a training set with known labels, the binary_classifier.py and classifier_consolidation.py scripts will return a catalogue with predicted labels for this catalogue from HDBSCAN (using the training labels just to judge its performance, but not dictate the clustering itself - i.e. it is semi-supervised). These new predicted labels can be compared to the original labels, and any disagreements may highlight previously misclassified sources. This is one advantage of using unsupervised learning.
One difference in this code compared to LF2020 is that there is a reduced attribute selection step (to select the input attributes to the hdbscan gridsearch). However, similar performance (see second table in Example / Test Run below) can be reached using the method in the implementation here. We note that here we only implement the method where colours are used as input, and do not include any half light radius information in the implementation here.
To download the code repository here, run the following in a terminal, where you want the code to be downloaded:
git clone https://github.com/crispinlogan/StarGalaxyQSOClassification.git
Requirements are given in the requirements.txt file. We used anaconda, specifically using Python3.7.
The requirements can either be installed via pip
pip install requirements.txt
or via conda (a new environment could be created first).
conda install -c conda-forge astropy=4.0.1.post1
conda install -c conda-forge hdbscan=0.8.26
conda install -c conda-forge seaborn==0.10.0
conda install -c conda-forge matplotlib=3.1.3
conda install -c conda-forge pandas=1.0.3
conda install -c conda-forge numpy=1.18.1
conda install -c conda-forge scikit-learn=0.23.0
Once the repo is cloned, you need to set the following variables in the ConfigVars
class in the config file:
base_path
- this should be set to where the repo is downloaded e.g./home/yourname/StarGalaxyQSOClassification/
. It should be set in the config file in theConfigVars
class.n_jobs
- the number of cores on which to run. Automatically set to 50% of available cores (with a minimum of 1) in theConfigVars
class in the config file. It can also be passed as an input argument when instantiating theconf
object (e.g. callingconf = ConfigVars(test = True, n_jobs=10)
), or set manually, after instantiating theconf
object (e.g. by callingconf.n_jobs
= 10 after instantiating theconf
object) in main.py or any of the binary_classifier.py, classifier_consolidation.py or predict.py scripts.- input catalogue - in the config file, the attribute
catname
needs to be changed to your training catalogue name, thetargetname
attribute should be set to the columm name in your catalogue that has the 'true' labels for the objects, thehclass_dict
attribute links the numeric values for the labels in the catalogue with the object name, and thedata_file
attribute uses thecatname
attribute to be the path to the catalogue itself. - prediction catalogue - in the config file, the attribute
catname_predicted
needs to be changed to your unseen catalogue that you want labels to be predicted for. It is used in thedata_file_predicted
variable that is the path to this new catalogue.
The input catalogues (for training and prediction) need to be in the same format
as the example catalogue for training, taken from LF2020
and example catalogue for prediction,
which is a random sub-sample of 50,000 data points from the KiDSVW catalogue,
which in turn is described in LF2020).
Depending on the photometric bands available in your catalogue, the variable photo_band_list
in the config file may need to be changed.
Then you need to run:
python setup_directory_structure.py
then you can either run individual scripts, or to run all three stages at once (binary_classifier.py, classifier_consolidation.py or predict.py) you can run:
python main.py
Outputs: The outputs from the code are in the /data/output directories. The specific directory structure is created upon running setup_directory_structure.py, and the directory names are:
- RF - outputs fom RF gridsearch (best RF setups, and best metrics).
- hdbscan_gridsearch - outputs from hdbscan gridsearch (gridsearch labels data file, gridsearch performance summary datafile, dendrograms, datapoints in PCA space)
- saved_models - saved models (scaler, PCA and hdbscan) trained on training data to later apply on new data.
- consolidation - outputs from consolidation (colour plots, metrics and summary of output labels, confusion matrices).
- prediction - outputs from prediction stage (colour plot, catalogue with predicted labels, summary of output labels, datapoints in PCA space).
We provide the CPz.csv file (which is used in LF2020) to
run the code on as an example. For the prediction stage, a random sample of
50,000 data points is selected from the KiDSVW catalogue (also presented in LF2020),
and it is also provided here.
This test can be run by setting test_bool
in the main.py script to True
.
For this test, the metrics obtained on the training catalogue (which will be found in
/data/test_output which is created upon running
python setup_directory_structure.py
) should be as follows for the optimal consolidation method
(note the performance is sub-optimal, as the test runs a very quick gridsearch):
F1 | Accuracy | Precision | Recall | |
---|---|---|---|---|
Star | 0.9648 | 0.9892 | 0.9942 | 0.9371 |
Galaxy | 0.98 | 0.9698 | 0.9828 | 0.9772 |
QSO | 0.801 | 0.9702 | 0.9414 | 0.697 |
However, we note that the training set provided, when run not in test mode (i.e.
when test_bool
set to False
), can achieve the following performance (again using the optimal consolidation method):
F1 | Accuracy | Precision | Recall | |
---|---|---|---|---|
Star | 0.9853 | 0.9954 | 0.9933 | 0.9775 |
Galaxy | 0.9867 | 0.9799 | 0.9838 | 0.9896 |
QSO | 0.9152 | 0.986 | 0.9568 | 0.8771 |
The prediction stage is also run in the test setup on the short_KiDSVW.csv catalogue, and there is a check in main.py that the output from the prediction matches what is expected. It is recommended to run this test to check that the scripts are running as expected on your computer before running in the non-test setup, on your own data.
Further documentation can be found here
If you have any questions about the code, please write to: crispin.logan@bristol.ac.uk or sotiria.fotopoulou@bristol.ac.uk
If you have used this codebase, please add a footnote with the link to this github repo (https://github.com/crispinlogan/StarGalaxyQSOClassification), and cite the LF2020 paper:
@ARTICLE{2020A&A...633A.154L,
author = {{Logan}, C.~H.~A. and {Fotopoulou}, S.},
title = "{Unsupervised star, galaxy, QSO classification. Application of HDBSCAN}",
journal = {\aap},
year = 2020,
month = jan,
volume = {633},
pages = {A154},
doi = {10.1051/0004-6361/201936648}
}