Skip to content

Repository for exploring machine learning algorithms for quasar selection/classification.

License

Notifications You must be signed in to change notification settings

gtrichards/QuasarSelection

Repository files navigation

QuasarSelection

Repository for exploring machine learning algorithms for quasar selection/classification. These are intended for analysis of SDSS+SpIES data, but can serve as an introduction for analysis of other data sets (e.g., DES, LSST, etc.) for those looking for a template to follow. These are largely "sand boxes", so apologies if they aren't fully coherent. Though I have gone through them to try to make sure that they have some longer-term value.

SpIESHighzCandidateSelection.ipynb
is a notebook that takes the training sets that we used for KDE selection of 3.5<z<5 quasars from SDSS+SpIES data and does color-based self classifation using other machine learning algorithsm from Scikit-Learn. KDE was about 78% complete and 97% efficient. The sklearn algorithms were about the same, which gives confidence that the sample is not highly contaminated (which isn't to so that it isn't contaminated at all) and that we could replace our usage of the proprietary KDE algorithm with public algorithms from sklearn. I looked at neural networks, SVM, random forests, bagging, and boosting. While bagging has the highest efficiency, it also takes considerably longer than random forests, which provided the best overall solution (in terms of completeness, efficiency and computing time). The results of this get applied to the test set in SpIESHighzQuasars.ipynb.

SpIESHighzCandidateSelection2.ipynb
is similar to the notebook above, but here I was exploring what would happen if we added i-band magnitude and u-band extinction as classification attributes. I also explored some different combinations of selection algorithms. I found that bagging including i and extinctu has less contamin than anything else. SpIESHighzQuasars2.ipynb is the notebook where I explore these changes in selection.

SpIESHighzQuasars.ipynb
is a notebook that applies the RF classifier from SpIESHighzCandidateSelection.ipynb to the test set. This uses colors only. I also explored with bagging, but that crashes and later led to trying just to classify Stripe 82 objects with stand-alone python code, running on the server. See SpIESHighzQuasarsS82all.py. But this notebook does successfully produce output classifications for the test set using the RF classifier. The output is in GTR-ADM-QSO-ir_good_test_2016_out.fits.

SpIESHighzQuasars.ipynb
is a notebook that applies the SVM and bagging classifiers from SpIESHighzCandidateSelection2.ipynb to the test set. I also tried to create a new version of the test set that included i-band magnitude and u-band extinction, but I ran into all sorts of problems. (Though it may be worth trying again). In the end, I ended up writing a stand-alone script SpIESHighzQuasarsS82all.py that just classifies the objects in Stripe 82. However, it does so using RF, SVM, and bagging. That way we can compare the results and find out which produces the least contamination as that is important for our clustering analysis.

SpIESHighzQuasarsS82all.py
See note above. Outputs GTR-ADM-QSO-ir_good_test_2016_out_Stripe82all.fits.

SpIESHighzQuasarPhotoz.ipynb
The above notebooks just do selection, now we need to do photo-$z$ on the output files with the quasar canddiates. Here I play with Nadaraya-Watson, Random Forest, Gaussian Process Regression, SVM, and Stochastic Gradient Descent. The first two are good, the next two crash, and the last one is terrible. Just doing self classification of the training set here.

SpIESHighzQuasarPhotoz2.ipynb
Now we apply the best photo-z algorithms to the test data, specifically quasar candidates. Right now just sticking to the Stripe 82 quasar candidates from GTR-ADM-QSO-ir_good_test_2016_out_Stripe82all.fits as the Nadaraya-Watson algorithm isn't particularly speedy. Output is GTR-ADM-QSO-ir_good_test_2016_out_Stripe82all_zphot.fits.

Included quasar candidates file, which includes all of the test object, not just the quasar candidates:
GTR-ADM-QSO-ir_good_test_2016_out_Stripe82all.fits.bz2
Actually, I couldn't because it exceeded the file size limit. So ask me if you want it!

Included the quasar candidate file limited to quasar candidates only with two photo-z outputs:
GTR-ADM-QSO-ir_good_test_2016_out_Stripe82all_zphot.fits

This work was supported in part by NSF CDS&E grant 1411773.

About

Repository for exploring machine learning algorithms for quasar selection/classification.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published