-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New methods #105
Comments
SGP it should be placed in a new module/package like in scikit-protopy. |
@chkoar What would be the reason to disassociate |
Actually none. Just for semantic reasons. Obviously, prototype generation methods could be considered as over-sampling methods. |
@glemaitre actually, oversampling is different than prototype generation: Prototype Selection: |
Thanks for the clarification @dvro. That could be placed in the wiki! |
Hi, If by SPIDER you mean algorithms from: "Selective Pre-processing of Imbalanced Data for Improving Classification Performance" and "Learning from imbalanced data in presence of noisy and borderline examples", maybe I could be of some help. I know the authors and maybe I could implement a python version of this algorithm with their "supervision"? That might be "safer" than using only pseudo-codes from conference papers. |
Yes, it is this article. We would be happy for having PR on that. We are going to make The only important thing is to follow the scikit-learn convention regarding the estimator |
MetaCost could be a nice addition. |
Yep. You can added it up in the previous list. |
Hi, |
@glemaitre do you think that we should have requirements, e.g. number of citations, before we merge an implementation into the package? |
I would say no. This is something that scikit-learn is doing but the contrib are here to give some freedom regarding that and have bleeding-age estimator. I would just require that the estimator to show some advantage on some benchmark such that we can explain to users when using it.
|
@glemaitre I was thinking to ask @mwydmuch to include a comparison with the |
yes, regarding the dependencies, we are limiting only numpy/scipy/scikit-learn. Then, we can see if we can vendor but it should be avoided as much as possible. Regarding the comparison, it is a bit my point when making a benchmark. I need to fix #360 in fact :) |
A new one: Sharifirad, S., Nazari, A., & Ghatee, M. (2018). Modified smote using mutual information and different sorts of entropies. arXiv preprint arXiv:1803.11002. Includes MIESMOTE, MAESMOTE, RESMOTE and TESMOTE. Since SMOTE is mostly a meta-algorithm to interpolate new sample, with a defined strategy that change depending on the author, would it be possible to implement a generic SMOTE model where the user can provide a custom function to make his own version of SMOTE? This might also ease the writing (and contribution) of new SMOTE models. |
Hi Marek, Thank you. Haleem |
Hello, I am writing because in my current use case I am working on, we would love to have a certain oversampling feature, yet, it is not implemented anywhere. Therefore I would like to propose it here. We are building an NLP model for binary classification, where one of the classes is strongly imbalanced. Therefore, one of the approaches would be to oversample using data augmentation techniques for nlp, e.g. using nlpaug library replace some words with synonyms. Having a class in the library, which allows to package the augmentation into the sklearn pipeline would be great! I can also see this being used in Computer Vision. Let me know what do you think? Whether this could become one of the features in this library, and in that case I would love to contribute. If it doesn't fit into this library, do you know any other open source project where this would fit? Cheers, |
Not sure if this is the right place, but for my work I implemented a custom version of SMOTE for Regression as described in this paper: Torgo L., Ribeiro R.P., Pfahringer B., Branco P. (2013) SMOTE for Regression. In: Correia L., Reis L.P., Cascalho J. (eds) Progress in Artificial Intelligence. EPIA 2013. Lecture Notes in Computer Science, vol 8154. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40669-0_33 As mentioned in the original post, it would be nice to get SMOTE for Regression in |
Great ideas |
Can you share link to code? |
Is code shared , so far? |
Not yet! thank you. Haleem |
@beeb actually they are call it imbalanced regression but to my view it is not. All the thing they call utility based learning and the key thing is around the utility function that it is used, right? In any case you can draft an implementations talk about it. |
Here is the code of the original paper and also what I took as inspiration for my modified implementation https://rdrr.io/cran/UBL/man/smoteRegress.html |
I'm not sure what you are saying. It's SMOTE but they use a function to determine if a data point is common or "rare" depending on how far away from the mean of the distribution it falls (kind of - I used the extremas of the whiskers of a box plot as the inflection points for a CubicHermiteSpline that defines "rarity", I think they also use this in the original code). Then they oversample those by selecting a random NN and computing the new sample in between (just like SMOTE) , the difference is that the label value for the new point is a weighted average of the labels for the two parents. |
@beeb yeap. i have read all their related work. Since they invonvle that utility function, to me is not imbalanced regression but something like cost sensitive/interested regression. Apart from my personal opinion, I think that this method still remains in the scope of the package so I would love to see that implemented in |
Is there any interest in adding Localized Random Affine Shadowsampling (LoRAS) from the maintainers? To quote from the paper's abstract:
If there is interest in inclusion to the library, then I can prepare a PR. Reference: |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Hey @zoj613 and @Sandy4321, please keep discussion focused, it creates a lot of noise otherwise. @zoj613 I'm -1 on including it right now. We loosely follow scikit-learn's rule of thumb to keep maintenance burden down. Methods should roughly be 3 years old and 200+ citations. |
Fair enough. Keeping to the topic at hand, I submitted a PR at #789 implementing |
I think that we should prioritize the SMOTE variants that we want to include. Basically, we could propose to implement the following:
Currently, we have SVM/KMeans/KNN based SMOTE for historical reasons rather than performance reasons. I think that we should probably make an effort regarding the documentation. Currently, we show the differences regarding how the methods are sampling (this is already a good point). However, I think that we should have a clearer guideline on SMOTE works best for which applications. What I mean is that SMOTE, SMOTENC, SMOTEN, might already cover a good basis. |
@glemaitre are there any standard APIs to follow for the SMOTE variants? |
Whenever possible it should inherit from SMOTE. |
This is a non exhaustive list of the methods that can be added for the next release.
Oversampling:
Prototype Generation/Selection:
Ensemble
Regression
P. Branco, L. Torgo and R. Ribeiro (2016). A Survey of Predictive Modeling on Imbalanced Domains. ACM Comput. Surv. 49, 2, 31. DOI: http://dx.doi.org/10.1145/2907070
Branco, P. and Torgo, L. and Ribeiro R.P. (2017) "Pre-processing Approaches for Imbalanced Distributions in Regression" Special Issue on Learning in the Presence of Class Imbalance and Concept Drift. Neurocomputing Journal. (submitted).
The text was updated successfully, but these errors were encountered: