SequentialFeatureSelection Early Stopping Criterion#886
SequentialFeatureSelection Early Stopping Criterion#886aldder wants to merge 6 commits intorasbt:masterfrom
Conversation
|
Thanks for the PR! I agree that overfitting can become an issue. Currently, there is the option
which will select the smallest feature set within 1 standard error of the best feature set, which helps with this. I like adding an 1)I think that the two parameters
if self.early_stop and k != k_to_select:
if k_score <= best_score:could be if self.early_stop_rounds and k != k_to_select:
if k_score <= best_score:What I mean is instead of having
this could be simplified to
2)The second concern I have is that if a user selects e.g., I wonder if it makes sense to allow E.g.,
If a user selects e.g., k_features=3 and
What are your thoughts? |
|
Thanks for your suggestions, I agree with your points.
|
|
Sounds great! Looking forward to it! |
| if not isinstance(early_stop_rounds, int) or early_stop_rounds < 0: | ||
| raise ValueError('Number of early stopping round should be ' | ||
| 'an integer value greater than or equal to 0.' | ||
| 'Got %d' % early_stop_rounds) |
There was a problem hiding this comment.
I think ...Got %d' % early_stop_rounds might not work if early_stop_rounds is not an integer. Maybe it's better to replace it with
...Got %s' % early_stop_rounds
| # early stop | ||
| if self.early_stop_rounds \ | ||
| and k != k_to_select \ | ||
| and self.k_features in {'best', 'parsimonious'}: |
There was a problem hiding this comment.
Instead of the check here, i would maybe change it to raising a ValueError in the top of the function if self.k_features is not in {'best', 'parsimonious'} and self.early_stop_rounds. This way the user is aware, and we only have to perform the check once.
There was a problem hiding this comment.
@rasbt Do you prefer having this check on top of fit function or during initialization?
There was a problem hiding this comment.
Ahh, totally lost track and missed your comment! Sorry! Regarding your question, I think fit might be better to keep it more consistent with scikit-learn behavior.
|
What is the status on this ? I use the k_features="parsimonious" on my model. I think this PR could get my runtime from 10 days into hours ;-) |
|
Thanks for the ping, and I need to look into this some time -- sorry, haven't had a chance recently due to too many other commitments. Unfortunately, I currently don't have a timeline for this. |
Description
According to some studies reported on papers (like this: https://www.researchgate.net/publication/220804144_Overfitting_in_Wrapper-Based_Feature_Subset_Selection_The_Harder_You_Try_the_Worse_it_Gets), the feature selection methodologies known as Wrapper suffer from overfitting as the number of explored states increases.
A method to reduce this overfitting is to use automatic stop criteria (early stop, as the one most known for neural networks).
In this PR I have implemented the criterion of early stopping for the class SequentialFeatureSelector.
One parameter has been added in during instantiation of the object:
Code Example:
Pull Request Checklist
./docs/sources/CHANGELOG.mdfile (if applicable)./mlxtend/*/testsdirectories (if applicable)mlxtend/docs/sources/(if applicable)PYTHONPATH='.' pytest ./mlxtend -svand make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)flake8 ./mlxtend