Feature selection (variable selection)
Feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction (Wikipedia)
Why feature selection?
- Data exploration
- Curse of dimensionality
- Less features - faster models
- Better metrics
- Overview
- An Introduction to Variable and Feature Selection (2003) Isabelle Guyon, Andre Elisseeff
- A Survey on Feature Selection (2016) Jianyu Miaoac, Lingfeng Niu
- Feature Selection: A Data Perspective (2016) Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, Huan Liu
- Feature Selection and Feature Extraction in Pattern Analysis: A Literature Review (2019) Benyamin Ghojogh, Maria N. Samad, Sayema Asif Mashhadi,Tania Kapoor, Wahab Ali, Fakhri Karray, Mark Crowle
- All-relevant vs minimal-optimal feature selection
Filter methods
Filter methods use model-free ranking to filter less relevant features
- Missing Values Ratio
- Removing features with a ratio of missing values greater than some threshold
- Low Variance Filter (sklearn)
- Removing features with a variance lower than some threshold
- Correlation (Wiki)
- χ² Chi-squared statistic for categorical features (Wiki, sklearn)
- ANOVA F-value for quantitative features(Wiki, sklearn)
- Mutual information (Wiki)
- mRMR Minimum redundancy, maximal relevancy (Link, Wiki)
- Relief (Wiki)
- Markov Blanket (Wiki)
- Fast Correlation-based Filter
- CBF Consistency-Based Filters
- Interact
Wrapper methods
Wrapper methods use a model and its performance to find the best feature subset
- SFS Sequential Feature Selection
- SFFS Sequential Floating Forward Selection
- Genertic algorithm (Wiki)
- PSO Particle Swarm Optimization (Wiki)
- Boruta All-relevant feature selection (CRAN, PyPI)
- MUVR (GitLab)
- Wrappers methods and overfitting:
Embedded methods
- LASSO
- Elastic net
- Spike and Slab regression (Wiki)
- Decision Tree (Wiki)
- Random Forest (Wiki)
- Random Forests (2001) Leo Breiman
- Overview of Random Forest Methodology and Practical Guidance with Emphasis on Computational Biology and Bioinformatic (2012) Anne-Laure Boulesteix, Silke Janitza, Jochen Kruppa, Inke R. Konig
- Variable selection using random forests (2010) Robin Genuer, Jean-Michel Poggi, Christine Tuleau-Malot
- Bias in random forest variable importance measures: Illustrations, sources and a solution (2007) Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, Torsten Hothorn
- Conditional Variable Importance for Random Forests (2008) Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, Achim Zeileis
- Correlation and variable importance in random forests (2016) Baptiste Gregorutti, Bertrand Michel, Philippe Saint-Pierre
- Gradient Boosting (Wiki)
Unsupervised and semi-supervised feature selection
- FSSEM Feature Subset Selection using Expectation-Maximization
- Laplacian Score
- Principal Feature Analysis
- Spectral Feature Selection
- MCFS Multi-cluster Feature Selection
- Autoencoders (Wiki)
Stable feature selection
Domain-specific
Packages
- R
- Package: fscaret (CRAN) Jakub Szlek
- Package: praznik (Code) Miron Kursa
- Package: FSinR (CRAN, Paper) Francisco Aragón-Royón, Alfonso Jiménez-Vílchez, Antonio Arauzo-Azofra, José Manuel Benítez
- Package: VSURF (CRAN, Paper)
- Package: spikeSlabGAM (Code, CRAN, Paper)
- Package: copent (CRAN, Code, Paper)
- Python
- Julia
- the main packages for ML in Julia are MLJ and Flow
Star
Issue