hfs.SHSELSelector¶
- class hfs.SHSELSelector(hierarchy: Optional[ndarray] = None, relevance_metric: str = 'IG', similarity_threshold=0.99, use_hfe_extension=False, preprocess_numerical_data=False)[source]¶
SHSEL feature selection method for hierarchical features.
This feature selection method was proposed by Ristoski and Paulheim in 2014. The features are selected by removing features with parents that have a similar relevance and removing features with lower than average information gain for each path from leaf to root. This Selector also implements the hierarchical feature engineering (HFE) extension proposed by Oudah and Henschel in 2018.
- __init__(hierarchy: Optional[ndarray] = None, relevance_metric: str = 'IG', similarity_threshold=0.99, use_hfe_extension=False, preprocess_numerical_data=False)[source]¶
Initializes a SHSELSelector.
- Parameters
- hierarchynp.ndarray
The hierarchy graph as an adjacency matrix.
- relevance_metricstr
The relevance metric to use in the initial selection stage of the algorithm. The options ore “IG” for information gain and “Correlation”. Default is IG.
- similarity_thresholdfloat
The similarity threshold to use in the initial selection stage of the algorithm. This can be a number between 0 an 1. Default is 0.99.
- use_hfe_extensionbool
If True the HFE algorithm proposed by Oudah and Henschel is used. Set relevance_metric to “Correlation” when using this extension. Default is False.
- preprocess_numerical_dataFalse
If True the data is preprocessed by adding up the child values. This method is used in the HFE extension algorithm which expects numerical data. If binary data is used it is recommended to set this parameter to False. Default is False.
- fit(X, y, columns=None)[source]¶
Fitting function that sets self.representatives_.
The number of columns in X and the number of nodes in the hierarchy are expected to be the same and each column should be mapped to exactly one node in the hierarchy with the columns parameter. After fitting self.representatives_ includes the names of all nodes from the hierarchy that are left after feature selection. The features are selected by removing features with parents that have a similar relevance and removing features with lower than average information gain for each path from leaf to root.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
The training input samples.
- yarray-like, shape (n_samples,)
The target values. An array of int.
- columns: list or None, length n_features
The mapping from the hierarchy graph’s nodes to the columns in X. A list of ints. If this parameter is None the columns in X and the corresponding nodes in the hierarchy are expected to be in the same order.
- Returns
- selfobject
Returns self.