Introduction

With the growth of datasets nowadays, feature selection gains more importance in the area of machine learning and forms a typical pre-processing step in machine learning. Those methods aim to improve the predictive performance [PPZ+11a], for example by avoiding overfitting caused by noisy features. Feature selection thus may help to improve generalization. Further a smaller training set can accelerate the training time, as the data the model has to include in its calculations is less. Feature selection methods can also increase the explainability of methods since users can concentrate on the most important variables and their meaning for the model.

Many real-world settings contain hierarchical relations. In text mining, words can be ordered in generalization-specialization relationships [RP14], while in Bioinformatics the function of genes is often described as hierarchy as for example in the go-term ontology, which collects terms in a directed acyclic graph. More information about the ontology can be found in Ashburner et al. [ABB+00].

While there are implementations of hierarchical classification methods, for example in the Hiclass library by Miranda, Köhnecke, and Renard, there are not many implementations of feature selection methods that deal with hierarchical structures. Some hierarchical methods are implemented in the kgextension library <https://github.com/om-hb/kgextension> [1]_, however it does not include an extensive collection. Most methods assume flat feature spaces without hierarchical dependencies. But as redundancy and relevance of features can be obtained out of hierarchical information more precise :cite:p:`wanbook, hierarchical feature selection aims at adding more information to improve the step of selection. Hence, we propose a library, that implements methods presented by Wan and Freitas (amongst others for example Hierarchy Based Redundant Attribute Removal, see [WFdM15]), Ristoski and Paulheim [RP14], Oudah and Henschel [OH18], Wang et al. [WMAB02], Jeong and Myaeng [JM13] and Lu et al. [LYT+13].

1

Knowledge Graph Extension for Python - https://github.com/om-hb/kgextension

References

ABB+00

M. Ashburner, C.A. Ball, Judith Blake, David Botstein, Heather Butler, and J. Cherry. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25:25–29, 01 2000.

CaMKa16

Konika Chawla and and Martin Kuiper and. Genes2go: a web application for querying gene sets for specific GO terms. Bioinformation, 12(3):231–232, June 2016. URL: https://doi.org/10.6026/97320630012231, doi:10.6026/97320630012231.

Con04

Gene Ontology Consortium. The gene ontology (GO) database and informatics resource. Nucleic Acids Research, 32(90001):258D–261, January 2004. URL: https://doi.org/10.1093/nar/gkh036, doi:10.1093/nar/gkh036.

JM13

Yoonjae Jeong and Sung-Hyon Myaeng. Feature selection using a semantic hierarchy for event recognition and type classification. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, 136–144. 2013.

LYT+13

Sisi Lu, Ye Ye, Rich Tsui, Howard Su, Ruhsary Rexit, Sahawut Wesaratchakit, Xiaochu Liu, and Rebecca Hwa. Domain ontology-based feature reduction for high dimensional drug data and its application to 30-day heart failure readmission prediction. In 9th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing, 478–484. IEEE, 2013.

MKohneckeR21

Fábio Malcher Miranda, Niklas Köhnecke, and Bernhard Y. Renard. Hiclass: a python library for local hierarchical classification compatible with scikit-learn. CoRR, 2021. URL: https://arxiv.org/abs/2112.06560, arXiv:2112.06560.

OH18

Mai Oudah and Andreas Henschel. Taxonomy-aware feature engineering for microbiome classification. BMC bioinformatics, 19(1):1–13, 2018.

PPZ+11a

Rafael Pereira, Alexandre Plastino, Bianca Zadrozny, Luiz Merschmann, and Alex Freitas. Lazy attribute selection: choosing attributes at classification time. Intell. Data Anal., 15:715–732, 08 2011. doi:10.3233/IDA-2011-0491.

PPZ+11b

Rafael Pereira, Alexandre Plastino, Bianca Zadrozny, Luiz Merschmann, and Alex Freitas. Lazy attribute selection: choosing attributes at classification time. Intell. Data Anal., 15:715–732, 08 2011. doi:10.3233/IDA-2011-0491.

RP14

Petar Ristoski and Heiko Paulheim. Feature selection in hierarchical feature spaces. In Discovery Science: 17th International Conference, DS 2014, Bled, Slovenia, October 8-10, 2014. Proceedings 17, 288–300. Springer, 2014.

TCB+13

Robi Tacutu, Thomas Craig, Arie Budovsky, Daniel Wuttke, Gilad Lehmann, Dmitri Taranukha, Joana Costa, Vadim E Fraifeld, and João Pedro de Magalhães. Human ageing genomic resources: integrated databases and tools for the biology and genetics of ageing. Nucleic acids research, 41(Database issue):D1027–D1033, 2013.

WFdM15

Cen Wan, Alex Freitas, and Joao Pedro de Magalhaes. Predicting the pro-longevity or anti-longevity effect of model organism genes with new hierarchical feature selection methods. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12:262–275, 03 2015. doi:10.1109/TCBB.2014.2355218.

WMAB02

Bill B Wang, RI McKay, Hussein A Abbass, and Michael Barlow. Learning text classifier using the domain concept hierarchy. In IEEE 2002 International Conference on Communications, Circuits and Systems and West Sino Expositions, volume 2, 1230–1234. IEEE, 2002.

WMAB03

Bill B Wang, RI Bob Mckay, Hussein A Abbass, and Michael Barlow. A comparative study for domain ontology guided feature extraction. In Proceedings of the 26th Australasian computer science conference-Volume 16, 69–78. 2003.