Volume 80, Issue 7 (October 2022)                   Tehran Univ Med J 2022, 80(7): 546-562 | Back to browse issues page

XML Persian Abstract Print


Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Ghafouri T, Manavizadeh N. Modeling and design of a diagnostic and screening algorithm based on hybrid feature selection-enabled linear support vector machine classification. Tehran Univ Med J 2022; 80 (7) :546-562
URL: http://tumj.tums.ac.ir/article-1-11963-en.html
1- Department of Electrical and Electronic Engineering, Nanostructured-Electronic Devices Laboratory, Faculty of Electrical Engineering, K. N. Toosi University of Technology, Tehran, Iran.
2- Department of Electrical and Electronic Engineering, Nanostructured-Electronic Devices Laboratory, Faculty of Electrical Engineering, K. N. Toosi University of Technology, Tehran, Iran. , manavizadeh@kntu.ac.ir
Abstract:   (211 Views)
Background: In the current study, a hybrid feature selection approach involving filter and wrapper methods is applied to some bioscience databases with various records, attributes and classes; hence, this strategy enjoys the advantages of both methods such as fast execution, generality, and accuracy. The purpose is diagnosing of the disease status and estimating of the patient survival.
Methods: Feature selection algorithms have been modeled in Matlab R2021a during April and May 2022 in the framework of statistical pattern recognition. First, the features are ranked based on normalized mutual information, as a metric of relevance and redundancy of features, and accordingly, an optimum feature subset with the highest accuracy of classification is selected. Two feature selection algorithms, i.e., inclusion of features enhancing the classification accuracy and exclusion of irrelevant features are applied to the interest datasets, subsequent to the mini-batch k-means clustering of records.
Results: At the end of the execution of both feature selection methods, evaluation metrics including accuracy, precision, recall, and F1 score are measured and compared. Both proposed feature selection approaches for the molecular biology, hepatitis C virus (HCV), and E. coli bacteria datasets result in the precision and recall scores more than 98 percent, meaning that there are few false positives and false negatives in the linear support vector machine (LSVM) classification. Regarding the HCV dataset, selection of nine relevant features among the thirteen present ones using the feature exclusion method yields the classification accuracy and F1 score of 98.92 percent and 99.02 percent, respectively. The feature inclusion approach also results in an accuracy of 98.78 percent with a slight discrepancy.
Conclusion: The results reveal superior strength of the feature selection methods used here for life science datasets with higher-order features such as protein/gene expression database. The potentials to generalize to other classifiers and automatically specify the optimal number of features during the feature selection procedure make these approaches flexible in many data mining applications for the life sciences.
Full-Text [PDF 1685 kb]   (198 Downloads)    
Type of Study: Original Article | Subject: Endocrinology

Add your comments about this article : Your username or Email:
CAPTCHA

Send email to the article author


Rights and permissions
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

© 2023 , Tehran University of Medical Sciences, CC BY-NC 4.0

Designed & Developed by : Yektaweb