Feature Selection in Supervised Learning Problems using Data Mining Approach
- Abstract
- When analyzing high dimensional massive data, it is desirable to identify a few important features that affect a certain outcome of interest using feature selection methods. Feature selection methods play more important role when the number of features far exceeds the number of observations , which makes the traditional statistical methods infeasible for data analysis. This study presents quantitative and qualitative analysis results of applying feature selection methods in two case studies, multivariate calibration for determining soil carbonate content and cancer prediction using gene expression data, as a regression and a classification problems, with comparison of their performance on each case study. Feature selection methods compared include Least Angle Regression algorithm (LARS), Least Absolute Shrinkage and Selection operator (Lasso), Genetic Algorithm (GA), and classical methods such as forward and stepwise selection. Selected subsets by each method are used for the input of Support Vector Machines (SVM) for supervised modeling. Root Mean Square Error Prediction (RMSEP), Mean Squared Error (MSE) and Prediction
Error (PE) and Area Under the Curve are quantitative criteria used to compare the methods. Due to a small number of samples, bootstrapping also applied when building models in order to obtain more reliable results. The results of case study 1 show high ability of LARS and Lasso in extracting effective features on carbonate content determination, as well as choosing the most true features (wavenumbers), while the least prediction errors were (RMSEP, MSE, and PE) obtained by applying subset selected by LARS. On the other hand, in case study 2 also LARS and Lasso show high accuracy in predicting prostate cancer.
- Author(s)
- PIRASTEH FARNAZ
- Issued Date
- 2015
- Awarded Date
- 2015. 8
- Type
- Dissertation
- Publisher
- 부경대학교
- URI
- https://repository.pknu.ac.kr:8443/handle/2021.oak/12560
http://pknu.dcollection.net/jsp/common/DcLoOrgPer.jsp?sItemId=000002069314
- Affiliation
- 부경대학교
- Department
- 대학원 화학공학과
- Advisor
- 유준
- Table Of Contents
- LIST OF TABLES i
LIST OF FIGURES ii
Abstract - 1 -
CHAPTER 1. INTRODUCTION - 3 -
CHAPTER 2. METHODS - 7 -
2.1 Feature Selection Methods - 7 -
2.1.1 Forward Selection - 8 -
2.1.2 Stepwise Selection - 9 -
2.1.3 Least Absolute Shrinkage and Selection Operator (Lasso) - 10 -
2.1.4 Least Angle Regression (LARS) - 17 -
2.1.5 Genetic Algorithm - 20 -
2.2 Support Vector Machines - 27 -
2.2.1 SVM in Regression problems - 27 -
2.2.2 SVM in Classification Problems - 27 -
2.3 Bootstrapping - 29 -
CHAPTER 3. PROBLEM DESCRIPTION - 30 -
3.1. Case study I: Soil Carbonate Content Prediction - 30 -
3.1.1 Sampling - 30 -
3.1.2 FT-IR & XRD data - 31 -
3.1.3 Performance evaluation of feature selection methods - 35 -
3.2. Case Study II: DNA Microarray Gene Expression Data for Diagnosing Prostate Cancer - 37 -
3.2.1 Performance evaluation of feature selection methods - 38 -
CHAPTER 4. RESULTS AND CONCLUSION - 40 -
4.1 Results for Case Study I: FT-IR & XRD Data - 40 -
4.2 Results for Case Study II: Gene Expression Data - 53 -
CHAPTER 5. CONCLUSION - 63 -
REFRENCES - 64 -
- Degree
- Master
-
Appears in Collections:
- 산업대학원 > 응용화학공학과
- Authorize & License
-
- Files in This Item:
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.