PUKYONG

Feature Selection in Supervised Learning Problems using Data Mining Approach

Metadata Downloads
Abstract
When analyzing high dimensional massive data, it is desirable to identify a few important features that affect a certain outcome of interest using feature selection methods. Feature selection methods play more important role when the number of features far exceeds the number of observations , which makes the traditional statistical methods infeasible for data analysis. This study presents quantitative and qualitative analysis results of applying feature selection methods in two case studies, multivariate calibration for determining soil carbonate content and cancer prediction using gene expression data, as a regression and a classification problems, with comparison of their performance on each case study. Feature selection methods compared include Least Angle Regression algorithm (LARS), Least Absolute Shrinkage and Selection operator (Lasso), Genetic Algorithm (GA), and classical methods such as forward and stepwise selection. Selected subsets by each method are used for the input of Support Vector Machines (SVM) for supervised modeling. Root Mean Square Error Prediction (RMSEP), Mean Squared Error (MSE) and Prediction
Error (PE) and Area Under the Curve are quantitative criteria used to compare the methods. Due to a small number of samples, bootstrapping also applied when building models in order to obtain more reliable results. The results of case study 1 show high ability of LARS and Lasso in extracting effective features on carbonate content determination, as well as choosing the most true features (wavenumbers), while the least prediction errors were (RMSEP, MSE, and PE) obtained by applying subset selected by LARS. On the other hand, in case study 2 also LARS and Lasso show high accuracy in predicting prostate cancer.
Author(s)
PIRASTEH FARNAZ
Issued Date
2015
Awarded Date
2015. 8
Type
Dissertation
Publisher
부경대학교
URI
https://repository.pknu.ac.kr:8443/handle/2021.oak/12560
http://pknu.dcollection.net/jsp/common/DcLoOrgPer.jsp?sItemId=000002069314
Affiliation
부경대학교
Department
대학원 화학공학과
Advisor
유준
Table Of Contents
LIST OF TABLES i
LIST OF FIGURES ii
Abstract - 1 -
CHAPTER 1. INTRODUCTION - 3 -
CHAPTER 2. METHODS - 7 -
2.1 Feature Selection Methods - 7 -
2.1.1 Forward Selection - 8 -
2.1.2 Stepwise Selection - 9 -
2.1.3 Least Absolute Shrinkage and Selection Operator (Lasso) - 10 -
2.1.4 Least Angle Regression (LARS) - 17 -
2.1.5 Genetic Algorithm - 20 -
2.2 Support Vector Machines - 27 -
2.2.1 SVM in Regression problems - 27 -
2.2.2 SVM in Classification Problems - 27 -
2.3 Bootstrapping - 29 -
CHAPTER 3. PROBLEM DESCRIPTION - 30 -
3.1. Case study I: Soil Carbonate Content Prediction - 30 -
3.1.1 Sampling - 30 -
3.1.2 FT-IR & XRD data - 31 -
3.1.3 Performance evaluation of feature selection methods - 35 -
3.2. Case Study II: DNA Microarray Gene Expression Data for Diagnosing Prostate Cancer - 37 -
3.2.1 Performance evaluation of feature selection methods - 38 -
CHAPTER 4. RESULTS AND CONCLUSION - 40 -
4.1 Results for Case Study I: FT-IR & XRD Data - 40 -
4.2 Results for Case Study II: Gene Expression Data - 53 -
CHAPTER 5. CONCLUSION - 63 -
REFRENCES - 64 -
Degree
Master
Appears in Collections:
산업대학원 > 응용화학공학과
Authorize & License
  • Authorize공개
Files in This Item:

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.