호소와 호소적인 특징을 나타내는 수역의 머신러닝을 활용한 Chl-a 예측 알고리즘 연구
- Alternative Title
- A study of machine learning algorithms for Chl-a prediction in the river and the lake like exhibiting lake characteristics
- Abstract
- Abstract
In this study, we tried to predict Chlorophyll-a(Chl-a) of algae which have been limiting water use after the weirs were constructed in the Nakdong River. a machine learning algorithm that was one of the A.I techniques was used as a predictive method. because current machine learning algorithms was very diverse, the machine learning algorithm suitable for predicting Chl-a was selected and applied among the machine learning algorithms that were used a lot recently.
The research area was the middle and lower of the Nakdong River and Daecheong Lake. algae have been occurring continuously in the midstream and downstream of the Nakdong River, since the weirs were constructed in 2012. the data set was consisted of integrating water quality and quantity data in the midstream and downstream of the Nakdong River. the Daecheong Lake is a typical lake in Korea and the data set of Daecheong Lake was also consisted in the same way as the Nakdong River. the data set was consisted of a total of 21 or 20 variables in the Nakdong River and Daecheong Lake. we extracted ten factors of high importance for water quality and quantity data. the algorithms predicted how ten factors affected Chl-a occurrence. we performed the machine learning algorithms about decision tree, random forest, elastic net, gradient boosting with Python. two excellent machine learning algorithms among four machine learning algorithms were applied again to verify the application of the algorithm in the Daecheong Lake.
The result of correlation analysis in the midstream area of the Nakdong River was BOD, DO, pH, SS, WV, TOC, NH₃-N, E·C, Flux, T-P at the GG(Gangjeong Goryeong) point and BOD, DO, COD, TOC, DTP, SS, E·C, T-P, PO₄-P, NH₃-N at the DS(Dal Seong) point. the result of correlation analysis in the downstream area of the Nakdong River was BOD, pH, DO, PO₄-P, COD, DTP, SS, NH₃-N, water temperature, T-P at the HC(Hapcheon Changnyeong) point and BOD, COD, pH, TOC, PO₄-P, NH₃ -N, DO, DTP, SS, E·C at the CH(Changnyeong Haman) point.
The performance of the four machine learning algorithms was analyzed through the algorithm's performance indicators. the indicators were MSE (Mean Square Error), RMSE (Root Mean Squared Error), R² (Coefficient of determination). default values were applied to hyper parameters. as a result of machine learning algorithm performance analysis, the GG point was analyzed in the order of decision tree, elastic net, random forest and gradient boosting. the DS point was analyzed in the order of elastic net, decision tree, random forest and gradient boosting. the HC point was analyzed in the order of decision tree, elastic net, random forest and gradient boosting. the CH point was analyzed in the order of elastic net, decision tree, random forest and gradient boosting.
Residual scatter plots of the four machine learning algorithms were checked. as a result of checking the error variance, it can be seen as a good regression analysis if it shows a residual plot with increasing residuals and a pattern of constant variance. although there was a difference among the four machine learning algorithms at the research area, random forest and gradient boosting algorithms were showed relatively normality and normal distribution curves of variance. the importance of the variables was confirmed by checking the residuals of the independent variables at the each of the four points. the residual of GG point was E·C(–0.11), BOD(–0.24), TOC(–0.28), NH₃-N(–0.52), pH(–0.52), T-P(–0.59), SS(–0.84), Flux(0.94), WV(-1.03), DO(-1.66). the residual of DS point was SS(–0.05), DO(0.16), COD(–0.29), E·C(0.31), TOC(-0.47), BOD(-0.56), PO₄-P(-0.71), T-P(-0.85), DTP(-0.9), NH₃-N(-0.94). the residual of HC point was PO₄-P(–0.01), DTP(–0.05), Water Temperature(–0.13), COD(0.16), NH₃-N(0.47), DO(-0.48), SS(-0.59), T-P(-0.64), pH(0.73), BOD(-1.32). the residual of CH point was COD(0.02), SS(0.14), DTP(–0.14), PO₄-P(0.36), TOC(0.38), pH(-0.44), DO(-0.56), E·C(-0.78), NH₃-N(-0.78), BOD(-0.97).
If the predicted value and the measured value are similar, the valid factor that can be confirmed from the residual analysis result can be considered as an important factor for predicting algae occurrence. in particular, as the accuracy of the prediction value of the algorithm increases, various and effective factors will be set as important factors. it will also be possible to predict the point of occurrence due to the difference in effective factors by point and the time of occurrence in summer when algal bloom occurs. in addition, in order to predict algal bloom at a specific point in time, it is necessary to construct a time series model which data are arranged in cross-sections and classified. the number of data sets can be a problem, but it seems possible if cross-validation is used.
The ROC (Receiver Operation Characteristic) curve was used to check the accuracy of whether the average Chl-a concentration of the four Nakdong River points was exceeded. the AUC (Area Under Curve) of the decision tree was 0.790 and the AUC of the elastic net was 0.867 and the AUC of the random forest was 0.869 and the AUC of the gradient boosting was 0.877 at the GG point. the AUC of the decision tree was 0.878 and the AUC of the elastic net was 0.880 and the AUC of the random forest was 0.931 and the AUC of the gradient boosting was 0.951 at the DS point. the AUC of the elastic net was 0.886 and the AUC of the decision tree was 0.905 and the AUC of the random forest was 0.955 and the AUC of the gradient boosting was 0.961 at the HC point. the AUC of the elastic net was 0.811 and the AUC of the decision tree was 0.848 and the AUC of the random forest was 0.881 and the AUC of the gradient boosting was 0.885 at the CH point.
The RMSE was 7.03 at the HC point when the hyper parameters of the random forest algorithm were showed that max_depth was 10 and n_estimator was 30. the RMSE was 7.48 at the CH point when the hyper parameters of the gradient boosting algorithm were showed that max_depth was 1 and learning_rate was 0.3. the RMSE was 8.13 at the CH point when the hyper parameters of the random forest algorithm were showed that max_depth was 10 and n_estimator was 60. the RMSE was 10.55 at the GG point when the hyper parameters of the gradient boosting algorithm were showed that max_depth was 3 and learning_rate was 0.1. the RMSE was 10.72 at the DS point when the hyper parameters of the gradient boosting algorithm were showed that max_depth was 7 and learning_rate was 0.1. the RMSE was 7.38 at the HC point when the hyper parameters of the gradient boosting algorithm were showed that max_depth was 2 and learning_rate was 0.1. there was a performance improvement compared to the case that default values of the hyper parameter were applied to all four points.
We looked at the CH point as an example to check the variable importance. as a result, the importance between the gradient boosting algorithm and the random forest the algorithm was different and the variables of algorithm were different from each other. the importance of these variables was also affected by multicollinearity with correlation between variables and it was different from each algorithm. In addition, the misclassification rate which was the difference in algorithm performance was small in the boosting algorithm. it was caused by the difference in bias and variance for each algorithm.
The result of correlation analysis was SS, Water Temperature, COD, DO, Flux, BOD, TOC, TCN, T-P, NO₃-N in the Daecheong Lake. those were extracted as important factors at CD(Chu Dong) point. the two algorithms that were excellent performed with the hyper parameters default in the Nakdong River were applied to the CD point in the Daecheong Lake. the random forest algorithm showed that MSE was 12.27 and RMSE was 3.50 and R² was 0.48 at the CD point. the gradient boosting algorithm showed that MSE was 7.40 and RMSE was 2.72 and R² was 0.66 at the CD point.
Hyper parameters were adjusted in the same way as the Nakdong River at the CD point. as a result of adjusting the hyper parameters, the random forest algorithm of the CD point showed that RMSE was 2.46 when the max_depth was 7(5) and n_ estimator was 60(100). the gradient boosting algorithm showed that RMSE was 2.49 when the max_depth was 4 and the learning_rate was 0.1. the application of adjusted hyper parameters like the Nakdong River was analyzed better performance than the default hyper parameters.
It was confirmed that the machine learning algorithm could be applied through Chl-a prediction of the Nakdong River and the Daecheong Lake. If so, additional research will be possible by expanding the area where weirs were constructed and appealing features were showed that algae frequently have been occurring. to predict more accurately, if various data from aquatic regions where algae have been occurring currently are additionally used as input data, the machine learning algorithm will be able to exhibit improved prediction performance.
Machine learning can process unstructured data such as photos, maps, and graphs. therefore, it can be applied to many different fields. for example, the dynamics of sea level rise and ice sheets due to climate change, the prediction and reduction of carbon emissions for the recent global effort to reduce carbon emissions, the degree of temperature rise due to the increase in atmospheric carbon dioxide. it can be applied to many fields such as forecasting. in particular, it is possible to predict a disaster situation and issue an alert before the disaster occurs. such predictions can help us reduce the damage. advances in machine learning can provide a way to predict and control the pace of climate change. in addition, it will be possible to use it in fields which it is difficult to identify various causes and triggers, such as odor substances.
- Author(s)
- 이상민
- Issued Date
- 2022
- Awarded Date
- 2022. 2
- Type
- Dissertation
- Publisher
- 부경대학교
- URI
- https://repository.pknu.ac.kr:8443/handle/2021.oak/24408
http://pknu.dcollection.net/common/orgView/200000602733
- Alternative Author(s)
- Sang Min Lee
- Affiliation
- 부경대학교 대학원
- Department
- 대학원 환경공학과
- Advisor
- 김일규
- Table Of Contents
- Ⅰ. 서 론 1
1.1 연구 배경과 목적 1
1.1.1 낙동강 보, 대청호 유역 특성 2
1.1.2 정체성 수역의 수질관리 필요성 3
1.1.3 머신러닝 연구의 필요성 4
Ⅱ. 이론적 배경 6
2.1 조류 6
2.1.1 우리나라 조류 발생 특징 6
2.1.2 조류 발생 원인 8
가. 오염물질의 유입 10
나. 수온과 일사량 10
다. 물 순환 정체 11
2.1.3 조류 발생 영향 12
2.2 클로로필-a(Chlorophyll-a) 13
2.2.1 클로로필-a 13
2.2.2 부영양화와 클로로필-a 15
2.3 선행 연구 16
2.3.1 머신러닝을 활용한 조류와 클로로필-a 예측에 관한 선행 연구 16
2.4 연구의 차별성 21
2.5 머신러닝 연구 22
2.5.1 머신러닝(Machine learning) 22
2.5.2 알고리즘(Algorithm) 23
2.5.3 앙상블(Ensemble) 기법 23
가. 부스팅(Boosting) 기법 24
나. 배깅(Bagging) 기법 25
2.5.4 데이터로부터 학습 26
가. 지도학습(Supervised learning) 26
나. 비지도학습(Unspervised learning) 27
2.5.5 파이썬(Python) 프로그램 30
2.6 데이터 마이닝(Date mining) 31
2.6.1 데이터 전처리 32
가. StandardScaler 32
나. RobustScaler 32
다. MinMaxScaler 33
라. Nomalizer 33
2.7 분석 방법 35
2.7.1 상관관계분석(Correlation analysis) 35
2.7.2 회귀분석(Regression analysis) 37
2.7.3 머신러닝 알고리즘 38
가. 결정 트리(Decision tree) 38
나. 랜덤 포레스트(Random forest) 42
다. 엘라스틱 넷(Elastic net) 45
라. 그래디언트 부스팅(Gradient boosting) 47
2.8 하이퍼 파라미터(Hyper parameter) 49
2.8.1 Desicion tree 하이퍼 파라미터 49
2.8.2 Random forest 하이퍼 파라미터 51
2.8.3 Gradient boosting 하이퍼 파라미터 53
2.8.4 하이퍼 파라미터 최적화 54
2.9 머신러닝 알고리즘 성능 분석 55
2.9.1 R² (coefficient of determination) 57
2.9.2 MSE (Mean Square Error), RMSE (Root Mean Square Error) 59
2.9.3 ROC (Receiver Operation Characteristic) curve 61
2.9.4 AUC (Area Under Curve) 64
2.9.5 잔차 분석 (Residual analysis) 66
Ⅲ. 연구내용과 방법 68
3.1 실험방법 68
3.1.1 조사지점과 시기 70
3.2 자료 수집과 데이터 set 구성 72
Ⅳ. 결과와 고찰 75
4.1 수질과 수량 항목 상관관계분석 결과 75
4.1.1 낙동강 수계(중류, 하류) 75
4.2 모형별 알고리즘 분석 결과 87
4.2.1 알고리즘별 하이퍼 파라미터 적용 결과 87
4.2.2 Decision tree 알고리즘 적용 결과 88
4.2.3 Random forest 알고리즘 적용 결과 89
4.2.4 Elastic net 알고리즘 적용 결과 90
4.2.5 Gradient boosting 알고리즘 적용 결과 91
4.3 알고리즘 성능지표를 통한 분석 결과 92
4.3.1 낙동강 수계 알고리즘 적용 분석 결과 92
가. 강정고령보(GG) 알고리즘 적용 분석 결과 92
나. 달성보(DS) 알고리즘 적용 분석 결과 103
다. 합천창녕보(HC) 알고리즘 적용 분석 결과 112
라. 창녕함안보(CH) 알고리즘 적용 분석 결과 121
4.4 ROC curve를 활용한 예측값 정확성 평가 결과 130
4.5 하이퍼 파라미터(Hyper parameter) 조정을 통한 최적 알고리즘 적용 134
4.5.1 Random forest 최적 하이퍼 파라미터 적용 137
4.5.2 Gradient boosting 최적 하이퍼 파라미터 적용 145
4.6 Boosting 알고리즘의 우수성과 적용 사례 153
4.7 요약 156
5.1 호소의 머신러닝 알고리즘 적용에 관한 연구 159
5.1.1 대청호 수질과 수량 상관관계분석 159
5.1.2 대청호의 알고리즘 적용 평가 162
가. 대청호의 알고리즘 평가 결과 162
5.2 알고리즘의 조정 하이퍼 파라미터 적용 결과 166
5.3 요약 170
Ⅴ. 종합 결론 171
Ⅵ. 향후 연구 방향 175
참 고 문 헌 177
Appendix 192
- Degree
- Doctor
-
Appears in Collections:
- 대학원 > 환경공학과
- Authorize & License
-
- Files in This Item:
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.