PUKYONG

Integrative Data-Driven and First-Principle Approaches for Chromatographic Retention Time Prediction and Binding Affinity Enhancement

Metadata Downloads
Abstract
Advances in predictive modeling of molecular properties are essential for accelerating discovery in analytical chemistry, cheminformatics, and drug discovery. This thesis presents a comprehensive investigation of three integrated computational strategies: (i) development of a cross-column Quantitative Structure–Retention Relationship (QSRR) model using Density Functional Theory (DFT) descriptors and machine learning (ML) to predict retention time (RT), (ii) prediction of chromatographic RT directly from Simplified Molecular Input Line Entry System (SMILES) sequences using a hybrid Transformer–Long Short-Term Memory (LSTM) model, and (iii) identification of binding-enhancing mutations in the methyl-CpG-binding domain protein 2 (MBD2) and transcriptional repressor p66-alpha (p66α) coiled-coil interaction through a multifaceted computational pipeline combining statistical prediction, alchemical free energy calculation, and molecular dynamics (MD) analysis. Together, these studies advance methodologies for chromatographic analysis and protein–protein interaction (PPI) modulation, providing insights into the integrative use of data-driven and first-principles approaches.
Chapter 2 presents the development of cross-column Quantitative Structure–Retention Relationship (QSRR) models by integrating Density Functional Theory (DFT)-derived quantum descriptors with machine learning (ML) algorithms for predicting retention times (RTs). RTs across three different reversed-phase high-performance liquid chromatography (RP-HPLC) columns were predicted using ML models, including Partial Least Squares (PLS), Ridge Regression (RR), Random Forest (RF), and Gradient Boosting (GB). The GB-QSRR model demonstrated the best performance, with a predictive squared correlation coefficient (Q²) of 0.989 and a root mean square error of prediction (RMSEP) of 0.749 minutes. Key molecular descriptors identified through feature analysis included Solvation Energy (SE), HOMO-LUMO Energy Gap (∆E HOMO-LUMO), Total Dipole Moment (Mtot), Global Hardness (η), and Gradient Time. These results show that ensemble ML methods effectively capture retention behavior across diverse RP-HPLC systems.
Chapter 3 addresses the use of a hybrid Transformer–LSTM deep learning (DL) model for predicting chromatographic RTs directly from SMILES strings. The architecture integrates a pretrained RoBERTa-based language model with bidirectional LSTM layers, enabling the model to sequentially capture contextual chemical information. The model was trained on the METLIN Small Molecule Retention Time (SMRT) dataset, containing 80,038 small molecules. It achieved a mean absolute error (MAE) of 26.23 seconds and a coefficient of determination (R²) of 0.91, outperforming previous state-of-the-art approaches. Additionally, transfer learning evaluations across ten datasets demonstrated the model’s robust adaptability to diverse chromatographic systems.
Chapter 4 explores the computational design of enhanced coiled-coil protein–protein interactions. Using the Binding Affinity Tool for Mutational Scanning in Complexes (BeAtMuSiC), ten single-point mutations at the MBD2–p66α interface were initially screened for stability enhancement. Selected mutations were further evaluated through alchemical free energy simulations using the double-system/single-box method combined with Crooks–Gaussian analysis. Mutations such as K149I, K149L, and K163L were identified as significantly stabilizing. MD simulations and Molecular Mechanics/Poisson–Boltzmann Surface Area (MM/PBSA) binding free energy calculations validated these predictions, revealing reduced flexibility and increased contact surface areas in the stabilized mutants. This integrated workflow presents a strategy for stabilizing protein complexes.
Chapter 5 highlights the integrated objective across all studies: improving molecular prediction through the integration of ML, DL, MD, and quantum chemistry. Together, these studies demonstrate the potential of computational tools in solving real-world molecular problems. Collectively, these approaches improve predictive accuracy, minimize experimental workload, and enhance understanding of molecular mechanisms.|분자 특성 예측 모델링의 발전은 분석화학, 케미인포매틱스, 신약 개발 분야에서 발견을 가속화하는 데 필수적이다. 본 학위논문은 세 가지 통합 계산 전략에 대한 종합적인 연구를 제시한다.
(i) 밀도 범함수 이론(Density Functional Theory, DFT) 기반 서술자와 머신러닝(Machine Learning, ML)을 이용한 컬럼 간 정량적 구조–머무름 시간 관계(Quantitative Structure–Retention Relationship, QSRR) 모델 개발, (ii) 간이 분자 입력 선형 표기법(Simplified Molecular Input Line Entry System, SMILES) 문자열을 이용하여 하이브리드 트랜스포머–장단기 메모리(Long Short-Term Memory, LSTM) 모델 기반 크로마토그래피 머무름 시간(Retention Time, RT) 직접 예측, (iii) 통계 기반 예측, 연금술적 자유 에너지 계산, 분자동역학(Molecular Dynamics, MD) 모사가 통합된 계산 파이프라인을 통해 methyl-CpG-binding domain protein 2(MBD2)와 전사 억제인자 p66-알파(p66α) 간의 코일드-코일 상호작용에서 결합을 향상시키는 돌연변이 식별.
이러한 연구들은 크로마토그래피 분석 및 단백질–단백질 상호작용(Protein-Protein Interaction, PPI) 조절 방법론을 발전시키며, 데이터 기반 모델링과 제일원리적 계산 접근법의 통합적 활용에 대한 통찰을 제공한다.
제 2 장에서는 DFT로부터 도출된 양자 서술자와 머신러닝 알고리즘을 결합하여 컬럼 간 QSRR 모델을 개발한 내용을 다룬다. 세 가지 역상 고성능 액체 크로마토그래피(Reversed-Phase High-Performance Liquid Chromatography, RP-HPLC) 컬럼에서 RT를 예측하기 위해 부분 최소 자승법(Partial Least Squares, PLS), 리지 회귀(Ridge Regression, RR), 랜덤 포레스트(Random Forest, RF), 그래디언트 부스팅(Gradient Boosting, GB) 모델링 기법이 활용되었다. 그 중 GB-QSRR 모델이 예측 결정 계수(Q²) 0.989와 예측 평균 제곱근 오차(RMSEP) 0.749분으로 가장 뛰어난 성능을 보였습니다. 특성 중요도 분석을 통해 확인된 핵심 분자 서술자는 용해 에너지(Solvation Energy), HOMO–LUMO 에너지 차(ΔE HOMO–LUMO), 총 쌍극자 모멘트(Mtot), 전역 경도(Global Hardness, η), 그라디언트 시간(Gradient Time) 등이었다. 이 결과는 앙상블 머신러닝 기법이 다양한 RP-HPLC 시스템에서의 RT 특성을 효과적으로 포착할 수 있음을 보여준다.
제 3 장에서는 SMILES 문자열로부터 크로마토그래피 RT를 직접 예측하는 하이브리드 트랜스포머–LSTM 딥러닝(Deep Learning, DL) 모델을 다룬다. 이 모델은 사전 학습된 RoBERTa 기반 언어 모델과 양방향 LSTM 층을 통합하여 분자의 문맥적 화학 정보를 순차적으로 포착할 수 있도록 설계되었다. 모델은 80,038개의 작은 분자를 포함하는 METLIN Small Molecule Retention Time (SMRT) 데이터셋을 기반으로 학습되었으며, 평균 절대 오차(Mean Absolute Error, MAE) 26.23초, 결정 계수(R²) 0.91이라는 성능을 기록하며 기존의 최첨단 접근법들을 능가하였다. 또한, 열 개의 외부 데이터셋에 대한 전이 학습 실험을 통해 다양한 크로마토그래피 시스템에 대한 뛰어난 적응력을 입증하였다.
제 4 장에서는 코일드-코일 단백질–단백질 상호작용의 결합력 향상을 위한 계산적 설계 접근을 다룬다. 단백질 복합체의 돌연변이 스캐닝을 위한 BeAtMuSiC 도구를 이용하여 MBD2–p66α 상호작용 부위에서의 10개 단일점 돌연변이가 초기 스크리닝되었으며, 선택된 돌연변이들은 Crooks–Gaussian 분석이 적용된 이중 시스템/단일 박스 방식의 연금술적 자유 에너지 시뮬레이션을 통해 추가적으로 평가되었다. K149I, K149L, K163L과 같은 돌연변이는 결합 안정성을 유의하게 향상시키는 것으로 확인되었다. 이후 수행된 분자동역학 시뮬레이션과 분자역학/포아송–볼츠만 표면 영역(MM/PBSA) 결합 자유 에너지 계산을 통해 해당 예측이 입증되었으며, 안정화된 돌연변이체는 유연성이 감소하고 접촉 표면적이 증가한 것으로 나타났다. 이 통합 계산 워크플로우는 단백질 복합체의 결합력 향상을 위한 효과적인 전략을 제시한다.
제 5 장에서는 모든 연구에서 공통적으로 추구된 목표, 즉 ML, DL, MD, 양자화학의 통합을 통해 분자 예측을 개선하려는 노력을 조명한다. 본 연구들은 계산 도구가 실제 분자 기반 문제 해결에 기여할 수 있는 잠재력을 보여주며, 예측 정확도 향상, 실험적 부담 경감, 분자 수준 메커니즘에 대한 이해 심화라는 측면에서 유의미하게 기여한다.
Author(s)
MAZRAEDOOST SARGOL
Issued Date
2025
Awarded Date
2025-08
Type
Dissertation
Keyword
Quantitative Structure–Retention Relationship (QSRR) Modeling, Chromatographic Retention Time Prediction, Simplified Molecular Input Line Entry System (SMILES) Representation, Transformer–Bidirectional Long Short-Term Memory (BiLSTM) Neural Network, Density Functional Theory (DFT)-Derived Quantum Descriptors, Protein–Protein Interaction (PPI), Binding Affinity
Publisher
국립부경대학교 대학원
URI
https://repository.pknu.ac.kr:8443/handle/2021.oak/34340
http://pknu.dcollection.net/common/orgView/200000899818
Affiliation
국립부경대학교 대학원
Department
대학원 화학융합공학부
Advisor
J Jay Liu
Table Of Contents
Chapter 1. Introduction 1
1.1. Background and Motivation 1
1.2. Research Framework 7
1.3. Research Objectives and Questions 12
1.4. Research Method 13
1.5. Scope of This Study 16
1.5.1. Chapter 1. Introduction 16
1.5.2. Chapter 2. Cross-Column Retention Time Prediction 17
1.5.3. Chapter 3. SMILES-Based Retention Time Prediction 17
1.5.4. Chapter 4. Computational Design for Binding Affinity Enhancement 17
1.5.5. Chapter 5. Conclusions and Future Research Direction 18
Chapter 2. Cross-column Density Functional Theory-based Quantitative Structure-Retention Relationship model development powered by Machine Learning 19
2.1. Background 19
2.2. Materials and Methods 24
2.2.1. Instrumentation or Equipment and Chromatographic Conditions 24
2.2.2. Reagents and Chemicals 27
2.2.3. Data Analysis and Model Development 30
2.2.4. Descriptor Categorization for Model Building 34
2.2.5. Molecular Descriptors 35
2.2.6. Characteristic Descriptors 38
2.2.7. Machine Learning Models 39
2.2.7.1. Partial Least Squares 39
2.2.7.2. Ridge Regression 41
2.2.7.3. Ensemble Learning Algorithms 43
2.2.8. Machine Learning Hyper-Parameter Optimization 45
2.2.9. QSRR model validation 47
2.2.9.1. Evaluation Metrics 47
2.2.9.2. Y-randomization 49
2.2.9.3. SHapley Additive Explanations 50
2.3. Results and Discussion 51
2.3.1. Hyper-parameters Optimization 51
2.3.2. Predictive Performance 52
2.3.3. Residual Analysis 55
2.3.4. Y-randomization 57
2.3.5. Interpretability and Feature Importance 60
2.3.6. A Comparison of the Results of the QSRR Model 64
2.4. Conclusions 69
Chapter 3. Prediction of Chromatographic Retention Time of a Small Molecule from SMILES Representation Using a Hybrid Transformer-LSTM Model 71
3.1. Background 71
3.2. Material and Methods 76
3.2.1. Dataset and Preprocessing 76
3.2.1.1. Small Molecule Retention Time (SMRT) Dataset 76
3.2.1.2. Data preprocessing 77
3.2.1.3. Data Augmentation 82
3.2.2. Retention time prediction with RoBERTa-BiLSTM 85
3.2.2.1. Transformer Neural Network 85
3.2.2.2. Transformer: Input Processing, Attention, and Architecture 86
3.2.2.3. Transfer Learning 89
3.2.2.4. RoBERTa 90
3.2.2.5. Layer Fusion 91
3.2.2.6. Bidirectional Long Short-Term Memory (BiLSTM) 92
3.2.2.7. Dense Layer 94
3.2.3. Experimental Setup 96
3.2.4. Evaluation Metrics 98
3.3. Results and Discussion 99
3.3.1. Overall performance 99
3.3.2. Explainability in Retention Time Prediction 103
3.3.3. Transfer Learning Capability 111
3.3.4. Ablation Study (Effect of Different Modules) 122
3.4. Conclusion 130
Chapter 4. Integrative computational pipeline for identifying binding-enhancing mutations targeting the MBD2–p66α interaction: Implications for therapeutic applications 132
4.1. Background 132
4.2. Materials and Methods 136
4.2.1. Protein-protein complexes preparation 136
4.2.2. Hybrid residues construction 138
4.2.3. Thermodynamic cycle and free-energy calculation 140
4.2.4. MD simulations-based alchemical free energies computation 143
4.2.5. Binding free energy calculation using the MM/PBSA 146
4.3. Results and Discussion 147
4.3.1. MD-based alchemical free energy calculations 150
4.3.2. Binding free energy calculation using the MM/PBSA method 152
4.3.3. Analysis of MD simulations using conventional methods 153
4.4. Conclusion 164
Chapter 5. Conclusions and Future Research Direction 166
5.1. Conclusions 166
5.2. Future Research Direction 170
References 172
요약 203
Abbreviations 207
Acknowledgements 210
Degree
Doctor
Appears in Collections:
대학원 > 화학융합공학부
Authorize & License
  • Authorize공개
  • Embargo2025-08-22
Files in This Item:

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.