Assessing Data Importance from HTS Results
The PubChem database hosts numerous High Throughput Screening (HTS) results from various centers. For example, as of October 29, 2024, there are 23,501 bioassays for EGFR. The leading record, AID 720582, details a QFRET-based biochemical assay conducted by The Scripps Research Institute in 2013, testing 370,276 compounds. Out of these, 2,294 were active and 367,982 were inactive. Researchers interested in EGFR may find this data useful for developing predictive models to identify active compounds.
To build a reliable model, it’s vital to verify the credibility of the experimental data. This involves analyzing Same-Project BioAssays linked to the main study and identifying stochastic errors while filtering out Pan-Assay Interference Compounds. The required data understanding process is surely more complex than this and needs thorough investigation, which takes much longer time than computation itself.
Post-cleansing, challenges persist due to the highly imbalanced nature of the dataset. Most machine learning algorithms for binary classification perform better with balanced data points.
The paper titled Machine Learning-Driven Data Valuation for Optimizing High-Throughput Screening Pipelines, published in the Journal of Chemical Information and Modeling on Oct 23, 2024, presents methodologies for analyzing such imbalanced datasets. I recommend it for all those interested, as it is available via Open Access. There are several related publications, including Data Valuation: A novel approach for analyzing high throughput screen data using machine learning, Minority Class Oriented Active Learning for Imbalanced Datasets, On the Prediction of Statistical Parameters in High-Throughput Screening Using Resampling Techniques, and Machine Learning Assisted Hit Prioritization for High Throughput Screening in Drug Discovery.
The paper’s core idea is to assess the importance of each data point based on its impact, helping identify true and false positives and selecting significant inactive data to reduce dataset imbalance and improve ML model reliability. Key methods for calculating data importance include:
- KNN Shapley Values: Approximates Shapley values using K-nearest neighbors to evaluate each data point’s impact on model performance.
- CatBoost Object Importance: Utilizes a CatBoost model to determine the importance of each training sample through a Leave-One-Out retraining process.
- DVRL (Data Valuation using Reinforcement Learning): A framework that calculates importance values for training data using reinforcement learning.
- TracIn: Tracks how each training sample affects test sample loss during deep learning training to compute importance.
- MVS-A (Minimal Variance Sampling Analysis): Analyzes the impact of samples on decision tree structure changes during gradient boosting model training.
The significance of calculating data relevance in HTS analysis includes:
- Improved Handling of Imbalanced Data: Identifies important samples even in extreme imbalances typical of HTS data.
- Efficient Active Learning: Focuses on samples that significantly impact model training rather than just likely active ones, enhancing the discovery of active compounds.
- Enhanced False Positive Identification: Methods like MVS-A and TracIn effectively identify false positives by considering both structural characteristics and overall dataset influence.
- Identification of Significant Inactive Samples: Importance-based under-sampling helps maintain model performance by identifying crucial inactive samples.
- Computational Efficiency: Methods such as MVS-A provide high efficiency, making them suitable for large HTS datasets.
- Exploration of Diverse Chemical Structures: Identifies significant samples with varied chemical structures, increasing the potential for discovering new active compounds.
- Enhanced Model Performance: Using important samples for training improves overall model performance and generalization capabilities.
According to Figure 6 of the referenced paper, removing low-importance points does not significantly enhance predictive performance compared to random under-sampling. However, removing high-importance points leads to a substantial drop in performance, showing that employing data importance calculations enhances computational efficiency by reducing dataset size rather than significantly improving predictive capability.
Comments