Understanding Data Curation and Its Impact on Predictive Modeling in Drug Discovery

The phrase “Garbage in, garbage out” is well-known not only in AI but also among professionals in data science. It resonates intuitively and has been consistently validated through experience. My journey in QSAR (Quantitative Structure-Activity Relationship) research began in the late 1990s, utilizing data from over a hundred compound efficacy studies. This experience highlighted the critical importance of meticulous data gathering and organization.

In the early 2000s, while developing predictive models for hERG protein inhibition, the situation remained unchanged. Today, high-throughput electrophysiological experiments yield significantly improved data quality. However, back then, compiling datasets from published results required detailed reviews of experimental methodologies, making reproducibility challenging. I recognized that not only the validation conducted after establishing a dataset but also the validation determining the dataset itself was crucial. Unfortunately, a universally accepted workflow for this was lacking, leaving it to individual researchers’ discretion.

The recently published paper titled Influence of Data Curation and Confidence Levels on Compound Predictions Using Machine Learning Models, featuring Jürgen Bajorath as the corresponding author, addresses these concerns. Although I have yet to read the full text, one notable thing from the abstract is the phrase “due to subsequent elimination of singletons rather than compounds from analogue series.” My research in AI drug discovery has repeatedly underscored the importance of deciding whether to include singletons, further validating this insight.

QSAR has been studied for over 50 years since its conceptual introduction by Hansch and Fujita in the 1960s, with data reliability being a long-standing topic of discussion. Given that predicting environmental toxicity has legal implications, established high standards exist for scientifically validating predictive models. Researchers in this field are well aware of the significance of data quality and predictive model validation.

However, with the rapid increase in studies utilizing AI for predictive modeling, many new researchers may not fully grasp the gold standards developed over decades. This paper serves as a valuable resource for these researchers to better understand and adopt these standards.

In a previous post titled Calling for Rigorous Method Comparison in Machine Learning for Drug Discovery, I addressed similar themes. In private discussions on LinkedIn related to this article, some researchers emphasized that many experimental results obtained through HTS methods registered in publicly available databases like PubChem exhibit significant variations in data quality. Understanding whether experimental data resolution meets the requirements of predictive models is crucial. For researchers developing predictive models for specific tasks in drug discovery, this understanding is essential. Equally important is for those using these models to assess whether the data used in their development aligns with their intended applications. While statistical validation may not be mandatory for the users, understanding data quality is imperative.

The drug discovery process involves complex interactions among various tasks. Data quality issues arising from one task can impact all related tasks, underscoring the importance of addressing these concerns throughout the process.

Here are three key takeaways:

  1. Data Quality is Crucial: The phrase “Garbage in, garbage out” underscores the importance of high-quality data in predictive modeling, particularly in drug discovery. Meticulous data gathering and validation processes are essential for developing reliable predictive models.
  2. Understanding Gold Standards: The recent paper discussing the influence of data curation highlights the necessity for researchers, especially newcomers in AI and drug discovery, to comprehend and adhere to established gold standards in data quality and validation to ensure accurate predictions.
  3. Interconnectedness of Tasks: In the drug discovery process, quality issues in one task can significantly impact related tasks. Therefore, maintaining high data quality across all stages is vital for the overall success of the drug development process.
Updated:

Comments