Be aware of overfitting by hyperparameter optimization!
The title of this blog post is the same as that of a paper published in the Journal of Cheminformatics by I. Tetko and others.
(This article addresses similar content as previous posts but from a different perspective. It may be helpful to read these articles together: Understanding Data Curation and Its Impact on Predictive Modeling in Drug Discovery, Calling for Rigorous Method Comparison in Machine Learning for Drug Discovery, Assessing Data Importance from HTS Results)
Hyperparameter optimization is an essential part of developing most predictive models. When attempting to create better predictive models from the same dataset, it is common to utilize new algorithms or employ different parameter optimization methods. If there is a need to create and use various predictive models, automating this process to quickly select the best model has likely become established in workflows.
Like many others who have engaged in QSAR research for a long time, I often find that those trying to understand why a predictive model performs well, and under what circumstances it may fail, tend to undervalue the benefits of new algorithms or optimization methods. Although the numerical improvements in predictive power from these advancements can be significant, they may not contribute much to understanding the data and can even sometimes complicate matters. The major advantage of a good predictive model is not merely accurate numerical predictions but rather the insights it provides about the data.
The dataset used in this paper pertains to the water solubility of compounds. Among the various parameters used in drug discovery, this is a highly physical experimental result, with well-documented experimental findings and generally high data quality. While distinguishing between thermodynamic and kinetic solubility can be challenging, this model can provide reliable predictions suitable for the early stages of drug discovery.
In contrast, there are challenges in predicting how molecules operate in various environments within the body. Phenomena referred to as ADME (Administration, Distribution, Metabolism, and Excretion) describe how the body treats drug molecules; these are not simple physical occurrences but involve complex interactions among various molecules and systems. Such phenomena are difficult to explain with straightforward algorithms and often contain many nonlinear elements. To effectively describe data related to these phenomena, certain algorithms may demonstrate clear advantages over others. It can be assumed that hyperparameter optimization also plays a role in enhancing predictive power in this context.
The phenomenon of overfitting indicates that solving a problem in an overly complex manner increases the likelihood of incorrect predictions. Understanding whether the predictive model we wish to develop involves simple physical phenomena or incorporates complex nonlinear characteristics is crucial in determining how much risk we are willing to accept regarding overfitting. In predicting relatively simple physical phenomena like water solubility, overfitting can be a significant issue, and it is generally better to choose simpler predictive models that are as explainable as possible. Conversely, when the problem itself is highly complex and contains many nonlinear factors, the benefits gained from algorithms or parameter optimization may be relatively greater.
This assessment is quite qualitative and relative, meaning it cannot be easily generalized for all problems. The insights are often acquired through experience with numerous instances of prediction and validation. However, acknowledging that there is no one-size-fits-all optimal approach applicable to every problem can be quite beneficial in practice.
Here are three key takeaway messages from the article:
Value of Understanding Over Algorithms: The primary benefit of a predictive model lies in the insights it provides about the data rather than just numerical accuracy. Understanding the factors influencing predictions is often more valuable than simply implementing new algorithms or optimization techniques.
Complexity vs. Simplicity in Modeling: When developing predictive models, it’s crucial to assess whether the phenomena being modeled are simple physical processes or complex nonlinear interactions. This understanding can guide the choice between simpler, more interpretable models and more complex algorithms that may better capture intricate relationships.
Risks of Overfitting: Overfitting can significantly impact predictive performance, especially in cases involving relatively straightforward physical phenomena. Striking a balance between model complexity and interpretability is essential to avoid overfitting while still achieving reliable predictions.
Comments