Foundation Models in Biology and Chemistry

Introduction

The article titled Foundation models build on ChatGPT tech to learn the fundamental language of biology published in Nature Biotechnology deeply explores how foundation models are being approached in the field of biology. It presents various attempts, from accessing the fundamental substances of life such as DNA and RNA to approaches using Recursion’s cell image-based Phenom. For those curious about how artificial intelligence might bring innovation to the field of biology, I highly recommend reading this article.

As a chemist reading such an article, I wonder if similar research can be conducted in chemistry, specifically whether a foundation model for chemistry can be developed soon. This question has been growing and becoming more concrete since I ventured into the field of AI drug discovery around 2019. In this article, I would like to discuss two central thoughts related to this topic: the complexity of molecules and the role of electrons.

The Complexity of Molecules

For Biomolecules

Biomolecules (DNA, RNA, and proteins) are the most important fundamental substances in biological phenomena. These molecules are central themes in two crucial fields: biochemistry and bioinformatics. The three types of molecules form the basis of the Central Dogma, responsible for storing, interpreting, and expressing genetic information. While it is naive to regard these processes as error-free and we are gaining a better understanding of the diverse and complex phenomena arising from the various issues encountered in these processes, they all fundamentally contain the same information.

From a chemical perspective, DNA and RNA are polymers made up of four monomers, while proteins consist of 20 different monomers. The chemistry that links the monomers of DNA and RNA is the phosphodiester bond, whereas proteins are linked by amide bonds. Reductively speaking, no matter how large or complex these molecules are, there are at most four or twenty types of basic pieces, and there is only one way to connect them.

This characteristic allows us to represent these molecules very simply; DNA can be represented using ATGC, RNA with AUGC, and proteins with 20 letters (RHKDESTNQCUGPAVILMPYW). Consequently, bioinformatics can process these text strings in various ways to solve diverse biology questions. However, to analyze and understand the relationship between biological phenomena and these text strings beyond the domain of bioinformatics, data that fills the gap is essential. Even with known relationships alone, basic predictions can be made; the approach taken by Recursion is an example of acquiring data that fills this gap.

In essence, having a foundation model for biology means starting from text strings that express the structures of biomolecules and progressing towards making predictions and performing tasks at a more macroscopic level. Similarly, a foundation model for chemistry must start from this same point—text strings representing the structures of biomolecules—and enable predictions and tasks at a more microscopic level.

For Small Organic Molecules

Considering how small organic molecules are created (assuming we are discussing their synthesis), there are tens to hundreds of thousands of monomers (a.k.a. building blocks) involved, and the chemistry connecting these monomers encompasses hundreds to thousands of types. This number continues to grow, and aside from a few proven cases, we generally have to experimentally determine which monomers are compatible with a reaction. Thus, in terms of molecular complexity, while biomolecules are large molecules made in a straightforward manner, small organic molecules are comparatively small but synthesized through very complex and creative ways.

Conformer Space

The properties of all molecules are determined by their structures, which generally refer to their three-dimensional arrangements. While determining the structures of biomolecules is challenging, AlphaFold has made remarkable innovations in this area, making it much more feasible.

In contrast, small organic molecules present a very different scenario due to the characteristics of single bonds among them. Small organic molecules can adopt various conformations at room temperature if they contain single bonds; generally, the more single bonds there are, the more conformers can exist. This means that a molecule can take on various shapes at room temperature. (Please refer to Conformational isomerism - Wikipedia page for a better understanding.) Since we need to connect molecular properties to structure but there isn’t just one structure but multiple possibilities, the problem becomes complex.

Considering the interaction between proteins and drugs, it is difficult to assert that a drug always has the same structure when it interacts with numerous proteins. This characteristic is both an advantage and a disadvantage for small organic compounds when they act as drugs. It also explains why two seemingly similar drugs can have very different biological profiles. If we do not properly understand this issue, it becomes nearly impossible to accurately grasp how drugs function within the body.

(I discovered a software called GeminoMol along with a publication that explains it. It's exciting to see this type of research already underway!)

The Role of Electrons

Chemists face an even greater challenge because all properties of molecules—such as intermolecular interactions and reactivity—are determined by electron density. For biomolecules, using a single alphabet character is sufficient to represent their structures since each monomer is relatively similar and not numerous. However, small organic molecules do not share this simplicity; therefore, understanding their properties requires knowledge of their electron distributions. This issue is addressed through solving the Schrödinger equation, which demands substantial computational resources and is well-known for becoming intractable as the number of electrons increases.

A prime example where understanding electron distribution within a molecule is crucial is with molecular glues. To my knowledge, no theoretically sound approaches have yet been established in this area.

To train AI models effectively, a sufficient amount of information and data must be supplied. However, most cheminformatics software currently uses representations of small organic molecules as SMILES strings, which contain no information about electron distribution. Consequently, most AI models are trained on data that entirely lacks critical information needed to comprehend molecular properties accurately; this results in their inability to predict small organic compound properties effectively.

Similar issues arise with protons in tautomerism—another unresolved problem—but I won’t delve into that here. It goes without saying that it becomes a much more complex issue when it comes to inorganic compounds.

Temporary Conclusion

In the next few months or years, I expect that biological foundation models will be developed through various methods and utilized in practical applications yielding significant results. However, I believe that the likelihood of developing a similarly advanced foundation model for chemistry within that timeframe is quite low. Even if some basic models are introduced during that period, achieving practical outcomes will be very challenging.

Many researchers anticipate that quantum computing will play a crucial role in this area. Although I am not well-versed in this field myself, if what researchers hope for becomes feasible, then a foundation model for chemistry might indeed be an attainable dream. Nevertheless, I am uncertain which goal might be reached sooner.

Updated:

Comments