Abstract
This review provides a comprehensive overview of the paradigm shift for computer-aided molecular design and property predictions from similarity-based modeling, including quantitative structure–activity/property relationship (QSAR/QSPR), read-across, read-across structure–activity relationship (RASAR), and pharmacophore mapping to sequence-based chemical language models (CLMs) using deep learning techniques. Starting with multiple methods of chemical structure and latent chemical space representations and touching the molecular descriptor- and fingerprint-based classical type modeling, this review introduces string-based deep learning models involving techniques like recurrent neural networks (RNNs) with long short-term memory (LSTM) and other architectures such as variational autoencoder (VAE), attention models, and generative adversarial networks (GANs). The basics of more efficient transformer models are also discussed. The problem-solving of training with scarce data using transfer learning, data augmentation, and natural-product-inspired training is analyzed. The applications of CLMs in the de novo design of small molecules of medicinal interest, enzymes, peptides, and multitask agents, the predictions of properties of drug candidates, and activity cliffs are presented. The applications of CLMs in materials science and predictive toxicology are also mentioned. We discuss the limitations of feature-based modeling approaches confined to a restricted feature space. In contrast, CLMs lack specific insights into aspects like SARs, bioisosteric replacements, synthesizability, and so forth, which collectively hinder their regulatory acceptance and acceptance by synthetic chemists. This review concludes that cheminformaticians need to utilize two complementary approaches, where factors like simplicity, reproducibility, and regulatory acceptability may prompt the use of feature-based approaches while aiming for higher accuracy and generating novel molecules may drive toward adopting CLMs. This article is categorized under: Data Science > Chemoinformatics Structure and Mechanism > Computational Biochemistry and Biophysics Software > Molecular Modeling.
| Original language | English |
|---|---|
| Article number | e70057 |
| Journal | Wiley Interdisciplinary Reviews: Computational Molecular Science |
| Volume | 15 |
| Issue number | 6 |
| DOIs |
|
| State | Published - 1 Nov 2025 |
Keywords
- QSAR
- Seq-2-seq modeling
- attention
- autoencoder
- chemical language models
- deep learning models
- predictions
- regenerative modeling
- self-attention
- transformer
Fingerprint
Dive into the research topics of 'From Feature-Based Chemical Similarity to Chemical Language Models—A Paradigm Shift in Computer-Aided Molecular Design and Property Predictions'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver