From Feature-Based Chemical Similarity to Chemical Language Models—A Paradigm Shift in Computer-Aided Molecular Design and Property Predictions

  • Arkaprava Banerjee
  • , Supratik Kar
  • , Kunal Roy
  • , Grace Patlewicz
  • , Imran Shah
  • , Panagiotis G. Karamertzanis
  • , Giuseppina Gini
  • , Emilio Benfenati

Research output: Contribution to journalComment/debate

3 Scopus citations

Abstract

This review provides a comprehensive overview of the paradigm shift for computer-aided molecular design and property predictions from similarity-based modeling, including quantitative structure–activity/property relationship (QSAR/QSPR), read-across, read-across structure–activity relationship (RASAR), and pharmacophore mapping to sequence-based chemical language models (CLMs) using deep learning techniques. Starting with multiple methods of chemical structure and latent chemical space representations and touching the molecular descriptor- and fingerprint-based classical type modeling, this review introduces string-based deep learning models involving techniques like recurrent neural networks (RNNs) with long short-term memory (LSTM) and other architectures such as variational autoencoder (VAE), attention models, and generative adversarial networks (GANs). The basics of more efficient transformer models are also discussed. The problem-solving of training with scarce data using transfer learning, data augmentation, and natural-product-inspired training is analyzed. The applications of CLMs in the de novo design of small molecules of medicinal interest, enzymes, peptides, and multitask agents, the predictions of properties of drug candidates, and activity cliffs are presented. The applications of CLMs in materials science and predictive toxicology are also mentioned. We discuss the limitations of feature-based modeling approaches confined to a restricted feature space. In contrast, CLMs lack specific insights into aspects like SARs, bioisosteric replacements, synthesizability, and so forth, which collectively hinder their regulatory acceptance and acceptance by synthetic chemists. This review concludes that cheminformaticians need to utilize two complementary approaches, where factors like simplicity, reproducibility, and regulatory acceptability may prompt the use of feature-based approaches while aiming for higher accuracy and generating novel molecules may drive toward adopting CLMs. This article is categorized under: Data Science > Chemoinformatics Structure and Mechanism > Computational Biochemistry and Biophysics Software > Molecular Modeling.

Original languageEnglish
Article numbere70057
JournalWiley Interdisciplinary Reviews: Computational Molecular Science
Volume15
Issue number6
DOIs
StatePublished - 1 Nov 2025

Keywords

  • QSAR
  • Seq-2-seq modeling
  • attention
  • autoencoder
  • chemical language models
  • deep learning models
  • predictions
  • regenerative modeling
  • self-attention
  • transformer

Fingerprint

Dive into the research topics of 'From Feature-Based Chemical Similarity to Chemical Language Models—A Paradigm Shift in Computer-Aided Molecular Design and Property Predictions'. Together they form a unique fingerprint.

Cite this