BORIS Theses

BORIS Theses
Bern Open Repository and Information System

Learning the Language of Chemical Reactions – Atom by Atom. Linguistics-Inspired Machine Learning Methods for Chemical Reaction Tasks

Schwaller, Philippe (2021). Learning the Language of Chemical Reactions – Atom by Atom. Linguistics-Inspired Machine Learning Methods for Chemical Reaction Tasks. (Thesis). Universität Bern, Bern

21schwaller_p.pdf - Thesis
Available under License Creative Commons: Attribution-Noncommercial-No Derivative Works (CC-BY-NC-ND 4.0).

Download (30MB) | Preview


Over the last hundred years, not much has changed how organic chemistry is conducted. In most laboratories, the current state is still trial-and-error experiments guided by human expertise acquired over decades. What if, given all the knowledge published, we could develop an artificial intelligence-based assistant to accelerate the discovery of novel molecules? Although many approaches were recently developed to generate novel molecules in silico, only a few studies complete the full design-make-test cycle, including the synthesis and the experimental assessment. One reason is that the synthesis part can be tedious, time-consuming, and requires years of experience to perform successfully. Hence, the synthesis is one of the critical limiting factors in molecular discovery. In this thesis, I take advantage of similarities between human language and organic chemistry to apply linguistic methods to chemical reactions, and develop artificial intelligence-based tools for accelerating chemical synthesis. First, I investigate reaction prediction models focusing on small data sets of challenging stereo- and regioselective carbohydrate reactions. Second, I develop a multi-step synthesis planning tool predicting reactants and suitable reagents (e.g. catalysts and solvents). Both forward prediction and retrosynthesis approaches use black-box models. Hence, I then study methods to provide more information about the models’ predictions. I develop a reaction classification model that labels chemical reaction and facilitates the communication of reaction concepts. As a side product of the classification models, I obtain reaction fingerprints that enable efficient similarity searches in chemical reaction space. Moreover, I study approaches for predicting reaction yields. Lastly, after I approached all chemical reaction tasks with atom-mapping independent models, I demonstrate the generation of accurate atom-mapping from the patterns my models have learned while being trained self-supervised on chemical reactions. My PhD thesis’s leitmotif is the use of the attention-based Transformer architecture to molecules and reactions represented with a text notation. It is like atoms are my letters, molecules my words, and reactions my sentences. With this analogy, I teach my neural network models the language of chemical reactions - atom by atom. While exploring the link between organic chemistry and language, I make an essential step towards the automation of chemical synthesis, which could significantly reduce the costs and time required to discover and create new molecules and materials.

Item Type: Thesis
Dissertation Type: Cumulative
Date of Defense: 22 March 2021
Subjects: 500 Science > 540 Chemistry
500 Science > 570 Life sciences; biology
Institute / Center: 08 Faculty of Science > Department of Chemistry, Biochemistry and Pharmaceutical Sciences (DCBP)
Depositing User: Hammer Igor
Date Deposited: 27 May 2021 15:46
Last Modified: 27 May 2021 15:49

Actions (login required)

View Item View Item