Chemoinformatics (also known as ‘cheminformatics’) applies computational and information science techniques to a range of chemical problems. These techniques are used, for example, to model compounds and to predict their properties. Chemoinformatics has made it possible to simulate expensive operations such as high-throughput screening (the rapid testing of numerous compounds for desirable qualities). A good starting point for those new to the field is the book ‘An Introduction to Chemoinformatics’ by Leach and Gillet.
My project is specifically concerned with the idea of de novo design (Latin for ‘from new’). De novo design builds novel molecules in virtual space and evaluates them against some measure of functionality, such as similarity to a compound known to work well, or the complementarity between the shape of the molecule and the receptor it is to fill. The aim is to streamline the drug discovery process. By referring to these virtual compounds, it may be possible to reduce the number of synthesis operations necessary. However, the sheer number of potential molecules that can be made renders it impossible to explore all of chemical space for solutions, even with modern levels of computing power. As a result, various predictive sampling processes (such as the Monte Carlo method) have been adapted for use in this field. These can give good results, but their simplistic approach to constructing molecules can lead to products being generated that are impossible to fabricate in real world situations, limiting the overall usefulness of the methods.
In order to work around such problems, methods have been developed within the chemoinformatics group that use genuine reaction data from literature to create generic rules (the reaction vectors). These rules can be applied to a given starting material to generate new molecules. This provides a compromise between finding novel molecules and retaining a synthetic awareness, as each transformation is based on a literature precedent. However, this method has its own limitations. Some structures for example, can be built up using multistep reaction sequences. In many cases, the intermediate steps in such sequences do not resemble either the end product or the starting material in terms of similarity to the target. Consequently, many potentially useful molecules are never completed because steps en route to the end product score poorly.
One solution to this problem is to create a new rule format that represents all the reaction steps in one operation. This requires some initial identification of those sequences, which is where my project comes in. If we regard a chemical reaction as a transformation of one molecule into another, it becomes possible to connect these transformations together into a network. Every path through the network will represent a multistep synthetic route for which each step has a known example in the literature. Reaction properties such as ease of synthesis, cost of materials, yield of product etc. can then be added to the network to bias the selection of routes in accordance with a particular set of criteria.