Corresponding Author:
E. C. Coutinho
Department of Pharmaceutical Chemistry, Bombay College of Pharmacy, Kalina, Santacruz (East), Mumbai-400 098, India.
[email protected]
Date of Submission 15 September 2005
Date of Revision 09 June 2006
Date of Acceptance 15 September 2005
Indian J Pharm Sci, 2006, 68 (6): 689-696  


Pressure is mounting on the pharmaceutical industry to reduce both the cost of drugs and the time to market. The large number of targets made available in the last decade has created a new area for technologies that can rapidly identify quality lead candidates. Virtual screening is one such technology that is gaining increasing importance in the drug discovery process. Virtual screening is a reliable and inexpensive method currently being employed as a complementary approach to high-throughput screening. Virtual screening can be adopted irrespective of the structural information of the target receptor. In the absence of structural data, virtual screening using pharmacophore-based search is a major in silico tool. However, when the structure of the receptor is available, virtual screening using both pharmacophore-based and docking techniques can be employed. Both of these methods can be synergistically integrated to improve the drug design and development process. In this article, we provide an overview of methods for virtual screening - in particular, docking and pharmacophore-based - along with commercial algorithms implementing these methods, and a successful example in this arena. Further, we enumerate the potential for patenting such kind of studies.


The pharmaceutical industry is under ever increasing pressure to increase its success rate in bringing drugs to the market. Current efforts within the industry are directed at reducing the hit-to-drug timeline and increasing the number of quality candidate drugs that make the transition from discovery to the clinical phase. Enormous advances in genomics have resulted in large increase in the number of potential therapeutic targets. This growth in potential targets has increased the demand for reliable target validation, as well as technologies that can identify rapidly several quality lead candidates [1-4]. Virtual screening (VS) technology is gaining increasing importance in the discovery process because it is a reliable and inexpensive method for identifying lead molecules [5]. It is seen as a complementary approach to high-throughput screening (HTS) and when coupled with structural biology, it enhances the success rate of identification of leads. Further, the advances in computational techniques have enabled VS to make a deep impact on the drug discovery process.

The tools for carrying out VS can be broadly categorized as receptor-based or ligand-based. The ligand-based methods use information provided by a compound or a set of compounds that are known to bind to the desired target, and use this knowledge to identify other compounds with similar properties in databases. This is usually achieved by similarity and substructure searching [6] or pharmacophore matching [7] or 3D shape matching [8]. However, when the structure of the receptor is available, both pharmacophore-based and docking techniques can be employed. The latter method involves positioning each ligand into the binding site of the target, thus proposing a binding mode and affinity for each compound in the database. This information is then used to rank the compounds in order to select and experimentally test a small subset of hits for biological activity. In particular, both these methods can be synergistically integrated to improve the drug design and development process.

This article is intended to provide an overview of methods, in particular docking and pharmacophore searching, along with commercial and non-commercial software for in silico screening of ligand databases and some successful examples in this arena.

Database Structure

The primary requirement for performing VS, by either docking or by pharmacophore search method, is the availability of a database (DB) of ligand molecules. A corporate database or one available from commercial vendors such as Available Chemicals Directory (ACD) [9] or Cambridge Structural Database (CSD) [10] can be employed for this purpose. Additionally, a database of known reagents and compounds which are readily synthesizable can be used for VS after primary filtration for ‘drug-like’ properties [2] such as variation of Lipinski’s rule-of-five [11], number of rotatable bonds or the polar surface area. A database could also be filtered to remove compounds with specific substructures associated with poor chemical stability or toxicity. Also, physically relevant ionization and tautomeric states should be assigned to compounds in the DB. It is prudent to use all the relevant tautomers because there is no way of knowing a priori which tautomer is most likely to bind to the receptor.

For the generation of the 3D structure of a molecule during compilation of a database, one can use programs such as Corina [12] (Molecular Networks GmbH, http://, Concord and Confort (Tripos Inc., USA) or Converter (Accelrys Inc., USA). Sometimes it is necessary to assign partial charges to compounds in the database depending on the requirement of the 3D search method.

Another issue in VS is the handling of ligand flexibility. Conformational flexibility can be introduced in VS either through the database or the query, or during the search process. Handling conformational flexibility in the database involves storage of multiple conformations of each ligand in the DB. However, this is a practically intractable solution because the number of conformations of a ligand increases with an increase in the number of rotatable bonds. A small drug-like molecule with four rotatable bonds, if scanned at a resolution of 120° for each dihedral angle, yields [360/120]4 = 81 conformations, and there is no firm basis for the ones to be included in the DB. A better solution is the hybrid approach. The combination of multi-conformer database and flexible searching provides an efficient and effective route. This hybrid approach is employed in the Catalyst BEST search method (Accelrys Inc., USA).

Vs Using Docking

The docking process involves the simulation of molecular recognition events using the 3D structure of the receptor (obtained by X-ray crystallography or NMR or a theoretical method such as comparative modelling). The 3D structure of the receptor is corrected and transformed into an appropriate format depending on the requirements of the docking program. The docking method then samples the conformational space of the binding site and scores each possible ligand pose, which is then taken as the predicted binding mode of that ligand. The ligand and receptor flexibility is an important aspect in VS because the conformations of the ligand, as well as the receptor binding site residues, might be different from their conformation in the bound state. The aspects of ligand and receptor flexibility are tackled to some extent in some of the docking programs (flexible search) or by pre-computing a DB of several conformers of each compound to be screened, as mentioned in the previous section. A majority of the docking programs explore ligand flexibility through a variety of algorithms. One way to explore the ligand’s conformational space is to use Monte Carlo or simulated annealing methods [13,14], but these are often time consuming and not suitable for large-scale VS. The Genetic Optimization of Ligand Docking (GOLD [15] , Cambridge Crystallographic Data Center, UK; and the most recent version of AutoDock [16] (Scripps Research Institute, USA) use genetic algorithm-based approaches to generate and select the best conformations of a ligand. Flex-X17 (Tripos Inc., USA) has a fragment-based approach and uses a set of low energy torsion angles derived from the CSD for each single bond. GOLD also enables the customization of torsional energy contributions via a statistical analysis of torsion angle distributions observed in CSD. The program Glide (Schrödinger Inc., USA; http://www.schrö treats ligand flexibility by an exhaustive enumeration of the rotamer states for each rotatable bond coupled to a heuristic screen that rapidly eliminates conformations deemed unsuitable for binding to a receptor.

Majority of the docking methods treat the receptor as a rigid entity, which is an inaccurate but necessary approximation in order to limit the complexity and consequently the computational cost required to accurately sample the flexibility of the binding site. Researchers have used an ensemble of protein structures to incorporate protein flexibility. This ensemble of protein structures can be obtained by simulated annealing of the binding site or by use of several high resolution structures of the same target bound to different ligands or in apo form, or by use of an ensemble of NMR structures, or by use of multiple homology models. The programs AutoDock [16] and Flex-E [18] (Tripos Inc., USA) incorporate protein mobility through rotamer libraries for flexible side chains and by use of an ensemble of protein structures to generate Boltzman-weighted grids with which the docking function is generated [19].

The next important step that determines the success of the docking process is ranking the quality of different poses of the same molecule and then with respect to other molecules in the DB (Scoring). There are several scoring functions [20] available to achieve this task. Force field-based scoring functions such as AMBER [21],CHARMm [22] and CVFF [23] are based on atomic force fields in which single-point intermolecular interaction energy between the ligand and the receptor is used as a score for the given pose. Empirical scoring functions are based on an additive approximation of physicochemical properties such as H-bonding, hydrophobic interactions, entropic changes and metal-ion interactions, if present. Experimental binding energies of known ligand-receptor complexes are used to derive the coefficients of each term explaining a binding event. Therefore, the efficiency of these scoring functions depends on the variety and number of ligand-receptor complexes used to derive the scoring function and its validation on a large set of ligand-receptor complexes in the PDB. GOLD [24] , FlexX]17], SCORE [25], VALIDATE [26], ChemScore [27], Ludi [28] (Accelrys Inc., USA) and PLP [29] use an empirical scoring function to rank different poses. The GOLD scoring function has been validated on a large data set [30] and uses a genetic algorithm to rank the poses, and it is considered to be a reliable docking and scoring algorithm for VS. Another class of scoring functions, knowledge-based, is derived from the structures of ligand-receptor complexes using statistical mechanics. In contrast to empirical scoring functions, they do not require binding affinity data of ligand-receptor complexes and so are free to use information in ligand-receptor complexes deposited in the public or proprietary databases. Further, such scoring functions are expected to be less biased and more readily transferable to systems not included in the development of the scoring function. PMF [31], Bleep [32], SMOG [33] and DrugScore [34] (Cambridge Crystallographic Data Center, CCDC, UK) are some examples of knowledge-based scoring functions. A flow chart of the steps involved in VS by docking method is depicted in fig. 1.

Fig. 1: Flow chart of the virtual screening process using docking method

At the end of the docking process, a set of best compounds is selected. It is not advisable to select the compounds merely on the basis of scoring or ‘best ranking’. A recent study [35] that evaluated the performance of several scoring functions such as PLP, ChemScore and Dock found little or no correlation between predicted and experimental binding affinities. Therefore, it becomes crucial to use a post-analysis strategy to minimize the false positives in the selection list. In consensus scoring strategy [36-38], a docking function is used to generate top-ranked poses and then multiple scoring functions are used to score these poses. Then, only the top-ranked compounds common to each scoring method are chosen for biological evaluation. It has been observed that almost all docking methods are able to identify the correct binding pose [39] but usually cannot rank it as the top-ranked binding pose for the ligand. Therefore, it is prudent to use consensus scoring of multiple poses rather than a top-ranked single pose.

Based on the above known limitations, it is strongly advisable to use a set of ligands with known binding modes and affinities, and multiple docking and scoring (consensus) algorithms, to calibrate the docking methodology for the target under study before undertaking VS of large databases.

Vs Using Pharmacophore Methods

The docking methodologies discussed above help in screening databases only when the 3D structure of the target receptor is available. In the early phase of drug discovery, or in cases where the structure elucidation is beyond the capabilities of present day techniques, the docking technique cannot be used. A pharmacophore-based search of databases can be carried out in absence of the receptor structure with the help of the information available from a set of ligands binding preferentially to the given receptor. The credit for first conceiving the concept of the pharmacophore goes to Paul Ehrlich, who devised a way (in early 1900s) to develop dyes through chromophores (i.e., the part of a molecule responsible for imparting colour) [40]. A good review of the evolution and history of the pharmacophore concept has been published by Peter Gund [41], who is also the author of the modern definition of a pharmacophore - “a set of free download froma site hosted by Medknow structural features in a molecule that is recognized at a receptor site and is responsible for that molecule’s biological activity” [42,43]. The concept could not achieve its full utilization until the development of 3D database searching software in 1990. MOLPAT [44], the first computer program, was developed at Princeton University in 1974 by Gund, Wipke and Langridge to recognize pharmacophore patterns. The demand for 3D searching software reached a critical level with the development of rapid 3D structure generation programs such as CONCORD [45], CORINA [12,46], AIMB [47] and WIZARD [48] . ALADDIN [49] (developed at Abbott and later commercialized by Daylight) and 3D-Search [50] (developed at Lederle) were developed by pharmaceutical companies, whereas academic and government institutions were responsible for CAST-3D [51] (Chemical Abstract Services), DOCK [52] (University of California at San Francisco) and CAVEAT [53] (University of California at Berkley). The first commercial 3D searching system, MACCS-3D [54], was developed by Güner et al. and was released in December of 1989. During the next 4 y, all of the technology that is available today was developed - ChemDBS-3D [55] (Accelrys Inc., USA), UNITY [56] (Tripos Inc., USA) and Catalyst [57] (Accelrys Inc., USA). The critical demand for the pharmacophore recognition software arose when the above-mentioned 3D searching technologies were made easily available. Though these 3D-search software had inbuilt query-generation tools, specialized pharmacophore generation software were also being developed. Most notable among them are DISCO [58] by Martin et al. (Tripos Inc., USA), HipHop [59] by Barnum et al. (Accelrys Inc., USA) and GASP [60] by Jones and Willett (Tripos Inc., USA). Meanwhile, predictive pharmacophore models such as CoMFA [61] by Cramer et al. (Tripos Inc., USA), Apex-3D [62] by Golander and Vorpagel (Accelrys Inc., USA) and HypoGen [63] by Teig et al. (Accelrys Inc., USA) were being developed. Detailed usage and validation of all of the pharmacophore development software are covered elsewhere41. For timely reviews in the field, the reader may follow other references [64-68].

Pharmacophore modelling provides a useful framework for better understanding of the existing data and can be used as a predictive tool in the design of compounds with improved potency, selectivity and/or pharmacokinetic properties. Pharmacophore models are generated by analyzing structure-activity-relationships and mapping common structural features of active analogues. The pharmacophore can be identified by direct method (using a receptor-ligand complex) or by indirect method (using only a collection of ligands that have experimentally been observed to interact with a given receptor). However, direct methods are becoming extremely important because of the increase in the rate at which protein structures are being determined. A flow chart of steps involved in VS by the pharmacophore method is depicted in fig. 2.

Fig. 2: Flow chart of the virtual screening process using pharmacophore method

Pharmacophore generation

Pharmacophore generation starts with the selection of a set of ligands for which the pharmacophore model is to be constructed. The type of ligand molecules, the size of the dataset and its diversity have a great impact on the resulting pharmacophore model [67]. The Carnell-Smith method69, RAPID [70] and HipHop [59,64,71] assume that ligands have the same activity and thus do not consider activity data. The information of inactive ligands can also be used to indicate structural features that significantly decrease the activity. The programs that use information of inactive ligands are CLEW [72] and the current version of DISCO [58]. Pharmacophore models can also be used to predict the activity of unknown compounds. For deriving such models, HypoGen [63] utilizes a large enough set of diverse compounds (about 18-30) with different activity levels (4-5 orders of magnitude). Most of the currently available methods such as HipHop, HypoGen, MPHIL [73] and RAPID are designed to handle small (less than 100 ligands) data sets. Other methods use larger data sets as input but convert them into a smaller one by sorting the activities of the ligands, depending on a user-specified cut-off. Lastly, one should remember that though the dataset should be as diverse as possible to get an accurate pharmacophore model, very different ligands may bind at different binding sites, resulting in a misleading pharmacophore model [74].

In the next step, the features relevant to the pharmacophore model are extracted from the input ligands (feature extraction). Features can be defined depending on topology (phenyl ring and carbonyl group), function (H-bond donor/acceptor, acid, base, aromatic ring and hydrophobic group) and atom (3D position of an atom and its type) [67]. Both topology-based and function-based features have some drawbacks. For example, hydroxyl oxygen can be classified as both an H-bond donor and acceptor. A simple way to represent a functional group is by its centre. The centre of an acid, base, H-bond donor/acceptor is usually defined as the position of an actual atom. For a hydrophobic region or an aromatic ring, the centre is defined as the centroid of the group. Furthermore, a vector representation is more accurate than a point representation since it imposes an additional constraint on bond directionality between the ligand feature and its complementary feature on the receptor [71,75]. In addition, a hydrophobic group can be represented by a sphere and an aromatic ring by a plane and its normal.

From each ligand structure, the selected features are combined to form a representation of the whole structure. The RAPID method represents ligand structures as a set of labelled points in three-dimensional space, where each point is associated with a feature [70]. Another approach by Takahashi et al. [76] represents a ligand structure through a labelled graph, with nodes representing the features and the edges representing the relations. Different methods that use this representation are given in the literature [77]. In another approach, a ligand structure is considered a set of labelled points, together with the associated interpoint distances [77]. This type of representation is orientation independent, in contrast to the 3D point-set representation.

The features extracted from different ligand molecules are matched and pharmacophore candidates are proposed. This is called pattern identification. A pattern or configuration is a set of relative locations in 3D space, each associated with a feature. A ligand is said to match a pattern if it possesses a set of features and a conformation such that the features can be superimposed with the corresponding locations. The most popular approach to define a pattern is to find the maximal common substructure (MCS). The methods that have adopted this approach are DISCO [58], RAPID [70], GAMMA [78] and GASP [60]. This approach assumes that a common pharmacophore is responsible for the observed activity. This is an inaccurate assumption in the MCS approach and can be overcome by relaxing the requirements (“Relax MCS” approach) that all input ligands possess all the features. This philosophy is used in the MPHIL [74] method.

Algorithms for pattern identification have also been developed. (For a detailed description, see reference [67]). The Clique detection algorithm [79-81] has been implemented in DISCO and MPHIL methods. Clique detection has its origin in graph theory. A clique is a subgraph in which every node is connected to every other node. The Clique detection algorithm finds the largest clique in a reference graph which is also present in every other graph in the set. HipHop and SCAMPI use an exhaustive search algorithm in which the search for a pattern starts with small sets of features and extends until no larger common pattern exists. HypoGen uses a similar approach but also incorporates activity information into the pharmacophore derivation process. This is done in three steps: the constructive stage identifies pharmacophore candidates that are common among the set of most active ligands, followed by the subtractive stage in which those pharmacophore candidates identified in the constructive stage and also present in more than half of the least active ligands are removed; and the last step of optimization attempts to improve the score of the pharmacophore candidates that pass the subtractive stage by simulated annealing [67]. GASP [60] and GAMMA [78] are based on the genetic algorithm [82] and also perform a conformational search as part of the GA run. Thus, molecular flexibility is simulated by generating multiple conformations of the given ligand followed by an exploration of different ways to align the molecules and to identify the pharmacophore pattern.

In the last step of pharmacophore generation, candidates are scored and ranked. A lower score indicates a greater possibility that the model is a result of chance correlation. A detailed description of scoring methods implemented in techniques that relax the MCS requirements, and scoring in genetic algorithms like GASP and GAMMA is given in the literature [59,67].

As mentioned in the introductory section, VS by pharmacophore searching is more efficient when structural knowledge of the target receptor is available. The receptor-based approach for pharmacophore generation involves analyzing features of the active site and the spatial relationship among them, and then an active image of this is used to construct the pharmacophore model. Such an operation gives rise to a large number of features, and it is necessary to determine which of these are actually parts of the pharmacophore. The method begins with a 3D structure of the receptor (usually in PDB format) and a set of ligands with known activity. Using the information of the active site residues (from biochemical or structural studies), a program such as Ludi [25,28] (Accelrys Inc., USA) generates an interaction map which is a complement of the receptor-binding site. The ligands with known activity are used to identify the functional features such as H-bond donors/acceptors and lipophilic groups from the interaction map in the active site. Often it is necessary to filter/limit the number of features since queries with multiple features may fail to retrieve any hits from the database. Therefore, 3D queries composed of fewer features are generated by considering all possible combinations. Catalyst [57] uses these queries to search the ligand database.

Successful Application of vs Methods

DNA gyrase inhibitors

A 3D database search using Ludi and Catalyst was performed to find novel scaffolds of DNA gyrase inhibitors [83]. The ACD database of 350,000 compounds and a portion of the Roche compound inventory (RCI) were screened to obtain 150 weak inhibitors. These novel DNA gyrase inhibitors binding to the ATP-binding site have seven structurally different scaffolds. The optimization of the indazole scaffold provided a DNA gyrase inhibitor which was 10 times more potent than novobiocin.

Sterol metabolism

Laggner and colleagues [84] have reported pharmacophore models for three protein targets involved in sterol metabolism. Twenty-three structurally diverse molecules with binding affinity data for EBP (emopamil binding protein), ERG2 (fungal counterpart of EBP) and the sigma-1 receptor were used in the study to derive a pharmacophore model with the HypoGen module of Catalyst. These three enzymes of sterol metabolism share high affinity for various structurally diverse compounds. Three pharmacophore models with one positive ionizable group and four hydrophobic features in common but with different spatial arrangements were derived and validated. The study showed that the hydrogen-bonding interactions are not required for high-affinity inhibitor binding. The models were subsequently used to search the database [such as World Drug Index (WDI), Kyoto Encyclopedia of Genes and Genomes (KEGG) and COMPOUND database]. In virtual screening, the drugs that were reported previously to bind to one or several of these proteins were retrieved along with 11 new hits, which were then tested experimentally. Inhibitors with nanomolar binding affinity were discovered.

Can one Protect Knowledge-Based Concepts?

The answer is now YES. Though there are no patents yet of QSAR studies, the pharmacophores are being protected under Intellectual Property Rights. The credit for the first application of patenting such a knowledge-based concept goes to Biogen. In 1998, Biogen applied for a world patent of pharmacophores (WO 98/04913) in which all compounds derived from a 3D database search of the described pharmacophore were included. Peptor Ltd. filed a patent (US 6,343,257) that involves the development of a pharmacophore, its use in VS and use of the hits to design new compounds. Another patent of pharmacophores covers Hepatitis C NS3 protease inhibitors. This patent (WO 98/46630) claims all compounds that fit the pharmacophore model that in turn represents the structure for inhibitors of Hepatitis C NS3 protease. Another patent filed for pharmacophores is US 2002/0013372 for the identification of CYP2D6 inhibitors.


The support provided by the Department of Science and Technology (DST), New Delhi, through the FIST program (SR/FST/LS1-083/2003) is acknowledged with thanks; and the authors SAK and AKM thank the Council of Scientific and Industrial Research (CSIR), New Delhi, for Research Fellowships.