Bioinformatics

The TARGET SELECTION AND BIOINFORMATICS Team was headed by George Phillips, Jr., PhD, and was responsible for identifying and ranking suitable target cDNAs for entry into the production pathway. This group also organized and released to the public the data collected by the Center into several public databases via the Sesame Laboratory Information Management System (LIMS):

| BMRB | PDB | PepcDB | PSI-SBKB | PSI-MR | TargetDB | Target XML

Goals for CESG Bioinformatics

  • Define target selection criteria.
  • Generate target priority scores.
  • Coordinate project data capture.
  • Analyze data produced by the project.
  • Improve pipeline efficiency.
  • Create and generate reports.
  • Support structure calculation, validation, and analysis.

Methods Being Investigated and Publications

CESG's mission was 60% effort on "fold-space" targets, 20% on targets of medical relevance, and 20% on outside user requests. In many cases, a target can satisfy more than one of these criteria, and these are high priority targets.

Fold-Space Scoring. Potential targets were scored on the basis of a variety of quantitative and predicted characteristics identified by software packages developed in-house, by collaborators, or by licensed from academic sources. Where applicable, the software applications were run using the default settings. Characteristics (and the relevant tools) used to score targets are listed below:

  • Homology to proteins in the PDB and in the TargetDB that are at, or nearing, structure completion (BLAST installed at the University of Wisconsin-Madison Condor GRID computing facility). We currently use 30% identity as a threshold for being a valid fold-space target.
  • Presence of target "fragments" which lack homology to proteins in the PDB and in the TargetDB that are at or nearing structure completion (BLAST and in-house fragment analysis software).
  • Progress by other structural genomics centers.
  • Coiled coil domains (COILS).
  • Transmembrane segments (TMHMM and HMMTOP).
  • Signal peptide (SignalP and TargetP).
  • Number of Cys residues.
  • PFAM member identifier.
  • Percent low complexity (seg).
  • New fold prediction (not used).
  • Microarray analysis of in vivo transcription (where available).
  • Predicted disorder (PONDR).
  • Solubility prediction (under development).
  • Gene homology to transposable elements (Censor).

Based on the results of these analyses, each ORF is allocated a 14-digit "target score". In each category, low digits are considered to be desirable. These scores are then used to divide the ORFs into priority tiers. An additional digit, representing the overall tier into which the ORF has been placed, is then added to the beginning of the target score. ORFs in the first, second, or third tiers are considered prime targets.

Medically Relevant Targets. Targets are deemed medically relevant if they have entries in the OMIM (Online Mendelian Inheritance in Man) database. We are currently prioritizing human proteins in this category with up to 40% sequence identity to anything in the PDB.

Outside User Requests. The management committee explicitly considers all outside user requests, as these can be quite difficult and challenging to manage. We first apply our usual fold-space scoring string and if the request meets that criteria and source DNA is available, the request is accepted. If the request has high medical relevance, the request can also be accepted. Requests which seem unlikely to succeed, or if highly identical structures are already in the PDB, are declined.

Once approved, targets are uploaded into the Sesame (LIMS) through a comma separated values (csv) file using the "Bulk Upload" option available in the "Lab Resources" area. The .csv file must contain specific information in a standard format.

Using a maskless photolithography method, we produced DNA oligonucleotide microarrays with probe sequences tiled throughout the genome of the plant Arabidopsis thaliana. RNA expression was determined for the complete nuclear, mitochondrial, and chloroplast genomes by tiling 5 million 36-mer probes. These probes were hybridized to labeled mRNA isolated from liquid grown T87 cells, an undifferentiated Arabidopsis cell culture line. Transcripts were detected from at least 60% of the nearly 26,330 annotated genes, which included 151 predicted genes that were not identified previously by a similar genome-wide hybridization study on four different cell lines. In comparison with previously published results with 25-mer tiling arrays produced by chromium masking-based photolithography technique, 36-mer oligonucleotide probes were found to be more useful in identifying intron-exon boundaries. Using two-dimensional HPLC tandem mass spectrometry, a small-scale proteomic analysis was performed with the same cells. A large amount of strongly hybridizing RNA was found in regions "antisense" to known genes. Similarity of antisense activities between the 25-mer and 36-mer data sets suggests that it is a reproducible and inherent property of the experiments. Transcription activities were also detected for many of the intergenic regions and the small RNAs, including tRNA, small nuclear RNA, small nucleolar RNA, and microRNA. Expression of tRNAs correlates with genome-wide amino acid usage.

Stolc, V. Samanta, M.P., Tongprasit, W., Sethi, H., Liang, S., Nelson, D.C., Hegeman, A. Nelson, C., Rancour, D., Bednarek, S., Ulrich, E.L., Zhao, Q., Wrobel, R.L., Newman, C.S., Fox, B.G., Phillips, G.N., Jr., Pak, J.W., Markley, J.L., Sussman, M.R. (2005) Identification of transcribed sequences in Arabidopsis thaliana by using high-resolution genome tiling arrays. PNAS 102(12):4453-8. |15755812|.

It has been previously shown that protein sequences containing a quasi-repetative assortment of amino acids are common in genomes and databases such as Swiss-Prot but are under-represented in the structure based Protein Data Bank (PDB). Structural genomics groups have been using the absence of these "low-complexity" sequences for several years as a way to select proteins that have a good chance of successful structure determination. In this study, we present a careful examination of the data deposited in the PDB as well as the available data from structural genomics groups in TargetDB and PepcDB to reveal interesting trends that could be taken into consideration when using low-complexity sequences as part of the target selection process. In particular, while the presence of low-complexity regions appears to inhibit the structure determination of protein structures by both nuclear magnetic resonance (NMR) and X-ray crystallography, it appears that when the proteins are normalized by length, NMR shows a higher tolerance for proteins with low-complexity regions.

Bannen, R.M., Bingman, C.A., Phillips, G.N., Jr. (2008) Effect of low-complexity regions on protein structure determination. JSFG 8(4):217-26. |18302007|

The effect of a target filtering algorithm PONDR (the Predictor of Naturally Disordered Regions) on the yield of viable structure determination candidates was analyzed. In addition, we compared the ability of PONDOR and 13 other approaches for predicting disorder from sequence to predict experimental results obtained on 70 protein targets from Arabidopsis thaliana and 1 from Caenorhabditis elegans, which had been labeled uniformly with nitrogen-15 and screened for disorder by NMR spectroscopy. Our study indicates that the efficiency of structural proteomics of eukaryotes can be improved significantly by removing targets predicted to be disordered by an algorithm chosen to provide optimal performance.

Oldfield, C.J., Ulrich, E.L., Cheng, Y., Dunker, A.K., Markley, J.L. (2005) Addressing the intrinsic disorder bottleneck in structural proteomics. Proteins 59(3):444-53. |15789434|

Determination of a protein structure requires a series of decisions and processes, starting with target selection, through cloning, expression, purification, and finally structure determination. Structural genomics projects may distribute these steps among several different groups of researchers. Although this division may achieve a lower cost per solved structure, it creates a unique set of challenges for integrating and passing information on the progress of a given target across several functional divisions. Laboratory information management systems (LIMS) are essential for gathering this information, but may not display the progress of a given target in an intuitive way. In addition, structural genomics projects funded by the Protein Structure Initiative (PSI) are obliged to disseminate data regularly to the TargetDB and PepcDB data repositories, and this requires the creation of specialized views of the data. We report here how the flow of a target through a structural genomics pipeline and reports to TargetDB and PepcDB can be abstracted as directed acyclic graphs or trees. To implement this kind of display, we created software that tracks the flow of activity leading toward protein structure determination and prepares XML reports as input to TargetDB and PepcDB. The target tracing software consists of a set of Perl CGI scripts that integrate with the Graphviz visualization system to provide a graphical, user-friendly Web interface. The database reporting software, also coded in Perl, transfers large-scale genomics data from our LIMS into a PepcDB reportable XML file. This software package has facilitated inter-group communication, improved the quality and accuracy of information in our LIMS, and increased the efficiency and accuracy of our reports to PepcDB.

Pan, X., Wesenberg, G.E., Markley, J.L., Fox, B.G., Phillips, G.N., Jr., Bingman, C.A. (2008) A graphical approach to tracking and reporting target status in structural genomics. JSFG 8(4):209-16. |18236171|

Collaborations

  • Keith Dunker (Indiana University School of Medicine and Molecular Kinetics, Inc. at Indianapolis, IN) protein disorder, solubility, and crystallization predictions
  • Dmitrij Frishman (Institute for Bioinformatics, GSF) PEDANT database ORF annotation
  • Christine Orengo and Russell Marsden (University College, London, Midwest Structural Genomics Consortium) Domain predictions
  • Sue Rhee (TAIR)
  • Chris Town, Owen White, and Steven Salzberg (TIGR) ORF predictions and annotations
  • Simon Twigger (Medical College of Wisconsin)