Institute for Mathematical Sciences Event Archive
Post-Genome Knowledge Discovery
(January – June 2002)
~ Abstract ~
Sub-theme 3. Protein interaction and clinical data analysis (May-June 2002)
Tutorial on protein interaction and clinical data analysis (21 - 23 May 2002) |
An introduction to protein interactions, computational
prediction of protein interactions, literature mining, and
protein networks
Ed Marcotte, University of Texas-Austin, USA
Lecture I: Principles of protein interactions & experiments for detecting them
Lecture II: Computational prediction of protein interactions
Lecture III: Mining the literature for protein interactions
Lecture IV: Networks of interacting proteins
These lectures will introduce the physical basis by which proteins interact and high-throughput experimental techniques for measuring interactions, then describe methods to discover interactions "in silico". These methods will be accompanied by an overview of the current movement to "mine" the extensive biological literature for the known protein interactions. Combining the interactions discovered by all of these approaches reveals that proteins interact in extensive networks. The structure and properties of these networks are just being discovered, as will be described in the fourth lecture. Lectures are one hour each, & are not evenly divided by topic.
Gene network inference and modeling biopathways
Satoru Miyano, University of Tokyo, Japan
The concept of systems biology will play an important role in the post-genome era, which should comprise harmonized activities in bioinformatics and discovery science. In this lecture, we pick up two topics which will be key issues for developing systems biology from the bioinformatics standpoint. One is computational knowledge discovery for biopathway information and the other is technology for modeling and simulating biopathways.
(1) Inference of Gene Networks from cDNA Microarray Data Computational knowlede discovery for biopathway information ranges over various kinds of data, e.g., microarray gene expression profile data, protein-protein interaction data, proteome analysis, scientific literature, etc. This lecture focuses on cDNA microarray gene expression profile data and discusses a strategy for inferring the relations between genes from cDNA microarray data obtained by various perturbations such as gene disruptions, shocks, etc. Based on our work [1], we show a method for inferring a network of causal relations between genes from cDNA microarray gene expression data by using Bayesian networks. This method employs nonparametric regression for capturing nonlinear relationships between genes and derive a new criterion called BNRC (Bayesian Network and Nonlinear Regression) for choosing the network in general situations. Theoretically, this theory and methodology include previous methods based on Bayes approach [2]. This method is applied to the S. cerevisiae cell cycle data and cDNA microarray data of 120 transcription factor disruptants. The results showed us that we can infer relations between genes as a directed acyclic network very effectively. We have also considered the use of linear splines fitted to gene expression data. Together with AIC, this allows us to infer statistically meaningful information from a time-course cDNA microarray data [3] where a small number of points in time are measured.
(2) Genomic Object Net - Towards Biopathway Modeling and Simulation The other important issue is to reorganize and represent various biopathway information so that we can model biopathways and simulate them for new hypothesis generation and testing. For this purpose, we have developed a software Genomic Object Net aiming at describing and simulating structurally complex dynamic causal interactions and processes such as metabolic pathways, signal transduction cascades, gene regulations. The notion of hybrid functional net is introduced and employed as its basic architecture. For Genomic Object Net, an XML personalized visualization environment is also developed for intuitive understanding of the representation and simulation. The software and some biopathway representations are available from http//:www.GenomicObject.Net [4,5].
References:
[1] Imoto, S., Goto, T. & Miyano, S. (2002). Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression. Proc. Pacific Symposium on Biocomputing 7, 175-186.
[2] Friedman, N., Linial, M., Nachman, I. & Pe'er, D. (2000). Using Bayesian Network to Analyse Expression Data. J. Comp. Biol., 7 601-620.
[3] De Hoon, M., Imoto, S. & Miyano, S. (2002). Statistical analysis of a small set of time-ordered gene expression data using linear splines, accepted Bioinformatics.
[4] Matsuno, H., Doi, A., Nagasaki, M. & Miyano, S. (2000), Hybrid Petri net representation of gene regulatory network, Pacific Symposium on Biocomputing 5, 338-349.
[5] Matsuno, H., Doi, A., Hirata, H. & Miyano, S. (2001). XML documentation of biopathways and their simulations in Genomic Object Net, Genome Informatics 12, 54-62.
Data mining techniques
Mohammed Zaki, Rensselaer Polytechnic Institute, USA
Data mining is the semi-automatic discovery of patterns, associations, changes, anomalies, and statistically significant structures and events in data. Traditional data analysis is assumption driven in the sense that a hypothesis is formed and validated against the data. Data mining, in contrast, is data driven in the sense that patterns are automatically extracted from data.
The goal of this tutorial is to provide an introduction to data mining techniques. The focus will be on methods appropriate for mining massive data sets using techniques from scalable and high performance computing. The techniques covered will include association rules, sequence mining, decision tree classification and clustering. Some aspects of preprocessing and postprocessing will also be covered.
Workshop on protein interaction and clinical data analysis (28 - 31 May 2002) |
Boosting and microarray data
Phil Long, Genome Institute of Singapore
(This is joint work with Vinsensius Vega.)
Boosting is a method for using training data to "learn" classification rules. It works by aggregating a number of rough "rules-of-thumb". Boosting has been successfully applied in a variety of domains. However, previous work has suggested that it is not well-suited to microarray data.
We have found one reason why Adaboost, the standard boosting algorithm, does not perform well on microarray data, and identified a simple fix that dramatically improves its ability to find accurate classification rules. The modified algorithm performs nearly as well as the best previously known methods, and finds accurate classification rules that interrogate the level of expression of few genes.
Dynamic experimental design methodology based on query
learning and its application to prediction of MHC class I
binding peptides
Naoki Abe, IBM T.J. Watson Research Center, USA
(This is joint work with Keiko Udaka, Hiroshi Mamitsuka, and Yukinobu Nakaseko, Kyoto University.)
We propose a paradigm for experiment design based on the framework of "query learning", and apply it to the problem of predicting MHC class I binding peptides. Query learning refers to a form of learning in which a function is estimated based on samples obtained by "querying" the function values at points of the learner's choice. The proposed paradigm is a "dynamic" method of experimental design, which is to be contrasted with the traditional approaches in which the points of experimentation are determined prior to the experiments. In the proposed method, the experiment design is conducted on the basis of the outcomes of earlier experiments, in an interactive fashion. We applied this paradigm to the important problem of predicting the MHC binding capacity of peptides. The experiments were conducted in a number of feedback loops (7 iterations in total), in which our computational method was used to determine the next set of peptides to be tested based on the results of the earlier iterations. Our experimental results demonstrate that it attains superior predictive performance as compared to a state-of-the-art prediction method based on matrix data. Furthermore, by combining the two methods, binder peptides (logKd < -6) could be predicted with 84% accuracy, reaching a level of predictive performance unprecedented to date.
Using emerging patterns to analyse clinical data
Jinyan Li, Laboratories for Information Technology,
Singapore
Emerging patterns (EPs) are a notion of data mining which is intended to capture significant differences between two classes of data. By definition, an emerging pattern is a set of conditions with which most samples of a class satisfy, but no samples of the other class satisfy.
A general introduction to EPs is first presented in this talk. To demonstrate the discriminating power of EPs, we then present a new classifier. This classifier is named PCL (Prediction by collective likelihood of EPs). PCL is a classifier of high accuracy and of easy comprehensiveness.
Differences and similarities between C4.5 and PCL are compared using several clinical data sets. The comparison includes those on basic ideas, accuracies, and derived rules. Most experimental results show that PCL outperformed over C4.5 on accuracy especially when applied to gene expression data.
Computational methods towards predicting aspects of
protein structure and interactions
Mona Singh, Princeton University, USA
TBA
A multi-queue branch-and-bound algorithm for anytime
optimal search with biological applications
Richard Lathrop, University of California-Irvine, USA
Many practical biological problems involve an intractable (NP-hard) search through a large space of possibilities. This paper describes preliminary results from a multi-queue variant of branch-and-bound search that combines anytime and optimal search behavior. The algorithm applies to problems whose solutions may be described by an $N$-dimensional vector. It produces an approximate solution quickly, then iteratively improves the result over time until a global optimum is produced. A global optimum may be produced before producing its proof of global optimality. Local minima are never revisited. We describe preliminary applications to {\it ab initio} protein backbone prediction, small drug-like molecule conformations, and protein-DNA binding motif discovery. The results are encouraging, although still quite preliminary.
In search of a global representation of the protein
space in view of structural and functional genomics
initiatives
Michal Linial, The Hebrew University of Jerusalem, Israel
Structural genomic initiative aims to solve a large number of proteins' structures that represent the diversity of the protein space. An essential step in any large-scale structural genomics projects is to define a relatively small set of proteins with new, currently unknown folds. Here, we present a method that ranks each protein according to how likely it is to belong to yet unsolved superfamily or fold. The method makes extensive use of protein databases for protein classification such as ProtoMap and ProtoNet.
In these classifications, the protein space is encoded as a graph whose vertices correspond to clusters of proteins. Structural information is derived from the PDB and adopting SCOP superfamily and fold classification (SCOP 1.55). All SCOP domains are mapped onto ProtoMap clusters and this allows us to view many clusters as 'structurally solved'. For the rest of the clusters, distances within the ProtoMap graph are computed to reflect the space around each cluster free of solved neighbors. These distances are used to rank all unsolved proteins. In the top of the list are proteins that are more likely to belong to new superfamilies or folds. The computed scores were tested against newly released structures that are disjoint from the original set. The top ranked proteins are at least 100 times more likely to represent new superfamilies than randomly chosen proteins. Proteins that score the highest expectancy to represent new superfamilies constitute the target list for structural determination. Our list of selected targets is available through an interactive web site - ProTarget. A statistical analysis was conducted to assess the equivalence between several sequence-based classifications and structural information. Distances in the protein sequence map were compared to distances obtained from structural comparisons (DALI scores in FSSP).
The potential of functional prediction from large-scale classification methods and their limitations in obtaining structural information will be discussed.
Related publications:
- Yona, G., Linial, N. and Linial M. (2000) ProtoMap
- A classification of all proteins sequences and hierarchy of protein families. Nucleic Acid Research 28, 49-55.
- Portugaly, E. and Linial, M. (2000) Estimating the probability of a protein to have a new fold
- A statistical computational approach. Proc. Natl. Acad. Sci., USA 97, 5161-5166.
- Bilu Y. and. Linial, M. (2001) The advantage of functional prediction based on clustering of yeast genes and its correlation with non-sequence based classifications. Journal of Computational Biology 9, 193-210.
- Sasson, O., Linial, N. and Linial M. (2002) The Metric Space of Proteins - Comparative Study of Clustering Algorithms. Xth Proc. Int. Conf Intell Syst Mol Biol. (in press).
Atomic Reconstruction of Metabolism (ARM) project
Masanori Arita, National Institute of Advance Industrial
Science and Technology and PRESTO, Japan
Three GUI tools for comprehensive metabolite analysis are introduced. First is a database for metabolites which can suggest structurally similar compounds for a given input. Second is a viewer of enzymatic reactions which can show structural correspondence among reactants in colors. The last is a simulator of tracer experiment, which can display all logically possible pathways as well as those with putative metabolites (i.e. predictions). All tools are based on up-to-date graph algorithms for representing molecules and metabolic pathways. Metabolic data for Eschelichia coli and Bacillus subtilis are prepared and ready for use.
Validating gene clusters
Dannie Durand, Carnegie Mellon University, USA
Large scale gene duplication, the duplication of whole genomes and subchromosomal regions, is a major force driving the evolution of genetic functional innovation. Whole genome duplications are widely believed to have played an important role in the evolution of the maize, yeast and vertebrate genomes. Two or more linked clusters of similar genes found in distinct regions on the same genome are often presented as evidence of large scale duplication. However, as the gene order and the gene complement of duplicated regions diverge progressively due to insertions, deletions and rearrangements, it becomes increasingly difficult to determine whether observed similarities in local genomic structure are indeed remnants of common ancestral gene order, or are merely coincidences. In this talk, I present combinatorial and graph theoretic approaches to validating gene clusters in comparative genomics.
New cytokine-related gene candidates identified from the
mouse transcriptome
Vladimir Brusic, Laboratories for Information Technology,
Singapore
We annotated more than 2000 RIKEN mouse cDNA clones pre-selected by the keywords related to the immune system products and by similarity to human MHC region. The keywords included terms cytokine and interleukin. Gene Ontology IDs included, among others, cytokine and chemokine mediated signaling pathway, cytokines, and IL1R ligand. Clones representing known mouse interleukin genes or their subunits include IL-1 members 5 and 6, IL-7, IL-10, IL-16, IL-17B, IL-20, IL-23, and IL-25. Clones representing known genes of interleukin receptors or their subunits include IL-1R, IL-2R, IL-4R, IL-6R, IL-7R, IL-10R, IL-12R, IL-13R, IL-15R, IL-17R, IL-18R, and IL-21R. Seventeen known small inducible cytokines of subfamilies A, B, C, and D are also represented in the annotated clones. Clones representing several known genes of cytokines or cytokine-associated genes (enhancers, accessory proteins, or cytokine-induced genes) are also in the annotated set. We found nine clusters of RIKEN clones that are candidates for novel cytokine or cytokine-related genes. These include candidates for the mouse guanylate binding protein (mGBP-5), mouse interleukin-1 receptor-associated kinase 2 (mIRAK-2), two members of the interferon-inducible proteins of the Ifi 200 cluster, three members of the membrane-associated family 1-8 of interferon-inducible proteins, one p27-like protein, and a hypothetical protein containing a Toll /Interleukin receptor domain.
Inferring HIV and cellular protein interactions during
infection with a functional genomics knowledge discovery
support system
Christian Schoenbach, RIKEN Genomic Sciences Center, Japan
(joint work with: Takeshi Nagashima, Akihiko Konagaya, Igor Kurochkin)
Protein interactions are functions of protein structure, post-translational modifications, translation, transcription and cellular context in a complex network at a given time. Context and temporal information is important when studying dynamic processes such as viral infection. To infer higher functional aspects of viral and cellular protein interactions during HIV infection we are developing a rule-based, semi-automatic knowledge discovery support system to associate experimental gene expression data with protein interactions, MeSH terms and gene ontology extracted from MEDLINE abstracts. Predicted and annotated protein interactions derived from the 2HAPI GeneChip expression data set (SDSC) of HIV infected T-cells are stored in a web accessible database. Although we work with known, informative genes and proteins the association with text information allows us to extract and summarize non-obvious relations for higher functional annotation, protein network construction and drug target discovery. For example, prosaposin which is mostly known for its OMIM reported association with Gaucher disease and some hereditary metabolic diseases is down-regulated during HIV infection and predicted to be associated with apoptosis, binding to a G-protein associated receptor, and hyperalgesia.
Data mining for protein structure prediction
Mohammed Zaki, Rensselaer Polytechnic Institute, USA
Proteins fold spontaneously and reproducibly into complex three-dimensional globules when placed in an aqueous solution, and, the sequence of amino acids making up a protein appears to completely determine its three dimensional structure. This self-organization cannot occur by a random conformational search for the lowest energy state, since such a search would take millions of years and proteins fold in milliseconds (known as levinthal's paradox). In this talk I'll highlight some of the data mining challenges for the protein folding problem, i.e., how to predict the three dimensional tertiary structure of a protein given its linear amino acid sequence. I'll discuss some recent work on using a hybrid approach to predict local structure using a Hidden Markov Model, and then infering contact rules based on association mining. The HMM models the interactions between adjacent short regions of the protein sequence, and so attempts to model the propagation of structure along the sequence. To detect long-range amino-acid contacts we discover rules to predict if a pair of residues is in contact or not. In the testing phase one can predict the contact map for an unknown protein, and from the contact map one can recover the 3D shape. I'll discuss limitation of the current approach, and some future directions on how to incorporate geometric constraints while mining and whether one can learn the folding pathways.
Literature Mining for Interaction Pathway Discovery
See-Kiong Ng, Laboratories for Information Technology,
Singapore
Despite the proliferation of online molecular databases that have provided much opportunities for bioinformatic data analysis, much of the bulk of biomedical information and knowledge still resides in free text format, as scientists routinely report their experimental results and insights in journal and conference articles. The resulting biomedical literature is uniquely centralized, in the form of Medline database which contains a publicly accessible online collection of literature abstracts. Literature mining, the process of automatically extracting and combining facts from scientific publications, is therefore key for post-genome knowledge discovery.
An important literature mining application is the extraction of molecular (e.g. protein-protein) interactions from the online abstracts to elucidate the complex systems and networks responsible for important biological functions. We have conducted an investigative study to assess the effectiveness of such approach; we discuss in this talk the implications of our results as well as the challenges of literature mining for interaction pathway discovery.
E-CELL: Towards integrative modeling of cellular
processes
Masaru Tomita, Keio University, Japan
E-CELL Project (http://www.e-cell.org) was launched in 1996 at Keio University in order to model and simulate various cellular processes with the ultimate goal of simulating the cell as a whole. E-CELL System, a generic software package we have developed, enables us to model not only metabolic pathways but also other higher-order cellular processes such as protein synthesis and signal transduction. Using the system, we have successfully constructed a virtual cell with 127 genes sufficient for "self-support''. The gene set was selected from the genome of Mycoplasma genitalium, and the metabolisms include transcription, translation, membrane transport, the glycolysis pathway for energy production, and the phospholipid biosynthesis pathway for membrane structure. Since all its proteins and membrane structure are modeled to degrade spontaneously over time, the virtual cell must keep synthesizing proteins and phospholipid bilayer to sustain its life. It thus uptakes glucose as its energy source, and emptying glucose in the environment would result in "cell death from hunger". Modeling Group in our institute are now developing many different models of cellular processes, including bacterial chemotaxis, circadian rhythms, photosynthesis, as well as cell cycle and cell division. For gene expression, we are working on general quantitative models and their application to gene regulation network of lactose open in E.coli and lambda phage genetic switch. For organelles, a quantitative model of mitochondria is nearly complete, and we will be soon developing chloroplasts in the context of e-Rice Project funded by Japanese ministry of agriculture. For human cells, we have already developed a quantitative model of erythrocytes, and being used in pathological analyses of enzyme deficiencies causing anemia. Other human cells now being developed include myocardial cells, neural cells, and pancreatic beta-cells. A major bottleneck in cell modeling is lack of quantitative data, such as kinetic parameters, dissociation constants, steady state concentration, and flux rates. Metabolome Group of our institute is developing methodologies for mass-production of those quantitative metabolic data. We analyze metabolic flux distributions (MFDs) with different conditions such as dissolved oxygen (DO) concentration, pH, temperature, and media composition. We also label certain substrates with U-13C or 1-13C, and measure isotope distribution of intercellular metabolites using NMR and GC-MS. Gene expression analyses are conducted using DNA microarray for the transcription level, and 2D electrophoresis with SWISS-2D PAGE and TOF-MS for the protein level. For high-throughput measurement of metabolites, we are developing a novel analytical device based on capillary electrophoresis, which can eventually up to hundreds of metabolites at the same time.
Automatic reconstruction of 3D structures in single
particle analyses
Kiyoshi Asai, National Institute of Advance Industrial
Science and Technology, Japan
Single particle analysis is a method for building 3D structures of particles by estimating the projection angles of their randomly oriented electron microscopic images. An overview of an automated system for single particle analyses, which has been developed after solving the 3D structure of sodium channel protein, is given.
On haplotype reconstruction for diploid populations
Jian Zhang, EURANDOM, Netherlands
The problem of inferring haplotype pairs directly from the unphased genotype data is crucial for haplotype-based disease gene discovery by whole genome linkage disequilibrium (LD) mapping and/or haplotype-based candidate gene analysis. Several in silico approaches have been described. However, none of them is directly based on the likelihood of haplotype. This makes it rather difficult to understand the relationships between these methods and therefore to develop more efficient ones. In this work, we introduce the general complete-data-likelihood framework and the new concept of `haplotype likelihood', which appear useful to avoid some limitations of the expectation-maximization (EM) and Clark methods. Based on these likelihood, we develop two kinds of estimators, called the maximum haplotype-likelihood (MHL) estimator and its Bayesian version. The MHL is different from the others in that it predicts the most likely haplotype pair for each individual in a given sample by using the haplotype likelihood and a new algorithm called block-wise evolutionary Monte Carlo algorithm. Like the coalescence-based and Bayesian segmentation approaches, our method improves the EM and Clark methods significantly in terms of error rate or capacity for handling a large number of ambiguous loci. We find that among all existing approaches, although the coalescence-based approach has the best performance under the coalescent model, our method can yield the lowest error rate for the heterogeneous genotype data like the ACE data for which the coalescent model may not be suitable. This is based on a joint work with Karla Koepke, Faming Liang, Margret R. Hoehe, Martin Vingron.
Optimal adaptive designs and clinical research
William F. Rosenberger, University of Maryland, USA
We explore the reasons behind using equal allocation to competing treatments in gene therapy clinical trials. In particular, we derive the optimal allocation ratio for maximizing the expected number of treatment successes, while preserving power of the clinical trial. Since this allocation depends on unknown parameters, we explore different response-adaptive randomization procedures to achieve this allocation. We illustrate with a gene therapy trial in cystic fibrosis.