Algorithmic Biology: Algorithmic Techniques in Computational Biology

RECOMB Satellite Workshop on Regulatory Genomics

(17 - 18 Jul 2006)

~ Abstracts ~

Discovering motifs with transcription factor domain knowledge
Francis Chin, Hong Kong University, Hong Kong

Finding the binding sites of transcription factors from a set of promoter regions of co-regulated genes is an important problem in molecular biology. Most motif-discovering algorithms consider over-represented similar patterns as binding sites and find the position specific score matrix (PSSM) with the maximum likelihood as the solution motif. However, many motifs in real biological data cannot be discovered by these algorithms because they do not consider the biological characteristics of binding sites. We introduce a new algorithm, DIMDom, which exploits two kinds of information: (a) the characteristic pat-tern of binding site classes, where class is determined based on biological information about transcription factor domains and (b) posterior probabilities of these classes.
We compared the performance of DIMDom with MEME on all the transcription factors of Drosophiia in the TRANSFAC database and found that DIMDom outperformed MEME with more than double the number of successes and double the accuracy in finding binding sites and motifs.

Joint work with Henry Leung.

« Back...

Computational challenges for top-down modeling and simulation of biological pathways
Satoru Miyano, University of Tokyo, Japan

If the concept of ordinary/partial differential equations would be the only way for modeling biological pathways for simulation like in some software tools, our understanding of life as system through computation would be not be drastically increased and would be very biased. If the language for modeling and describing biological pathways would be not rich like graph structures, GIF files, binary relations, kinetic equations, links to another information resources, English narrations, etc., we would loose a lot of valuable knowledge and information on biological systems produced and reported by laboratories because biological knowledge and information are very heterogeneous.

Placing this understanding as our basis of development, we have been developing an XML format Cell System Markup Language CSML and a modeling and simulation tool Cell Illustrator. In this talk, we present the newest version CSML 3.0 and Cell Illustrator 3.0 which supports CSML 3.0.

Cell Illustrator (CI for short) is a software tool for modeling and simulating biological pathways which is based on the notion of Petri net which was developed with the name Genomic Object Net. An important challenge for Systems Biology is to create a software platform with which scientists in biology/medicine can comfortably create models of dynamic causal interactions and processes in the cell(s) and simulate them for further investigations, e.g. testing/creating hypotheses. CI employs the notion of Hybrid Functional Petri Net with extension (HFPNe) as its architecture. HFPNe was defined by enhancing some functions to hybrid Petri net so that various aspects in pathways can be intuitively modeled, including integer, real, string, boolean, vector, objects, etc. The architecture of CI 3.0 is designed so that users can get involved with modeling and simulation in a biologically intuitive way with their profound knowledge and insights, and they can also be benefited from some public/commercial pathway databases. We consider that biological system modeling should be conducted by biological scientists because their minds are full of unpublished deep insights which are inevitable for right modeling. Therefore, any computational challenge for developing such modeling and simulation tools should take care of this aspect. CI 3.0 has a biology-oriented GUI and we can make modeling of very complex biological processes like a drawing tool. Further, we can create a personalized visualization of simulation by developing an XML document for animation. Its effectiveness has been demonstrated by modeling various biological processes. Recently, we have developed a method for automatic parameter estimation for HFPN models by developing a theory of data assimilation that will be implemented as a function of CI.

Simultaneously, we developed an XML format called Cell System Markup Language (CSML) for describing biological systems for simulation. Some XML formats are proposed to be a standard format for biological pathways. However, all formats provide only a partial solution for the storage and integration of biological data. The aim of CSML 3.0 is to create a really usable XML format for visualizing, modeling and simulating biological pathways. For many cases, in vivo/vitro biological experimental results and in silico analyzed results are useful information for biological pathway analysis. A successful application is Cytoscape, which can combine in vivo/vitro and in silico analyses into one graphical network. The core application supports a text-based and a GML formats. Plugins for importing XML format are developed. However, the functionality is limited. In addition, the application just visualizes the biological pathway related data but dynamic simulation part is missing.

Other XML formats, SBML 2.0 and CellML 1.0 are proposed and developed for dynamic simulation. These formats have become popular for chemical reactions and many applications support them as data exchanging formats. However, these formats do not define any graphical elements, which cause a difficulty to be a powerful data exchange format among biological pathway applications. Here, CSML 3.0 is developed as an integrated/unified data exchange format which covers widely used data formats and applications, e.g. CellML 1.0, SBML 2.0, BioPAX, and Cytoscape. In CSML 1.9 and CSML 2.0, the main focus was to support Hybrid Functional Petri net (HFPN) based visualization and simulation. CSML 3.0 has focused on Hybrid Functional Petri net with extension (HFPNe) architecture, extended HFPN with object notion, for more advanced biological pathway modeling. In short, objects that construct biological pathways are treated as "generic entity" of HFPNe architecture and any relations among objects are treated as "generic process" on the HFPNe architecture. The details of CSML 3.0 will be available form http://www.csml.org/

We also developed automatic conversion programs which convert SBML 2.0 to CSML 3.0 and CellML 1.0 to CSML 3.0 automatically. Cell Illustrator 3.0 fully supports CSML 3.0 as its base XML. Thus every model in SBML 2.0 and CellML 1.0 can be executable on Cell Illustrator 3.0. It is also possible to automatically convert KEGG and BioCyc metabolic pathways to CSML.

« Back...

A tale of two topics --- motif significance and sensitivity of spaced seeds
Ming Li, University of Waterloo, Canada

Computing the p-value of a motif has been a very difficult problem. Many heuristic algorihms try to approximate it. It turns out that this problem is very similar to the optimal spaced seed design in homology search. Connecting the two topics, for the first time we show computing the p-value is NP-hard, and give a reasonably fast algorithm by dynamic programming. Test results will be given.

Joint work with J. Zhang, Bo Jiang, J. Tromp, X. Zhang, M.Q. Zhang

« Back...

Computational structural proteomics and inhibitor discovery
Ruben Abagyan, The Scripps Research Institute, La Jolla, USA

Rapid advance of structural proteomics calls for the development of new methods for predicting structural changes, association, function, as well as improving methods for structure based molecular design. The main challenges of computational structural biology and chemistry will be reviewed. We have developed methods for predicting the functional map of a protein with a known 3D structure, accurate docking of compounds to a binding site and virtual ligand screening of large chemical databases, and structure prediction by global energy optimization, e.g. characterizing mutants and SNPs, homology modeling, protein protein or peptide docking, and accurate loop prediction.

Predicting how flexible molecules dock to a flexible receptor is one of the main challenges in computational structural biology and structure based ligand design. Two stories in which novel compounds were discovered through "ligand-guided" receptor pocket modeling followed by virtual screening of large compound libraries, were presented. First, we developed models of the androgen receptor in an antagonist-bound conformation. These models were used to discover computationally the secondary activity of antipsychotic drugs. These drugs were then chemically altered and "re-purposed" to loose their binding to the serotonin and dopamin receptors, and improve their anti-androgen properties. The experimental side of this project was performed by the labs of Xiaokun Zhang and James Dalton. Second, in a collaboration with the David Lomas lab at Cambridge, we identified the first small molecules to inhibit pathological polymerization of an alpha1-antitrypsin mutant which is the most common genetic cause of a lethal liver disease in childhood. Computationally this project was particularly difficult because the target of a small molecule was a dynamic protein-protein interface. Third, we developed a protocol for protein-protein docking which produced the winning overall predictions in two consecutive CAPRI competitions.

Finally, a new way to disseminate structural and functional information in structural proteomics developed in collaboration with the Oxford Center for Structural Genomics is presented.

« Back...

An improved gibbs sampling method for motif discovery via sequence weighting
Tao Jiang, University of California at Riverside, USA

The discovery of motifs in DNA sequences remains a fundamental and challenging problem in computational molecular biology and regulatory genomics, although a large number of computational methods have been proposed in the past decade. Among these methods, the Gibbs sampling strategy has shown great promise and
is routinely used for finding regulatory motif elements in the promoter regions of co-expressed genes. In this paper, we present an enhancement to the Gibbs sampling method when the expression data of the concerned genes is given. A sequence weighting scheme is proposed by explicitly taking gene expression variation into account in Gibbs sampling. That is, every putative motif element is assigned a weight proportional to the fold change in the
expression level of its downstream gene under a single experimental condition, and a position specific scoring matrix (PSSM) is estimated from these weighted putative motif elements. Such an estimated PSSM might represent a more accurate motif model since motif elements with dramatic fold changes in gene expression are more likely to represent true motifs. This weighted Gibbs sampling method has been implemented and successfully tested on
both simulated and biological sequence data. Our experimental results demonstrate that the use of sequence weighting has a profound impact on the performance of a Gibbs motif sampling algorithm.

Joint work with Xin Chen (School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore)

« Back...

Computational prediction of regulatory elements by comparative sequence analysis
Martin Tompa, University of Washington, USA

With many vertebrate genomes now completely sequenced, the most promising methods for predicting functional sequence elements are based on comparison of sequences from multiple species. We focus on problems that arise when using such tools on a genome-wide scale in the vertebrates. These problems include difficulties in finding reliably homologous promoter sequences, difficulties in choosing the best tool and parameters to apply to these sequences, and difficulties in assessing the significance of the predictions produced. Solutions are offered to each of these problems, though they are far from complete.

« Back...