Algorithmic Biology: Algorithmic Techniques in Computational Biology - IMS

Workshop on BioAlgorithmics

(12 - 14 July 2006)

~ Abstracts ~

A parsimony approach to genome-wide ortholog assignment
Tao Jiang, University of California at Riverside, USA

The assignment of orthologous genes between a pair of genomes is a fundamental and challenging problem in comparative genomics. Existing methods that assign orthologs based on the similarity between DNA or protein sequences may make erroneous assignments when sequence similarity does not clearly delineate the evolutionary relationship among genes of the same families. In this paper, we present a new approach to ortholog assignment
that takes into account both sequence similarity and evolutionary events at genome level, where orthologous genes are assumed to correspond to each other in the most parsimonious evolving scenario under genome rearrangement and gene duplication. It is then formulated as a problem of computing the signed reversal distance with duplicates between two genomes of interest, for which an efficient heuristic algorithm was given by introducing two new optimization problems, minimum common partition and maximum cycle decomposition. Following this approach, we have implemented a high-throughput system for assigning orthologs on a genome scale, called MSOAR, and tested it on both simulated data and real genome sequence data. Our predicted orthologs between the human and mouse genomes are strongly supported by ortholog and protein function information in authorative databases, and predictions made by other key ortholog assignment methods such as Ensembl, Homologene, INPARANOID, and HGNC. The simulation results demonstrate that MSOAR in general performs better than the iterated exemplar algorithm of David Sankoff's in terms of identifying true exemplar genes.

Joint work with X. Chen, Z. Fu, J. Zheng, V. Vacic, P. Nan, Y. Zhong, and S. Lonardi.

Linear-time haplotype inference on pedigree without recombinations
Bethany Chan, Hong Kong University, Hong Kong

In diploid organisms, such as humans, chromosomes come in pairs, and experiments often yield genotypes, which
blend haplotypes for the chromosome pair. This gives rise to the problem of inferring haplotypes from genotypes.
Given a pedigree P and each node associated with a genotype, the Haplotyping Pedigree Data (with No Recombinations) Problem (HPD-NR) is to find a consistent haplotype configuration for P such that the haplotypes of each child are Mendelian-consistent, i.e. one of the child's haplotype is exactly the same as one of its father's and the other is the same as one of its mother's. Li and Jiang formulated the problem as an mnxmn matrix and solved HPD-NR by Gaussian elimination which could be solved in polynomial time (O(m^3n^3)), where n is the number of
individuals in the pedigree and m is the number of loci for each individual. In this paper, we propose a new algorithm that can either find a CHC solution or report 'no solution' in O(mn) time, which is optimal.

Joint work with Mee Yee Chan, W.T. Chan, Francis Chin, Stanley P.Y. Fung and Ming-Yang Kao.

Reconstructing protein-protein interaction map of saccharomyces cerevisiae
Kui Lin, Beijing Normal University, PRC

A map of protein protein interactions provides valuable insight into the cellular function and machinery of a proteome. By measuring the similarity between two Gene Ontology (GO) terms with a relative specificity semantic
relation, here, we proposed a new method of reconstructing a yeast protein protein interaction map by fully exploring the knowledge buried in two GO annotations for the yeast genome, namely, the BP and CC annotations. Our method was validated using high-quality interaction datasets for its effectiveness. Based on a Z-score analysis, a gold standard positive (GSP) dataset with the highest level of confidence that covered 78% of the high-quality interaction dataset and a gold standard negative (GSN) dataset with the lowest level of confidence were derived. In addition, we assessed four high-throughput experimental interaction datasets using the GSPs and GSNs. Our predicted network reconstructed from GSPs consists of 40 753 interactions among 2259 proteins, and forms 16 connected components. We mapped all of the MIPS complexes except for homodimers onto the predicted network. As a result, ~35% of complexes were identified interconnected. The methods is expected to provide a new approach for predicting the protein protein interaction maps from other completely sequenced genomes with high-quality GO-based annotations.

Joint work with Xiaomei Wu, Lei Zhu, Jie Guo and Da-Yong Zhang.

In Search of replication origins in herpesviruses: a computational approach
David Chew, National University of Singapore

Replication origins are places on the DNA molecules where replication processes are initiated. As DNA
replication is the central step in the reproduction of many viruses, understanding the molecular mechanisms
involved in DNA replication is of great importance in developing strategies to control the growth and spread of viruses. Knowledge of the locations of these replication origins will enhance the development of antiviral agents by blocking viral DNA replication or by interfering with the infection process. With the increasing availability of genomic DNA sequence data, the value of using computational methods to predict likely locations of replication origins before the experimental search has already been recognized although no prediction scheme that works for all DNA in general is available to date. In this talk, we will describe two computational approaches to locate potential replication origin sites in Herpesviruses in silico by looking at palindromic pattern concentration: I will describe several scoring schemes to measure the spatial abundance of palindromes in a genome, taking into account the count, length and base-pair composition of the palindromes AT content concentration: applying a score based approach and making use of statistical theory developed by Karlin and his collaborators, we locate regions in a genome with high AT concentration in the genome of the viruses.

Joint work with Kwok Pui Choi and Ming-Ying Leung.

New generation homology search
Ming Li, University of Waterloo, Canada

Traditional homology search technology was a heuristic science. Given a gene sequence, the search is either too slow (dynamic programming) or not sensitive enough. When it does return something, the results are simply some fragments of alignments.

We wish to make homology search an exact science. The search has to be fast as well as fully sensitive. The result needs to be the complete gene structure match, instead of fragmented alignments. I will discuss the ideas of
optimized spaced seeds introduced in the PatternHunter papers, as well as newer ideas of integrating HMM with homology search.

Joint work with Bin Ma and John Tromp and X.F. Cui, T. Vinar, and D. Shasha

Reconstructing contiguous regions of an ancestral genome
Jian Ma, Pennsylvania State University, USA

We recently analyzed mammalian genome rearrangements (human, mouse, rat and dog) at higher resolution than has been published to date. We identify 1,338 conserved segments that may contain large insertions or deletions, but that are free of chromosome fissions or fusions as well as inversions or translocations over 50 kb in length, in the lineages leading to human, mouse, rat and dog from their most recent common ancestor. We describe a new method for predicting the ancestral order and orientation of those intervals from their observed adjacencies in
modern species. We produce a map of an early mammalian genome that accounts for 96.8% of the available human genome sequence data. The precision is further increased by mapping inversions as small as 31 bp. Analysis of the predicted evolutionary breakpoints in the human linage confirms certain published observations but disagrees with
others. Although only a few mammalian genomes are currently sequenced to high precision, our theoretical analyses and computer simulations indicate that our results are reasonably accurate, and that they will become highly accurate in the foreseeable future.

Joint work with Webb Miller, David Haussler, Louxin Zhang, Mathieu Blanchette, Jim Kent, Richard Burhans, Brian Raney, Bernard Suh.

A graph-based approach to inferring protein function from heterogeneous data sources
Kenny Hon Nian Chua, National University of Singapore

Sequence homology has been widely used to shed light on the functions of proteins. However, statistical constraints impose limits on the sensitivity of sequence homology. Recent works introduce other sources of biological data such as protein interaction and gene expression profiles to reinforce protein function inference. In our current work, we infer protein function from multiple heterogeneous data sources by reformulating these data sources into functional linkage graphs. Each data source is modeled into a graph representing a network of inter-protein similarity observed from that particular source of evidence. These graphs can be combined to form a more complete graph in a probabilistic manner. The edges in such functional linkage graphs, which correspond to binary protein relationship, can be further weighted by topology using established techniques in graph theory. Functions are inferred for a protein from its known neighbours in the graph using a simple but effective weighted averaging technique. Using this approach, we combined sequence homology from BLAST searches, protein interactions from GRID (General Repository of Interaction Data) and co-occurence of protein names in Pubmed abstracts. Our experiments show that the data sources complement one another and combine to infer protein function with greater sensitivity and precision.

Computational prediction of operons in Synechococcus sp. WH8102
Xin Chen, Nanyang Technological University

We computationally predict operons in the Synechococcus sp. WH8102 genome based on three types of genomic data: intergenic distances, COG gene functions and phylogenetic profiles. In the proposed method, we first estimate a log-likelihood distribution for each type of genomic data, and then fuse these distribution information by a perceptron to discriminate pairs of genes within operons (WO pairs) from those across transcription unit borders (TUB pairs). Computational experiments demonstrated that WO pairs tend to have shorter intergenic distances, a higher probability being in the same COG functional categories and more similar phylogenetic profiles than TUB pairs, indicating their powerful capabilities for operon prediction. By testing the method on 236 known operons of Escherichia coli K12, an overall accuracy of 83.8% is obtained by joint learning from multiple types of genomic data, whereas individual information source yields accuracies of 80.4%, 74.4%, and 70.6% respectively. We have applied this new approach, in conjunction with our previous comparative genome analysis-based approach, to predict 556 (putative) operons in WH8102.

This is a joint work with Z. Su, Y. Xu and T. Jiang.

Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions
Lim Soon Wong, National University of Singapore

Most approaches in predicting protein function from proteinprotein interaction data utilize the observation that a protein often share functions with proteins that interacts with it (its level-1 neighbours). However, proteins that interact with the same proteins (i.e. level-2 neighbours) may also have a greater likelihood of sharing similar physical or biochemical characteristics. We speculate that two separate forms of functional association accounts for such a phenomenon, and a protein is likely to share functions with its level-1 and/or level-2 neighbours. We are
interested to find out how significant is functional association between level-2 neighbours and how they can be exploited for protein function prediction. We made a statistical study on recent interaction data and observed that functional association between level-2 neighbours is clearly observable. A substantial number of proteins are observed to share functions with level-2 neighbours but not with level-1 neighbours. We develop an algorithm that predicts the functions of a protein in two steps: (1) assign a weight to each of its level-1 and level-2 neighbours by estimating its functional similarity with the protein using the local topology of the interaction network as well as the reliability of experimental sources; (2) scoring each function based on its weighted frequency in these neighbours.
Using leave-one-out cross validation, we compare the performance of our method against that of several other existing approaches and show that our method performs well.

This is a joint work with Hon Nian Chua and Wing-Kin Sung.

High resolution genetic image analysis with wavelets
Yuping Wang, University of Missouri-Kansas City

Reaping the fruit of genome sequencing, high resolution genetic probes such as the array CGH and FISH have been developed. These probes enable the detection of subtle and cryptic chromosomal abnormalities or genome rearrangements at a resolution unobtainable with the conventional approaches. At the same time, they bring about computational challenges for image and data analysis. As a powerful tool developed in applied mathematics and signal processing, wavelet analysis has shown strong promise in solving these challenges. In this talk, I will show how wavelets offer an effective way in addressing a variety of genetic imaging problems such as enhancement, compression, registration and classification.

Computing bi-clusters for microarray data analysis
Lusheng Wang, City University of Hong Kong, Hong Kong

One of the main goals in the analysis of microarray data is to identify groups of genes and groups of experimental conditions (including environments, individuals and tissues), that exhibit similar expression patterns. This is the so-called bi-clustering problem. In this talk, we present approximation algorithms and heuristics for bi-clustering problem.

Protein subcellular localization prediction based on compartment-specific biological features
Wen Lian Hsu, Institute of Information Science, Academia Sinica, Taiwan

Prediction of subcellular localization of proteins is important for genome annotation, protein function prediction, and drug discovery. We present a prediction method for Gram-negative bacteria that uses ten one-versus-one support vector machine (SVM) classifiers, where compartment-specific biological features are selected as input to each SVM classifier. In Gram-negative bacteria secretory pathways, proteins localized to a particular subcellular compartment have distinct biological properties. We construct our classification framework to mimic the translocation process of bacterial secretory pathways. To distinguish between proteins translocated to different compartments, we consider several biological input features including amino acid composition, dipeptide composition, solvent accessibility, secondary structure elements, signal peptides, transmembrane a-helices, twin-arginine signal peptides, transmembrane ß-barrels, and non-classical protein secretion. To encode solvent accessibility and secondary structure elements, every protein sequence is represented by a specific feature vector constructed from encoded representations of residue properties. We use three descriptors, composition, transition, and distribution, to encode the global characteristic of a given biological feature in a protein. Composition is the number of amino acids of a particular property divided by the total number of amino acids. Transition characterizes the percent frequency with which amino acids of a particular property is followed by amino acids of a different property. Distribution measures the protein length within which the first, 25, 50, 75 and 100% of the amino acids of a particular property is located respectively. In addition, we also apply a statistical approach to identify the significant amino acid coupling sequence patterns in proteins located in different compartments. The amino acid coupling sequence pattern is defined as any two types of amino acids separated by one or more amino acids. We use amino acid coupling patterns and term frequency–inverse document frequency (TFIDF) to extract the sequence information as features. The extracted features are then reduced for dimension by using latent semantic indexing (LSI). The proteins are represented as term-document matrix M for conceptual feature extraction by a mathematical construction, the singular value decomposition (SVD). Based on the assumption that concepts with small singular values are considered as noise and thus can be neglected, we derive an approximation of M by taking only the largest singular values as features. The final prediction of localization sites is determined by integrating the results from ten binary classifiers using a combination of majority votes and a probabilistic method. We demonstrate that feature selection guided by biological knowledge and insights in one-versus-one SVM classifiers can lead to a significant improvement in the prediction performance. Some localization related features can be captured by residue properties encoded by composition, transition, and distribution descriptors. We also found that proteins located in different compartments exhibit significant bias in their amino acid coupling patterns represented by TFIDF and reduced by LSI. The overall accuracy reaches 91.4%, which is 1.6% better than the state-of-the-art system, in a ten-fold cross-validation evaluation on a benchmark data set. Our model is also used to produce highly accurate prediction of 92.8% overall accuracy for proteins of dual localizations.

coauthors: Chia-Yu Su, Allan Lo, Hua-Sheng Chiu, Jia-Ming Chang, and Ting-Yi Sung

Visual exploration of multivariate image data through ultra-fast computation of similarity maps with the Lasagne tool
Peter Serocka, Shanghai Institutes for Biological Sciences, Shanghai, PRC

Multivariate image data become increasingly available in numereous fields, including fluorescence microscopy. As new methods for fast analysis are needed, we present Lasagne, a visualisation tool that aims at revealing geometric structures in multivariate images which represent regions exhibiting similar pixel data. While the user specifies a pixel-of-interest in the image domain, a new picture is generated that shows the similarity of all other pixels to the given one. Thereby one obtains a fast overwiew of the shape of the regions of further interest. Due to hardware-accelerated image-generation,this process works at interactive speed. We show the application of Lasagne in Topological Proteomics research (MELK microsopy).

Analysis of genome rearrangement using reversals and block-interchanges
Yen Jen Lin, National Tsing Hua University

With an increasing number of genomic data (DNA, RNA, and protein sequences) being available, the study of genome rearrangement, which measures the evolutionary difference between two organisms by conducting a large-scale comparisons of their genomic data, has received a lot of attention in both computational biology and bioinformatics. One of the most promising ways to do this research is to compare the orders of identical landmarks in two different genomes, where the identical landmarks can be the homologous/conserved regions (including genes) shared by the sequences. The considered genomes are usually denoted by a set of ordered (signed or unsigned) integers with each integer representing an identical landmark in the genomes and its sign (+ or -) indicating the transcriptional orientation. Given two permutations representing two linear/circular chromosomes, the genome rearrangement study is to compute the rearrangement distance that is defined to be a parsimonious evolutionary scenario required to transform one chromosome into another. Among the proposed operations, we seek an economical explanation of such transformation by using reversals, which invert a segment in a permutation, and block-interchanges, which have the effort of swapping two non-overlapping segments of a permutation, based on the approach of breakpoint graph. Moreover, our experimental results on the ?-Proteobacteria almost coincide with the previous study of considering reversals only but slightly modify the evolutionary distance of one species w.r.t. the previous one, which indicates that considering the reversal and block-interchange together seems to play a significant role in the evolution reconstruction.

A semantic approach for mining biological databases
Srinivasa K G, M S Ramaiah Institute of Technology

A variety of biological databases are currently available to researchers in the XML format. Homology-related querying on such databases pose several problems, as most available exhaustive mining techniques do not
incorporate the semantic relationships inherent to these data collections. Our work identifies an index-based approach to mining such data and explores the improvement achieved in the quality of query results by the application of genetic algorithms. Our experiments confirm the widely accepted advantages of index and vector-space based model for biological data and specifically, show that the application of genetic algorithms optimizes the
search and achieves higher levels of precision and accuracy in heterogeneous databases and faster query execution across all data collections.

The similarity metric and the distance metric
Kaizhong Zhang, University of Western Ontario

We consider general requirements of similarity measures and propose a general definition. We then consider the
relationship between a similarity metric and a distance metric. With these basic definitions and properties, we give
general results on how to normalize similarity metric and distance metric. These results have application in bioinformatics where similarity measures and distance measures are routinely used.

Large-scale phylogenetic reconstruction, the tree of life, and the CIPRES project
Bernard Moret, University of New Mexico, USA and Swiss Federal Institute of Technology, Switzerland

In this talk, I will review current computational activities aimed at reconstructing the Tree of Life, the evolutionary history of all living organisms (an estimated 10-100 million). Researchers and funding agencies worldwide have put renewed emphasis on the establishment of evolutionary relationships among living species; such relationships
are fundamental to research in medicine, drug design, agriculture, ecology, and many other areas. The CIPRES (Cyber Infrastructure for Phylogenetic Research), was founded to develop the informatics tools required to attempt
a reconstruction of the Tree of Life. I will sketch the goal and current achievements of CIPRES, comment on future needs, and relate its work to that of other research efforts in phylogeny. I will then focus on specific challenges that arise in reconstructing trees with 100,000 or more leaves, with particular emphasis on sources of error and on the methodological advances we will need to evaluate the quality of such reconstructions.

Rearrangements in genomes with unequal content
Melvin Zhang, National University of Singapore

The phylogeny of a group of organisms is a tree representation of their evolutionary relationships. Phylogenetic analysis gives us insights into the evolution of traits within species as well as helps us understand the process of evolution. Most modern phylogenies are determined by comparing the DNA sequences of conserved genes but in
recent years there has been a growing interest in the use of gene order rearrangements between genomes for phylogenetic reconstruction. Early programs, such as GRAPPA and MGR, which makes use gene order rearrangements, are limited to genomes with the same set of genes. We propose an extension of the MGR algorithm (Bourque and Pevzner, 2002) to handle genomes with different sets of genes. Experiments on simulated data showed that the accuracy of our extension with regards to the tree topology is a significant improvement over the naïve method of first removing the genes which are not common to all the genomes and then running the MGR program.

Fusion genes as putative microbial drug targets in H. pylori
Meena Kishore Sakharkar, Nanyang Technological University

Fusion genes have been reported as a means of enabling the development of novel or enhanced functions. In this report, we analyzed fusion genes in the genomes of two Helicobacter pylori strains (26695 and J99) and identified 32 fusion genes that are present as neighbours in one strain (components) and are fused in the second (composite), and vice-versa. The mechanism for each case of gene fusion is explored. All the genes identified as fusion products in this analysis were reported as essential genes in this bacterium in the previously documented transposon mutagenesis of H. pylori strain G27. This observation suggests the potential of the products of fusion genes as putative microbial drug targets. These results underscore the utility of bacterial genomic sequence comparisons for understanding gene evolution and for in silico drug target identification in the post-genomic era.

A new approach identifying conserved genes in a path from commensalism to pathogenicity
Zhiwei Cao, Shanghai Center for Bioinformation Technology

Staphylococcus epidermidis, long regarded as an innocuous commensal bacterium of the human skin, is recognized as the most frequent cause of nosocomial sepsis and infections associated with implanted medical devices. Such a conditional pathogen provides a model of choice to study some genome landmarks correlated with the transition between commensalism and pathogenicity. Traditional investigations stress differences in gene content. We focused here on conserved genes that have accumulated small mutation differences during the transition. A comparison of S. epidermidis strain ATCC12228, a non-biofilm forming, non-infection associated strain and S. epidermidis strain RP62A, a methicillin-resistant biofilm clinical isolate, revealed consistent variation, mostly single-nucleotide polymorphisms (SNPs), in orthologous genes in addition to the previously investigated global changes in gene clusters. This polymorphism, scattered throughout the genome, may reveal genes that contribute to adaptation of the bacteria to different environmental stimuli, allowing them to shift from commensalism to pathogenicity. The N/S ratios for virulence factors and surface proteins differed significantly from that of average SNPs. Of 931 SNPs gene pairs, 40 showed a disproportionate distribution of dN vs dS. We could suggest however that they were likely to belong to surface proteins or considered in priority as important for pathogenicity. Our study proposed a new approach to identify genes involved in pathogenic processes and provided some insight about the molecular mechanisms leading a commensal inhabitant to become an invasive pathogen.

Compressed text indexing and local alignment
Tak Wah Lam, Hong Kong University, Hong Kong

In recent years, the theoretical computer science community has made dramatic improvements to compressed indexes for exact string matching. It is now feasible to build and store a compressed index for a human genome using a PC. Yet exact string matching has limited applications, and local alignment or approximate string matching are needed more often. In this talk, we report some experimental results on applying Burrows-Wheeler Transform to speed up local alignment and approximate string matching. Our results reveals that the new tool is sound practically.

Pathway finder
Chee Keong Kwoh, Nanyang Technological University

The importance of biological pathways has seen the creation of more than 200 online databases (e.g. Kyoto Encyclopedia of Gene and Genomes (KEGG) and Biocarta) in this area. Although useful metabolic or regulatory path co-occurrence data may be obtained from these pathways, few tools exist that perform the function of pathway matching. This talk describes the project that was initiated to address the issue of finding path co-occurrences in different biological processes with pathway data from KEGG. A program was developed to be able to detect such path co-occurrences across multiple pathways, given a user-specified input.

On simultaneous knowledge extraction from protein protein interaction data
Hugo Willy, Institute for infocomm research

An important class of interaction switches for biological circuits and disease pathways are short binding motifs. However, experimentally discovering such motifs is not an easy task. Computationally, binding motifs may be found by applying standard motif discovery algorithms on a set of protein sequences that interact with a common protein or a group of proteins with similar properties. However, such approach fails if either a protein interacts with very few other proteins or when prior knowledge of such protein grouping is not available or erroneous. Experimental
noise in input interaction data can further deteriorate the dismal performance of such approaches.

We propose a novel approach to effectively circumvent the above-mentioned limitations to find correlated short sequence motifs from protein-protein interaction data. Correlated motifs are those motifs that consistently
co-occur in the interacting sequences. We adopted the (l,d)-motif model to formulate the new problem as an (l,d)-motif pair finding problem. We present both an exact algorithm, D-MOTIF, as well as its approximation algorithm, D-STAR to solve this problem. Evaluation on extensive simulated data showed that our approach is robust against highly noisy input data. On real biological datasets, our algorithm was able to extract correlated motifs that corresponded to the actual binding interfaces of two different classes of proteins.

Extracting protein-protein interactions from the literature using the hidden vector state model
Yulan He, Nanyang Technological University

The biochip is being widely used in genetic applications, toxicological, protein, and biomedical research. To make biochip really useful, determining the presence and/or amount (referred to as quantitation) of certain proteins in biological samples is significant. This prediction requires different capture agents which can be found through the knowledge about protein-protein interactions. However, large quantity of this kind of knowledge still hides in the biomedical literature. We present an information extraction system which employs a semantic parser using the Hidden Vector State (HVS) model for protein-protein interactions. Unlike other hierarchical parsing models which require fully annotated treebank data for training, the HVS model can be trained using only lightly annotated data whilst simultaneously retaining sufficient ability to capture the hierarchical structure needed to robustly extract semantics in task domain.

When applied in extracting protein-protein interactions information from medical literature, we found that it performed better than other established statistical methods and achieved 58.3% and 76.8% in recall and precision respectively. Moreover, the statistical nature of the pure data-driven HVS model makes it intrinsically robust and it can be easily adapted to other domains, which is rarely mentioned and possessed by other rule-based approaches. In addition, the HVS model can also be trained directly from un-annotated data by employing a traditional K-nearest neighbor method to automatically generate the semantic annotations from the un-annotated corpus.

This is a joint work with Deyu Zhou and Chee Keong Kwoh.

Computational methods for refinement and expansion of signaling pathways
Ron Shamir, Tel Aviv University, Israel

One of the major challenges in systems biology is the analysis of large scale genome-wide data vis-.-vis prior knowledge about a particular cellular network. Here we introduce an extended computational framework that combines formalization of available qualitative knowledge in a probabilistic model, and integration of high throughput experimental data. Using our methods, it is possible to learn an improved model with better fit to the experiments. In particular, we improve the model in terms of (a) refining the relations among model components, and (b) expanding the model to include new components that are transcriptionally regulated by the model. We demonstrate our methodology on the yeast response to hyper-osmotic shock. We show that our integrative approach synergistically combines qualitative and quantitative evidence into concrete novel
biological hypotheses.

Joint work with Irit Gat-Viks.

Reconstructing recombination networks from sequence data: small parsimony problem
Nguyen Cam Thach, National University of Singapore

This work formalizes the parsimony score for recustructing recombination networks from sequenc data. It is proved that the small parsimony problem becomes np-hard even for galled networks. We also present a dynamic programming method for the problem, which is efficient when the number of the recombination nodes is small.

Joint work with Bao Nguyen, Win-K. Sung and Louxin Zhang.

Optimizing barbecues: algorithms and applications
Axel Mosig, Shanghai Institutes for Biological Sciences, Shanghai, PRC

Starting with a problem that occurs in the context of discovering so-called cis regulatory modules in promoter sequences, we discuss geometric and combinatorial versions of the best barbecue problem. In its combinatorial version, the best barbeque problem asks for the largest intersection of n sets that are taken one from each of
n given collections of subsets of some universal set. The problem being NP-complete in general, we first present branch-and-bound algorithms. As it turns out, one can borrow several ideas from clique-algorithms to obtain improved methods for solving the best barbecue problem. As a second approach, we present integer linear
programming formulations; these make it particularly easy to integrate biologically relevant constraints and allow to use state-of-the-art ILP solvers to be used for tackling instances occuring in applications.

Finally, we briefly address a related problem occuring in the context of aligning certain flourescence microscopy data sets ("Toponome images"), leading to the problem of topologically aligning set-labelled planar graphs.

References:

Algorithms for the Combinatorial Best Barbeque Problem, Patric R.J. Vstergerd, Vesa Vaskelaineny, Axel Mosig; submitted (2006).

Discovering cis-Regulatory Modules by Optimizing Barbeques, A. Mosig, T. Biyikoglu, S.J. Prohaska, P.F. Stadler, Submitted (2005).

P. R. J. Vstergerd, A fast algorithm for the maximum clique problem, Discrete Appl. Math. 120 (2002).

Dan gusfield's "galled trees" and the block decomposition of metrics
Andreas Dress, Shanghai Institutes for Biological Sciences, PRC

It will be shown that the concept of a "galled tree" developed by Dan Gusfield and Wang is closely related to that of "block decomposition", a concept that naturally arises in the study of split and, more generally, of coherent decompositions of finite metrics. This allows us in particular to develop a much more general convenient framework for studying reticulate evolution and related phenomena. In the lecture, I will present the basic facts regarding
block decomposition and how this relates to Dan Gusfield's work.

Contribution of bioinformatics to understanding stem cell biology
Bing Lim, Genome Institute of Singapore

Stem Cells are attracting huge interest because they are the basis for a new field of medicine known as Regenerative Medicine. The ability to harness the use of stem cells require a good understanding of some of their unique biological properties. Three questions dominate the field.
1) What is self renewal, the ability to proliferate and regenerate new stem cells without differentiating
2) What is pluripotency, the ability to differentiate to many lineages
3) What is commitment, the ability to make lineage –fate decision.

These cellular characteristics have been observed and described for over 30 years
The molecular basis for these cellular responses and behaviour are just beginning to be uncovered. The use of new genomics technology in combination with bioinformatic analysis have made a major impact in moving this front.

The talk will cover some of our recent work on the genome wide profiling of stem cell transcriptome and the mapping of transcriptional binding sites to demonstrate this point.

Key Collaborators: Huckhui Ng, Yijun Ruan, Chialin Wei, Ken Sung, Visensius Vega, Guillaume Bourque, Paul Robson, Larry Stanton, Hasan Otu, Isidore Rigoutsos

Motif discovery from both ends: combinatorial and probabilistic
Mehmet M. Dalkilic, Indiana University, USA

The motif prediction problem is to predict short, conserved subsequences that are part of a family of sequences, and it is a very important biological problem. Gibbs is one of the first successful motif algorithms and it runs very fast compared to other algorithms and its search behavior is based on the well studied Gibbs random sampling.
However, motif prediction is a very difficult problem and Gibbs may not predict true motifs in some cases. Thus we explored a possibility of improving the prediction accuracy of Gibbs while retaining its fast runtime performance. In this paper, we only consider Gibbs for proteins, not for DNA binding sites. We have developed IGIBBS, an integrated motif search framework for proteins that employs two previous techniques of our own: one for guiding motif search by clustering sequences and another by pattern refinement. These two techniques are combined to a
new double clustering approach to guiding motif search. The unique feature of our framework is that users do not have to specify the number of motifs to be predicted when motifs occur in different subsets of the input sequences since it automatically clusters input sequences into clusters and predict motifs from the clusters. Tests on the PROSITE database show that our framework improved the prediction accuracy of Gibbs significantly. Compared to more exhaustive search methods like MEME, IGIBBS predicted motifs more accurately and runs one order of magnitude faster. Motif discovery can be also viewed as an application of the more general multiple local alignment problem. While conceptually simply, this is challenging because of the algorithmic complexity of when aligning many sequences. We have introduced a novel algorithm for multiple local alignments for protein sequences, based on
the De Bruijn graph approach first proposed by Zhang and Waterman for aligning DNA sequence. We generalize their approach to aligning protein sequences by building "approximate " de Bruijn graphs (ADBGs) to allow gluing similar but not identical amino acids. We then identify heavily trafficked paths in the ADBGs that correspond to motifs. We have implemented this algorithm and tested it using 100 sets of protein sequences from PROSITE. The results show that our method achieved comparable results as other popular motif discovery programs, like MEME and Pratt, while offering advantages in terms of speed. We extended the ADBG to effectively deal with gaps as
well.

Topological and functional discovery in a gene co-expression meta-network of gastric cancer
Patrick, Boon Ooi Tan, Genome Institute of Singapore

Gastric adenocarcinoma is a leading cause of global cancer mortality, surpassed only by lung and breast cancer. We established an International Consortium to assemble a clinically annotated gastric cancer expression database comprising >300 patient tissue samples at various pathological stages, and used a rank-based statistical methodology to identify gene coexpression interactions conserved across multiple patient populations and array platforms. The resulting co-expression meta-network (the "gastrome"), comprising >12,000 interactions across >2000 genes, follows a hierarchical scale-free assembly, containing several deeply embedded coexpression modules of both known and novel functions. The different modules exhibited highly distinct topologies, with some modules (eg cellular proliferation) being highly connected and integrated with other components, while others (eg
ribosomal biosynthesis) were relatively isolated and autonomous. One such highly isolated module contained a sub-network of intestinal differentiation genes, suggesting a topological basis underlying the frequent appearance of intestinal metaplasia in gastric cancer. The gastrome also revealed a novel interaction between PLA2G2A, previously identified as a prognostic marker in gastric cancer, and EphB2, a target of the Wnt signaling pathway. We validated this association at the protein level using tissue microarrays, and found that PLA2G2A, like EphB2, is likely to be a target gene of the Wnt signaling pathway. These results deepen our understanding of the molecular circuitry of gastric cancer. More generally, they demonstrate how meta-analytical approaches can play useful roles in biological discovery, by identifying both higher-level, as well as subtle but potentially important biological relationships relevant to human disease.

QNet: a method to construct circular phylogenetic networks from quartet data
Stefan Grünewald, Shanghai Institutes for Biological Sciences, Shanghai, PRC

Phylogenetic networks are a generalization of phylogenetic trees that allow the representation of conflicting signal or alternative evolutionary histories in a single diagram. Networks are commonly used in case the underlying history is not treelike. For example, recombination, hybridization, and gene transfer can all lead to histories that are not adequately represented by a single tree. Moreover, even when the history is treelike, parallel evolution, model heterogeneity, and sampling error can make it hard to find a unique tree. In such cases networks provide a tool for representing ambiguity or for visualizing a collection of feasible trees. Circular splits graphs form a special class of phylogenetic networks that has become popular recently. I will present QNet, a method that constructs circular splits graphs from weighted quartets.

Algorithms for irredundant motif discovery
Alberto Apostolico, Georgia Institue of Technology, USA and University of Padova, Italy

Among the pattern discovery problems of computational biology and other domains, the extraction from a sequence or sequence ensemble of patterns consisting of intermixed solid and don't care characters elicits persistent attention and study. In the context of motifs arising from sequence self- correlations, notions of maximality and redundancy have been formulated that can alleviate the burden of discovery. Such notions support algebraic-flavored constructs, whereby all motifs in a sequence can be expressed by those contained in a tipically much smaller set called the 'basis'. The motifs of the basis are said to be 'irredundant'. This talk reviews ideas and algorithms involved in these developments.

High resolution genetic image analysis with wavelets
Wang Yu-Ping, University of Missouri-Kansas City, USA

Reaping the fruit of genome sequencing, high resolution genetic probes such as the array CGH and FISH have been developed. These probes enable the detection of subtle and cryptic chromosomal abnormalities or genome rearrangements at a resolution unobtainable with the conventional approaches. At the same time, they bring about computational challenges for image and data analysis. As a powerful tool developed in applied mathematics and signal processing, wavelet analysis has shown strong promise in solving these challenges. In this talk, I will show how wavelets offer an effective way in addressing a variety of genetic imaging problems such as enhancement, compression, registration and classification..