|
Workshop on BioAlgorithmics
(12 - 14 July 2006)
~ Abstracts ~
A parsimony approach to
genome-wide ortholog assignment
Tao Jiang, University of California at Riverside, USA
The assignment of orthologous genes between a pair of
genomes is a fundamental and challenging problem in
comparative genomics. Existing methods that assign orthologs
based on the similarity between DNA or protein sequences may
make erroneous assignments when sequence similarity does not
clearly delineate the evolutionary relationship among genes
of the same families. In this paper, we present a new
approach to ortholog assignment
that takes into account both sequence similarity and
evolutionary events at genome level, where orthologous genes
are assumed to correspond to each other in the most
parsimonious evolving scenario under genome rearrangement
and gene duplication.
It is then formulated as a problem of computing the signed
reversal distance with duplicates between two genomes of
interest, for which an efficient heuristic algorithm was
given by introducing two new optimization problems, minimum
common partition and maximum cycle decomposition. Following
this approach, we have
implemented a high-throughput system for assigning orthologs
on a genome scale, called MSOAR, and tested it on both
simulated data and real genome sequence data. Our predicted
orthologs between the human and mouse genomes are strongly
supported by ortholog and protein function information in
authorative databases, and
predictions made by other key ortholog assignment methods
such as Ensembl, Homologene, INPARANOID, and HGNC. The
simulation results demonstrate that MSOAR in general
performs better than the iterated exemplar algorithm of
David Sankoff's in terms of identifying true exemplar genes.
Joint work with X. Chen, Z. Fu, J. Zheng, V. Vacic, P.
Nan, Y. Zhong, and S. Lonardi.
« Back...
Linear-time haplotype inference on
pedigree without recombinations
Bethany Chan, Hong Kong University, Hong Kong
In diploid organisms, such as humans, chromosomes come in
pairs, and experiments often yield genotypes, which
blend haplotypes for the chromosome pair. This gives rise to
the problem of inferring haplotypes from genotypes.
Given a pedigree P and each node associated with a genotype,
the Haplotyping Pedigree Data (with No Recombinations)
Problem (HPD-NR) is to find a consistent haplotype
configuration for P such that the haplotypes of each child
are Mendelian-consistent, i.e. one of the child's haplotype
is exactly the same as one of its father's and the other is
the same as one of its mother's. Li and Jiang formulated the
problem as an mnxmn matrix and solved HPD-NR by Gaussian
elimination which could be solved in polynomial time (O(m^3n^3)),
where n is the number of
individuals in the pedigree and m is the number of loci for
each individual. In this paper, we propose a new
algorithm that can either find a CHC solution or report 'no
solution' in O(mn) time, which is optimal.
Joint work with Mee Yee Chan, W.T. Chan, Francis Chin,
Stanley P.Y. Fung and Ming-Yang Kao.
« Back...
Reconstructing protein-protein
interaction map of saccharomyces cerevisiae
Kui Lin, Beijing Normal University, PRC
A map of protein protein interactions provides valuable
insight into the cellular function and machinery of a
proteome. By measuring the similarity between two Gene
Ontology (GO) terms with a relative specificity semantic
relation, here, we proposed a new method of reconstructing a
yeast protein protein interaction map by fully exploring the
knowledge buried in two GO annotations for the yeast genome,
namely, the BP and CC annotations. Our method was validated
using high-quality interaction datasets for its
effectiveness. Based on a Z-score analysis, a gold
standard positive (GSP) dataset with the highest level of
confidence that covered 78% of the high-quality interaction
dataset and a gold standard negative (GSN) dataset with the
lowest level of confidence were derived. In addition, we
assessed four high-throughput experimental interaction
datasets using the GSPs and GSNs. Our predicted network
reconstructed from GSPs consists of 40 753 interactions
among 2259 proteins, and forms 16 connected components. We
mapped all of the MIPS complexes except for homodimers onto
the predicted network. As a result, ~35% of complexes were
identified interconnected. The methods is expected to
provide a new approach for predicting the protein protein
interaction maps from other completely sequenced genomes
with high-quality GO-based annotations.
Joint work with Xiaomei Wu, Lei Zhu, Jie Guo and Da-Yong
Zhang.
« Back...
In Search of replication origins in
herpesviruses: a computational approach
David Chew, National University of Singapore
Replication origins are places on the DNA molecules where
replication processes are initiated. As DNA
replication is the central step in the reproduction of many
viruses, understanding the molecular mechanisms
involved in DNA replication is of great importance in
developing strategies to control the growth and spread of
viruses. Knowledge of the locations of these replication
origins will enhance the development of antiviral agents by
blocking viral DNA replication or by interfering with the
infection process. With the increasing availability of
genomic DNA sequence data, the value of using computational
methods to predict likely locations of replication origins
before the experimental search has already been recognized
although no prediction scheme that works for all DNA in
general is available to date. In this talk, we will
describe two computational approaches to locate potential
replication origin sites in Herpesviruses in silico by
looking at palindromic pattern concentration: I will
describe several scoring schemes to measure the spatial
abundance of palindromes in a genome, taking into account
the count, length and base-pair composition of the
palindromes AT content concentration: applying a score based
approach and making use of statistical theory developed by
Karlin and his collaborators, we locate regions in a genome
with high AT concentration in the genome of the viruses.
Joint work with Kwok Pui Choi and Ming-Ying Leung.
« Back...
New generation homology search
Ming Li, University of Waterloo, Canada
Traditional homology search technology was a heuristic
science. Given a gene sequence, the search is either too
slow (dynamic programming) or not sensitive enough. When it
does return something, the results are simply some fragments
of alignments.
We wish to make homology search an exact science. The search
has to be fast as well as fully sensitive. The result needs
to be the complete gene structure match, instead of
fragmented alignments. I will discuss the ideas of
optimized spaced seeds introduced in the PatternHunter
papers, as well as newer ideas of integrating HMM with
homology search.
Joint work with Bin Ma and John Tromp and X.F. Cui, T. Vinar,
and D. Shasha
« Back...
Reconstructing contiguous regions of
an ancestral genome
Jian Ma, Pennsylvania State University, USA
We recently analyzed mammalian genome rearrangements
(human, mouse, rat and dog) at higher resolution than has
been published to date. We identify 1,338 conserved segments
that may contain large insertions or deletions, but that are
free of chromosome fissions or fusions as well as inversions
or translocations over 50 kb in length, in the lineages
leading to human, mouse, rat and dog from their most recent
common ancestor. We describe a new method for predicting the
ancestral order and orientation of those intervals from
their observed adjacencies in
modern species. We produce a map of an early mammalian
genome that accounts for 96.8% of the available human genome
sequence data. The precision is further increased by mapping
inversions as small as 31 bp. Analysis of the predicted
evolutionary breakpoints in the human linage confirms
certain published observations but disagrees with
others. Although only a few mammalian genomes are currently
sequenced to high precision, our theoretical analyses and
computer simulations indicate that our results are
reasonably accurate, and that they will become highly
accurate in the foreseeable future.
Joint work with Webb Miller, David Haussler, Louxin Zhang,
Mathieu Blanchette, Jim Kent, Richard Burhans, Brian Raney,
Bernard Suh.
« Back...
A graph-based approach to
inferring protein function from heterogeneous data sources
Kenny Hon Nian Chua, National University of Singapore
Sequence homology has been widely used to shed light on
the functions of proteins. However, statistical constraints
impose limits on the sensitivity of sequence homology.
Recent works introduce other sources of biological data such
as protein interaction and gene expression profiles to
reinforce protein function inference. In our current work,
we infer protein function from multiple heterogeneous data
sources by reformulating these data sources into functional
linkage graphs. Each data source is modeled into a graph
representing a network of inter-protein similarity observed
from that particular source of evidence. These graphs can be
combined to form a more complete graph in a probabilistic
manner. The edges in such functional linkage graphs, which
correspond to binary protein relationship, can be further
weighted by topology using established techniques in graph
theory. Functions are inferred for a protein from its known
neighbours in the graph using a simple but effective
weighted averaging technique. Using this approach, we
combined sequence homology from BLAST searches, protein
interactions from GRID (General Repository of Interaction
Data) and co-occurence of protein names in Pubmed abstracts.
Our experiments show that the data sources complement one
another and combine to infer protein function with greater
sensitivity and precision.
« Back...
Computational prediction of
operons in Synechococcus sp. WH8102
Xin Chen, Nanyang Technological University
We computationally predict operons in the Synechococcus
sp. WH8102 genome based on three types of genomic data:
intergenic distances, COG gene functions and phylogenetic
profiles. In the proposed method, we first estimate a
log-likelihood distribution for each type of genomic data,
and then fuse these distribution information by a perceptron
to discriminate pairs of genes within operons (WO pairs)
from those across transcription unit borders (TUB pairs).
Computational experiments demonstrated that WO pairs tend to
have shorter intergenic distances, a higher probability
being in the same COG functional categories and more similar
phylogenetic profiles than TUB pairs, indicating their
powerful capabilities for operon prediction. By testing the
method on 236 known operons of Escherichia coli K12, an
overall accuracy of 83.8% is obtained by joint learning from
multiple types of genomic data, whereas individual
information source yields accuracies of 80.4%, 74.4%, and
70.6% respectively. We have applied this new approach, in
conjunction with our previous comparative genome
analysis-based approach, to predict 556 (putative) operons
in WH8102.
This is a joint work with Z. Su, Y. Xu and T. Jiang.
« Back...
Exploiting indirect neighbours and
topological weight to predict protein function from
protein-protein interactions
Lim Soon Wong, National University of Singapore
Most approaches in predicting protein function from
proteinprotein interaction data utilize the observation that
a protein often share functions with proteins that interacts
with it (its level-1 neighbours). However, proteins that
interact with the same proteins (i.e. level-2 neighbours)
may also have a greater likelihood of sharing similar
physical or biochemical characteristics. We speculate that
two separate forms of functional association accounts for
such a phenomenon, and a protein is likely to share
functions with its level-1 and/or level-2 neighbours. We are
interested to find out how significant is functional
association between level-2 neighbours and how they can be
exploited for protein function prediction. We made a
statistical study on recent interaction data and observed
that functional association between level-2 neighbours is
clearly observable. A substantial number of proteins are
observed to share functions with level-2 neighbours but not
with level-1 neighbours. We develop an algorithm that
predicts the functions of a protein in two steps: (1) assign
a weight to each of its level-1 and level-2 neighbours by
estimating its functional similarity with the protein using
the local topology of the interaction network as well as the
reliability of experimental sources; (2) scoring each
function based on its weighted frequency in these neighbours.
Using leave-one-out cross validation, we compare the
performance of our method against that of several other
existing approaches and show that our method performs well.
This is a joint work with Hon Nian Chua and Wing-Kin Sung.
« Back...
High resolution genetic image
analysis with wavelets
Yuping Wang, University of Missouri-Kansas City
Reaping the fruit of genome sequencing, high resolution
genetic probes such as the array CGH and FISH have been
developed. These probes enable the detection of subtle and
cryptic chromosomal abnormalities or genome rearrangements
at a resolution unobtainable with the conventional
approaches. At the same time, they bring about computational
challenges for image and data analysis. As a powerful tool
developed in applied mathematics and signal processing,
wavelet analysis has shown strong promise in solving these
challenges. In this talk, I will show how wavelets offer an
effective way in addressing a variety of genetic imaging
problems such as enhancement, compression, registration and
classification.
« Back...
Computing bi-clusters for
microarray data analysis
Lusheng Wang, City University of Hong Kong, Hong Kong
One of the main goals in the analysis of microarray data
is to identify groups of genes and groups of experimental
conditions (including environments, individuals and
tissues), that exhibit similar expression patterns. This is
the so-called bi-clustering problem. In this talk, we
present approximation algorithms and heuristics for
bi-clustering problem.
« Back...
Protein subcellular localization
prediction based on compartment-specific biological features
Wen Lian Hsu, Institute of Information Science,
Academia Sinica, Taiwan
Prediction of subcellular localization of proteins is
important for genome annotation, protein function
prediction, and drug discovery. We present a prediction
method for Gram-negative bacteria that uses ten
one-versus-one support vector machine (SVM) classifiers,
where compartment-specific biological features are selected
as input to each SVM classifier. In Gram-negative bacteria
secretory pathways, proteins localized to a particular
subcellular compartment have distinct biological properties.
We construct our classification framework to mimic the
translocation process of bacterial secretory pathways. To
distinguish between proteins translocated to different
compartments, we consider several biological input features
including amino acid composition, dipeptide composition,
solvent accessibility, secondary structure elements, signal
peptides, transmembrane a-helices, twin-arginine signal
peptides, transmembrane ß-barrels, and non-classical protein
secretion. To encode solvent accessibility and secondary
structure elements, every protein sequence is represented by
a specific feature vector constructed from encoded
representations of residue properties. We use three
descriptors, composition, transition, and distribution, to
encode the global characteristic of a given biological
feature in a protein. Composition is the number of amino
acids of a particular property divided by the total number
of amino acids. Transition characterizes the percent
frequency with which amino acids of a particular property is
followed by amino acids of a different property.
Distribution measures the protein length within which the
first, 25, 50, 75 and 100% of the amino acids of a
particular property is located respectively. In addition, we
also apply a statistical approach to identify the
significant amino acid coupling sequence patterns in
proteins located in different compartments. The amino acid
coupling sequence pattern is defined as any two types of
amino acids separated by one or more amino acids. We use
amino acid coupling patterns and term frequency–inverse
document frequency (TFIDF) to extract the sequence
information as features. The extracted features are then
reduced for dimension by using latent semantic indexing
(LSI). The proteins are represented as term-document matrix
M for conceptual feature extraction by a mathematical
construction, the singular value decomposition (SVD). Based
on the assumption that concepts with small singular values
are considered as noise and thus can be neglected, we derive
an approximation of M by taking only the largest singular
values as features. The final prediction of localization
sites is determined by integrating the results from ten
binary classifiers using a combination of majority votes and
a probabilistic method. We demonstrate that feature
selection guided by biological knowledge and insights in
one-versus-one SVM classifiers can lead to a significant
improvement in the prediction performance. Some localization
related features can be captured by residue properties
encoded by composition, transition, and distribution
descriptors. We also found that proteins located in
different compartments exhibit significant bias in their
amino acid coupling patterns represented by TFIDF and
reduced by LSI. The overall accuracy reaches 91.4%, which is
1.6% better than the state-of-the-art system, in a ten-fold
cross-validation evaluation on a benchmark data set. Our
model is also used to produce highly accurate prediction of
92.8% overall accuracy for proteins of dual localizations.
coauthors: Chia-Yu Su, Allan Lo, Hua-Sheng Chiu, Jia-Ming
Chang, and Ting-Yi Sung
« Back...
Visual exploration of multivariate
image data through ultra-fast computation of similarity maps
with the Lasagne tool
Peter Serocka, Shanghai Institutes for Biological
Sciences, Shanghai, PRC
Multivariate image data become increasingly available in
numereous fields, including fluorescence microscopy. As new
methods for fast analysis are needed, we present Lasagne, a
visualisation tool that aims at revealing geometric
structures in multivariate images which represent regions
exhibiting similar pixel data. While the user specifies a
pixel-of-interest in the image domain, a new picture is
generated that shows the similarity of all other pixels to
the given one. Thereby one obtains a fast overwiew of the
shape of the regions of further interest. Due to
hardware-accelerated image-generation,this process works at
interactive speed. We show the application of Lasagne in
Topological Proteomics research (MELK microsopy).
« Back...
Analysis of genome rearrangement
using reversals and block-interchanges
Yen Jen Lin, National Tsing Hua University
With an increasing number of genomic data (DNA, RNA, and
protein sequences) being available, the study of genome
rearrangement, which measures the evolutionary difference
between two organisms by conducting a large-scale
comparisons of their genomic data, has received a lot of
attention in both computational biology and bioinformatics.
One of the most promising ways to do this research is to
compare the orders of identical landmarks in two different
genomes, where the identical landmarks can be the
homologous/conserved regions (including genes) shared by the
sequences. The considered genomes are usually denoted by a
set of ordered (signed or unsigned) integers with each
integer representing an identical landmark in the genomes
and its sign (+ or -) indicating the transcriptional
orientation. Given two permutations representing two
linear/circular chromosomes, the genome rearrangement study
is to compute the rearrangement distance that is defined to
be a parsimonious evolutionary scenario required to
transform one chromosome into another. Among the proposed
operations, we seek an economical explanation of such
transformation by using reversals, which invert a segment in
a permutation, and block-interchanges, which have the effort
of swapping two non-overlapping segments of a permutation,
based on the approach of breakpoint graph. Moreover, our
experimental results on the ?-Proteobacteria almost coincide
with the previous study of considering reversals only but
slightly modify the evolutionary distance of one species
w.r.t. the previous one, which indicates that considering
the reversal and block-interchange together seems to play a
significant role in the evolution reconstruction.
« Back...
A semantic approach for mining
biological databases
Srinivasa K G, M S Ramaiah Institute of Technology
A variety of biological databases are currently available
to researchers in the XML format. Homology-related querying
on such databases pose several problems, as most available
exhaustive mining techniques do not
incorporate the semantic relationships inherent to these
data collections. Our work identifies an index-based
approach to mining such data and explores the improvement
achieved in the quality of query results by the application
of genetic algorithms. Our experiments confirm the widely
accepted advantages of index and vector-space based model
for biological data and specifically, show that the
application of genetic algorithms optimizes the
search and achieves higher levels of precision and accuracy
in heterogeneous databases and faster query execution across
all data collections.
« Back...
The similarity metric and the
distance metric
Kaizhong Zhang, University of Western Ontario
We consider general requirements of similarity measures
and propose a general definition. We then consider the
relationship between a similarity metric and a distance
metric. With these basic definitions and properties, we give
general results on how to normalize similarity metric and
distance metric. These results have application in
bioinformatics where similarity measures and distance
measures are routinely used.
« Back...
Large-scale phylogenetic
reconstruction, the tree of life, and the CIPRES project
Bernard Moret, University of New Mexico, USA and
Swiss Federal Institute of Technology, Switzerland
In this talk, I will review current computational
activities aimed at reconstructing the Tree of Life, the
evolutionary history of all living organisms (an estimated
10-100 million). Researchers and funding agencies worldwide
have put renewed emphasis on the establishment of
evolutionary relationships among living species; such
relationships
are fundamental to research in medicine, drug design,
agriculture, ecology, and many other areas. The CIPRES
(Cyber Infrastructure for Phylogenetic Research), was
founded to develop the informatics tools required to attempt
a reconstruction of the Tree of Life. I will sketch the goal
and current achievements of CIPRES, comment on future needs,
and relate its work to that of other research efforts in
phylogeny. I will then focus on specific challenges that
arise in reconstructing trees with 100,000 or more leaves,
with particular emphasis on sources of error and on the
methodological advances we will need to evaluate the quality
of such reconstructions.
« Back...
Rearrangements in genomes with
unequal content
Melvin Zhang, National University of Singapore
The phylogeny of a group of organisms is a tree
representation of their evolutionary relationships.
Phylogenetic analysis gives us insights into the evolution
of traits within species as well as helps us understand the
process of evolution. Most modern phylogenies are determined
by comparing the DNA sequences of conserved genes but in
recent years there has been a growing interest in the use of
gene order rearrangements between genomes for phylogenetic
reconstruction. Early programs, such as GRAPPA and MGR,
which makes use gene order rearrangements, are limited
to genomes with the same set of genes. We propose an
extension of the MGR algorithm (Bourque and Pevzner, 2002)
to handle genomes with different sets of genes. Experiments
on simulated data showed that the accuracy of our extension
with regards to the tree topology is a significant
improvement over the naïve method of first removing the
genes which are not common to all the genomes and then
running the MGR program.
« Back...
Fusion genes as putative microbial
drug targets in H. pylori
Meena Kishore Sakharkar, Nanyang Technological
University
Fusion genes have been reported as a means of enabling
the development of novel or enhanced functions. In this
report, we analyzed fusion genes in the genomes of two
Helicobacter pylori strains (26695 and J99) and identified
32 fusion genes that are present as neighbours in one strain
(components) and are fused in the second (composite),
and vice-versa. The mechanism for each case of gene fusion
is explored. All the genes identified as fusion products in
this analysis were reported as essential genes in this
bacterium in the previously documented transposon
mutagenesis of H. pylori strain G27. This observation
suggests the potential of the products of fusion genes as
putative microbial drug targets. These results underscore
the utility of bacterial genomic sequence comparisons for
understanding gene evolution and for in silico drug target
identification in the post-genomic era.
« Back...
A new approach identifying
conserved genes in a path from commensalism to pathogenicity
Zhiwei Cao, Shanghai Center for Bioinformation
Technology
Staphylococcus epidermidis, long regarded as an innocuous
commensal bacterium of the human skin, is recognized as the
most frequent cause of nosocomial sepsis and infections
associated with implanted medical devices. Such a
conditional pathogen provides a model of choice to study
some genome landmarks correlated with the transition between
commensalism and pathogenicity. Traditional investigations
stress differences in gene content. We focused here on
conserved genes that have accumulated small mutation
differences during the transition. A comparison of S.
epidermidis strain ATCC12228, a non-biofilm forming,
non-infection associated strain and S. epidermidis strain
RP62A, a methicillin-resistant biofilm clinical isolate,
revealed consistent variation, mostly single-nucleotide
polymorphisms (SNPs), in orthologous genes in addition to
the previously investigated global changes in gene clusters.
This polymorphism, scattered throughout the genome, may
reveal genes that contribute to adaptation of the bacteria
to different environmental stimuli, allowing them to shift
from commensalism to pathogenicity. The N/S ratios for
virulence factors and surface proteins differed
significantly from that of average SNPs. Of 931 SNPs gene
pairs, 40 showed a disproportionate distribution of dN vs dS.
We could suggest however that they were likely to belong to
surface proteins or considered in priority as important for
pathogenicity. Our study proposed a new approach to identify
genes involved in pathogenic processes and provided some
insight about the molecular mechanisms leading a commensal
inhabitant to become an invasive pathogen.
« Back...
Compressed text indexing and local
alignment
Tak Wah Lam, Hong Kong University, Hong Kong
In recent years, the theoretical computer science
community has made dramatic improvements to compressed
indexes for exact string matching. It is now feasible to
build and store a compressed index for a human genome using
a PC. Yet exact string matching has limited applications,
and local alignment or approximate string matching are
needed more often. In this talk, we report some experimental
results on applying Burrows-Wheeler Transform to speed up
local alignment and approximate string matching. Our results
reveals that the new tool is sound practically.
« Back...
Pathway finder
Chee Keong Kwoh, Nanyang Technological University
The importance of biological pathways has seen the
creation of more than 200 online databases (e.g. Kyoto
Encyclopedia of Gene and Genomes (KEGG) and Biocarta) in
this area. Although useful metabolic or regulatory path
co-occurrence data may be obtained from these pathways, few
tools exist that perform the function of pathway matching.
This talk describes the project that was initiated to
address the issue of finding path co-occurrences in
different biological processes with pathway data from KEGG.
A program was developed to be able to detect such path
co-occurrences across multiple pathways, given a
user-specified input.
« Back...
On simultaneous knowledge
extraction from protein protein interaction data
Hugo Willy, Institute for infocomm research
An important class of interaction switches for biological
circuits and disease pathways are short binding motifs.
However, experimentally discovering such motifs is not an
easy task. Computationally, binding motifs may be found by
applying standard motif discovery algorithms on a set of
protein sequences that interact with a common protein or a
group of proteins with similar properties. However, such
approach fails if either a protein interacts with very few
other proteins or when prior knowledge of such protein
grouping is not available or erroneous. Experimental
noise in input interaction data can further deteriorate the
dismal performance of such approaches.
We propose a novel approach to effectively circumvent the
above-mentioned limitations to find correlated short
sequence motifs from protein-protein interaction data.
Correlated motifs are those motifs that consistently
co-occur in the interacting sequences. We adopted the (l,d)-motif
model to formulate the new problem as an (l,d)-motif pair
finding problem. We present both an exact algorithm,
D-MOTIF, as well as its approximation algorithm, D-STAR to
solve this problem. Evaluation on extensive simulated data
showed that our approach is robust against highly noisy
input data. On real biological datasets, our algorithm was
able to extract correlated motifs that corresponded to the
actual binding interfaces of two different classes of
proteins.
« Back...
Extracting protein-protein
interactions from the literature using the hidden vector
state model
Yulan He, Nanyang Technological University
The biochip is being widely used in genetic applications,
toxicological, protein, and biomedical research. To make
biochip really useful, determining the presence and/or
amount (referred to as quantitation) of certain proteins in
biological samples is significant. This prediction requires
different capture agents which can be found through the
knowledge about protein-protein interactions. However, large
quantity of this kind of knowledge still hides in the
biomedical literature. We present an information extraction
system which employs a semantic parser using the Hidden
Vector State (HVS) model for protein-protein interactions.
Unlike other hierarchical parsing models which require fully
annotated treebank data for training, the HVS model can be
trained using only lightly annotated data whilst
simultaneously retaining sufficient ability to capture the
hierarchical structure needed to robustly extract semantics
in task domain.
When applied in extracting protein-protein interactions
information from medical literature, we found that it
performed better than other established statistical methods
and achieved 58.3% and 76.8% in recall and precision
respectively. Moreover, the statistical nature of the pure
data-driven HVS model makes it intrinsically robust and it
can be easily adapted to other domains, which is rarely
mentioned and possessed by other rule-based approaches. In
addition, the HVS model can also be trained directly from
un-annotated data by employing a traditional K-nearest
neighbor method to automatically generate the semantic
annotations from the un-annotated corpus.
This is a joint work with Deyu Zhou and Chee Keong Kwoh.
« Back...
Computational methods for refinement
and expansion of signaling pathways
Ron Shamir, Tel Aviv University, Israel
One of the major challenges in systems biology is the
analysis of large scale genome-wide data vis-.-vis prior
knowledge about a particular cellular network. Here we
introduce an extended computational framework that combines
formalization of available qualitative knowledge in a
probabilistic model, and integration of high throughput
experimental data. Using our methods, it is possible to
learn an improved model with better fit to the experiments.
In particular, we improve the model in terms of (a) refining
the relations among model components, and (b) expanding the
model to include new components that are transcriptionally
regulated by the model. We demonstrate our methodology on
the yeast response to hyper-osmotic shock. We show that our
integrative approach synergistically combines qualitative
and quantitative evidence into concrete novel
biological hypotheses.
Joint work with Irit Gat-Viks.
« Back...
Reconstructing recombination
networks from sequence data: small parsimony problem
Nguyen Cam Thach, National University of Singapore
This work formalizes the parsimony score for
recustructing recombination networks from sequenc data. It
is proved that the small parsimony problem becomes np-hard
even for galled networks. We also present a dynamic
programming method for the problem, which is efficient when
the number of the recombination nodes is small.
Joint work with Bao Nguyen, Win-K. Sung and Louxin Zhang.
« Back...
Optimizing barbecues: algorithms
and applications
Axel Mosig, Shanghai Institutes for Biological
Sciences, Shanghai, PRC
Starting with a problem that occurs in the context of
discovering so-called cis regulatory modules in promoter
sequences, we discuss geometric and combinatorial versions
of the best barbecue problem. In its combinatorial version,
the best barbeque problem asks for the largest intersection
of n sets that are taken one from each of
n given collections of subsets of some universal set. The
problem being NP-complete in general, we first present
branch-and-bound algorithms. As it turns out, one can borrow
several ideas from clique-algorithms to obtain improved
methods for solving the best barbecue problem. As a second
approach, we present integer linear
programming formulations; these make it particularly easy to
integrate biologically relevant constraints and allow to use
state-of-the-art ILP solvers to be used for tackling
instances occuring in applications.
Finally, we briefly address a related problem occuring in
the context of aligning certain flourescence microscopy data
sets ("Toponome images"), leading to the problem of
topologically aligning set-labelled planar graphs.
References:
Algorithms for the Combinatorial Best Barbeque Problem,
Patric R.J. Vstergerd, Vesa Vaskelaineny, Axel Mosig;
submitted (2006).
Discovering cis-Regulatory Modules by Optimizing Barbeques,
A. Mosig, T. Biyikoglu, S.J. Prohaska, P.F. Stadler,
Submitted (2005).
P. R. J. Vstergerd, A fast algorithm for the maximum clique
problem, Discrete Appl. Math. 120 (2002).
« Back...
Dan gusfield's "galled trees" and
the block decomposition of metrics
Andreas Dress, Shanghai Institutes for Biological
Sciences, PRC
It will be shown that the concept of a "galled tree"
developed by Dan Gusfield and Wang is closely related to
that of "block decomposition", a concept that naturally
arises in the study of split and, more generally, of
coherent decompositions of finite metrics. This allows us in
particular to develop a much more general convenient
framework for studying reticulate evolution and related
phenomena. In the lecture, I will present the basic facts
regarding
block decomposition and how this relates to Dan Gusfield's
work.
« Back...
Contribution of bioinformatics to
understanding stem cell biology
Bing Lim, Genome Institute of Singapore
Stem Cells are attracting huge interest because they are
the basis for a new field of medicine known as Regenerative
Medicine. The ability to harness the use of stem cells
require a good understanding of some of their unique
biological properties. Three questions dominate the field.
1) What is self renewal, the ability to proliferate and
regenerate new stem cells without differentiating
2) What is pluripotency, the ability to differentiate to
many lineages
3) What is commitment, the ability to make lineage –fate
decision.
These cellular characteristics have been observed and
described for over 30 years
The molecular basis for these cellular responses and
behaviour are just beginning to be uncovered. The use of new
genomics technology in combination with bioinformatic
analysis have made a major impact in moving this front.
The talk will cover some of our recent work on the genome
wide profiling of stem cell transcriptome and the mapping of
transcriptional binding sites to demonstrate this point.
Key Collaborators: Huckhui Ng, Yijun Ruan, Chialin Wei, Ken
Sung, Visensius Vega, Guillaume Bourque, Paul Robson, Larry
Stanton, Hasan Otu, Isidore Rigoutsos
« Back...
Motif discovery from both ends:
combinatorial and probabilistic
Mehmet M. Dalkilic, Indiana University, USA
The motif prediction problem is to predict short,
conserved subsequences that are part of a family of
sequences, and it is a very important biological problem.
Gibbs is one of the first successful motif algorithms and it
runs very fast compared to other algorithms and its search
behavior is based on the well studied Gibbs random sampling.
However, motif prediction is a very difficult problem and
Gibbs may not predict true motifs in some cases. Thus we
explored a possibility of improving the prediction accuracy
of Gibbs while retaining its fast runtime performance. In
this paper, we only consider Gibbs for proteins, not for DNA
binding sites. We have developed IGIBBS, an integrated motif
search framework for proteins that employs two previous
techniques of our own: one for guiding motif search by
clustering sequences and another by pattern refinement.
These two techniques are combined to a
new double clustering approach to guiding motif search. The
unique feature of our framework is that users do not have to
specify the number of motifs to be predicted when motifs
occur in different subsets of the input sequences since it
automatically clusters input sequences into clusters and
predict motifs from the clusters. Tests on the PROSITE
database show that our framework improved the prediction
accuracy of Gibbs significantly. Compared to more exhaustive
search methods like MEME, IGIBBS predicted motifs more
accurately and runs one order of magnitude faster. Motif
discovery can be also viewed as an application of the more
general multiple local alignment problem. While conceptually
simply, this is challenging because of the algorithmic
complexity of when aligning many sequences. We have
introduced a novel algorithm for multiple local alignments
for protein sequences, based on
the De Bruijn graph approach first proposed by Zhang and
Waterman for aligning DNA sequence. We generalize their
approach to aligning protein sequences by building
"approximate " de Bruijn graphs (ADBGs) to allow gluing
similar but not identical amino acids. We then identify
heavily trafficked paths in the ADBGs that correspond to
motifs. We have implemented this algorithm and tested it
using 100 sets of protein sequences from PROSITE. The
results show that our method achieved comparable results as
other popular motif discovery programs, like MEME and Pratt,
while offering advantages in terms of speed. We extended the
ADBG to effectively deal with gaps as
well.
« Back...
Topological and functional
discovery in a gene co-expression meta-network of gastric
cancer
Patrick, Boon Ooi Tan, Genome Institute of Singapore
Gastric adenocarcinoma is a leading cause of global
cancer mortality, surpassed only by lung and breast cancer.
We established an International Consortium to assemble a
clinically annotated gastric cancer expression database
comprising >300 patient tissue samples at various
pathological stages, and used a rank-based statistical
methodology to identify gene coexpression interactions
conserved across multiple patient populations and array
platforms. The resulting co-expression meta-network (the "gastrome"),
comprising >12,000 interactions across >2000 genes, follows
a hierarchical scale-free assembly, containing several
deeply embedded coexpression modules of both known and novel
functions. The different modules exhibited highly distinct
topologies, with some modules (eg cellular proliferation)
being highly connected and integrated with other components,
while others (eg
ribosomal biosynthesis) were relatively isolated and
autonomous. One such highly isolated module contained a
sub-network of intestinal differentiation genes, suggesting
a topological basis underlying the frequent appearance of
intestinal metaplasia in gastric cancer. The gastrome also
revealed a novel interaction between PLA2G2A, previously
identified as a prognostic marker in gastric cancer, and
EphB2, a target of the Wnt signaling pathway. We validated
this association at the protein level using tissue
microarrays, and found that PLA2G2A, like EphB2, is likely
to be a target gene of the Wnt signaling pathway. These
results deepen our understanding of the molecular circuitry
of gastric cancer. More generally, they demonstrate how
meta-analytical approaches can play useful roles in
biological discovery, by identifying both higher-level, as
well as subtle but potentially important biological
relationships relevant to human disease.
« Back...
QNet: a method to construct
circular phylogenetic networks from quartet data
Stefan Grünewald, Shanghai Institutes for Biological
Sciences, Shanghai, PRC
Phylogenetic networks are a generalization of
phylogenetic trees that allow the representation of
conflicting signal or alternative evolutionary histories in
a single diagram. Networks are commonly used in case the
underlying history is not treelike. For example,
recombination, hybridization, and gene transfer can all lead
to histories that are not adequately represented by a single
tree. Moreover, even when the history is treelike, parallel
evolution, model heterogeneity, and sampling error can make
it hard to find a unique tree. In such cases networks
provide a tool for representing ambiguity or for visualizing
a collection of feasible trees. Circular splits graphs form
a special class of phylogenetic networks that has become
popular recently. I will present QNet, a method that
constructs circular splits graphs from weighted quartets.
« Back...
Algorithms for irredundant
motif discovery
Alberto Apostolico, Georgia Institue of Technology,
USA and University of Padova, Italy
Among the pattern discovery problems of computational
biology and other domains, the extraction from a sequence or
sequence ensemble of patterns consisting of intermixed solid
and don't care characters elicits persistent attention and
study. In the context of motifs arising from sequence self-
correlations, notions of maximality and redundancy have been
formulated that can alleviate the burden of discovery. Such
notions support algebraic-flavored constructs, whereby all
motifs in a sequence can be expressed by those contained in
a tipically much smaller set called the 'basis'. The motifs
of the basis are said to be 'irredundant'. This talk reviews
ideas and algorithms involved in these developments.
« Back...
High resolution genetic image analysis with wavelets
Wang Yu-Ping, University of Missouri-Kansas City, USA
Reaping the fruit of genome sequencing, high resolution genetic probes such as the array CGH and FISH have been developed. These probes enable the detection of subtle and cryptic chromosomal abnormalities or genome rearrangements at a resolution unobtainable with the conventional approaches. At the same time, they bring about computational challenges for image and data analysis. As a powerful tool developed in applied mathematics and signal processing, wavelet analysis has shown strong promise in solving these challenges. In this talk, I will show how wavelets offer an effective way in addressing a variety of genetic imaging problems such as enhancement, compression, registration and classification..
« Back...
|
|