Research and training in the area of molecular biology and genomics are focused on several different high-dimensional (High-D) discrete inference problems including: genome arrangements; analysis of repeated sequences in genomes; and RNA secondary structure. Data from repeated genome sequences are well suited to the probabilistic models that are the main stay of our group’s work. Because there is ambiguity in the source of next generation sequence reads among repeated genomic sequences deterministic methods fail in repeat regions. However subtle differences give rise to probabilistic differences whose effects are amplified by the presence of multiple independent reads available from next generation sequencing.

Analysis of genome rearrangements involving repeats is a major component of our work in this area. Our work on genome rearrangements involves inference of structural variations including deletions/duplication, inversion, and gene conversion events among repeated sequences of the human genome. The main mechanism behind these rearrangements is non-allele homologous recombination (NAHR) of pairs of repeated regions. These events occur as a result of errors in the most common form of recombination, which normally involves recombination of the two alleles from sister chromatids.  Errors crop up in repeated sequences when an allele from one chromatid mistakenly recombines with a repeated copy of its sister allele. We use graphical models in analysis of these interconnected events.  This work is in collaboration with Professor Ben Raphael of the CS Department at Brown. Epigenetic silencing change of repeated regions with age is another important project involving repeated genome sequences. This work is done in collaboration with Professors John Sedivy and Nicola Neretti of Brown’s MCB Department.

Our group has worked on statistical inferences of RNA secondary structure (SS) from many years. Our most important contributions in this area have been on the development of exact sampling algorithms for the characterization of the shape of Boltzmann weighted ensemble of RNA SS.  Currently we have a project using information theory that identifies all of the major probable RNA stems and all probable combinations of these stems. These stems are the major detainments of ensemble shape. A Gibbs sampling model and algorithm for joint inference of RNA SS of similar folding RNA sequences and the alignment of their sequences is another important contribution. We also have collaboration with Professor Robert Reenan of Brown’s MCB Department on RNA editing. Contrary to the central dogma of molecular biology, RNA molecules that are transcribed from the DNA of a genome are not always identical copies of the genome’s DNA sequence because there are important enzymes that edit the sequence as it is being transcribed. This editing changes some of the bases in the RNA sequence. The most important of these enzymes, ADAR, changes adenosine (A) bases to inosine (I) bases, which are interpreted by the cell machinery as guanosine (G) bases. An important component of this work is synergist with our work on repeats because both require the use of a probabilistic algorithm for the alignment of reads from next generation sequencing to a reference genome.  Our group also works on the analysis of sequence of data from stratigraphic records concerning the history of climate change.