# Efficient mixed models for next-generation sequencing data

Efficient mixed models for next-generation sequencing data | |
---|---|

status: finished | |

Master: | project within::Bioinformatics |

Student name: | student name::David Vossen |

Dates | |

Start | start date:=2011/04/01 |

End | end date:=2011/09/01 |

Supervision | |

Supervisor: | Jelle Goeman, Erik van Zwet |

Second reader: | has second reader::Sanne Abeln |

Company: | has company::LUMC - Medical Statistics and BioInformatics |

Thesis: | has thesis::Media:Thesis.pdf |

Poster: | has poster::Media:Media:Posternaam.pdf |

#### Signature supervisor

..................................

## Abstract

Next-generation sequencing is a new technology that allows sequencing of millions of short strands of DNA or RNA obtained from a biological tissue. After alignment to the human genome the data from a next-generation sequencing experiment comes in the form of counts of aligned reads; each count represents the number of measured sequences that have aligned to a certain location or region of the genome. The number of such counts can easily be several millions. If several subjects are measured, the data set resulting from an experiment is a matrix of counts, with millions of rows and one column per subject.

In our department we have a large next generation sequencing data set of the Dutch famine study. In this study, the 96 subjects consist of 48 sibling pairs, of which one was exposed to famine in utero during the Dutch famine (winter 1944-1945), and one was not. Interest in this study is genomic methylation. This was measured using bisulfite conversion, which (we omit the details) results in a double measurement for each indivual. For each subject for each genomic location (CpG) there is both a methylated and an unmethylated count. The methylation fraction is the ratio between the two, and the research question for these data is to find those CpG sites and regions for which the methylation fraction is systematically different between exposed cases and their unexposed sibling controls.

Statistically, such count data have two sources of variation. There is technical variation because the measured methylation fraction differs from the true methylation fraction for each individual, and biological variation because methylation fractions vary between individuals within a population. To model both sources of variation simultaneously, the classical model is a generalized linear mixed model, typically a logistic or loglinear model with additional random effects to model the between-subjects variation. Such models are quite well-behaved (although they may be unstable for low counts), but a major problem in sequencing applications is computation time. Although each model fit may take only about a second, doing the calculation for millions of genomic sites can be troublesome, especially if permutation testing requires the whole analysis to be repeated at least 100 to 1,000 times.

In this project we will explore a statistical solution to this computational problem. We generalize the model of DerSimonian and Laird (1986, Controlled Clinical Trials), which was originally proposed for meta-analysis in clinical trials when even fitting a single mixed model was out of reach of contemporary computing power, to binary data in general.