2017 ASHS Annual Conference
Improving an RNA-Seq Analysis Workflow in Blueberry
Improving an RNA-Seq Analysis Workflow in Blueberry
Friday, September 22, 2017
Kona Ballroom (Hilton Waikoloa Village)
RNA-Seq is a powerful tool to monitor changes in the cell transcriptome, with applications in assaying gene expression and sequence variants, among many others. A typical RNA-Seq analysis consists of (i) trimming adapter sequences from the reads, with optional filtering of reads by quality and/or size, (ii) mapping them to a reference genome or transcriptome, including de novo assemblies, and (iii) quantifying gene counts to utilize for downstream differential expression analyses. Multiple software options are available at each step of the process, which ultimately affect the overall outcome of the analysis. When the organism under study is fully sequenced to the genome level, the performance of different software packages is usually less variable compared to working with a non-model organism. Here, blueberry (Vaccinium spp.) was used as a non-model organism for which a draft genome sequence is available for diploid V. corymbosum. However, there are multiple Vaccinium species of interest for breeding and cultivation, many of which show differences in their ploidy level, increasing the complexity of their analysis and bringing into question the utility of the V. corymbosum reference for other species. To explore this question, we use deep RNA-Seq data from two species of blueberry -diploid V. arboreum and tetraploid V. corymbosum- to test several possible RNA-Seq analysis workflows, including mapping to the reference genome versus de novo transcriptome assemblies. At a read processing level, the combination of kmer correction and two trimming programs (trimmomatic and skewer) generated four sets of reads for downstream use. These were then mapped to the reference genome or to de novo assemblies using multiple programs (bowtie2, STAR, HISAT2, GSNAP, stampy and salmon). Less stringent trimming with skewer retained a larger number of reads than trimmomatic but with lower quality, which negatively affected the performance of de novo assemblies although increased the number of mapped reads. At the next step, performance of the mapping programs was very variable. GSNAP/STAR and salmon provided the highest number of total counts when used against the reference genome and de novo assemblies, respectively. Here we show how the selection of different programs throughout the workflow impact the ability to correctly assign reads to transcripts and properly interpret the RNA-Seq analysis. Despite being out of the scope of this work, a question remains open on whether the results from each pipeline may affect the count profiles and thus influence the subsequent differential expression analyses.