The candidate fusion split reads were re-mapped against the reference genome

To increase the chance of identifying recurrent fusion transcripts across the cohorts, fusion candidate templates provided by the sample-based strategy were combined in the beginning step of the cohort based analysis. However, in recognition of inter-cohort differences in block archive ages and library quality, the expression profiling step was carried out separately within each cohort. The average insert size and complexity of the Providence cohort libraries are higher than those of the Rush cohort libraries. Here we describe results from the Providence RNA-Seq dataset to illustrate the performance of the cohort based computational approach. Briefly, 50 bp single end reads were mapped to the human reference genome to provide candidate reads splitting across potential fusion junctions similar to GSTRUCT-fusion and GFP. The candidate fusion split reads were re-mapped against the human reference genome under the GSNAP parameters favoring local alignments. Any reads that aligned locally, and were therefore not split across the fusion junction, were discarded. This alignment re-testing step eliminated 28% of distant spliced junctions identified in Step 1. The RefSeq annotation file was used to annotate these distant spliced junctions. Only junctions mapping to two different annotated genes were kept, and 80% of distant spliced junctions identified in Step 2 were eliminated during the annotation step. Next, candidate fusion junctions having at least one supporting read were combined from the two cohorts and further tested using the cohort based strategy. The donor and acceptor mRNA or premRNA template sequences were used as controls for the sequence homology search and to generate read alignments in the cohort based approach. This step removed 27% of potential false positive fusion junctions from Step 3. The remaining five template sets were combined and constructed into a single template index. All short reads mapping near any junction sites in the template index as well as reads not mapped in Step 1 were aligned to the template index for each RNA-Seq library. Fusion templates with at least one supporting short read were selected for further cohort based analysis.

Leave a Reply