Next-Generation Sequencing (NGS) Bioinformatic Pipelines

Bioinformatic pipeline and data processing descriptions, workflow steps, quality control analysis,
and available client data and reports. Champions data pipelines are run and monitored in the
Amazon Elastic Compute Cloud environment.

Current Pipelines Available: RNAseq, Whole-Exome Sequencing (WES)

Data Pipelines Workflow Overview:

1145

Workflow Overview

Illumina-Demultiplex Pre-Pipeline

The illumina-demultiplex Nextflow pipeline serves as the ‘Pre-pipeline’ after the wet-lab and inhouse
(RNAseq or WES) sequencing is complete. Human or mouse sequencing raw data and a valid
sample sheet can be used as input.

Steps:

  1. Data from the sequencing instrument is transferred to the AWS cloud environment.
  2. The raw data from the Illumina NextSeq 2000 instrument (*.bcl files) is used to convert to
    generate fastq.gz files.
  3. The fastq.gz files undergo an extensive quality control process in which the sequence PHRED
    quality score, %PF (reads passing filter), read indexing score, and read count threshold, must
    meet a minimum threshold to pass QC.
  4. Fastq.gz sequence files that pass QC will then undergo a concatenation (if multiple duplicate
    sample IDs exist) step if necessary.
  5. A pipeline run manifest for RNAseq, or WES, is generated and used to launch each respective
    pipeline.

Reports/Output Available for Clients:

  • Fastq.gz Quality Control Report.

RNAseq Pipeline

The RNAseq Nextflow pipeline uses known bioinformatic tools to analyze and aggregate data from
fastq.gz sequence files. This pipeline accepts PDX (which will be de-moused in silico) or human
samples. The raw data generated in the RNAseq pipeline undergoes post-pipeline processing to
output normalised Expression of gene transcripts, and gene Fusions.

Steps:

  1. QC of fastq files using FastQC.
  2. Fusion analysis using FusionCatcher with the most up to date human genome references.
  3. Trimming: Fastq files are trimmed for low quality sequences and sequencing adaptors.
  4. Alignment: After trimming, fastq sequence files are aligned to a combined index of human
    and mouse reference genomes (GRCh37 and GRCm38) using STAR aligner (v2.4.2a).
  5. De-mousing: Fastq files undergo a de-mousing step, in which the mouse ‘host’ is removed.
  6. Alignment: Alignment of fastq to bacterial and viral genomes. This step is to check if both
    bacteria and/or viruses are present.
  7. Expression quantification using FeatureCount (v1.4.3) and RSEM.
  8. Sample Haplotype: Derive a sample haplotype against a given VCF file.
  9. QC of data generated using Picard (v1.83), as well as de-mousing statistics.

Reports Available for Clients:

  • RNAseq QC Metrics Summary Report

Whole-Exome Sequencing (WES) Pipeline

The Whole-exome sequencing (WES) Nextflow pipeline uses the well-known GATK pipeline for local
realignment around INDELs and base quality score recalibration, as well as SNP analysis using
multiple callers (muTect, Lofreq and Strelka) against a normal cell line sample as a control
(NA12878). This pipeline accepts PDX (which will be de-moused) or human samples. The raw data
generated in the WES pipeline is processed to produce filtered Mutations (SNVs and INDELs), Copy
Number Variants (CNV), MSI and TMB (tumor mutation burden) score and HLA typing.

Steps:

  1. Trimming: Fastq files are trimmed for low quality sequences and sequencing adaptors.
  2. AGeNT trimming and FastQC: For SureSelect XT HS2 libraries, in which a molecular tag is
    added to monitor read duplication rate, then followed up by FastQC of the tagged Fastq files.
  3. Alignment: Alignment is performed using BWA aligner using a combined index of human and
    mouse reference genomes (GRCh37 and GRCm38).
  4. De-mousing: combined BAM files undergo a de-mousing step, in which the mouse ‘host’ is
    removed.
  5. Alignment: Alignment is performed again using BWA aligner.
  6. Mark duplicates. For samples generated with Agilent library prep kit and human only
    samples.
  7. De-duplication: De-duplication to remove duplicate reads.
  8. INDEL realignment: GATK3 is used for INDEL realignment.
  9. Add Read Groups: Read groups are added after INDEL realignment using Picard Tools.
  10. Base recalibration: GATK3 base recalibration is performed on post-alignment output (bam).
  11. Clip overlap: Overlapping reads are clipped in read pairs of post-alignment output (bam) of
    both normal samples AND the tumor ‘normal’ sample included on the run.
  12. Collect exon metrics: Exon metrics are collected using Picard ‘CollectHsMetrics’ function and
    using the latest human reference sequence.
  13. Call INDELS: Call INDELs using Scalpel v0.5.4., Strelka v2.9.10., and Pindel v0.2.5b9.
  14. Call SNVs: Call SNV using Lofreq v2.1.4., MuTect v1.1.7., Strelka v2.9.10., and Pindel v0.2.5b9.
  15. Post-processing: GATK3 for variant selection and filtration.
  16. Combine Variants: Combine filtered VCF (variant calling file) files from different callers.
  17. Annotate combined VCF file using SnpSift and VEP
  18. Detect microsatellite instability: MSISensor2 for microsatellite instability detection.
  19. CNV: Detect copy number variations with EXCAVATOR2 (v1.1.2).
  20. HLA Typing: HLA typing performed using Optitype.
  21. Sample Haplotype: Derive a sample haplotype against a given VCF file.

Reports Available for Clients:

  • WES QC Metrics Summary Report

Post-Pipeline QC & Data Processing

Quality Control and data processing is run after each run of RNAseq or WES. All data generated in
each respective pipeline undergoes extensive QC and data processing prior to releasing to a client
or uploading to our Lumin Bioinformatics platform. Each instance of post-pipeline data-processing is
unique for each pipeline and produces unique output types.

Steps:

  1. Aggregation of output data from pipeline run.
  2. QC of aggregated data.
  3. Data that passes QC will undergo further data processing using in-house scripts to produce
    the following final data output:

RNAseq:

    1. Expression
    1. Fusion
    1. Haplotype

WES:

    1. Mutations
    1. Copy Number Variations (CNV)
    1. HLA Typing
    1. MSI Scores
    1. TMB (tumor mutation burden)
    1. Haplotypes
  1. Final processed data is prepared for delivery to the client or is prepared for upload to our
    Lumin Bioinformatics platform.