Bioinformatics Services

The Bioinformatics Core provides two-tier Bioinformatic services: 1) basic base calling and QC services required for NGS; and 2) functional analysis for project specific applications. These are outlined in more detail below:

  1. QC Analysis, Basecalling, and Demultiplexing

    NGS generates gigabase to terabase level data output in millions to billions of short 100-300 base raw reads. Each run of any of the sequencers must meet specific QA/QC standards of output prior to any further analysis. The Bioinformatics Core has established QA/QC pipelines for each of the sequencing platforms that run automatically after each run and are monitored by the bioinformatics staff of the Core. These QC pipelines include but are not limited to: number of reads, number of bases, overall percent GC content, average length of reads, percent variation of A/C/G/T in a position specific manner, levels of duplications/adapter contaminations in the sequences, etc. A majority of these QC steps have been developed in-house, but we also use broadly accepted QC software like FASTQC. The results of the QC pipeline, metrics and graphics, are posted on the secure FTP server. Sample QA/QC specifications are illustrated in Fig. 1.

    Fig. 1. QA/QC Analysis of an Illumina HiSeq2500 Run. Panels A: Quality Score vs Read Length; B: Percent Reads at >Q30 per Position; C: Percent A/G/C/T per Position; D: Percent GC of Reads; E: Percent Reads at Given Quality; F: Percent Read Duplicates (all reads or high quality reads).

    After basic QA/QC, bases are called and the reads are demultiplexed Illumina’s proprietary CASAVA (Consensus Assessment of Sequence And VAriation) pipeline (for Illumina reads). The demultiplexing step separates the reads into discrete sequence files for each sample.

    Output files are made available as compressed FASTQ for Illumina’s HiSeq/MiSeq, SFF for Roche 454, etc. These sequence files are available on an FTP server for secure downloads.

    Thus, only high quality demultiplexed reads are made available to our users.

  2. Functional Analysis

    If requested, the Bioinformatics Core provides in-depth data analysis for a variety of sequencing protocols including but not limited to the following:

    1. Genome Assembly

      The preferred platform for small-to-medium sized genome assembly projects is the MiSeq for both cost and efficiency. The MiSeq provides 2 X 300 base Paired End Sequences, which are amenable to efficient assembly. The reads undergo a through quality filtering as outlined above, trimming to eliminate clearly inappropriate length sequences, followed by a low-complexity purge and poly A/T clipping, before being assembled using CLC Bio Assembly Cell (or other) software. We also have extensive experience using Roche 454 sequence reads for assembly, using Roche’s proprietary Newbler software. The high quality reads are mapped back to the assembly to calculate coverage metrics.

    2. Whole Genome/Exome Sequencing with Variant Calling

      Specialized Illumina kits are used for Whole Genome/Exome sequencing for Homo sapiens, Mus musculus and samples from other well-characterized genomes. Optimized in-house pipelines are in place for effective and accurate variant calling using the GATK (Genome Annalysis ToolKit), which involves read alignment, duplicate removal, indel realignment, base recalibration, SNP/INDEL calling, variant recalibration and filtering. Variants can further be annotated using ANNOVAR or similar software.

    3. RNA-Seq

      Reads from a RNA-Seq experiment are generally processed using the Tuxedo suite of tools including TopHat 2 , Cufflinks and Cuffdiff. The steps involved are performing an exon-aware alignment; gene and isoform level FPKM expression measurements; and case-control differential expression testing. Visualization can also be obtained using CummeRbund.

    4. ChIP-Seq

      MACS2 is run to for peak analysis for ChIP-Seq experiments. The pipeline performs alignment to a reference genome, cross-correlation analysis and binding site predictions. Integrated Genome Viewer (IGV) is used visualization purposes.

    5. microRNA

      As microRNA tend to be in the 20-50nt size range, i.e shorter than a standard read length, customized trimming is required before performing any downstream analysis. Our pipeline clips off any Illumina adapter sequences before genome alignment or miRNA quantification by aligning against mirBASE.

    6. CustomSeq

      The BCCL also assists in providing customized bioinformatics support like:

      1. Aligning raw FASTQ reads using Bowtie2/BWA/CLC Bio to generate SAM/BAM files
      2. Gene Calling and Annotations of Bacterial or other smaller Eukaryotic Genomes
      3. Running BLAST with NCBI nt/nr databases