Finding novel transcripts/genes with TopHat and Cufflinks
This example will use a small portion of a genome with incomplete annotation
Upload reference, annotation, and fastq files to Galaxy
- insect_genome.fa (fasta format)
- insect.gtf (gtf format)
- treated1_1.fastq (fastqsanger format)
- treated1_2.fastq (fastqsanger format)
- treated2_1.fastq (fastqsanger format)
- treated2_2.fastq (fastqsanger format)
- untreated1_1.fastq (fastqsanger format)
- untreated1_2.fastq (fastqsanger format)
- untreated2_1.fastq (fastqsanger format)
- untreated2_2.fastq (fastqsanger format)
Example of uploading gtf file:
Run TopHat on each pair of fastq files (4 runs)
TopHat will align reads to the reference fasta.
Use default parameters, except for “Mean Inner Distance between Mate Pairs”:
- treated1 – 110
- treated2 – 102
- untreated1 – 109
- untreated2 – 99
Example of a TopHat run:
Rename TopHat output files
Change attributes so name reflects sample number:
Run Cufflinks on each TopHat “accepted_hits” file (4 runs)
Cufflinks will assemble transcripts based on pattern of reads aligned.
Note that we are using the existing annotation (gtf file) as a guide.
Example of a Cufflinks run:
Run Cuffmerge on all the Cufflinks output (1 run)
Cuffmerge will merge the Cufflinks assemblies,
again using the existing gtf annotation as a guide.
Run Cuffcompare on the Cuffmerge output
This will compare the newly assembled transcripts to the original annotation.
Take a look at the cuffcompare transcript accuracy output:
Sn = “sensitivity” = true positives / (true positives + false negatives)
and represents, for example, the percentage of coding bases that have been
correctly predicted as coding.
Sp = “specificity” = true positives / (true positives + true negatives)
and represents, for example, the percentage of noncoding bases that have been
correctly predicted as non-coding.
(Burset and Guigo, 1996, Evaluation of Gene Structure Prediction Programs, Genomics(34):353-367)
Run Cuffdiff to identify differentially expressed genes
Here we will compare untreated1 vs. treated1 without including the other replicates.
If you have time, try re-running with all the samples to see if the gene list changes.
Remember to change Min Alignment Count from 1000 -> 10
Sort the Cuffdiff output by q-value
The file we’ll be sorting is “gene differential expression testing”.
Differentially expressed genes are tagged with “yes” in the far-right column.
The names of the genes are in column 3 (e.g., ATPsyn-beta, et al.).
Entries that only have “-” in column 3 are potential novel genes.
Exploring the revised transcript set
If time allows, you can generate a fasta file with all the transcripts in your new gtf file
(cuffcompare combined transcripts gtf file). Try using the “Extract Genomic DNA” tool in
Galaxy. (Note: the fasta headers will have genomic locations to identify the sequences.)
You can then download the resulting fasta file and blast some of the novel sequences.
Identifying novel transcripts using IGV
These files are needed for upload to IGV. You may need to download them from Galaxy.
- Cuffcompare: combined transcripts gtf file
- untreated1: accepted_hits (TopHat output, download both .bam and .bai files)
- treated1: accepted_hits (TopHat output, download both .bam and .bai files)
Start IGV and import the insect_genome.fa (File / Import Genome).
Include insect.gtf as the “Gene file”
Load the new files from Galaxy (File/ Load from File) – the Cuffcompare gtf file,
and both bam files (the corresponding bai files will load automatically).
Zoom to the coordinates of the differentially expressed genes (see sorted toptable), including those shown below: