Overview

Assembly and annotation of Mnemiopsis leidyi transcriptome.

Data Sources

Time point Year Collected Replicate Name left(1)/right(2) Filename
1hpf 2012 N7 1 1hpf_2012_N7_1.fastq.gz
1hpf 2012 N7 2 1hpf_2012_N7_2.fastq.gz
1hpf 2013 21 1 1hpf_2013_21_1.fastq.gz
1hpf 2013 21 2 1hpf_2013_21_2.fastq.gz
1hpf 2013 41 1 1hpf_2013_41_1.fastq.gz
1hpf 2013 41 2 1hpf_2013_41_2.fastq.gz
2hpf 2012 N6 1 2hpf_2012_N6_1.fastq.gz
2hpf 2012 N6 2 2hpf_2012_N6_2.fastq.gz
2hpf 2013 22 1 2hpf_2013_22_1.fastq.gz
2hpf 2013 22 2 2hpf_2013_22_2.fastq.gz
2hpf 2013 32 1 2hpf_2013_32_1.fastq.gz
2hpf 2013 32 2 2hpf_2013_32_2.fastq.gz
2hpf 2013 42 1 2hpf_2013_42_1.fastq.gz
2hpf 2013 42 2 2hpf_2013_42_2.fastq.gz
3hpf 2012 N5 1 3hpf_2012_N5_1.fastq.gz
3hpf 2012 N5 2 3hpf_2012_N5_2.fastq.gz
3hpf 2013 23 1 3hpf_2013_23_1.fastq.gz
3hpf 2013 23 2 3hpf_2013_23_2.fastq.gz
3hpf 2013 33 1 3hpf_2013_33_1.fastq.gz
3hpf 2013 33 2 3hpf_2013_33_2.fastq.gz
4hpf 2012 N4 1 4hpf_2012_N4_1.fastq.gz
4hpf 2012 N4 2 4hpf_2012_N4_2.fastq.gz
4hpf 2013 24 1 4hpf_2013_24_1.fastq.gz
4hpf 2013 24 2 4hpf_2013_24_2.fastq.gz
4hpf 2013 34 1 4hpf_2013_34_1.fastq.gz
4hpf 2013 34 2 4hpf_2013_34_2.fastq.gz
5hpf 2012 14 1 5hpf_2012_14_1.fastq.gz
5hpf 2012 14 2 5hpf_2012_14_2.fastq.gz
5hpf 2012 N3 1 5hpf_2012_N3_1.fastq.gz
5hpf 2012 N3 2 5hpf_2012_N3_2.fastq.gz
5hpf 2013 25 1 5hpf_2013_25_1.fastq.gz
5hpf 2013 25 2 5hpf_2013_25_2.fastq.gz
6hpf 2012 23 1 6hpf_2012_23_1.fastq.gz
6hpf 2012 23 2 6hpf_2012_23_2.fastq.gz
6hpf 2012 N2 1 6hpf_2012_N2_1.fastq.gz
6hpf 2012 N2 2 6hpf_2012_N2_2.fastq.gz
6hpf 2013 26 1 6hpf_2013_26_1.fastq.gz
6hpf 2013 26 2 6hpf_2013_26_2.fastq.gz
6hpf 2013 46 1 6hpf_2013_46_1.fastq.gz
6hpf 2013 46 2 6hpf_2013_46_2.fastq.gz
7hpf 2012 N1 1 7hpf_2012_N1_1.fastq.gz
7hpf 2012 N1 2 7hpf_2012_N1_2.fastq.gz
7hpf 2013 17 1 7hpf_2013_17_1.fastq.gz
7hpf 2013 17 2 7hpf_2013_17_2.fastq.gz
7hpf 2013 27 1 7hpf_2013_27_1.fastq.gz
7hpf 2013 27 2 7hpf_2013_27_2.fastq.gz
7hpf 2013 47 1 7hpf_2013_47_1.fastq.gz
7hpf 2013 47 2 7hpf_2013_47_2.fastq.gz
8hpf 2012 18 1 8hpf_2012_18_1.fastq.gz
8hpf 2012 18 2 8hpf_2012_18_2.fastq.gz
8hpf 2012 N8B 1 8hpf_2012_N8B_1.fastq.gz
8hpf 2012 N8B 2 8hpf_2012_N8B_2.fastq.gz
8hpf 2013 18 1 8hpf_2013_18_1.fastq.gz
8hpf 2013 18 2 8hpf_2013_18_2.fastq.gz
8hpf 2013 28 1 8hpf_2013_28_1.fastq.gz
8hpf 2013 28 2 8hpf_2013_28_2.fastq.gz
8hpf 2013 48 1 8hpf_2013_48_1.fastq.gz
8hpf 2013 48 2 8hpf_2013_48_2.fastq.gz
9hpf 2012 37 1 9hpf_2012_37_1.fastq.gz
9hpf 2012 37 2 9hpf_2012_37_2.fastq.gz
9hpf 2013 49 1 9hpf_2013_49_1.fastq.gz
9hpf 2013 49 2 9hpf_2013_49_2.fastq.gz

Assembly

Left and right reads were separately concatenated and then assembled with Trinity version trinityrnaseq_r20140717 (Grabherr et al. 2011).

Trinity \  
--normalize_reads \  
--max_memory 500G \  
--CPU 16 \  
--SS_lib_type FR \  
--left \  
lefts.fastq \  
--right \  
rights.fastq  

Initial Assembly Statistics

Number of Sequences: 233,327
Total Length: 160,677,750
Average Length: 688
Longest Sequence: 29,348
Shortest Sequence: 201
%GC: 40%
N50: 1,152
N90: 266

Sequence Cleaning

Adapter and contaminate was removed from assembly using SeqClean (Cleaner 2017). Additionally, sequences shorter than 300 bp were discarded.

seqclean \  
Mnemiopsis_leidyi_trinity_20140802_unfiltered.nt \  
-v UniVec \  
-l 300 \  
-c 16 \  
-s Hs_GRCh38.nt,ecoli.nt  
Output:
**************************************************
Sequences analyzed:    233327
-----------------------------------
                   valid:    141346  (24371 trimmed)
                 trashed:     91981
**************************************************
----= Trashing summary =------
               by 'short':    90500
        by 'Hs_GRCh38.nt':       11
              by 'shortq':      140
            by 'ecoli.nt':       17
              by 'UniVec':     1313
------------------------------

Cleaned Assembly Statistics

Number of Sequences: 141,346
Total Length: 137,906,212
Average Length: 975
Longest Sequence: 29,348
Shortest Sequence: 300
%GC: 41%
N50: 1,486
N90: 405

Genome Filtering

In order to produce a more accurate transcriptome set, we filtered out sequences not present in the available M. leidyi genome. We used the very permissive requirement that any transcript aligning with at least 90% identity over any length of genomic sequence ought to be retained.

Genome Filtered Transcript Statistics

Number of Sequences: 44,438
Total Length: 62,917,788
Average Length: 1,415
Longest Sequence: 29,348
Shortest Sequence: 300
%GC: 41%
%N: 0%
N50: 2,168
N90: 601

Additional Sequence Cleaning

NCBI detected additional contaminate in our sequences, primarily vector/adapter. We sequentially ran SeqClean to remove concatenated adapter.

Cleaned Sequence Statistics

Number of Sequences: 44,388
Total Length: 62,859,653
Average Length: 1,416
Longest Sequence: 29,348
Shortest Sequence: 300
%GC: 41%
%N: 0%
N50: 2,167
N90: 601

Transcript collapse

For RNAseq we wished to create a non-redundant set of transcripts by collapsing isoforms using CD-HIT (Li, Jaroszewski, and Godzik 2001).

cd-hit-est \  
-i in_genome_clean_r6.nt \  
-o cdhit.nt \  
-r 0 \  
-c .95 \  
-G 0 \  
-aS .5 \  
-M 0 \  
-T 0  

Non-redundant Sequence Statistics

Number of Sequences: 31,067
Total Length: 47,344,147
Average Length: 1,523
Longest Sequence: 29,348
Shortest Sequence: 300
%GC: 41%
%N: 0%
N50: 2,200
N90: 692

Sequence Renaming

Sequences were renamed with a “Mle” prefix and zero padded numbers. Names are in the format of ‘Mle_000000’.

Annotation

Swissprot / Uniprot

Best BLASTX hits to Swissprot with a maximum evalue of 0.001 were used to annotate genes (Camacho et al. 2009).

blastx \
-query Mleidyi_20160525.nt \  
-db swissprot \  
-max_target_seqs 1 \  
-max_hsps 1 \  
-evalue .001 \  
-outfmt 6 \  
-out ml_v_sp.blastx \  
-num_threads 32

Pfam

Sequences were translated with Transdecoder and then domains were assigned to transcripts using hmmscan against the Pfam-A database (Eddy 2011).
Transdecoder.LongOrfs \  
-S \
-m 60 \
-t Mleidyi_20160525.aa.fa

All ORFs of 60aa or greater were retained.

hmmscan \  
-o Mleidyi_20160525.aa.fa.pfam.out \  
--domtblout Mleidyi_20160525.aa.fa.pfam.table \  
-E .01 \  
--domE .01 \  
--cpu 16 \  
../db/Pfam-A.hmm \  

awk '{print $4 "\t" $2}' Mleidyi_20160525.aa.fa.pfam.table | perl -p -e 's/(Mle_\d+)\|.*(\t.+)/$1$2/' | sort -u > mleidyi2pfam.txt

Software

Program Version
Trinity trinityrnaseq_r20140717
Seqclean Downloaded Feb 22, 2011
BLAST+ 2.3.0
Transdecoder 2.0.1
cd-hit CD-HIT version 4.6
HMMeR 3.1b2

Accession Numbers

BioProject: PRJNA344880
TSA: GFAT00000000

References

Camacho, Christiam, George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, and Thomas L Madden. 2009. “BLAST+: Architecture and Applications.” BMC Bioinformatics 10 (1). Springer Nature: 421. doi:10.1186/1471-2105-10-421.

Cleaner. 2017. “Sequence Cleaner.” SourceForge. https://sourceforge.net/projects/seqclean/.

Eddy, Sean R. 2011. “Accelerated Profile HMM Searches.” Edited by William R. Pearson. PLoS Computational Biology 7 (10). Public Library of Science (PLoS): e1002195. doi:10.1371/journal.pcbi.1002195.

Grabherr, Manfred G, Brian J Haas, Moran Yassour, Joshua Z Levin, Dawn A Thompson, Ido Amit, Xian Adiconis, et al. 2011. “Full-Length Transcriptome Assembly from RNA-Seq Data Without a Reference Genome.” Nature Biotechnology 29 (7). Springer Nature: 644–52. doi:10.1038/nbt.1883.

Li, W., L. Jaroszewski, and A. Godzik. 2001. “Clustering of Highly Homologous Sequences to Reduce the Size of Large Protein Databases.” Bioinformatics 17 (3). Oxford University Press (OUP): 282–83. doi:10.1093/bioinformatics/17.3.282.

Privacy Policy

Terms of Use