Overview

Assembly and annotation of Mnemiopsis leidyi transcriptome.

Data Sources

Time point	Year Collected	Replicate Name	left(1)/right(2)	Filename
1hpf	2012	N7	1	1hpf_2012_N7_1.fastq.gz
1hpf	2012	N7	2	1hpf_2012_N7_2.fastq.gz
1hpf	2013	21	1	1hpf_2013_21_1.fastq.gz
1hpf	2013	21	2	1hpf_2013_21_2.fastq.gz
1hpf	2013	41	1	1hpf_2013_41_1.fastq.gz
1hpf	2013	41	2	1hpf_2013_41_2.fastq.gz
2hpf	2012	N6	1	2hpf_2012_N6_1.fastq.gz
2hpf	2012	N6	2	2hpf_2012_N6_2.fastq.gz
2hpf	2013	22	1	2hpf_2013_22_1.fastq.gz
2hpf	2013	22	2	2hpf_2013_22_2.fastq.gz
2hpf	2013	32	1	2hpf_2013_32_1.fastq.gz
2hpf	2013	32	2	2hpf_2013_32_2.fastq.gz
2hpf	2013	42	1	2hpf_2013_42_1.fastq.gz
2hpf	2013	42	2	2hpf_2013_42_2.fastq.gz
3hpf	2012	N5	1	3hpf_2012_N5_1.fastq.gz
3hpf	2012	N5	2	3hpf_2012_N5_2.fastq.gz
3hpf	2013	23	1	3hpf_2013_23_1.fastq.gz
3hpf	2013	23	2	3hpf_2013_23_2.fastq.gz
3hpf	2013	33	1	3hpf_2013_33_1.fastq.gz
3hpf	2013	33	2	3hpf_2013_33_2.fastq.gz
4hpf	2012	N4	1	4hpf_2012_N4_1.fastq.gz
4hpf	2012	N4	2	4hpf_2012_N4_2.fastq.gz
4hpf	2013	24	1	4hpf_2013_24_1.fastq.gz
4hpf	2013	24	2	4hpf_2013_24_2.fastq.gz
4hpf	2013	34	1	4hpf_2013_34_1.fastq.gz
4hpf	2013	34	2	4hpf_2013_34_2.fastq.gz
5hpf	2012	14	1	5hpf_2012_14_1.fastq.gz
5hpf	2012	14	2	5hpf_2012_14_2.fastq.gz
5hpf	2012	N3	1	5hpf_2012_N3_1.fastq.gz
5hpf	2012	N3	2	5hpf_2012_N3_2.fastq.gz
5hpf	2013	25	1	5hpf_2013_25_1.fastq.gz
5hpf	2013	25	2	5hpf_2013_25_2.fastq.gz
6hpf	2012	23	1	6hpf_2012_23_1.fastq.gz
6hpf	2012	23	2	6hpf_2012_23_2.fastq.gz
6hpf	2012	N2	1	6hpf_2012_N2_1.fastq.gz
6hpf	2012	N2	2	6hpf_2012_N2_2.fastq.gz
6hpf	2013	26	1	6hpf_2013_26_1.fastq.gz
6hpf	2013	26	2	6hpf_2013_26_2.fastq.gz
6hpf	2013	46	1	6hpf_2013_46_1.fastq.gz
6hpf	2013	46	2	6hpf_2013_46_2.fastq.gz
7hpf	2012	N1	1	7hpf_2012_N1_1.fastq.gz
7hpf	2012	N1	2	7hpf_2012_N1_2.fastq.gz
7hpf	2013	17	1	7hpf_2013_17_1.fastq.gz
7hpf	2013	17	2	7hpf_2013_17_2.fastq.gz
7hpf	2013	27	1	7hpf_2013_27_1.fastq.gz
7hpf	2013	27	2	7hpf_2013_27_2.fastq.gz
7hpf	2013	47	1	7hpf_2013_47_1.fastq.gz
7hpf	2013	47	2	7hpf_2013_47_2.fastq.gz
8hpf	2012	18	1	8hpf_2012_18_1.fastq.gz
8hpf	2012	18	2	8hpf_2012_18_2.fastq.gz
8hpf	2012	N8B	1	8hpf_2012_N8B_1.fastq.gz
8hpf	2012	N8B	2	8hpf_2012_N8B_2.fastq.gz
8hpf	2013	18	1	8hpf_2013_18_1.fastq.gz
8hpf	2013	18	2	8hpf_2013_18_2.fastq.gz
8hpf	2013	28	1	8hpf_2013_28_1.fastq.gz
8hpf	2013	28	2	8hpf_2013_28_2.fastq.gz
8hpf	2013	48	1	8hpf_2013_48_1.fastq.gz
8hpf	2013	48	2	8hpf_2013_48_2.fastq.gz
9hpf	2012	37	1	9hpf_2012_37_1.fastq.gz
9hpf	2012	37	2	9hpf_2012_37_2.fastq.gz
9hpf	2013	49	1	9hpf_2013_49_1.fastq.gz
9hpf	2013	49	2	9hpf_2013_49_2.fastq.gz

Assembly

Left and right reads were separately concatenated and then assembled with Trinity version trinityrnaseq_r20140717 (Grabherr et al. 2011).

Trinity \  
--normalize_reads \  
--max_memory 500G \  
--CPU 16 \  
--SS_lib_type FR \  
--left \  
lefts.fastq \  
--right \  
rights.fastq

Initial Assembly Statistics

Number of Sequences:	233,327
Total Length:	160,677,750
Average Length:	688
Longest Sequence:	29,348
Shortest Sequence:	201
%GC:	40%
N50:	1,152
N90:	266

Sequence Cleaning

Adapter and contaminate was removed from assembly using SeqClean (Cleaner 2017). Additionally, sequences shorter than 300 bp were discarded.

seqclean \  
Mnemiopsis_leidyi_trinity_20140802_unfiltered.nt \  
-v UniVec \  
-l 300 \  
-c 16 \  
-s Hs_GRCh38.nt,ecoli.nt

Output:
**************************************************
Sequences analyzed:    233327
-----------------------------------
                   valid:    141346  (24371 trimmed)
                 trashed:     91981
**************************************************
----= Trashing summary =------
               by 'short':    90500
        by 'Hs_GRCh38.nt':       11
              by 'shortq':      140
            by 'ecoli.nt':       17
              by 'UniVec':     1313
------------------------------

Cleaned Assembly Statistics

Number of Sequences:	141,346
Total Length:	137,906,212
Average Length:	975
Longest Sequence:	29,348
Shortest Sequence:	300
%GC:	41%
N50:	1,486
N90:	405

Genome Filtering

In order to produce a more accurate transcriptome set, we filtered out sequences not present in the available M. leidyi genome. We used the very permissive requirement that any transcript aligning with at least 90% identity over any length of genomic sequence ought to be retained.

Genome Filtered Transcript Statistics

Number of Sequences:	44,438
Total Length:	62,917,788
Average Length:	1,415
Longest Sequence:	29,348
Shortest Sequence:	300
%GC:	41%
%N:	0%
N50:	2,168
N90:	601

Additional Sequence Cleaning

NCBI detected additional contaminate in our sequences, primarily vector/adapter. We sequentially ran SeqClean to remove concatenated adapter.

Cleaned Sequence Statistics

Number of Sequences:	44,388
Total Length:	62,859,653
Average Length:	1,416
Longest Sequence:	29,348
Shortest Sequence:	300
%GC:	41%
%N:	0%
N50:	2,167
N90:	601

Transcript collapse

For RNAseq we wished to create a non-redundant set of transcripts by collapsing isoforms using CD-HIT (Li, Jaroszewski, and Godzik 2001).

cd-hit-est \  
-i in_genome_clean_r6.nt \  
-o cdhit.nt \  
-r 0 \  
-c .95 \  
-G 0 \  
-aS .5 \  
-M 0 \  
-T 0

Non-redundant Sequence Statistics

Number of Sequences:	31,067
Total Length:	47,344,147
Average Length:	1,523
Longest Sequence:	29,348
Shortest Sequence:	300
%GC:	41%
%N:	0%
N50:	2,200
N90:	692

Sequence Renaming

Sequences were renamed with a “Mle” prefix and zero padded numbers. Names are in the format of ‘Mle_000000’.

Annotation

Swissprot / Uniprot

Best BLASTX hits to Swissprot with a maximum evalue of 0.001 were used to annotate genes (Camacho et al. 2009).

blastx \
-query Mleidyi_20160525.nt \  
-db swissprot \  
-max_target_seqs 1 \  
-max_hsps 1 \  
-evalue .001 \  
-outfmt 6 \  
-out ml_v_sp.blastx \  
-num_threads 32

Pfam

Sequences were translated with Transdecoder and then domains were assigned to transcripts using hmmscan against the Pfam-A database (Eddy 2011).

Transdecoder.LongOrfs \  
-S \
-m 60 \
-t Mleidyi_20160525.aa.fa

All ORFs of 60aa or greater were retained.

hmmscan \  
-o Mleidyi_20160525.aa.fa.pfam.out \  
--domtblout Mleidyi_20160525.aa.fa.pfam.table \  
-E .01 \  
--domE .01 \  
--cpu 16 \  
../db/Pfam-A.hmm \


awk '{print $4 "\t" $2}' Mleidyi_20160525.aa.fa.pfam.table | perl -p -e 's/(Mle_\d+)\|.*(\t.+)/$1$2/' | sort -u > mleidyi2pfam.txt

Software

Program	Version
Trinity	trinityrnaseq_r20140717
Seqclean	Downloaded Feb 22, 2011
BLAST+	2.3.0
Transdecoder	2.0.1
cd-hit	CD-HIT version 4.6
HMMeR	3.1b2

Accession Numbers

BioProject: PRJNA344880
TSA: GFAT00000000

References

Camacho, Christiam, George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, and Thomas L Madden. 2009. “BLAST+: Architecture and Applications.” BMC Bioinformatics 10 (1). Springer Nature: 421. doi:10.1186/1471-2105-10-421.

Cleaner. 2017. “Sequence Cleaner.” SourceForge. https://sourceforge.net/projects/seqclean/.

Eddy, Sean R. 2011. “Accelerated Profile HMM Searches.” Edited by William R. Pearson. PLoS Computational Biology 7 (10). Public Library of Science (PLoS): e1002195. doi:10.1371/journal.pcbi.1002195.

Grabherr, Manfred G, Brian J Haas, Moran Yassour, Joshua Z Levin, Dawn A Thompson, Ido Amit, Xian Adiconis, et al. 2011. “Full-Length Transcriptome Assembly from RNA-Seq Data Without a Reference Genome.” Nature Biotechnology 29 (7). Springer Nature: 644–52. doi:10.1038/nbt.1883.

Li, W., L. Jaroszewski, and A. Godzik. 2001. “Clustering of Highly Homologous Sequences to Reduce the Size of Large Protein Databases.” Bioinformatics 17 (3). Oxford University Press (OUP): 282–83. doi:10.1093/bioinformatics/17.3.282.

Transcriptome

Eric Ross