Install and test Trinity, the De novo assemble tool

This article is about how to install and test Trinity RNA seq de novo assemble tool on Mac OSX [1].

Install

Seeing Trinity .travis.yml [2] is good to know it.

First, we need to use a compiler supporting Open MP [3] (= Open Multi-Processing). Mac's default gcc (= actually = clang) does not support Open MP. Latest GCC 8 (gcc-8, g++-8) is supporting it.

In this time, use GCC 8.

$ brew install gcc-8
$ brew install g++-8

Then install htslib, Samtools, Bowtie2, Jellyfish, Salmon and numpy as a dependency packages.

Install Trinity from the source code.

$ git clone git@github.com:trinityrnaseq/trinityrnaseq.git

$ cd trinityrnaseq

$ CC=gcc-8 CXX=g++-8 make

$ CC=gcc-8 CXX=g++-8 make plugins

$ ./Trinity --help



###############################################################################
#

     ______  ____   ____  ____   ____  ______  __ __
    |      ||    \ |    ||    \ |    ||      ||  |  |
    |      ||  D  ) |  | |  _  | |  | |      ||  |  |
    |_|  |_||    /  |  | |  |  | |  | |_|  |_||  ~  |
      |  |  |    \  |  | |  |  | |  |   |  |  |___, |
      |  |  |  .  \ |  | |  |  | |  |   |  |  |     |
      |__|  |__|\_||____||__|__||____|  |__|  |____/

    Trinity-v2.8.4


#
#
# Required:
#
#  --seqType <string>      :type of reads: ('fa' or 'fq')
#
#  --max_memory <string>      :suggested max memory to use by Trinity where limiting can be enabled. (jellyfish, sorting, etc)
#                            provided in Gb of RAM, ie.  '--max_memory 10G'
#
#  If paired reads:
#      --left  <string>    :left reads, one or more file names (separated by commas, no spaces)
#      --right <string>    :right reads, one or more file names (separated by commas, no spaces)
#
#  Or, if unpaired reads:
#      --single <string>   :single reads, one or more file names, comma-delimited (note, if single file contains pairs, can use flag: --run_as_paired )
#
#  Or,
#      --samples_file <string>         tab-delimited text file indicating biological replicate relationships.
#                                   ex.
#                                        cond_A    cond_A_rep1    A_rep1_left.fq    A_rep1_right.fq
#                                        cond_A    cond_A_rep2    A_rep2_left.fq    A_rep2_right.fq
#                                        cond_B    cond_B_rep1    B_rep1_left.fq    B_rep1_right.fq
#                                        cond_B    cond_B_rep2    B_rep2_left.fq    B_rep2_right.fq
#
#                      # if single-end instead of paired-end, then leave the 4th column above empty.
#
####################################
##  Misc:  #########################
#
#  --include_supertranscripts      :yield supertranscripts fasta and gtf files as outputs.
#
#  --SS_lib_type <string>          :Strand-specific RNA-Seq read orientation.
#                                   if paired: RF or FR,
#                                   if single: F or R.   (dUTP method = RF)
#                                   See web documentation.
#
#  --CPU <int>                     :number of CPUs to use, default: 2
#  --min_contig_length <int>       :minimum assembled contig length to report
#                                   (def=200)
#
#  --long_reads <string>           :fasta file containing error-corrected or circular consensus (CCS) pac bio reads
#                                   (** note: experimental parameter **, this functionality continues to be under development)
#
#  --genome_guided_bam <string>    :genome guided mode, provide path to coordinate-sorted bam file.
#                                   (see genome-guided param section under --show_full_usage_info)
#
#  --jaccard_clip                  :option, set if you have paired reads and
#                                   you expect high gene density with UTR
#                                   overlap (use FASTQ input file format
#                                   for reads).
#                                   (note: jaccard_clip is an expensive
#                                   operation, so avoid using it unless
#                                   necessary due to finding excessive fusion
#                                   transcripts w/o it.)
#
#  --trimmomatic                   :run Trimmomatic to quality trim reads
#                                        see '--quality_trimming_params' under full usage info for tailored settings.
#
#
#  --no_normalize_reads            :Do *not* run in silico normalization of reads. Defaults to max. read coverage of 200.
#                                       see '--normalize_max_read_cov' under full usage info for tailored settings.
#                                       (note, as of Sept 21, 2016, normalization is on by default)
#
#  --no_distributed_trinity_exec   :do not run Trinity phase 2 (assembly of partitioned reads), and stop after generating command list.
#
#
#  --output <string>               :name of directory for output (will be
#                                   created if it doesn't already exist)
#                                   default( your current working directory: "/Users/jun.aruga/git/trinityrnaseq/trinity_out_dir"
#                                    note: must include 'trinity' in the name as a safety precaution! )
#
#  --workdir <string>              :where Trinity phase-2 assembly computation takes place (defaults to --output setting).
#                                  (can set this to a node-local drive or RAM disk)
#
#  --full_cleanup                  :only retain the Trinity fasta file, rename as ${output_dir}.Trinity.fasta
#
#  --cite                          :show the Trinity literature citation
#
#  --verbose                       :provide additional job status info during the run.
#
#  --version                       :reports Trinity version (Trinity-v2.8.4) and exits.
#
#  --show_full_usage_info          :show the many many more options available for running Trinity (expert usage).
#
#
###############################################################################
#
#  *Note, a typical Trinity command might be:
#
#        Trinity --seqType fq --max_memory 50G --left reads_1.fq  --right reads_2.fq --CPU 6
#
#            (if you have multiple samples, use --samples_file ... see above for details)
#
#    and for Genome-guided Trinity, provide a coordinate-sorted bam:
#
#        Trinity --genome_guided_bam rnaseq_alignments.csorted.bam --max_memory 50G
#                --genome_guided_max_intron 10000 --CPU 6
#
#     see: /Users/jun.aruga/git/trinityrnaseq/sample_data/test_Trinity_Assembly/
#          for sample data and 'runMe.sh' for example Trinity execution
#
#     For more details, visit: http://trinityrnaseq.github.io
#
###############################################################################

Optionally if you want to install Trinity to somewhere, run "make install".

Files are installed to /usr/local/bin/trinityrnaseq/ in this case.

$ sudo make install

$ export PATH=/usr/local/bin/trinityrnaseq:$PATH

$ which Trinity
/usr/local/bin/trinityrnaseq/Trinity

Test

$ TRINITY_HOME=$(pwd) make test -C sample_data/test_Trinity_Assembly
...
##### Done Running Trinity #####
...

Usage

The document is here. [4][5]

Introduction. www.youtube.com

There are 2 cases Paired reads and Unpaired (Single) reads [6].

On my Mac, number of CPU is 4.

$ sysctl -n hw.ncpu
4

Paired reads

$ Trinity --seqType fq --max_memory 1G --left reads_1.fq --right reads_2.fq --CPU 4

Unpaired reads

$ Trinity --seqType fq --max_memory 1G --single reads.fq --CPU 4

References

[1] Trinity GitHub: https://github.com/trinityrnaseq/trinityrnaseq
[2] Trinity .travis.yml: https://github.com/trinityrnaseq/trinityrnaseq/blob/master/.travis.yml
[3] https://www.openmp.org/
[4] Trinity document: https://github.com/trinityrnaseq/trinityrnaseq/wiki
[5] Trinity document - Running Trinity: https://github.com/trinityrnaseq/trinityrnaseq/wiki/Running-Trinity
[6] Paired-End vs. Single-Read Sequencing Technology

Jun's Blog

Output, activities, memo and etc.