Jun's Blog

Output, activities, memo and etc.

Sequencing: Trimming FASTQ file

This article to trim .fastq files.

.fastq files outputted from a sequencer can include data with a bad quality score. In the case, we would do trimming the bad data.

Use the data used in previous blog post. Genome Sequening: Quality Check by fastqc - Another Japan in the World

Use the application "Trim Galore!" [1][2]

Install

Install Trim Galore!

This time, I downloaded the development version by git clone and only set PATH to the command.

$ cd GIT_DIR

$ git clone git@github.com:FelixKrueger/TrimGalore.git

$ cd TrimGalore

$ pwd
/Users/jun.aruga/git/TrimGalore

$ ls ./trim_galore
./trim_galore

$ export PATH=$PATH:$HOME/git/TrimGalore

$ trim_galore --version

                          Quality-/Adapter-/RRBS-Trimming
                               (powered by Cutadapt)
                                  version 0.4.5_dev

                             Last update: 19 03 2018

This steps are still not enough to run the command. You also have to install cutadapt [3] command.

trim_galore is a wrapper command of fastqc and cutadapt.

Install cutadapt

cutadapt [3] is a python package. There are several ways to install it such as pip or conda. This time I Installed it by pip command.

$ which python3
/usr/local/python-3.6.1/bin/python3

$ sudo /usr/local/python-3.6.1/bin/pip3 install cutadapt

$ pip3 list | grep cutadapt
cutadapt          1.16

$ which cutadapt
/usr/local/python-3.6.1/bin/cutadapt

$ cutadapt --version
1.16

Run

$ trim_galore -q 28 --paired --illumina --fastqc -o trimmed_by_quality_28/ SRR747784_1.fastq SRR747784_2.fastq

Options (from trim_galore --help):

  • -q: quality score by filter. Default is 20.
  • --paired: This option performs length trimming of quality/adapter/RRBS trimmed reads for paired-end files.
  • --illumina: Adapter sequence to be trimmed is the first 13bp of the Illumina universal adapter 'AGATCGGAAGAGC' instead of the default auto-detection of adapter sequence.
  • --fastqc: Run FastQC in the default mode on the FastQ file once trimming is complete. -o trimmed_by_quality_28/: outputted directory.

You can see data in quality score yellow area (<=28) are trimmed comparing direct fastqc result.

The result by Direct fastqc f:id:happybirthday:20180430052737p:plain

The result by trimming (trim_galore). f:id:happybirthday:20180430050857p:plain

The trimmed FASTQ files are created in trimmed_by_quality_28/*.fq.

$ ls trimmed_by_quality_28/*.fq
trimmed_by_quality_28/SRR747784_1_val_1.fq  trimmed_by_quality_28/SRR747784_2_val_2.fq

Below data is "before trimming" and "after trimming". You can see "after trimming" L5-L8 are trimmed. Remember FASTQ file format is 1 set per 4 lines.

$ head -8 SRR747784_1.fastq
@SRR747784.1 1 length=100
GCACTCAAGATACATAACTCTCCTCCTTTTTCAGAAAATGACTACAAAAAGGGAATTTTGACATATTGCAGGATGCATCACCAGGTGATGTTACTCTTCT
+SRR747784.1 1 length=100
CCCFFFFFHHHHHJJGHJJJJJJJJJJJJJJJIJIJJIJJIJJJJJJIJJJJJJJJJJJJJJJJJJJIIIJJHHHHHHFFFFFDD>CEDDEDEDDDDDDD
@SRR747784.2 2 length=100
AGGCGGAAGCGGGTCCCGGGGAGGCGGAATCGGGTTCCGGGGAGGCGGAAGCGGGTCCTGGGGAGGCAGAAGCGGGTCTTGGGGGGGCGGAAGGGGGTTC
+SRR747784.2 2 length=100
C@CFFFFFHHHHHGIJIJJJJGIIEDB>?BDDDDABDDDDDD6DD@DDB8@0:8@57?>ACDD0;;(2?<?C99B@<@34>CBB################

$ head -8 trimmed_by_quality_28/SRR747784_1_val_1.fq
@SRR747784.1 1 length=100
GCACTCAAGATACATAACTCTCCTCCTTTTTCAGAAAATGACTACAAAAAGGGAATTTTGACATATTGCAGGATGCATCACCAGGTGATGTTACTCTTCT
+SRR747784.1 1 length=100
CCCFFFFFHHHHHJJGHJJJJJJJJJJJJJJJIJIJJIJJIJJJJJJIJJJJJJJJJJJJJJJJJJJIIIJJHHHHHHFFFFFDD>CEDDEDEDDDDDDD
@SRR747784.2 2 length=100
AGGCGGAAGCGGGTCCCGGGGAGGCGGAATCGGGTTCCGGGGAGGCGG
+SRR747784.2 2 length=100
C@CFFFFFHHHHHGIJIJJJJGIIEDB>?BDDDDABDDDDDD6DD@DD

References