This article to trim .fastq files.
.fastq files outputted from a sequencer can include data with a bad quality score. In the case, we would do trimming the bad data.
Use the data used in previous blog post. Genome Sequening: Quality Check by fastqc - Another Japan in the World
Use the application "Trim Galore!" [1][2]
Install
Install Trim Galore!
This time, I downloaded the development version by git clone
and only set PATH to the command.
$ cd GIT_DIR $ git clone git@github.com:FelixKrueger/TrimGalore.git $ cd TrimGalore $ pwd /Users/jun.aruga/git/TrimGalore $ ls ./trim_galore ./trim_galore $ export PATH=$PATH:$HOME/git/TrimGalore $ trim_galore --version Quality-/Adapter-/RRBS-Trimming (powered by Cutadapt) version 0.4.5_dev Last update: 19 03 2018
This steps are still not enough to run the command.
You also have to install cutadapt
[3] command.
trim_galore
is a wrapper command of fastqc
and cutadapt
.
Install cutadapt
cutadapt
[3] is a python package.
There are several ways to install it such as pip
or conda
.
This time I Installed it by pip command.
$ which python3 /usr/local/python-3.6.1/bin/python3 $ sudo /usr/local/python-3.6.1/bin/pip3 install cutadapt $ pip3 list | grep cutadapt cutadapt 1.16 $ which cutadapt /usr/local/python-3.6.1/bin/cutadapt $ cutadapt --version 1.16
Run
$ trim_galore -q 28 --paired --illumina --fastqc -o trimmed_by_quality_28/ SRR747784_1.fastq SRR747784_2.fastq
Options (from trim_galore --help
):
- -q: quality score by filter. Default is 20.
- --paired: This option performs length trimming of quality/adapter/RRBS trimmed reads for paired-end files.
- --illumina: Adapter sequence to be trimmed is the first 13bp of the Illumina universal adapter 'AGATCGGAAGAGC' instead of the default auto-detection of adapter sequence.
- --fastqc: Run FastQC in the default mode on the FastQ file once trimming is complete. -o trimmed_by_quality_28/: outputted directory.
You can see data in quality score yellow area (<=28) are trimmed comparing direct fastqc
result.
The result by Direct fastqc
The result by trimming (trim_galore
).
The trimmed FASTQ files are created in trimmed_by_quality_28/*.fq
.
$ ls trimmed_by_quality_28/*.fq trimmed_by_quality_28/SRR747784_1_val_1.fq trimmed_by_quality_28/SRR747784_2_val_2.fq
Below data is "before trimming" and "after trimming". You can see "after trimming" L5-L8 are trimmed. Remember FASTQ file format is 1 set per 4 lines.
$ head -8 SRR747784_1.fastq @SRR747784.1 1 length=100 GCACTCAAGATACATAACTCTCCTCCTTTTTCAGAAAATGACTACAAAAAGGGAATTTTGACATATTGCAGGATGCATCACCAGGTGATGTTACTCTTCT +SRR747784.1 1 length=100 CCCFFFFFHHHHHJJGHJJJJJJJJJJJJJJJIJIJJIJJIJJJJJJIJJJJJJJJJJJJJJJJJJJIIIJJHHHHHHFFFFFDD>CEDDEDEDDDDDDD @SRR747784.2 2 length=100 AGGCGGAAGCGGGTCCCGGGGAGGCGGAATCGGGTTCCGGGGAGGCGGAAGCGGGTCCTGGGGAGGCAGAAGCGGGTCTTGGGGGGGCGGAAGGGGGTTC +SRR747784.2 2 length=100 C@CFFFFFHHHHHGIJIJJJJGIIEDB>?BDDDDABDDDDDD6DD@DDB8@0:8@57?>ACDD0;;(2?<?C99B@<@34>CBB################ $ head -8 trimmed_by_quality_28/SRR747784_1_val_1.fq @SRR747784.1 1 length=100 GCACTCAAGATACATAACTCTCCTCCTTTTTCAGAAAATGACTACAAAAAGGGAATTTTGACATATTGCAGGATGCATCACCAGGTGATGTTACTCTTCT +SRR747784.1 1 length=100 CCCFFFFFHHHHHJJGHJJJJJJJJJJJJJJJIJIJJIJJIJJJJJJIJJJJJJJJJJJJJJJJJJJIIIJJHHHHHHFFFFFDD>CEDDEDEDDDDDDD @SRR747784.2 2 length=100 AGGCGGAAGCGGGTCCCGGGGAGGCGGAATCGGGTTCCGGGGAGGCGG +SRR747784.2 2 length=100 C@CFFFFFHHHHHGIJIJJJJGIIEDB>?BDDDDABDDDDDD6DD@DD
References
- [1] Trim Galore! Official web: https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/
- [2] Trim Galore! Source: https://github.com/FelixKrueger/TrimGalore
- [3] cutadapt: https://github.com/marcelm/cutadapt