diff --git a/CITATIONS.md b/CITATIONS.md deleted file mode 100755 index eb266486d60dd8e0213bf40a39ba749e21f93fd1..0000000000000000000000000000000000000000 --- a/CITATIONS.md +++ /dev/null @@ -1,46 +0,0 @@ -# Citations - - -## Main - -[Nextflow](https://www.nextflow.io/docs/latest/index.html) - -[Singularity](https://docs.sylabs.io/guides/latest/user-guide/) - - -## Basecalling - -[pod5](https://pypi.org/project/pod5/) - -[dorado](https://github.com/nanoporetech/dorado) - - -## Quality Control - -[PycoQC](https://github.com/a-slide/pycoQC) - -[MultiQC](https://multiqc.info/) - - -## Alignment - -[Minimap2](https://github.com/lh3/minimap2) - - -## Methylation Calling - -[Modkit](https://github.com/nanoporetech/modkit) - - -## Other Genomics Tools - -[Samtools](https://github.com/samtools/samtools) - - -## Other - -[Conda](https://docs.conda.io/en/latest/) - -[Bioconda](https://bioconda.github.io/) - -[pip](https://pypi.org/project/pip/) diff --git a/LICENSE b/LICENSE deleted file mode 100755 index 766501c64b426586083f72b5fb9155268e224274..0000000000000000000000000000000000000000 --- a/LICENSE +++ /dev/null @@ -1,21 +0,0 @@ -MIT License - -Copyright (c) 2023 Bernardo Aguzzoli Heberle - -Permission is hereby granted, free of charge, to any person obtaining a copy -of this software and associated documentation files (the "Software"), to deal -in the Software without restriction, including without limitation the rights -to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -copies of the Software, and to permit persons to whom the Software is -furnished to do so, subject to the following conditions: - -The above copyright notice and this permission notice shall be included in all -copies or substantial portions of the Software. - -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -SOFTWARE. diff --git a/README.md b/README.md index 7528f406953d89a6946271a4645393f14e4e709f..878cb6defdfd03cf7ea8683d41afc604a52c4dfb 100755 --- a/README.md +++ b/README.md @@ -1,246 +1,323 @@ -# DCNL_NANOPORE_PIPELINE -NextFlow pipeline used by the Developmental Cognitive Neuroscience Lab (DCNL) to process Oxford Nanopore (ONT) DNA methylation data +# nanopore +NextFlow pipeline used by the Developmental Cognitive Neuroscience Lab (DCNL) to process Oxford Nanopore (ONT) DNA methylation data. This repository is currently mainted by the DCNL and the Artificial Intelligence and Data Science Center (CIACD) at PUC-RS. +## Table of Contents + +1. [Getting Started](#getting-started) +1. [Pipeline paramters](#pipeline-parameters) +1. [Pipeline output directory](#pipeline-output-directory) +1. [Examples](#examples) +1. [Useful links](#useful-links) ## Getting Started -### 1) Have a functioning version of Nextflow in your Path. +1. This pipeline assumes you're running a **GNU/Linux** distribution, such as Debian or Ubuntu. -- Information on how to install NextFlow can be found [here](https://www.nextflow.io/docs/latest/getstarted.html). - -### 2) Have a functioning version of Singularity on your Path. +1. Install `git`, `java`, `nextflow` and `apptainer`: -- Information on how to install Singularity cna be found [here](https://docs.sylabs.io/guides/3.0/user-guide/installation.html) - - -### 3) Clone this github repo using the command below + - Install Java: install either [OpenJRE/JDK][openjava] (**recommended, see below**) or [OracleJRE/JDK][oraclejava]. to install both openjre and openjdk using Debian/Ubuntu: -``` -git clone https://github.com/bernardo-heberle/DCNL_NANOPORE_PIPELINE -``` + ```sh + sudo apt install default-jre default-jdk + ``` + - Install [NextFlow][nextflow-docs-install] (skip Java installation) + - Install [Apptainer][apptainer-docs-install-deb] -### 4) Make sure you have all the sequencing files and reference genomes/assemblies files you need to run the pipeline. - -- ".fast5" or ".pod5" files. +1. Check that all dependencies are accessible via your users `$PATH`: -- refecence/assembly ".fa" file specific to your organism of interest. - + ```sh + which {git,java,apptainer,nextflow} + ``` -### 5) Set NXF_SINGULARITY_CACHEDIR environment variable to your desired directory: + ```txt + /usr/bin/git + /usr/bin/java + /usr/bin/apptainer + /home/$USER/.local/bin/nextflow + ``` -Substitute `////` in the codeblock below for the path to the directory you would like to store your singularity images in. Make sure the directory exists before executing the pipeline. +1. Clone this repository and change directory to it: -``` -echo "" >> ~/.bash_profile && echo 'NXF_SINGULARITY_CACHEDIR="////"' >> ~/.bash_profile && echo 'export NXF_SINGULARITY_CACHEDIR' >> ~/.bash_profile && . ~/.bash_profile -``` + ```sh + git clone https://gmapsrv.pucrs.br/gitlab/ccd-public/nanopore.git + cd nanopore/ + ``` -# -# Pipeline parameters: +1. Make sure you have both the sequencing and reference genomes/assemblies files you need to run the pipeline. By convention, the sequencing files (`.fast5` or `.pod5` format) should be stored on `data/` (`mkdir data`), while the reference files (`.fa` format) should be stored on `references/`. Reference files are specific to the organism under study (human, rat, etc.). -## -## Parameters for step 1 (Basecalling) +1. Set `NXF_APPTAINER_CACHEDIR` environment variable to your users' `apptainer` home directory, as follows: -Many of the parameters for this step are based on dorado basecaller, see their [documentation](https://github.com/nanoporetech/dorado) to understand it better. + ```sh + NXF_APPTAINER_CACHEDIR="$HOME/apptainer/cache/" + mkdir -p "$NXF_APPTAINER_CACHEDIR" + echo "export NXF_APPTAINER_CACHEDIR" | tee -a ~/.bashrc + source ~/.bashrc + ``` -``` ---step +1. You should now be able to run the `nextflow` pipeline (`workflow/main.nf`). See [Pipeline parameters](#pipeline-parameters) and [Examples](#examples) for details. - ---basecall_path +[openjava]:https://openjdk.org/install/ +[oraclejava]:https://www.java.com/en/download/linux_manual.jsp +[nextflow-docs-install]:https://www.nextflow.io/docs/latest/install.html#install-nextflow +[apptainer-docs-install-deb]:https://apptainer.org/docs/admin/main/installation.html#install-debian-packages ---basecall_speed +[top](#table-of-contents) ---basecall_mods +## Pipeline parameters ---basecall_compute +### Step 1: Basecalling ---basecall_config +Many of the parameters for this step are based on dorado basecaller, see their [documentation](https://github.com/nanoporetech/dorado) to understand it better. ---basecall_trim +```txt +--step ---qscore_thresh + +``` +```txt +--basecall_path ---demux + +``` ---trim_barcodes +```txt +--basecall_speed ---gpu_devices +``` ---prefix +```txt +--basecall_mods ---out_dir /" in the directory you submitted the pipeline from. - Default: "output_directory"> + ``` -## -## Parameters for step 2 (Alignment Filtering and Quality Control): -``` ---step +```txt +--basecall_compute ---steps_2_and_3_input_directory ". Default = "None"> + +``` ---qscore_thresh +```txt +--basecall_config ---mapq + +``` ---min_mapped_reads_thresh +```txt +--basecall_trim ---is_barcoded + +``` -``` +```txt +--qscore_thresh + +``` -## -## Parameters for step 3 (Methylation Calling and MultiQC): +```txt +--demux + ``` ---step - ---steps_2_and_3_input_directory ". Default = "None"> ---multiqc_config +```txt +--trim_barcodes + ``` -# -# Submission examples: - -## -## STEP 1: GPU basecalling without demultiplexing +```txt +--gpu_devices + ``` -## -## STEP 2: Alignment Filtering and Quality Control from MinKNOW basecalling and alignment (bam files were generated by MinKNOW) +```txt +--out_dir -``` -nextflow ../DCNL_NANOPORE_PIPELINE/workflow/main.nf \ - --steps_2_and_3_input_directory "./results/test_basecall_gpu_no_demux_mouse/" \ - --min_mapped_reads_thresh 500 \ - --is_barcoded "True" \ - --qscore_thresh 9 --mapq 10 --step "2_from_step_1" -resume +/" in the directory you submitted the pipeline from. Default: "output_directory"> ``` -## -## STEP 3: Methylation calling and MultiQC report: +### Step 2: Alignment Filtering and Quality Control -``` -nextflow ../DCNL_NANOPORE_PIPELINE/workflow/main.nf \ - --steps_2_and_3_input_directory "./results/test_basecall_gpu_no_demux_mouse/" \ - --multiqc_config "../DCNL_NANOPORE_PIPELINE/references/multiqc_config.yaml" --step 3 -resume +```txt +--step + ``` -## -## Pipeline output directory description: +```txt +--steps_2_and_3_input_directory -1. **fast5_to_pod5** - One directory per sample. Only exists for sample that had any fast5 files converted into pod5 files for more efficient basecalling with Dorado. +". Default = "None"> +``` -2. **basecalling_output** - Dorado basecalling output. One ".bam" file per sample (already mapped to the reference genome of choice and sorted). - Also includes one sequencing summary file per sample. Reads for the same run will be separated into different fastq files - based on barcode when demultiplexing is enabled. - -3. **pycoqc_no_filter** - Includes pycoQC quality control reports for each sample with metrics prior to alignment filtering by MAPQ. - PycoQC reports are output in both ".html" and ".json" format. The ".html" files can be imported into - a personal computer and opened using any internet browser to provide a quick glance basic statistics from the sequencing run. +```txt +--qscore_thresh -4. **pycoqc_filtered** - Includes pycoQC quality control reports for each sample with metrics post alignment filtering by MAPQ. - PycoQC reports are output in both ".html" and ".json" format. The ".html" files can be imported into - a personal computer and opened using any internet browser to provide a quick glance basic statistics from the sequencing run. + +``` -5. **multiqc_input/minimap2** - Includes ".flagstat" and ".idxstat" files generate with samtools from before and after alignment filtering. These files show number - of reads per sample and number of reads per chromosome. This information is integrated in the final multiQC report. +```txt +--mapq -6. **bam_filtering** - Output from filtering bam files. Filtered files only include primary alignments with MAPQ greater than or equal to what the user specified. - This directory includes sorted ".bam" files from before and after filtering and their respective index ".bai" files. + +``` -7. **intermediate_qc_reports** - Intermediate quality control reports for each sample separated into 3 directories: - "read_length", "number_of_reads", "quality_score_thresholds". +```txt +--min_mapped_reads_thresh -8. **modkit** - Directory with methylation calls, bed file pileup, and summary files generated using modkit. - See [documentation](https://nanoporetech.github.io/modkit/quick_start.html) for more information. + +``` +```txt +--is_barcoded -9. **num_reads_report** - Three reports, one with number of reads for each sample, other with reads length, and another with - MAPQ and PHRED quality scores used to filter the files. + +``` +### Step 3: Methylation Calling and MultiQC -10. **multiQC_output** - MultiQC output files, most importantly the ".html" report showing summary statistics for all file. +```txt +--step -11. **calculate_coverage** - Two .tsv files containing the average coverage for each sample across every chromosome of the reference genome used. If a value is not present for a sample - that means that chromosome had 0 coverage in that sample. + +``` +```txt +--steps_2_and_3_input_directory -13. **minknow_converted_input** - Merged .bam files and sequencing_summary.txt files for each barcode. +". Default = "None"> +``` + +```txt +--multiqc_config + +``` +[top](#table-of-contents) + +## Pipeline output directory + +1. `fast5_to_pod5`: One directory per sample. Only exists for sample that had any fast5 files converted into pod5 files for more efficient basecalling with Dorado. +1. `basecalling_output`: Dorado basecalling output. One ".bam" file per sample (already mapped to the reference genome of choice and sorted). Also includes one sequencing summary file per sample. Reads for the same run will be separated into different fastq files based on barcode when demultiplexing is enabled. +1. `pycoqc_no_filter`: Includes pycoQC quality control reports for each sample with metrics prior to alignment filtering by MAPQ. PycoQC reports are output in both ".html" and ".json" format. The ".html" files can be imported into a personal computer and opened using any internet browser to provide a quick glance basic statistics from the sequencing run. +1. `pycoqc_filtered`: Includes pycoQC quality control reports for each sample with metrics post alignment filtering by MAPQ. PycoQC reports are output in both ".html" and ".json" format. The ".html" files can be imported into a personal computer and opened using any internet browser to provide a quick glance basic statistics from the sequencing run. +1. `multiqc_input/minimap2`: Includes ".flagstat" and ".idxstat" files generate with samtools from before and after alignment filtering. These files show number of reads per sample and number of reads per chromosome. This information is integrated in the final multiQC report. +1. `bam_filtering`: Output from filtering bam files. Filtered files only include primary alignments with MAPQ greater than or equal to what the user specified. This directory includes sorted ".bam" files from before and after filtering and their respective index ".bai" files. +1. `intermediate_qc_reports`: Intermediate quality control reports for each sample separated into 3 directories: "read_length", "number_of_reads", "quality_score_thresholds". +1. `modkit`: Directory with methylation calls, bed file pileup, and summary files generated using modkit. See [documentation](https://nanoporetech.github.io/modkit/quick_start.html) for more information. +1. `num_reads_report`: Three reports, one with number of reads for each sample, other with reads length, and another with MAPQ and PHRED quality scores used to filter the files. +1. `multiQC_output`: MultiQC output files, most importantly the ".html" report showing summary statistics for all file. +1. `calculate_coverage`: Two .tsv files containing the average coverage for each sample across every chromosome of the reference genome used. If a value is not present for a sample that means that chromosome had 0 coverage in that sample. +1. `minknow_converted_input`: Merged .bam files and sequencing_summary.txt files for each barcode. + +[top](#table-of-contents) + +## Examples + +The following examples assume your current directory is the root directory of the project (`nanopore/`). + +1. Set the following variables for your test run (see examples in the comments): + + ```sh + # BASECALL_PATH="./data/test_data_minial/" + export BASECALL_PATH="" + # REFERENCE_FILE="./references/mouse_reference.fa" + export REFERENCE_FILE="" + # OUTPUT_DIR_NAME="test_gpu" + export OUTPUT_DIR_NAME="" + ``` + +1. STEP 1: GPU basecalling without demultiplexing + + ```sh + nextflow ./workflow/main.nf \ + --basecall_path "$BASECALL_PATH" \ + --basecall_speed "hac" \ + --step 1 \ + --ref "$REFERENCE_FILE" \ + --gpu_devices "all" \ + --basecall_mods "5mC_5hmC" \ + --qscore_thresh 9 \ + --basecall_config "False" \ + --basecall_trim "none" \ + --basecall_compute "gpu" \ + --basecall_demux "False" \ + --queue_size 1 \ + --out_dir "$OUTPUT_DIR_NAME" \ + -resume + ``` + +1. STEP 2A: Alignment Filtering and Quality Control from STEP 1 + + ```sh + nextflow ./workflow/main.nf \ + --steps_2_and_3_input_directory "./results/$OUTPUT_DIR_NAME/" \ + --min_mapped_reads_thresh 500 \ + --qscore_thresh 9 \ + --mapq 10 \ + --step "2_from_step_1" \ + -resume + ``` + +1. STEP 2B (MinKNOW): Alignment Filtering and Quality Control from MinKNOW basecalling and alignment (bam files were generated by MinKNOW) + + ```sh + nextflow ./workflow/main.nf \ + --steps_2_and_3_input_directory "./results/$OUTPUT_DIR_NAME/" \ + --min_mapped_reads_thresh 500 \ + --is_barcoded "True" \ + --qscore_thresh 9 \ + --mapq 10 \ + --step "2_from_step_1" \ + -resume + ``` + +1. STEP 3: Methylation calling and MultiQC report + + ```sh + nextflow ./workflow/main.nf \ + --steps_2_and_3_input_directory "./results/$OUTPUT_DIR_NAME/" \ + --multiqc_config "./references/multiqc_config.yaml" \ + --step 3 \ + -resume + ``` + +[top](#table-of-contents) + +## Useful links + +- Main + - [Nextflow](https://www.nextflow.io) + - [Apptainer](https://apptainer.org)/[Singularity](https://docs.sylabs.io) +- Basecalling + - [pod5](https://pypi.org/project/pod5/) + - [dorado](https://github.com/nanoporetech/dorado) +- Quality Control + - [PycoQC](https://github.com/a-slide/pycoQC) + - [MultiQC](https://multiqc.info/) +- Alignment + - [Minimap2](https://github.com/lh3/minimap2) +- Methylation Calling + - [Modkit](https://github.com/nanoporetech/modkit) +- Other Genomics Tools + - [Samtools](https://github.com/samtools/samtools) +- Other Software + - [Conda](https://docs.conda.io/en/latest/) + - [Bioconda](https://bioconda.github.io/) + - [pip](https://pypi.org/project/pip/) + +[top](#table-of-contents)