diff --git a/CHANGELOG.md b/CHANGELOG.md index 8fbf432c..a9bd714c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -15,6 +15,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - [#507](https://github.com/nf-core/funcscan/pull/507) Updated to nf-core template v3.5.1 (by @jfy133) - [#510](https://github.com/nf-core funcscan/pull/510) Fixed code to make Nextflow strict-syntax compliant (by @jfy133) - [#521](https://github.com/nf-core funcscan/pull/521) Added option to turn on RGI's own cleanup of intermediate files (❤️ to @SamD28 for requesting, added by @jfy133) +- [#519](https://github.com/nf-core/funcscan/pull/519)Added BiG-SLiCE (`bigslice`) as a new BGC clustering tool in the BGC subworkflow. BiG-SLiCE clusters BGC sequences detected by antiSMASH and/or GECCO into Gene Cluster Families (GCFs) using an HMM-based approach. Activated with `--bgc_bigslice_run` and requires `--bgc_bigslice_db`. (by @SkyLexS) ### `Fixed` diff --git a/README.md b/README.md index 0e39fbf4..f64eb44f 100644 --- a/README.md +++ b/README.md @@ -39,7 +39,7 @@ The nf-core/funcscan AWS full test dataset are contigs generated by the MGnify s 4. Annotation of coding sequences from 3. to obtain general protein families and domains with [`InterProScan`](https://github.com/ebi-pf-team/interproscan) 5. Screening contigs for antimicrobial peptide-like sequences with [`ampir`](https://cran.r-project.org/web/packages/ampir/index.html), [`Macrel`](https://github.com/BigDataBiology/macrel), [`HMMER`](http://hmmer.org/), [`AMPlify`](https://github.com/bcgsc/AMPlify) 6. Screening contigs for antibiotic resistant gene-like sequences with [`ABRicate`](https://github.com/tseemann/abricate), [`AMRFinderPlus`](https://github.com/ncbi/amr), [`fARGene`](https://github.com/fannyhb/fargene), [`RGI`](https://card.mcmaster.ca/analyze/rgi), [`DeepARG`](https://bench.cs.vt.edu/deeparg). [`argNorm`](https://github.com/BigDataBiology/argNorm) is used to map the outputs of `DeepARG`, `AMRFinderPlus`, and `ABRicate` to the [`Antibiotic Resistance Ontology`](https://www.ebi.ac.uk/ols4/ontologies/aro) for consistent ARG classification terms. -7. Screening contigs for biosynthetic gene cluster-like sequences with [`antiSMASH`](https://antismash.secondarymetabolites.org), [`DeepBGC`](https://github.com/Merck/deepbgc), [`GECCO`](https://gecco.embl.de/), [`HMMER`](http://hmmer.org/) +7. Screening contigs for biosynthetic gene cluster-like sequences with [`antiSMASH`](https://antismash.secondarymetabolites.org), [`BiG-SLiCE`](https://github.com/medema-group/bigslice), [`DeepBGC`](https://github.com/Merck/deepbgc), [`GECCO`](https://gecco.embl.de/), [`HMMER`](http://hmmer.org/) 8. Screening contigs for carbohydrate-active enzymes (CAZymes), CAZyme gene clusters and substrates with [run_dbcan](https://github.com/bcb-unl/run_dbcan). 9. Creating aggregated reports for all samples across the workflows with [`AMPcombi`](https://github.com/paleobiotechnology/AMPcombi) for AMPs, [`hAMRonization`](https://github.com/pha4ge/hAMRonization) for ARGs, and [`comBGC`](https://raw.githubusercontent.com/nf-core/funcscan/master/bin/comBGC.py) for BGCs 10. Software version and methods text reporting with [`MultiQC`](http://multiqc.info/) diff --git a/conf/modules.config b/conf/modules.config index c8a394a9..a748fc6f 100644 --- a/conf/modules.config +++ b/conf/modules.config @@ -541,6 +541,14 @@ process { ] } + withName: BIGSLICE { + publishDir = [ + path: { "${params.outdir}/bgc/bigslice/" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename }, + ] + } + withName: HAMRONIZATION_ABRICATE { publishDir = [ path: { "${params.outdir}/arg/hamronization/abricate" }, diff --git a/docs/output.md b/docs/output.md index 5229e037..3802526d 100644 --- a/docs/output.md +++ b/docs/output.md @@ -6,7 +6,7 @@ The output of nf-core/funcscan provides reports for each of the functional group - **antibiotic resistance genes** (tools: [ABRicate](https://github.com/tseemann/abricate), [AMRFinderPlus](https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder), [DeepARG](https://bitbucket.org/gusphdproj/deeparg-ss/src/master), [fARGene](https://github.com/fannyhb/fargene), [RGI](https://card.mcmaster.ca/analyze/rgi) – summarised by [hAMRonization](https://github.com/pha4ge/hAMRonization). Results from ABRicate, AMRFinderPlus, and DeepARG are normalised to [ARO](https://obofoundry.org/ontology/aro.html) by [argNorm](https://github.com/BigDataBiology/argNorm).) - **antimicrobial peptides** (tools: [Macrel](https://github.com/BigDataBiology/macrel), [AMPlify](https://github.com/bcgsc/AMPlify), [ampir](https://ampir.marine-omics.net), [hmmsearch](http://hmmer.org) – summarised by [AMPcombi](https://github.com/paleobiotechnology/AMPcombi)) -- **biosynthetic gene clusters** (tools: [antiSMASH](https://docs.antismash.secondarymetabolites.org), [DeepBGC](https://github.com/Merck/deepbgc), [GECCO](https://gecco.embl.de), [hmmsearch](http://hmmer.org) – summarised by [comBGC](#combgc)) +- **biosynthetic gene clusters** (tools: [antiSMASH](https://docs.antismash.secondarymetabolites.org), [BiGSLiCE](https://github.com/medema-group/bigslice), [DeepBGC](https://github.com/Merck/deepbgc), [GECCO](https://gecco.embl.de), [hmmsearch](http://hmmer.org) – summarised by [comBGC](#combgc)) - **carbohydrate-active enzymes (CAZymes)**, CAZyme gene clusters and substrates (tools: [run_dbcan](https://github.com/bcb-unl/run_dbcan)) As a general workflow, we recommend to first look at the summary reports ([ARGs](#hamronization), [AMPs](#ampcombi), [BGCs](#combgc)), to get a general overview of what hits have been found across all the tools of each functional group. After which, you can explore the specific output directories of each tool to get more detailed information about each result. The tool-specific output directories also includes the output from the functional annotation steps of either [prokka](https://github.com/tseemann/prokka), [pyrodigal](https://github.com/althonos/pyrodigal), [prodigal](https://github.com/hyattpd/Prodigal), or [Bakta](https://github.com/oschwengers/bakta) if the `--save_annotations` flag was set. Additionally, taxonomic classifications from [MMseqs2](https://github.com/soedinglab/MMseqs2) are saved if the `--taxa_classification_mmseqs_db_savetmp` and `--taxa_classification_mmseqs_taxonomy_savetmp` flags are set. @@ -39,6 +39,7 @@ results/ | └── rgi/ ├── bgc/ | ├── antismash/ +| ├── bigslice/ | ├── deepbgc/ | ├── gecco/ | └── hmmsearch/ @@ -101,6 +102,7 @@ Antimicrobial Peptides (AMPs): Biosynthetic Gene Clusters (BGCs): - [antiSMASH](#antismash) – biosynthetic gene cluster detection. +- [BiGSLiCE](#bigslice) – biosynthetic gene cluster super-linear clustering engine. - [deepBGC](#deepbgc) – biosynthetic gene cluster detection, using a deep learning model. - [GECCO](#gecco) – biosynthetic gene cluster detection, using Conditional Random Fields (CRFs). - [hmmsearch](#hmmsearch) – biosynthetic gene cluster detection, based on hidden Markov models. @@ -393,7 +395,7 @@ Output Summaries: ### BGC detection tools -[antiSMASH](#antismash), [deepBGC](#deepbgc), [GECCO](#gecco), [hmmsearch](#hmmsearch). +[antiSMASH](#antismash), [BiGSLiCE](#bigslice), [deepBGC](#deepbgc), [GECCO](#gecco), [hmmsearch](#hmmsearch). Note that the BGC tools are run on a set of annotations generated on only long contigs (3000 bp or longer) by default. These specific filtered FASTA files are under `bgc/seqkit/`, and annotations files are under `annotation//long/`, if the corresponding saving flags are specified (see [parameter docs](https://nf-co.re/funcscan/parameters)). However the same annotations _should_ also be annotation files in the sister `all/` directory. @@ -435,6 +437,25 @@ Note that filtered FASTA is only used for BGC workflow for run-time optimisation [antiSMASH](https://docs.antismash.secondarymetabolites.org) (**anti**biotics & **S**econdary **M**etabolite **A**nalysis **SH**ell) is a tool for rapid genome-wide identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial genomes. It identifies biosynthetic loci covering all currently known secondary metabolite compound classes in a rule-based fashion using profile HMMs and aligns the identified regions at the gene cluster level to their nearest relatives from a database containing experimentally verified gene clusters (MIBiG). +#### BiGSLiCE + +
+Output files + +- `bigslice/` + - `/` + - `result/` + - `data.db`: SQLite database containing results for BGCs, CDSs, Gene Cluster Families (GCFs), HMMs and HSPs. + - `tmp/` + - `/` + - `*.fa`: predicted biosynthetic features as FASTA files, one file per hit HMM. + +
+ +[BiG-SLiCE](https://github.com/medema-group/bigslice) (**Bi**osynthetic **G**ene cluster **S**uper-**Li**near **C**lustering **E**ngine) is a highly scalable tool for the large-scale analysis and clustering of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs). +It takes BGC regions in GenBank format (e.g. output from antiSMASH or GECCO) along with an HMM database and produces an SQLite database of predicted BGC features and GCF assignments. +BiG-SLiCE requires the HMM database to be supplied via `--bgc_bigslice_db` and is activated with `--bgc_bigslice_run`. It requires at least one of antiSMASH or GECCO (with convert in bigslice format) to be enabled. + #### deepBGC
diff --git a/docs/usage.md b/docs/usage.md index 552195c2..a10e2116 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -169,6 +169,17 @@ When the annotation is run with Prokka, the resulting `.gbk` file passed to anti If antiSMASH is run for BGC detection, we recommend to **not** run Prokka for annotation but instead use the default annotation tool (Pyrodigal), or switch to Prodigal or (for bacteria only!) Bakta. ::: +### BiGSLiCE + +[BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs). +It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled: + +- antiSMASH (default BGC tool). +- GECCO with `--bgc_gecco_runconvert --bgc_gecco_convertmode gbk --bgc_gecco_convertformat bigslice` + +BiG-SLiCE does **not** discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input. +The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline. + ## Databases and reference files Various tools of nf-core/funcscan use databases and reference files to operate. @@ -527,6 +538,25 @@ deepbgc_db/ └── myDetectors*.pkl ``` +### BiGSLiCE + +BiG-SLiCE requires its own HMM database. Unlike most other tools, the pipeline does **not** auto-download this database — it **must** be supplied manually with `--bgc_bigslice_db`. + +Download the pre-built database archive from the BiG-SLiCE GitHub releases page: + +```bash +wget https://github.com/medema-group/bigslice/releases/download/v2.0.0rc/bigslice-models.2022-11-30.tar.gz +tar -xzf bigslice-models.2022-11-30.tar.gz +``` + +Then supply the extracted directory to the pipeline: + +```bash +--bgc_bigslice_db '////' +``` + +The contents of the database directory should contain subdirectories such as `biosynthetic_pfams/` and `sub_pfams/` in the top level. + ### InterProScan [InterProScan](https://github.com/ebi-pf-team/interproscan) is used to provide more information about the proteins annotated on the contigs. By default, turning on this subworkflow with `--run_protein_annotation` will download and unzip the [InterPro database](http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.72-103.0/) version 5.72-103.0. The database can be saved in the output directory `/databases/interproscan/` if the `--save_db` is turned on. diff --git a/modules.json b/modules.json index 72952b97..69d365d1 100644 --- a/modules.json +++ b/modules.json @@ -70,6 +70,11 @@ "git_sha": "72c983560c9b9c2a02ff636451a5e5008f7d020b", "installed_by": ["modules"] }, + "bigslice": { + "branch": "master", + "git_sha": "875cf13d1c974d62483fddd55a02456880363b5c", + "installed_by": ["modules"] + }, "deeparg/downloaddata": { "branch": "master", "git_sha": "81880787133db07d9b4c1febd152c090eb8325dc", diff --git a/modules/nf-core/bigslice/environment.yml b/modules/nf-core/bigslice/environment.yml new file mode 100644 index 00000000..de8fdfbb --- /dev/null +++ b/modules/nf-core/bigslice/environment.yml @@ -0,0 +1,7 @@ +--- +# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/environment-schema.json +channels: + - conda-forge + - bioconda +dependencies: + - "bioconda::bigslice=2.0.2" diff --git a/modules/nf-core/bigslice/main.nf b/modules/nf-core/bigslice/main.nf new file mode 100644 index 00000000..dc88bcc5 --- /dev/null +++ b/modules/nf-core/bigslice/main.nf @@ -0,0 +1,54 @@ +process BIGSLICE { + tag "$meta.id" + label 'process_medium' + + // WARN: Version information not provided correctly by tool on CLI. Please update version string below when bumping container versions. + conda "${moduleDir}/environment.yml" + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/bigslice:2.0.2--pyh8ed023e_0': + 'biocontainers/bigslice:2.0.2--pyh8ed023e_0' }" + + input: + tuple val(meta), path(bgc, stageAs: 'bgc_files/*') + path(hmmdb) + + output: + tuple val(meta), path("${prefix}/result/data.db") , emit: db + tuple val(meta), path("${prefix}/result/tmp/**/*.fa"), emit: fa + tuple val("${task.process}"), val('bigslice'), eval("echo 2.0.2"), topic: versions, emit: versions_bigslice + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + prefix = task.ext.prefix ?: "${meta.id}" + def sample = meta.id + """ + mkdir -p input/dataset/${sample} input/taxonomy + cp bgc_files/* input/dataset/${sample}/ + + printf "# dataset_name\\tdataset_path\\ttaxonomy_path\\tdescription\\n" > input/datasets.tsv + printf "dataset\\tdataset\\ttaxonomy/taxonomy.tsv\\tBGC dataset\\n" >> input/datasets.tsv + + touch input/taxonomy/taxonomy.tsv + + bigslice \\ + $args \\ + --num_threads ${task.cpus} \\ + -i input \\ + --program_db_folder ${hmmdb} \\ + ${prefix} + """ + + stub: + def args = task.ext.args ?: '' + prefix = task.ext.prefix ?: "${meta.id}" + """ + echo $args + + mkdir -p ${prefix}/result/tmp/2e555308dfc411186cf012334262f127 + touch ${prefix}/result/data.db + touch ${prefix}/result/tmp/2e555308dfc411186cf012334262f127/test.fa + """ +} diff --git a/modules/nf-core/bigslice/meta.yml b/modules/nf-core/bigslice/meta.yml new file mode 100644 index 00000000..518f4bf3 --- /dev/null +++ b/modules/nf-core/bigslice/meta.yml @@ -0,0 +1,90 @@ +name: "bigslice" +description: | + A scalable tool for large-scale analysis of Biosynthetic Gene Clusters (BGCs). + It takes genome regions in GenBank format along with an HMM database and produces a SQLite database and FASTA outputs of predicted features. +keywords: + - biosynthetic gene clusters + - genomics + - analysis +tools: + - "bigslice": + description: A highly scalable, user-interactive tool for the large scale analysis + of Biosynthetic Gene Clusters data + homepage: "https://github.com/medema-group/bigslice" + documentation: "https://github.com/medema-group/bigslice" + tool_dev_url: "https://github.com/medema-group/bigslice" + doi: "10.1093/gigascience/giaa154" + licence: ["AGPL v3-or-later"] + identifier: "" + +input: + - - meta: + type: map + description: | + Groovy Map containing sample information + e.g. `[ id:'sample1' ]` + - bgc: + type: directory + description: | + Path to a folder containing genomic regions in GenBank format, structured for BiG-SLiCE. + Each genome should have its own subfolder with region `.gbk` files. + The folder should also contain a datasets.tsv, and a taxonomy folder, with TSV taxonomy files per dataset. + See the tool's wiki for more information: https://github.com/medema-group/bigslice/wiki/Input-folder + pattern: "*" + - hmmdb: + type: directory + description: | + Path to the BiG-SLiCE HMM database folder containing biosynthetic and sub Pfams for annotation, in the required BiG-SLiCE format. + An example directory in compressed archive format can be found here: https://github.com/medema-group/bigslice/releases/download/v2.0.0rc/bigslice-models.2022-11-30.tar.gz + +output: + db: + - - meta: + type: map + description: Groovy Map containing sample/dataset information + - ${prefix}/result/data.db: + type: file + description: | + The results SQLite database. Contains various tables relevant to result + BGCs, CDSs, GCFs, HMMs and HSPs. + pattern: "data.db" + ontologies: + - edam: "http://edamontology.org/format_3621" # SQLite format + fa: + - - meta: + type: map + description: Groovy Map containing sample/dataset information + - ${prefix}/result/tmp/**/*.fa: + type: file + description: | + Predicted features as FASTA files. One file per hit HMM. + pattern: "*.fa" + ontologies: + - edam: "http://edamontology.org/format_1929" # FASTA + versions_bigslice: + - - ${task.process}: + type: string + description: The name of the process + - bigslice: + type: string + description: The name of the tool + - echo 2.0.2: + type: eval + description: The expression to obtain the version of the tool + +topics: + versions: + - - ${task.process}: + type: string + description: The name of the process + - bigslice: + type: string + description: The name of the tool + - echo 2.0.2: + type: eval + description: The expression to obtain the version of the tool + +authors: + - "@vagkaratzas" +maintainers: + - "@vagkaratzas" diff --git a/modules/nf-core/bigslice/tests/main.nf.test b/modules/nf-core/bigslice/tests/main.nf.test new file mode 100644 index 00000000..19bcb011 --- /dev/null +++ b/modules/nf-core/bigslice/tests/main.nf.test @@ -0,0 +1,117 @@ +nextflow_process { + + name "Test Process BIGSLICE" + script "../main.nf" + process "BIGSLICE" + config "./nextflow.config" + + tag "modules" + tag "modules_nfcore" + tag "bigslice" + tag "aria2" + tag "untar" + + setup { + run("ARIA2", alias: "ARIA2_HMMDB") { + script "../../aria2/main.nf" + process { + """ + input[0] = [ + [ id:'test_hmm_db' ], + 'https://github.com/medema-group/bigslice/releases/download/v2.0.0rc/bigslice-models.2022-11-30.tar.gz' // https URL + ] + """ + } + } + + run("UNTAR", alias: "UNTAR_HMMDB") { + script "../../untar/main.nf" + process { + """ + input[0] = ARIA2_HMMDB.out.downloaded_file + """ + } + } + + run("ARIA2", alias: "ARIA2_GBK") { + script "../../aria2/main.nf" + process { + """ + input[0] = [ + [ id:'test_gbk' ], + params.modules_testdata_base_path + 'genomics/prokaryotes/streptomyces_coelicolor/fixtures_bigslice_gbk.tar.gz' // https URL + ] + """ + } + } + + run("UNTAR", alias: "UNTAR_GBK") { + script "../../untar/main.nf" + process { + """ + input[0] = ARIA2_GBK.out.downloaded_file + """ + } + } + } + + test("streptomyces_coelicolor - bigslice - gbk") { + + when { + process { + """ + // Flatten the GBK directory into a list of individual GBK files with meta + input[0] = UNTAR_GBK.out.untar.map { meta, dir -> + def gbk_files = [] + dir.eachFileRecurse { if (it.name.endsWith('.gbk')) gbk_files << it } + [ meta, gbk_files ] + } + input[1] = UNTAR_HMMDB.out.untar.map{ it -> it[1] } + """ + } + } + + then { + assert process.success + assertAll( + { assert snapshot( + file(process.out.db[0][1]).name, + process.out.fa[0][1].size(), + process.out.findAll { key, val -> key.startsWith("versions")} + ).match() } + ) + } + + } + + test("streptomyces_coelicolor - bigslice - gbk - stub") { + + options "-stub" + + when { + process { + """ + // Flatten the GBK directory into a list of individual GBK files with meta + input[0] = UNTAR_GBK.out.untar.map { meta, dir -> + def gbk_files = [] + dir.eachFileRecurse { if (it.name.endsWith('.gbk')) gbk_files << it } + [ meta, gbk_files ] + } + input[1] = UNTAR_HMMDB.out.untar.map{ it -> it[1] } + """ + } + } + + then { + assert process.success + assertAll( + { assert snapshot( + process.out, + process.out.findAll { key, val -> key.startsWith("versions")} + ).match() } + ) + } + + } + +} diff --git a/modules/nf-core/bigslice/tests/main.nf.test.snap b/modules/nf-core/bigslice/tests/main.nf.test.snap new file mode 100644 index 00000000..c678d7f8 --- /dev/null +++ b/modules/nf-core/bigslice/tests/main.nf.test.snap @@ -0,0 +1,88 @@ +{ + "streptomyces_coelicolor - bigslice - gbk - stub": { + "content": [ + { + "0": [ + [ + { + "id": "test_gbk" + }, + "data.db:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ], + "1": [ + [ + { + "id": "test_gbk" + }, + "test.fa:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ], + "2": [ + [ + "BIGSLICE", + "bigslice", + "2.0.2" + ] + ], + "db": [ + [ + { + "id": "test_gbk" + }, + "data.db:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ], + "fa": [ + [ + { + "id": "test_gbk" + }, + "test.fa:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ], + "versions_bigslice": [ + [ + "BIGSLICE", + "bigslice", + "2.0.2" + ] + ] + }, + { + "versions_bigslice": [ + [ + "BIGSLICE", + "bigslice", + "2.0.2" + ] + ] + } + ], + "meta": { + "nf-test": "0.9.3", + "nextflow": "25.10.3" + }, + "timestamp": "2026-03-04T09:47:43.387153103" + }, + "streptomyces_coelicolor - bigslice - gbk": { + "content": [ + "data.db", + 40, + { + "versions_bigslice": [ + [ + "BIGSLICE", + "bigslice", + "2.0.2" + ] + ] + } + ], + "meta": { + "nf-test": "0.9.3", + "nextflow": "25.10.3" + }, + "timestamp": "2026-03-04T09:47:30.918713387" + } +} \ No newline at end of file diff --git a/modules/nf-core/bigslice/tests/nextflow.config b/modules/nf-core/bigslice/tests/nextflow.config new file mode 100644 index 00000000..2986e346 --- /dev/null +++ b/modules/nf-core/bigslice/tests/nextflow.config @@ -0,0 +1,5 @@ +process { + withName: BIGSLICE { + ext.prefix = "test_bigslice" + } +} diff --git a/nextflow.config b/nextflow.config index e59e3afa..93dfc9c2 100644 --- a/nextflow.config +++ b/nextflow.config @@ -258,6 +258,10 @@ params { bgc_gecco_convertmode = 'clusters' bgc_gecco_convertformat = 'gff' + + bgc_bigslice_run = false + bgc_bigslice_db = null + bgc_run_hmmsearch = false bgc_hmmsearch_models = null bgc_hmmsearch_savealignments = false @@ -578,5 +582,13 @@ validation { monochromeLogs = params.monochrome_logs } +report { + overwrite = true +} + +timeline { + overwrite = true +} + // Load modules.config for DSL2 module specific options includeConfig 'conf/modules.config' diff --git a/nextflow_schema.json b/nextflow_schema.json index ce096772..0777a1d7 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -164,14 +164,14 @@ }, "taxa_classification_mmseqs_taxonomy_sensitivity": { "type": "number", - "default": 5, + "default": 5.0, "help_text": "This flag specifies the speed and sensitivity of the taxonomic search. It stands for how many kmers should be produced during the preliminary seeding stage. A very fast search requires a low value e.g. '1.0' and a a very sensitive search requires e.g. '7.0'. More details can be found in the [documentation](https://mmseqs.com/latest/userguide.pdf).\n\n> Modifies tool parameter(s):\n> - mmseqs taxonomy: `-s`", "description": "Specify the speed and sensitivity for taxonomy assignment.", "fa_icon": "fas fa-history" }, "taxa_classification_mmseqs_taxonomy_orffilters": { "type": "number", - "default": 2, + "default": 2.0, "help_text": "This flag specifies the sensitivity used for prefiltering the query ORF. Before the taxonomy-assigning step, MMseqs2 searches the predicted ORFs against the provided database. This value influences the speed with which the search is carried out. More details can be found in the [documentation](https://mmseqs.com/latest/userguide.pdf).\n\n> Modifies tool parameter(s):\n> - mmseqs taxonomy: `--orf-filter-s`", "description": "Specify the ORF search sensitivity in the prefilter step.", "fa_icon": "fas fa-history" @@ -388,7 +388,7 @@ "default": "Bacteria", "fa_icon": "fas fa-crown", "description": "Specify the kingdom that the input represents.", - "help_text": "Specifies the kingdom that the input sample is derived from and/or you wish to screen for\n\n> ⚠️ Prokka cannot annotate Eukaryotes.\n\nFor more information please check the Prokka [documentation](https://github.com/tseemann/prokka).\n\n> Modifies tool parameter(s):\n> - Prokka: `--kingdom`", + "help_text": "Specifies the kingdom that the input sample is derived from and/or you wish to screen for\n\n> \u26a0\ufe0f Prokka cannot annotate Eukaryotes.\n\nFor more information please check the Prokka [documentation](https://github.com/tseemann/prokka).\n\n> Modifies tool parameter(s):\n> - Prokka: `--kingdom`", "enum": ["Archaea", "Bacteria", "Mitochondria", "Viruses"] }, "annotation_prokka_gcode": { @@ -409,7 +409,7 @@ }, "annotation_prokka_evalue": { "type": "number", - "default": 0.000001, + "default": 1e-6, "description": "E-value cut-off.", "help_text": "Specifiy the maximum E-value used for filtering the alignment hits.\n\nFor more information please check the Prokka [documentation](https://github.com/tseemann/prokka).\n\n> Modifies tool parameter(s):\n> - Prokka: `--evalue`", "fa_icon": "fas fa-sort-amount-down" @@ -710,7 +710,7 @@ "amp_ampcombi_db": { "type": "string", "description": "The path to the folder containing the reference database files.", - "help_text": "The path to the folder containing the reference database files (`*.fasta` and `*.tsv`); a fasta file and the corresponding table with structural, functional and if reported taxonomic classifications. AMPcombi will then generate the corresponding `mmseqs2` directory, in which all binary files are prepared for the downstream alignment of the recovered AMPs with [MMseqs2](https://github.com/soedinglab/MMseqs2). These can also be provided by the user by setting up an mmseqs2 compatible database using `mmseqs createdb *.fasta` in a directory called `mmseqs2`.\n\nExample file structure for the reference database supplied by the user:\n\n```bash\namp_DRAMP_database/\n├── general_amps_2024_11_13.fasta\n├── general_amps_2024_11_13.txt\n└── mmseqs2\n ├── ref_DB\n ├── ref_DB.dbtype\n ├── ref_DB_h\n ├── ref_DB_h.dbtype\n ├── ref_DB_h.index\n ├── ref_DB.index\n ├── ref_DB.lookup\n └── ref_DB.source```\n\nFor more information check the AMPcombi [documentation](https://ampcombi.readthedocs.io/en/main/usage.html#parse-tables)." + "help_text": "The path to the folder containing the reference database files (`*.fasta` and `*.tsv`); a fasta file and the corresponding table with structural, functional and if reported taxonomic classifications. AMPcombi will then generate the corresponding `mmseqs2` directory, in which all binary files are prepared for the downstream alignment of the recovered AMPs with [MMseqs2](https://github.com/soedinglab/MMseqs2). These can also be provided by the user by setting up an mmseqs2 compatible database using `mmseqs createdb *.fasta` in a directory called `mmseqs2`.\n\nExample file structure for the reference database supplied by the user:\n\n```bash\namp_DRAMP_database/\n\u251c\u2500\u2500 general_amps_2024_11_13.fasta\n\u251c\u2500\u2500 general_amps_2024_11_13.txt\n\u2514\u2500\u2500 mmseqs2\n \u251c\u2500\u2500 ref_DB\n \u251c\u2500\u2500 ref_DB.dbtype\n \u251c\u2500\u2500 ref_DB_h\n \u251c\u2500\u2500 ref_DB_h.dbtype\n \u251c\u2500\u2500 ref_DB_h.index\n \u251c\u2500\u2500 ref_DB.index\n \u251c\u2500\u2500 ref_DB.lookup\n \u2514\u2500\u2500 ref_DB.source```\n\nFor more information check the AMPcombi [documentation](https://ampcombi.readthedocs.io/en/main/usage.html#parse-tables)." }, "amp_ampcombi_parsetables_cutoff": { "type": "number", @@ -728,7 +728,7 @@ }, "amp_ampcombi_parsetables_dbevalue": { "type": "number", - "default": 5, + "default": 5.0, "description": "Remove all DRAMP annotations that have an e-value greater than this value.", "help_text": "This e-value is used as a cut-off for the annotations from the internal Diamond alignment step (against the DRAMP database by default). Any e-value below this value will only remove the DRAMP classification and not the entire hit.\n\n> Modifies tool parameter(s):\n> - AMPCOMBI: `--db_evalue`", "fa_icon": "fas fa-sort-numeric-down" @@ -799,14 +799,14 @@ "properties": { "amp_ampcombi_cluster_covmode": { "type": "number", - "default": 0, + "default": 0.0, "description": "MMseqs2 coverage mode.", "help_text": "This assigns the coverage mode to the MMseqs2 cluster module. This determines how AMPs are grouped into clusters. More details can be found in the [MMseqs2 documentation](https://mmseqs.com/latest/userguide.pdf).\n\n> Modifies tool parameter(s):\n> - AMPCOMBI: `--cluster_cov_mode`", "fa_icon": "far fa-circle" }, "amp_ampcombi_cluster_sensitivity": { "type": "number", - "default": 4, + "default": 4.0, "description": "Remove hits that have no stop codon upstream and downstream of the AMP.", "help_text": "This assigns the sensitivity of alignment to the MMseqs2 cluster module. This determines how AMPs are grouped into clusters. More information can be obtained in the [MMseqs2 documentation](https://mmseqs.com/latest/userguide.pdf).\n\n> Modifies tool parameter(s):\n> - AMPCOMBI: `--cluster_sensitivity`", "fa_icon": "fas fa-arrows-alt-h" @@ -820,7 +820,7 @@ }, "amp_ampcombi_cluster_mode": { "type": "number", - "default": 1, + "default": 1.0, "description": "MMseqs2 clustering mode.", "help_text": "This assigns the cluster mode to the MMseqs2 cluster module. This determines how AMPs are grouped into clusters. More information can be obtained in the [MMseqs2 documentation](https://mmseqs.com/latest/userguide.pdf).\n\n> Modifies tool parameter(s):\n> - AMPCOMBI: `--cluster_mode`", "fa_icon": "fas fa-circle" @@ -868,7 +868,7 @@ }, "arg_amrfinderplus_identmin": { "type": "number", - "default": -1, + "default": -1.0, "help_text": "Specify the minimum percentage amino-acid identity to reference protein or nucleotide identity for nucleotide reference must have if a BLAST alignment (based on methods: BLAST or PARTIAL) was detected, otherwise NA.\n\n If you specify `-1`, this means use a curated threshold if it exists and `0.9` otherwise.\n\nSetting this value to something other than `-1` will override any curated similarity cutoffs. For BLAST: alignment is > 90% of length and > 90% identity to a protein in the AMRFinderPlus database. For PARTIAL: alignment is > 50% of length, but < 90% of length and > 90% identity to the reference, and does not end at a contig boundary.\n\nFor more information check the AMRFinderPlus [documentation](https://github.com/ncbi/amr/wiki/Running-AMRFinderPlus#--organism-option).\n\n> Modifies tool parameter(s):\n> - AMRFinderPlus: `--ident_min`", "description": "Minimum percent identity to reference sequence.", "fa_icon": "fas fa-angle-left" @@ -1465,6 +1465,22 @@ }, "fa_icon": "fas fa-angle-double-right" }, + "bgc_bigslice": { + "title": "BGC: BiG-SLiCE", + "type": "object", + "default": "", + "properties": { + "bgc_bigslice_run": { + "type": "boolean", + "description": "Run BiG-SLiCE to cluster detected BGCs into gene cluster families (GCFs)." + }, + "bgc_bigslice_db": { + "type": "string", + "description": "Path to the pre-downloaded BiG-SLiCE HMM database directory." + } + }, + "description": "Parameters for BiG-SLiCE clustering of biosynthetic gene clusters (BGCs) into gene cluster families (GCFs). More info: https://github.com/medema-group/bigslice" + }, "bgc_hmmsearch": { "title": "BGC: hmmsearch", "type": "object", @@ -1782,6 +1798,9 @@ { "$ref": "#/$defs/bgc_gecco" }, + { + "$ref": "#/$defs/bgc_bigslice" + }, { "$ref": "#/$defs/bgc_hmmsearch" }, diff --git a/subworkflows/local/bgc.nf b/subworkflows/local/bgc.nf index e12c21c0..ed728c68 100644 --- a/subworkflows/local/bgc.nf +++ b/subworkflows/local/bgc.nf @@ -13,6 +13,7 @@ include { COMBGC } from '../../modules/local/com include { TABIX_BGZIP as BGC_TABIX_BGZIP } from '../../modules/nf-core/tabix/bgzip' include { MERGE_TAXONOMY_COMBGC } from '../../modules/local/merge_taxonomy_combgc' include { GECCO_CONVERT } from '../../modules/nf-core/gecco/convert' +include { BIGSLICE } from '../../modules/nf-core/bigslice' workflow BGC { take: @@ -116,6 +117,30 @@ workflow BGC { GECCO_CONVERT(ch_gecco_clusters_and_gbk, params.bgc_gecco_convertmode, params.bgc_gecco_convertformat) ch_versions = ch_versions.mix(GECCO_CONVERT.out.versions) } + // BIGSLICE + if (params.bgc_bigslice_run) { + + ch_bigslice_hmmdb = Channel.fromPath(params.bgc_bigslice_db, checkIfExists: true) + .first() + + def gecco_bigslice = !params.bgc_skip_gecco && params.bgc_gecco_runconvert && params.bgc_gecco_convertformat == 'bigslice' + + if (!params.bgc_skip_antismash && gecco_bigslice) { + ch_bigslice_input = ANTISMASH_ANTISMASH.out.gbk_results.mix(GECCO_CONVERT.out.bigslice) + } else if (!params.bgc_skip_antismash) { + ch_bigslice_input = ANTISMASH_ANTISMASH.out.gbk_results + } else { + ch_bigslice_input = GECCO_CONVERT.out.bigslice + } + + ch_bigslice_grouped = ch_bigslice_input + .groupTuple() + .map { meta, files -> + [meta, files.flatten()] + } + + BIGSLICE(ch_bigslice_grouped, ch_bigslice_hmmdb) + } // HMMSEARCH if (params.bgc_run_hmmsearch) { if (params.bgc_hmmsearch_models) { diff --git a/subworkflows/local/utils_nfcore_funcscan_pipeline/main.nf b/subworkflows/local/utils_nfcore_funcscan_pipeline/main.nf index ea48c2f8..ab697d7b 100644 --- a/subworkflows/local/utils_nfcore_funcscan_pipeline/main.nf +++ b/subworkflows/local/utils_nfcore_funcscan_pipeline/main.nf @@ -173,6 +173,14 @@ def validateInputParameters() { error("[nf-core/funcscan] ERROR: when specifying --bgc_gecco_convertmode 'clusters', --bgc_gecco_convertformat can only be set to 'gff'. You specified --bgc_gecco_convertformat '${params.bgc_gecco_convertformat}'. Check input!") } } + if (params.run_bgc_screening && params.bgc_bigslice_run) { + if (params.bgc_skip_antismash && (params.bgc_skip_gecco || !params.bgc_gecco_runconvert || params.bgc_gecco_convertformat != 'bigslice')) { + error('[nf-core/funcscan] ERROR: BigSLICE requires at least one of: (1) antiSMASH enabled, or (2) GECCO enabled with GECCO convert in bigslice format. Please check your parameters.') + } + if (!params.bgc_bigslice_db) { + error('[nf-core/funcscan] ERROR: BigSLICE HMM database not found for --bgc_bigslice_db! Please check input.') + } + } } //