Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### `Added`

- Add SNV scoring by MIVMIR, GICAM models [#812](https://github.com/nf-core/raredisease/pull/812)

### `Changed`

- Replace `ch_publish`/`subworkflow_results` with named typed channel emits for qc_bam subworkflow [#853](https://github.com/nf-core/raredisease/pull/853)
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ For more details about the output files and reports, please refer to the

nf-core/raredisease was written in a collaboration between the Clinical Genomics nodes in Sweden, with major contributions from [Ramprasad Neethiraj](https://github.com/ramprasadn), [Anders Jemt](https://github.com/jemten), [Lucia Pena Perez](https://github.com/Lucpen), and [Mei Wu](https://github.com/projectoriented) at Clinical Genomics Stockholm.

Additional contributors were [Sima Rahimi](https://github.com/sima-r), [Gwenna Breton](https://github.com/Gwennid) and [Emma Västerviga](https://github.com/EmmaCAndersson) (Clinical Genomics Gothenburg); [Halfdan Rydbeck](https://github.com/hrydbeck) and [Lauri Mesilaakso](https://github.com/ljmesi) (Clinical Genomics Linköping); [Subazini Thankaswamy Kosalai](https://github.com/sysbiocoder) (Clinical Genomics Örebro); [Annick Renevey](https://github.com/rannick), [Peter Pruisscher](https://github.com/peterpru) and [Eva Caceres](https://github.com/fevac) (Clinical Genomics Stockholm); [Ryan Kennedy](https://github.com/ryanjameskennedy) (Clinical Genomics Lund); [Anders Sune Pedersen](https://github.com/asp8200) (Danish National Genome Center) and [Lucas Taniguti](https://github.com/lmtani).
Additional contributors were [Sima Rahimi](https://github.com/sima-r), [Gwenna Breton](https://github.com/Gwennid) and [Emma Västerviga](https://github.com/EmmaCAndersson) (Clinical Genomics Gothenburg); [Halfdan Rydbeck](https://github.com/hrydbeck) and [Lauri Mesilaakso](https://github.com/ljmesi) (Clinical Genomics Linköping); [Subazini Thankaswamy Kosalai](https://github.com/sysbiocoder) (Clinical Genomics Örebro); [Annick Renevey](https://github.com/rannick), [Peter Pruisscher](https://github.com/peterpru), [Eva Caceres](https://github.com/fevac) and [Tor Björgen](https://github.com/torbjorgen) (Clinical Genomics Stockholm); [Ryan Kennedy](https://github.com/ryanjameskennedy) (Clinical Genomics Lund); [Anders Sune Pedersen](https://github.com/asp8200) (Danish National Genome Center) and [Lucas Taniguti](https://github.com/lmtani).
Comment thread
torbjorgen marked this conversation as resolved.

We thank the nf-core community for their extensive assistance in the development of this pipeline.

Expand Down
121 changes: 121 additions & 0 deletions assets/rank_model_genmod_gicam.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
[Version]
version = 1
name = rank_model_for_gicam

[Categories]

[[inheritance_models]]
category_aggregation = min

[[variant_call_quality_filter]]
category_aggregation = sum

[model_score]
category = variant_call_quality_filter
data_type = integer
description = Inheritance model score
field = INFO
info_key = ModelScore
record_rule = min
separators = ',',':',

[[not_reported]]
score = 0

[[low_qual]]
score = -5
lower = 0
upper = 10

[[medium_qual]]
score = -2
lower = 10
upper = 20

[[high_qual]]
score = 0
lower = 20
upper = 300

[genetic_models]
category = inheritance_models
data_type = string
description = Inheritance models followed for the variant
field = INFO
info_key = GeneticModels
record_rule = max
separators = ',', ':', '|',

[[ad]]
priority = 1
score = 1
string = 'AD'

[[ad_dn]]
score = 1
priority = 1
string = 'AD_dn'

[[ar]]
score = 1
priority = 1
string = 'AR_hom'

[[ar_dn]]
score = 1
priority = 1
string = 'AR_hom_dn'

[[ar_comp]]
score = 1
priority = 1
string = 'AR_comp'

[[ar_comp_dn]]
score = 1
priority = 1
string = 'AR_comp_dn'

[[xr]]
score = 1
priority = 1
string = 'XR'

[[xr_dn]]
score = 1
priority = 1
string = 'XR_dn'

[[xd]]
score = 1
priority = 1
string = 'XD'

[[xd_dn]]
score = 1
priority = 1
string = 'XD_dn'

[[not_reported]]
score = -12

[filter]
category = variant_call_quality_filter
data_type = string
description = The filters for the variant
field = FILTER
record_rule = min
separators = ';',

[[not_reported]]
score = 0

[[pass]]
score = 3
priority = 1
string = 'PASS'

[[dot]]
score = 3
priority = 2
string = '.'
21 changes: 20 additions & 1 deletion conf/modules/rank_variants.config
Original file line number Diff line number Diff line change
Expand Up @@ -80,9 +80,28 @@ process {

withName: '.*RANK_VARIANTS_SNV:TABIX_BGZIPTABIX' {
ext.args2 = '-p vcf'
ext.prefix = { "${meta.id}_snv_ranked_${meta.set}" }
// Avoid BCFTOOLS i/o nameclash by appending genmod_ when ranking with mivmir, gicam
// VCF name is overridden to fit naming convention BCFTOOLS_APPEND_MIVMIR_GICAM
ext.prefix = { "${meta.id}_snv_ranked_${params.rank_with_mivmir_gicam ? 'genmod_': ''}${meta.set}" }
}

withName: '.*RANK_VARIANTS_SNV:MIVMIR_INFER' {
// VCF key name change tweak for GRCh38
ext.args = '--override_vcf_input_keys GNOMADAF_popmax:GNOMADAF_grpmax'
}

withName: '.*RANK_VARIANTS_SNV:TABIX_BGZIPTABIX_GICAM' {
ext.prefix = { "${meta.id}_snv_ranked_gicam_${meta.set}" }
}

withName: '.*RANK_VARIANTS_SNV:BCFTOOLS_APPEND_MIVMIR_GICAM' {
ext.args = { [
'--columns MivmirScore,MivmirExplanation,GicamScore',
'--write-index=tbi',
'--output-type z'
].join(' ') }
ext.prefix = { "${meta.id}_snv_ranked_${meta.set}" }
}
}

//
Expand Down
96 changes: 96 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [Filtering and ranking](#filtering-and-ranking)
- [Filter_vep](#filter_vep)
- [GENMOD](#genmod)
- [MIVMIR, GICAM](#mivmir-gicam)
- [Mobile element analysis](#mobile-element-analysis)
- [Calling mobile elements](#calling-mobile-elements)
- [Annotating mobile elements](#annotating-mobile-elements)
Expand Down Expand Up @@ -527,6 +528,101 @@ We recommend using vcfanno to annotate SNVs with precomputed CADD scores (files

</details>

#### MIVMIR, GICAM

[MIVMIR](../modules/local/mivmir/meta.yml) and [GICAM](../modules/local/gicam/meta.yml) are two machine learning models used to
infer a pathogenicity score for SNVs, INDELs. In essence, MIVMIR infer SNV pathogenicity and GICAM improves precision
for duo, trio, ... analysis.

The models are enabled as a single entity, to achieve the best performance.
GICAM has a dependency on MIVMIR, acting as an input feature to GICAM.

MIVMIR, GICAM can be enabled by setting the `--rank_with_mivmir_gicam` feature flag.

Only `<outdir>/rank_and_filter/<case_id>_snv_ranked_<research|clinical>.vcf.gz` contains the output annotations.

MIVMIR and GICAM [source code and performance metrics available here](https://github.com/Clinical-Genomics/rdds/tree/master/src/rdds/).

##### MIVMIR

MIVMIR is a deep neural network regression type model trained on ClinVar pathogenic
variants and an Ashkenazim trio negative background. The purpose of the model
is to infer pathogenicity/ rank scores from a set of input annotations.
The model is scoring every variant in isolation.

Model is optimized to generalize well on Scilifelab Clinical Genomics Stockholm cohort,
and does have some custom input features (`SWEGENAF`, `Frq`) as by our
local pipeline configuration.

> It is possible to run model inference on variants without SWEGENAF, Frq
> annotations. These annotations are dropped out during training to simulate
> lack of previous clinical evidence in order to regularize the model.
> However, such configuration is highly experimental, unsupported
> and at your own discretion.

###### Requirements

MIVMIR relies on the following VCF annotations/keys as variant feature inputs:

- `CSQ/PolyPhen`
- `CSQ/SIFT`
- `CSQ/CLINVAR_CLNREVSTAT`
- `CSQ/CLINVAR_CLNSIG`
- `CSQ/MaxEntScan_alt`
- `CSQ/MaxEntScan_diff`
- `CSQ/MES-SWA_acceptor_alt`
- `CSQ/MES-SWA_donor_alt`
- `CSQ/MES-SWA_donor_diff`
- `CSQ/SpliceAI_pred_DS_AL`
- `CSQ/SpliceAI_pred_DS_DG`
- `CSQ/SpliceAI_pred_DS_DL`
- `CSQ/REVEL_score`
- `CSQ/LoFtool`
- `CSQ/GERP++_RS`
- `CSQ/phastCons100way_vertebrate`
- `CSQ/phyloP100way_vertebrate`
- `CSQ/SpliceAI_pred_DS_AG`
- `most_severe_consequence`
- `CADD`
- `SWEGENAF` ([SweGen Variant Frequency Dataset](https://swefreq.nbis.se/dataset/SweGen))
- `GNOMADAF_popmax`
- `Frq` (variant frequency from a local database)

> If any annotation is missing, it will be **_*silently*_** interpreted as `''` or `0.0` depending on data type.

> For GRCh38, GnomAD has renamed the `GNOMADAF_popmax` to `GNOMADAF_grpmax`.
> Adjust this via the `--override_vcf_input_keys [OLD_KEY:NEW_KEY],[...]` flag.

The tool adds two keys to the VCF

- `MivmirScore`, a score in range (0, 1) where 1.0 inferred pathogenic.
- `MivmirExplanation`, a key value array type str-float explaining the additive contribution of each annotation
to the final variant score. The explanations are sorted in decreasing order, the largest input feature
contributor is listed first. This key is only available for high scoring variants for compute performance
reasons.

##### GICAM

Model for SNV ranking in conjunction with MIVMIR.

This model improves precision for duo, trio, ... analysis situations by reducing MIVMIR scores for variants that's
not following the appropriate GENMOD genetic inheritance model. Applied to MIVMIR scores as a post-processing step.

###### Requirements

GICAM relies on the following VCF annotations/keys as variant feature inputs:

- `MivmirScore`, (0, 1)
- `RankScoreNormalized`, (0, 1) as produced by GENMOD using rank_model_genmod_gicam.ini custom scoring config

The tool adds one key to the VCF

- `GicamScore` (0, 1) where 1.0 inferred pathogenic.

> GICAM is optimized for the GENMOD scoring config present in the gicam module directory, that
> generates RankScoreNormalized.
> Modifications to the GENMOD scoring config will break inference (unless GICAM is first retrained on the new config).

### Mobile element analysis

#### Calling mobile elements
Expand Down
3 changes: 2 additions & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -290,7 +290,7 @@ The mandatory and optional parameters for each category are tabulated below.
| vcfanno_toml<sup>3</sup> | vep_filters/vep_filters_scout_fmt<sup>10</sup> |
| vep_cache_version | cadd_resources<sup>11</sup> |
| vep_cache<sup>4</sup> | run_vcfanno_db_sanity_check<sup>12</sup> |
| gnomad_af<sup>5</sup> | |
| gnomad_af<sup>5</sup> | rank_with_mivmir_gicam<sup>13</sup> |
| score_config_snv<sup>6</sup> | |
| variant_consequences_snv<sup>7</sup> | |
| vep_plugin_files<sup>8</sup> | |
Expand All @@ -310,6 +310,7 @@ no header and the following columns: `CHROM POS REF_ALLELE,ALT_ALLELE AF`. Sampl
<sup>10</sup> This file contains a list of candidate genes (with [HGNC](https://www.genenames.org/) IDs) that is used to split the variants into candidate variants and research variants. Research variants contain all the variants, while candidate variants are a subset of research variants and are associated with candidate genes. Sample file [here](https://github.com/nf-core/test-datasets/blob/raredisease/reference/hgnc.txt). Not required if `--skip_subworkflows generate_clinical_set` is set.<br />
<sup>11</sup>Path to a folder containing cadd annotations. Equivalent of the data/annotations/ folder described [here](https://github.com/kircherlab/CADD-scripts/#manual-installation), and it is used to calculate CADD scores for small indels. <br />
<sup>12</sup>When set to `true`, each vcfanno database file listed in `vcfanno_resources` is checked for records (non-header lines). Any database with zero records is removed from the vcfanno TOML config before annotation runs to prevent vcfanno from crashing on default resource files. Default: `false`.<br />
<sup>13</sup> Enable variant SNV-INDEL scoring using MIVMIR, GICAM machine learning models.

:::note
We use CADD only to annotate small indels. To annotate SNVs with precomputed CADD scores, pass the file containing CADD scores as a resource to vcfanno instead. Files containing the precomputed CADD scores for SNVs can be downloaded from [here](https://cadd.gs.washington.edu/download) (download files listed under the description: "All possible SNVs of GRCh3<7/8>/hg3<7/8>")
Expand Down
6 changes: 6 additions & 0 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@ workflow NFCORE_RAREDISEASE {
val_readcount_intervals
val_reduced_penetrance
val_rtg_truthvcfs
val_rank_with_mivmir_gicam
val_run_mt_for_wes
val_run_rtgvcfeval
val_run_vcfanno_db_sanity_check
Expand Down Expand Up @@ -220,6 +221,8 @@ workflow NFCORE_RAREDISEASE {
ch_score_config_mt = channelFromPath(val_score_config_mt, true)
ch_score_config_snv = channelFromPath(val_score_config_snv, true)
ch_score_config_sv = channelFromPath(val_score_config_sv, true)
// ch_genmod_gicam_score_config is integral to GICAM inference; it cannot be changed without retraining gicam
ch_score_config_genmod_gicam = channel.fromPath("$projectDir/assets/rank_model_genmod_gicam.ini", checkIfExists: true).collect()
Comment thread
torbjorgen marked this conversation as resolved.
ch_vcf2cytosure_blacklist = channelFromPath(val_vcf2cytosure_blacklist, true)
ch_vcfanno_lua = channelFromPath(val_vcfanno_lua, true)
ch_vcfanno_toml = channelFromPath(val_vcfanno_toml, true)
Expand Down Expand Up @@ -436,6 +439,7 @@ workflow NFCORE_RAREDISEASE {
ch_score_config_mt,
ch_score_config_snv,
ch_score_config_sv,
ch_score_config_genmod_gicam,
ch_sdf,
ch_sentieon_pcr_indel_model,
ch_subdepth,
Expand Down Expand Up @@ -518,6 +522,7 @@ workflow NFCORE_RAREDISEASE {
val_multiqc_samples,
val_outdir,
val_platform,
val_rank_with_mivmir_gicam,
val_run_mt_for_wes,
val_run_rtgvcfeval,
val_run_vcfanno_db_sanity_check,
Expand Down Expand Up @@ -673,6 +678,7 @@ workflow {
params.readcount_intervals,
params.reduced_penetrance,
params.rtg_truthvcfs,
params.rank_with_mivmir_gicam,
params.run_mt_for_wes,
params.run_rtgvcfeval,
params.run_vcfanno_db_sanity_check,
Expand Down
33 changes: 33 additions & 0 deletions modules/local/gicam/main.nf
Comment thread
torbjorgen marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
process GICAM_INFER {
// https://github.com/Clinical-Genomics/rdds/tree/master/src/rdds/gicam

tag "${meta.id}"
label 'process_high'

container "docker.io/clinicalgenomics/rdds_mivmir:v1.13.0"

beforeScript "mkdir ${task.workDir}/rdds-tmp"
afterScript "rm -r ${task.workDir}/rdds-tmp"
containerOptions {[
workflow.containerEngine.equals("singularity") ? "--bind ${task.workDir}/rdds-tmp:/rdds/tmp" : "",
workflow.containerEngine.equals("docker") ? "--tmpfs /rdds/tmp": "",
""
].minus("").join(" ")}
Comment thread
fellen31 marked this conversation as resolved.

input:
tuple val(meta), path(input_vcf)

output:
tuple val(meta), path('*-predictions.vcf'), emit: vcf
Comment thread
torbjorgen marked this conversation as resolved.
tuple val("${task.process}"), val('gicam'), val('v1.13.0'), topic: versions, emit: versions_gicam

when:
task.ext.when == null || task.ext.when

script:
"""
. /opt/pyenv/bin/activate
export PYTHONPATH=/rdds/src
python3 -m rdds.gicam infer-vcf --cpu_cores ${task.cpus} ${input_vcf}
"""
}
Loading
Loading