Skip to content

Add a fuzzy matching parameter (--max-barcode-errors) to the demux algorithm for higher barcode demux classifications#1580

Open
uzbit wants to merge 3 commits intonanoporetech:masterfrom
uzbit:feat_demuxfuzz
Open

Add a fuzzy matching parameter (--max-barcode-errors) to the demux algorithm for higher barcode demux classifications#1580
uzbit wants to merge 3 commits intonanoporetech:masterfrom
uzbit:feat_demuxfuzz

Conversation

@uzbit
Copy link
Copy Markdown

@uzbit uzbit commented Mar 13, 2026

It was pointed out to me that the default demux for dorado is not classifying many reads due to exact matching? Perhaps I'm missing something, but it seems from the help docs in the application, there aren't any flags to modify demux parameters. This PR adds --max-barcode-errors N to allow for fuzzy barcode matching up to N errors using edit distance (indel+subs). See below for test results. Note particularly with my test data, and max-barcode-errors=3 we get 84.3% classified into barcodes, vs the 4.9% classified using default parameters.

DEMUX DEFAULT v1.4:

../../../dorado/build/bin/dorado demux --kit-name SQK-NBD114-96 -v --emit-fastq --emit-summary --no-trim -o demuxed_reads_default/ calls_v1.4.bam
[2026-03-13 15:21:34.337] [info] Running: "demux" "--kit-name" "SQK-NBD114-96" "-v" "--emit-fastq" "--emit-summary" "--no-trim" "-o" "demuxed_reads_default/" "calls_v1.4.bam"
[2026-03-13 15:21:34.338] [info] num input files: 1
[2026-03-13 15:21:34.338] [debug] > barcoding threads 15, writer threads 1
[2026-03-13 15:21:34.338] [info] - Note: FASTQ output is not recommended as not all data can be preserved.
[2026-03-13 15:21:34.338] [debug] Creating output folder: 'demuxed_reads_default'. Length:44
[2026-03-13 15:21:34.350] [info] > starting barcode demuxing
[2026-03-13 15:21:34.351] [debug] > Kits to evaluate: 1
[2026-03-13 15:21:35.342] [debug] Processed 50000 reads
[2026-03-13 15:21:36.792] [debug] Processed 100000 reads
[2026-03-13 15:21:38.228] [debug] Processed 150000 reads
[2026-03-13 15:21:39.676] [debug] Processed 200000 reads
[2026-03-13 15:21:41.073] [debug] Processed 250000 reads
[2026-03-13 15:21:42.552] [debug] Processed 300000 reads
[2026-03-13 15:21:43.967] [debug] Processed 350000 reads
[2026-03-13 15:21:45.405] [debug] Processed 400000 reads
[2026-03-13 15:21:46.817] [debug] Processed 450000 reads
[2026-03-13 15:21:48.249] [debug] Processed 500000 reads
[2026-03-13 15:21:49.708] [debug] Processed 550000 reads
[2026-03-13 15:21:51.205] [debug] Processed 600000 reads
[2026-03-13 15:21:52.651] [debug] Processed 650000 reads
[2026-03-13 15:21:54.083] [debug] Processed 700000 reads
[2026-03-13 15:21:55.570] [debug] Processed 750000 reads
[2026-03-13 15:21:57.026] [debug] Processed 800000 reads
[2026-03-13 15:21:58.485] [debug] Processed 850000 reads
[2026-03-13 15:21:59.739] [debug] Total reads processed: 892590
[2026-03-13 15:22:00.191] [info] > Finished in (ms): 25853
[2026-03-13 15:22:00.191] [info] > Reads written: 892590
[2026-03-13 15:22:00.191] [info] > 892590 reads demuxed @ classifications/s: 3.452559e+04
[2026-03-13 15:22:00.191] [debug] Barcode distribution :
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode23 : 1
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode37 : 1
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode41 : 9375
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode42 : 1130
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode43 : 362
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode44 : 7525
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode45 : 73
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode46 : 212
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode47 : 95
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode48 : 7
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode49 : 78
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode50 : 127
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode51 : 3837
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode52 : 3646
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode53 : 1032
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode54 : 185
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode55 : 355
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode56 : 2996
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode57 : 2987
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode58 : 29
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode59 : 4338
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode60 : 95
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode61 : 3549
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode62 : 146
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode63 : 1399
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode64 : 76
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode65 : 1
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode66 : 1
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode67 : 2
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode69 : 1
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode79 : 1
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode84 : 4
[2026-03-13 15:22:00.191] [debug] SQK-NBD114-96_barcode90 : 6
[2026-03-13 15:22:00.191] [debug] unclassified : 848918
[2026-03-13 15:22:00.191] [debug] Classified rate 4.8927307%
[2026-03-13 15:22:00.191] [info] > finished barcode demuxing

NEW FUZZY BARCODE MATCH v1.4:

../../../dorado/build/bin/dorado demux --kit-name SQK-NBD114-96 -v --emit-fastq --emit-summary --no-trim --max-barcode-errors 3 -o demuxed_reads_fuzzy/ calls_v1.4.bam
[2026-03-13 15:27:48.892] [info] Running: "demux" "--kit-name" "SQK-NBD114-96" "-v" "--emit-fastq" "--emit-summary" "--no-trim" "--max-barcode-errors" "3" "-o" "demuxed_reads_fuzzy/" "calls_v1.4.bam"
[2026-03-13 15:27:48.893] [info] num input files: 1
[2026-03-13 15:27:48.893] [debug] > barcoding threads 15, writer threads 1
[2026-03-13 15:27:48.893] [info] - Note: FASTQ output is not recommended as not all data can be preserved.
[2026-03-13 15:27:48.893] [debug] Creating output folder: 'demuxed_reads_fuzzy'. Length:42
[2026-03-13 15:27:48.905] [info] > starting barcode demuxing
[2026-03-13 15:27:48.906] [debug] > Kits to evaluate: 1
[2026-03-13 15:27:53.338] [debug] Processed 50000 reads
[2026-03-13 15:28:00.084] [debug] Processed 100000 reads
[2026-03-13 15:28:06.741] [debug] Processed 150000 reads
[2026-03-13 15:28:13.573] [debug] Processed 200000 reads
[2026-03-13 15:28:20.468] [debug] Processed 250000 reads
[2026-03-13 15:28:27.474] [debug] Processed 300000 reads
[2026-03-13 15:28:34.160] [debug] Processed 350000 reads
[2026-03-13 15:28:40.945] [debug] Processed 400000 reads
[2026-03-13 15:28:47.783] [debug] Processed 450000 reads
[2026-03-13 15:28:54.782] [debug] Processed 500000 reads
[2026-03-13 15:29:01.908] [debug] Processed 550000 reads
[2026-03-13 15:29:09.121] [debug] Processed 600000 reads
[2026-03-13 15:29:16.343] [debug] Processed 650000 reads
[2026-03-13 15:29:23.593] [debug] Processed 700000 reads
[2026-03-13 15:29:30.727] [debug] Processed 750000 reads
[2026-03-13 15:29:37.827] [debug] Processed 800000 reads
[2026-03-13 15:29:45.007] [debug] Processed 850000 reads
[2026-03-13 15:29:51.206] [debug] Total reads processed: 892590
[2026-03-13 15:29:53.392] [info] > Finished in (ms): 124499
[2026-03-13 15:29:53.392] [info] > Reads written: 892590
[2026-03-13 15:29:53.392] [info] > 892590 reads demuxed @ classifications/s: 7.169455e+03
[2026-03-13 15:29:53.392] [debug] Barcode distribution :
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode12 : 1
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode29 : 1
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode41 : 53149
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode42 : 29729
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode43 : 6985
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode44 : 39717
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode45 : 6559
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode46 : 3361
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode47 : 781
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode48 : 6369
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode49 : 19349
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode50 : 14554
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode51 : 39214
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode52 : 46481
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode53 : 17797
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode54 : 11315
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode55 : 94278
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode56 : 58064
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode57 : 29393
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode58 : 24865
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode59 : 42423
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode60 : 43877
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode61 : 54342
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode62 : 75994
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode63 : 17570
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode64 : 16580
[2026-03-13 15:29:53.392] [debug] SQK-NBD114-96_barcode65 : 2
[2026-03-13 15:29:53.393] [debug] SQK-NBD114-96_barcode66 : 5
[2026-03-13 15:29:53.393] [debug] SQK-NBD114-96_barcode67 : 2
[2026-03-13 15:29:53.393] [debug] SQK-NBD114-96_barcode68 : 5
[2026-03-13 15:29:53.393] [debug] SQK-NBD114-96_barcode81 : 1
[2026-03-13 15:29:53.393] [debug] unclassified : 139827
[2026-03-13 15:29:53.393] [debug] Classified rate 84.33469%
[2026-03-13 15:29:53.393] [info] > finished barcode demuxing

@palakpsheth
Copy link
Copy Markdown

@uzbit love this

@malton-ont
Copy link
Copy Markdown
Collaborator

Hi @uzbit,

Could you explain what issue this PR is attempting to resolve? If you are having issues with classification rates I would recommend opening an issue so we can take a look at your data in the first instance - there may be valid reasons why dorado is refusing to classify these reads.

It was pointed out to me that the default demux for dorado is not classifying many reads due to exact matching

This is incorrect - dorado does not require exact matching of barcodes. It already allows a number of mismatches (see here where we check the barcode penalty against m_scoring_params.max_barcode_penalty).

there aren't any flags to modify demux parameters

This is true, but scoring parameters can be overridden by creating a custom barcode configuration - see the documentation here. Note that this needs to be a full custom configuration, including the barcode sequences. We're looking at making this easier in a future release.

Reagarding the PR itself, your barcode_fuzzy function is skipping a number of checks that the standard barcoding requires - barcode proximity to the ends of the read, flank scores, midstrand barcode detection (in fact, it seems to directly permit and classify based on midstrand barcodes?). It is also not suitable for double ended barcode kits, but does not make any guards against using these.

@uzbit
Copy link
Copy Markdown
Author

uzbit commented Mar 25, 2026

Could you explain what issue this PR is attempting to resolve? If you are having issues with classification rates I would recommend opening an issue so we can take a look at your data in the first instance - there may be valid reasons why dorado is refusing to classify these reads.

I will verify I can share this dataset with you before opening an issue. Thanks for the offer and I'll get back to you about this soon.

Reagarding the PR itself, your barcode_fuzzy function is skipping a number of checks that the standard barcoding requires - barcode proximity to the ends of the read, flank scores, midstrand barcode detection (in fact, it seems to directly permit and classify based on midstrand barcodes?). It is also not suitable for double ended barcode kits, but does not make any guards against using these.

Granted this function does loosen up all of the restrictions on matching barcodes to sequences, and this is the point. It is intended to be a diagnostic tool to look at a given dataset. It seems at least from my dataset, the default parameters are too strict. I understand if you'd like to close this and look forward to when you add user modifiable demuxing parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants