Add a fuzzy matching parameter (--max-barcode-errors) to the demux algorithm for higher barcode demux classifications#1580
Conversation
|
@uzbit love this |
|
Hi @uzbit, Could you explain what issue this PR is attempting to resolve? If you are having issues with classification rates I would recommend opening an issue so we can take a look at your data in the first instance - there may be valid reasons why dorado is refusing to classify these reads.
This is incorrect - dorado does not require exact matching of barcodes. It already allows a number of mismatches (see here where we check the barcode penalty against
This is true, but scoring parameters can be overridden by creating a custom barcode configuration - see the documentation here. Note that this needs to be a full custom configuration, including the barcode sequences. We're looking at making this easier in a future release. Reagarding the PR itself, your |
I will verify I can share this dataset with you before opening an issue. Thanks for the offer and I'll get back to you about this soon.
Granted this function does loosen up all of the restrictions on matching barcodes to sequences, and this is the point. It is intended to be a diagnostic tool to look at a given dataset. It seems at least from my dataset, the default parameters are too strict. I understand if you'd like to close this and look forward to when you add user modifiable demuxing parameters. |
It was pointed out to me that the default demux for dorado is not classifying many reads due to exact matching? Perhaps I'm missing something, but it seems from the help docs in the application, there aren't any flags to modify demux parameters. This PR adds --max-barcode-errors N to allow for fuzzy barcode matching up to N errors using edit distance (indel+subs). See below for test results. Note particularly with my test data, and max-barcode-errors=3 we get 84.3% classified into barcodes, vs the 4.9% classified using default parameters.
DEMUX DEFAULT v1.4:
NEW FUZZY BARCODE MATCH v1.4: