Versatile, fast, exact FR/CDR annotation of TCR and BCR sequences — mRNA and protein in FASTA, and reads in FASTQ from both amplicon and bulk RNA-seq — for nucleotide and amino-acid input, across all loci at once.
arda does the expensive IgBLAST work once, offline — building a pre-aligned
reference database of every in-frame V·J germline scaffold with FR1–4 / CDR1–3
markup — then at runtime maps your sequences to that database with MMseqs2 and
transfers the markup through the alignment in a small C++ hot path. The result
is an AIRR-formatted annotation that matches
IgBLAST (≈97% region concordance on real GenBank mRNA), from a plain CLI + Python
library — no Docker, no workflow engine.
IgBLAST is the gold standard but is slow to invoke per-batch and awkward to embed.
arda keeps IgBLAST-quality region calls while being:
- Fast & scalable — MMseqs2 search + a C++ projection step; multiprocessing and SLURM-friendly from small FASTA to large FASTQ.
- Embeddable —
import arda; arda.annotate_sequences(...). - Easy to install — conda for the
mmseqsbinary,pip install -e .for the package + C++ extension; IgBLAST is fetched into a gitignoredbin/and is only needed to (re)build the reference DB, not at runtime.
bash setup.sh # creates conda env `arda`, fetches IgBLAST, pip install -e .
conda activate ardaFlags: --no-conda (use the active env), --build-db (rebuild references after
install), --tests (run the fast suites). The committed database/vdj/<organism>/
references mean most users never need to build anything.
Supported organisms: human, mouse (full IG + TR), rat, rabbit, rhesus_monkey (IG only — IgBLAST ships no TR internal annotation for these).
arda info # resolved paths + tool availability
arda annotate -i reads.fastq -o out.airr.tsv --organism human --seqtype nt
arda annotate -i prot.fasta -o out.airr.tsv --organism human --seqtype aa
arda annotate -i reads.fastq -o out.airr.tsv --strand forward # plus-strand only
arda build-db --organism all # rebuild references (needs IgBLAST)
arda build-index --organism all # (re)build the precompiled mmseqs DBs
arda slurm -i big.fastq -o big.airr.tsv --shards 50 --partition cpu # cluster scaleSee examples/ for a runnable per-locus demo and
benchmarks/RESULTS.md for measured speed/accuracy.
The reference database ships with precompiled MMseqs2 indexes
(database/vdj/<organism>/mmseqs/), so annotation runs out of the box with no
build step. They are used automatically when the local MMseqs2 version matches the
shipped one; otherwise arda transparently rebuilds a private cache on first run
(arda build-index regenerates the shipped DBs for your version).
Input may be FASTA or FASTQ, plain or gzipped. Nucleotide input is searched on
both strands by default (reverse-complement reads are re-oriented and flagged
rev_comp=T); a single search annotates a mixed bulk RNA-seq file across all loci.
import arda
records = arda.annotate_sequences(
["GACGTGCAG...", ("clone7", "CAGGTG...")], # strings or (id, seq) pairs
seqtype="nt", organism="human",
)
# -> list of AIRR record dicts: v_call, d_call/d2_call, j_call, fwr1..fwr4,
# cdr1..cdr3, *_start/*_end (1-based closed), *_aa, junction(_aa), np1/np2/np3,
# v_sequence_end, j_sequence_start, productive, rev_comp, ...There is no coverage filter, so a V-only or J-only query maps to its
scaffold and only the regions inside the query's coverage are returned. This lets
you annotate isolated germline V or J alleles without synthesising a
rearrangement — a bare V yields fwr1..fwr3, a bare J yields fwr4:
from arda.annotate.mapper import annotate_records
recs = annotate_records(
[("TRBV9*01", v_germline_nt), ("TRBJ2-7*01", j_germline_nt)],
organism="human", seqtype="nt", strand="forward", map_d=False,
)
# V record -> fwr1/cdr1/fwr2/cdr2/fwr3 (+ v_sequence_end = CDR3 start)
# J record -> fwr4 (+ j_sequence_start = CDR3 end / FR4 start)(mirpy uses exactly this to bake per-allele FR/CDR subsequences into its gene
library; see tests/synthetic/test_germline_segments.py.)
- Reference build (
arda.refbuild, offline): download IMGT/V-QUEST germlines → enumerate deduplicated in-frame V×J scaffolds (D only affects CDR3 interior, so it isn't enumerated) → annotate withigblastn -outfmt 19→ translate → writedatabase/vdj/<organism>/{alleles.fasta, alleles.aa.fasta, markup.tsv, markup.aa.tsv, combinations.tsv, build.log}. - Runtime (
arda.annotate): MMseqs2 search query→scaffolds → best hit → C++transfer_regionsprojects scaffold region coordinates onto the query (handling indels, truncation, mid-codon alignment starts, reverse strand) → for VDJ loci a gapless C++ local alignment of the CDR3 interior against the D germlines addsd_call/d2_call+np*→ AIRR TSV. Out-of-frame junctions are reported with an N-bridge (_) so FR4 still reads.
See memory/ for design rationale and gotchas. Fast sequence
primitives (translate, detect_coding_frame, reverse_complement,
back_translate) live in the C++ extension and are re-exported from
arda.refbuild.translate — mirpy-API-compatible, so mirpy can import arda and
reuse them.
Exact annotation that matches IgBLAST while being several times faster, scaling to
large FASTQ. Synthetic human IGH, 16 threads (scripts/bench_vs_igblast.py):
| sequences | arda | arda rate | speedup vs IgBLAST | region concordance |
|---|---|---|---|---|
| 10,000 | 5.5s | ~1.8k/s | 4.4× | 98.9% |
| 50,000 | 16s | ~3.0k/s | 7.3× | |
| 100,000 | 30s | ~3.3k/s | 7.9× |
On ~7.3k real GenBank mRNA records spanning all five organisms and their loci
(committed, gzipped test fixtures), region concordance with IgBLAST on productive
records is 98–99.7% per organism; junction_aa/cdr3_aa match IgBLAST ~99%
and satisfy the AIRR invariants exactly. V-gene assignment agrees ~100%. (GenBank
also contains genomic/partial/non-productive entries that confuse both tools; those
are excluded from the comparison.)
Bulk RNA-seq is much faster than amplicon, because mmseqs prefilters by k-mer
matching — reads with no receptor k-mer are rejected before alignment. At 150 nt
reads, 16 threads (scripts/bench_prefilter.py):
| receptor content | throughput |
|---|---|
| 100% (amplicon) | ~5.7k reads/s |
| 10% | ~19k reads/s |
| 1% (blood RNA-seq) | ~25k reads/s |
Extrapolated to a 32-core node, a 30M-read bulk RNA-seq library (~1% receptor)
annotates in roughly 10–20 min — the same order of magnitude as a STAR genome
alignment pass on the same data (STAR is faster per read, but arda maps only to a
tiny germline DB and the non-receptor majority costs just prefilter rejection).
Large FASTQ is streamed in bounded chunks (a background reader prefetches the
next chunk while the current one is annotated), so memory stays flat regardless of
input size — --chunk-size tunes it.
See ROADMAP.md. Done: V·J reference build (5 organisms), MMseqs2
mapping, C++ markup transfer, reverse-complement, all-loci querying, streaming I/O,
out-of-frame junctions, D-segment mapping incl. D-D fusions, precompiled
indexes, multi-node (SLURM) sharding. Next: full AIRR productivity.
pip install -e . # rebuilds the C++ ext on import
python -m pytest tests/unit tests/synthetic -q # fast suite
env ARDA_REALWORLD=1 python -m pytest tests/realworld -s # vs IgBLAST (network)
env RUN_BENCHMARK=1 python -m pytest tests/benchmark -s # timing/memory/scalingLayout: src/arda/{refbuild,annotate}, C++ in src/_markup/markup.cpp,
references in database/, downloads in gitignored bin/ + data/.