Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 46 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,18 +171,54 @@ Alias: `strt-seq`

### QC aggregation

Run from the parent directory containing per-sample output folders:
```bash
starsolo qc <dir1> [dir2 … dirN] [options]
```

Pass one or more STARsolo output directories as positional arguments:

```bash
starsolo qc | column -t
starsolo qc sample1/ sample2/ sample3/
starsolo qc results/GSM* | column -t
starsolo qc sample1/ sample2/ --species human
starsolo qc sample1/ sample2/ --no-contamination-check
```

Checks for:
- Incomplete runs (`_STARtmp` still present)
- Barcode cross-contamination between samples
- Low mapping percentages
#### QC options

| Flag | Description | Default |
|:--|:--|:--|
| `-s, --species <name>` | Override species label in output (e.g. `human`, `mouse`) | guessed from `genomeDir` in `Log.out` |
| `--no-contamination-check` | Skip cross-sample barcode overlap check | off (check runs by default) |
| `-h, --help` | Show help | — |

Outputs a tab-separated table with read counts, mapping rates, cell counts, and configuration for each sample.
#### Checks performed

- **Completion** — warns if `_STARtmp` is still present (run did not finish)
- **Cross-contamination** — compares filtered barcodes between every pair of samples; warns if overlap exceeds 20%
- **Low mapping** — warns if `GeneFull` unique mapping rate is ≤ 20%

#### Output columns

Outputs a tab-separated table to stdout (pipe through `column -t` for aligned display):

| Column | Description |
|:--|:--|
| `Sample` | Directory name |

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation says the Sample column is the "Directory name", but run_qc currently prints the directory argument verbatim (which may include paths like results/GSM123). Either update the docs to say it's the provided path, or change QC output to print basename(<dir>) (and ensure uniqueness if doing so).

Suggested change
| `Sample` | Directory name |
| `Sample` | Provided directory path |

Copilot uses AI. Check for mistakes.
| `Rd_all` | Total reads |
| `Rd_in_cells` | Reads in cells (GeneFull) |
| `Frc_in_cells` | Fraction of reads in cells |
| `UMI_in_cells` | UMIs in cells |
| `Cells` | Estimated number of cells |
| `Med_nFeature` | Median genes per cell |
| `Good_BC` | Fraction of reads with valid barcodes |
| `WL` | Detected whitelist (`v1`, `v2`, `v3`, `v4-3p`, `v4-5p`, `arc`) |
| `Species` | Species label |
| `Paired` | `Single` or `Paired` |
| `Strand` | Strand specificity |
| `all_u+m` / `all_u` | Genome mapping rate (unique+multi / unique) |
| `exon_u+m` / `exon_u` | Exonic (Gene) mapping rate |
| `full_u+m` / `full_u` | Full-gene (GeneFull) mapping rate |

## Configuration

Expand All @@ -202,7 +238,7 @@ CLI flags always take precedence over environment variables, which take preceden
| Script | Purpose |
|:--|:--|
| [scripts/bbduk_trim.sh](scripts/bbduk_trim.sh) | Adapter/polyA trimming via BBMap |
| [scripts/bsub_submit.sh](scripts/bsub_submit.sh) | Submit any starsolo command as an LSF job |
| [scripts/bsub_submit.sh](scripts/bsub_submit.sh) | Submit any `starsolo` command as an LSF job |

### LSF submission

Expand Down Expand Up @@ -238,7 +274,8 @@ STARsolo/
│ ├── platform_dropseq.sh # Drop-seq logic
│ ├── platform_rhapsody.sh # BD Rhapsody logic
│ ├── platform_indrops.sh # inDrops logic
│ └── platform_strt.sh # STRT-seq logic
│ ├── platform_strt.sh # STRT-seq logic
│ └── qc.sh # QC aggregation logic
├── etc/
│ └── defaults.conf # Default configuration
├── scripts/
Expand Down
18 changes: 6 additions & 12 deletions bin/starsolo
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ source "$STARSOLO_ROOT/lib/platform_dropseq.sh"
source "$STARSOLO_ROOT/lib/platform_rhapsody.sh"
source "$STARSOLO_ROOT/lib/platform_indrops.sh"
source "$STARSOLO_ROOT/lib/platform_strt.sh"
source "$STARSOLO_ROOT/lib/qc.sh"

# ---------- Top-level help ---------------------------------------------------

Expand All @@ -60,7 +61,7 @@ Platforms:
rhapsody BD Rhapsody (3 barcode segments, 8 bp UMI)
indrops inDrops (2 barcode segments + adapter, 6 bp UMI)
strt STRT-seq (96-barcode whitelist, 8 bp CB, 8 bp UMI)
qc Aggregate QC stats across completed runs
qc Aggregate QC stats across completed runs (see: starsolo qc --help)

Global options (available for all platforms):
-s, --species <name> Species (e.g. human, mouse); resolves reference
Expand All @@ -85,7 +86,7 @@ Examples:
starsolo rhapsody /data/fastqs SAMPLE1 --species human
starsolo indrops /data/fastqs SAMPLE1 --species human --adapter GAGTGATTGCTTGTGACGCCAA
starsolo strt /data/fastqs SAMPLE1 --species human --strand Reverse
starsolo qc
starsolo qc sample1/ sample2/ --species human
EOF
}

Expand Down Expand Up @@ -193,16 +194,6 @@ parse_droplet_args() {
BAM=$(resolve_bam_flags "$BAM_MODE")
}

# ---------- QC subcommand ---------------------------------------------------

run_qc() {
local QC_SCRIPT="$STARSOLO_ROOT/scripts/solo_qc.sh"
if [[ ! -x "$QC_SCRIPT" ]]; then
die "QC script not found: $QC_SCRIPT"
fi
exec bash "$QC_SCRIPT" "$@"
}

# ---------- Dispatch --------------------------------------------------------

main() {
Expand Down Expand Up @@ -266,6 +257,9 @@ main() {
run_strt "$FQDIR" "$TAG" "$CPUS" "$REF" "$WL" "$BAM" "$STRAND"
;;
qc)
if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
show_qc_help; exit 0
fi
run_qc "$@"
;;
*)
Expand Down
242 changes: 242 additions & 0 deletions lib/qc.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
#!/bin/bash
# =============================================================================
# STARsolo CLI - QC aggregation
# =============================================================================
# Aggregate QC statistics for one or more STARsolo output directories.
# =============================================================================

# ---------- Helpers ----------------------------------------------------------

_qc_guess_species() {
local logfile="$1"
local genomedir
genomedir=$(grep "^genomeDir" "$logfile" | tail -n1)
if echo "$genomedir" | grep -q "/human/"; then
echo "Human"
elif echo "$genomedir" | grep -q "/mouse/"; then
echo "Mouse"
else
echo "Other"
fi
}

_qc_guess_wl() {
local logfile="$1"
local wlfile
wlfile=$(grep "^soloCBwhitelist" "$logfile" | tail -n1 | awk '{print $NF}')
wlfile=$(basename "$wlfile")
case "$wlfile" in
3M-february-2018.txt) echo "v3" ;;
3M-3pgex-may-2023.txt) echo "v4-3p" ;;
3M-5pgex-jan-2023.txt) echo "v4-5p" ;;
737K-august-2016.txt) echo "v2" ;;
737K-april-2014_rc.txt) echo "v1" ;;
737K-arc-v1.txt) echo "arc" ;;
*) echo "Undef" ;;
esac
}

# ---------- Main QC workflow -------------------------------------------------

run_qc() {
local FORCE_SPECIES="" CHECK_CONTAMINATION=1
local -a DIRS=()

Comment on lines +41 to +44

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI currently exercises each platform via .github/workflows/test.yml, but there is no coverage for the newly refactored starsolo qc subcommand. Add a workflow step (or a small fixture under data/test) that runs bin/starsolo qc ... and asserts basic output/header behavior (and optionally --no-contamination-check) to prevent regressions in argument parsing and file detection.

Copilot uses AI. Check for mistakes.
# --- Argument parsing ---
while (( $# > 0 )); do
case "$1" in
-h|--help)
show_qc_help; exit 0 ;;
-s|--species|--specie)
FORCE_SPECIES="$2"; shift 2 ;;
--no-contamination-check)
Comment on lines +48 to +52

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --species option assumes a following argument (FORCE_SPECIES="$2"; shift 2). With set -u, calling starsolo qc --species (missing value) will raise an unbound parameter error / shift out of range instead of a clean message. Validate that $2 is present and not another flag before shifting, and fail with die if missing.

Also, --specie is accepted here but not documented in show_qc_help; either document it as an alias or remove it to avoid an accidental/typo API surface.

Copilot uses AI. Check for mistakes.
CHECK_CONTAMINATION=0; shift ;;
-*)
die "Unknown option: $1. Run 'starsolo qc --help'." ;;
*)
DIRS+=("$1"); shift ;;
esac
done

if (( ${#DIRS[@]} == 0 )); then
show_qc_help
exit 0
fi

# Strip trailing slashes
local -a CLEAN_DIRS=()
local d
for d in "${DIRS[@]}"; do
CLEAN_DIRS+=("${d%/}")
done

# --- Check completion status ---
log_info "Checking that all STARsolo jobs went to completion …"
local i
for i in "${CLEAN_DIRS[@]}"; do
if [[ -d "$i/output" && -s "$i/Log.final.out" ]]; then
if [[ -d "$i/_STARtmp" ]]; then
log_warn "Sample $i did not run to completion: _STARtmp is still present!"
fi
fi
done

# --- Cross-contamination check ---
if (( CHECK_CONTAMINATION )); then
log_info "Checking potential sample cross-contamination …"
local TMPDIR_SORT
TMPDIR_SORT=$(mktemp -d)

Comment on lines +86 to +89

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TMPDIR_SORT is cleaned up only at the end via rm -rf "$TMPDIR_SORT". If any command in the contamination loop fails (e.g., zcat/sort/comm) and the script exits due to set -e, the temp directory will be left behind. Consider installing a trap (e.g., on RETURN within run_qc) to ensure cleanup on all exit paths.

Copilot uses AI. Check for mistakes.
local j local_i local_j N1 N2 MIN COMM PCT
for i in "${CLEAN_DIRS[@]}"; do
if [[ -s "$i/output/Gene/filtered/barcodes.tsv.gz" ]]; then
for j in "${CLEAN_DIRS[@]}"; do
if [[ -s "$j/output/Gene/filtered/barcodes.tsv.gz" && "$i" != "$j" ]]; then
local_i="$TMPDIR_SORT/$(basename "$i").barcodes.tsv"
local_j="$TMPDIR_SORT/$(basename "$j").barcodes.tsv"
Comment on lines +91 to +96

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temp filenames are derived from basename "$i" / basename "$j". Since starsolo qc now accepts arbitrary directory arguments, two different directories with the same basename (e.g., /run1/sampleA and /run2/sampleA) will collide and corrupt comparisons. Use a collision-free naming scheme (e.g., an index in CLEAN_DIRS, a hash of the full path, or mktemp per sample).

Suggested change
for i in "${CLEAN_DIRS[@]}"; do
if [[ -s "$i/output/Gene/filtered/barcodes.tsv.gz" ]]; then
for j in "${CLEAN_DIRS[@]}"; do
if [[ -s "$j/output/Gene/filtered/barcodes.tsv.gz" && "$i" != "$j" ]]; then
local_i="$TMPDIR_SORT/$(basename "$i").barcodes.tsv"
local_j="$TMPDIR_SORT/$(basename "$j").barcodes.tsv"
declare -A SORTED_BARCODE_FILES=()
for i in "${CLEAN_DIRS[@]}"; do
if [[ -s "$i/output/Gene/filtered/barcodes.tsv.gz" ]]; then
for j in "${CLEAN_DIRS[@]}"; do
if [[ -s "$j/output/Gene/filtered/barcodes.tsv.gz" && "$i" != "$j" ]]; then
if [[ -z "${SORTED_BARCODE_FILES["$i"]}" ]]; then
SORTED_BARCODE_FILES["$i"]=$(mktemp "$TMPDIR_SORT/barcodes.XXXXXX.tsv")
fi
if [[ -z "${SORTED_BARCODE_FILES["$j"]}" ]]; then
SORTED_BARCODE_FILES["$j"]=$(mktemp "$TMPDIR_SORT/barcodes.XXXXXX.tsv")
fi
local_i="${SORTED_BARCODE_FILES["$i"]}"
local_j="${SORTED_BARCODE_FILES["$j"]}"

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +96

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use unique temp-file keys for contamination inputs

Naming the sorted barcode cache files with basename conflates different directories that share a sample folder name (for example batch1/sampleA and batch2/sampleA). In that case both samples write/read the same temp path, so comm can compare a file to itself and report near-100% overlap (plus concurrent writes to one file), producing false cross-contamination warnings for unrelated samples.

Useful? React with 👍 / 👎.

if [[ ! -s "$local_i" ]]; then
zcat "$i/output/Gene/filtered/barcodes.tsv.gz" | sort > "$local_i" &
fi
if [[ ! -s "$local_j" ]]; then
zcat "$j/output/Gene/filtered/barcodes.tsv.gz" | sort > "$local_j" &
fi
wait
N1=$(wc -l < "$local_i")
N2=$(wc -l < "$local_j")
MIN=$(( N1 <= N2 ? N1 : N2 ))
COMM=$(comm -12 "$local_i" "$local_j" | wc -l)
PCT=$(echo "$COMM" | awk -v v="$MIN" '{printf "%d\n",100*$1/v+0.5}')

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PCT divides by MIN, but MIN can be 0 if either barcode list is empty (wc -l == 0). That will cause an awk division-by-zero error under set -e and abort QC. Guard against MIN==0 (e.g., treat overlap as 0% and continue).

Suggested change
PCT=$(echo "$COMM" | awk -v v="$MIN" '{printf "%d\n",100*$1/v+0.5}')
if (( MIN == 0 )); then
PCT=0
else
PCT=$(echo "$COMM" | awk -v v="$MIN" '{printf "%d\n",100*$1/v+0.5}')
fi

Copilot uses AI. Check for mistakes.
if (( PCT >= 20 )); then
log_warn "Samples $i ($N1 barcodes) and $j ($N2 barcodes) share $COMM ($PCT%) barcodes — higher than expected!"
fi
fi
done
elif [[ -d "$i/output" && -s "$i/Log.out" ]]; then
log_info "Sample $i: no filtered output detected (probably plate-based)."
fi
done
rm -rf "$TMPDIR_SORT"
else
log_info "Skipping cross-contamination check (--no-contamination-check)."
fi

# --- Extract and output statistics ---
log_info "Extracting STARsolo stats …"
>&2 echo

echo -e "Sample\tRd_all\tRd_in_cells\tFrc_in_cells\tUMI_in_cells\tCells\tMed_nFeature\tGood_BC\tWL\tSpecies\tPaired\tStrand\tall_u+m\tall_u\texon_u+m\texon_u\tfull_u+m\tfull_u"

local REF WL R1 B G1 G2 E1 E2 F1 F2 C R2 CF R3 GC ST PAIRED F2PCT
for i in "${CLEAN_DIRS[@]}"; do

if [[ -d "$i/output/Gene/filtered/" && -s "$i/Log.out" ]]; then
# --- Droplet-based ---
PAIRED="Single"
if grep -q "clip5pNbases 39 0" "$i/Log.out"; then
PAIRED="Paired"
fi

if [[ -n "$FORCE_SPECIES" ]]; then
REF="$FORCE_SPECIES"
else
REF=$(_qc_guess_species "$i/Log.out")
fi

WL=$(_qc_guess_wl "$i/Log.out")

R1=$(grep "Number of Reads," "$i/output/Gene/Summary.csv" | awk -F "," '{print $2}')
B=$( grep "Reads With Valid Barcodes," "$i/output/Gene/Summary.csv" | awk -F "," '{print $2}')
G1=$(grep "Reads Mapped to Genome: Unique+Multiple," "$i/output/Gene/Summary.csv" | awk -F "," '{print $2}')
G2=$(grep "Reads Mapped to Genome: Unique," "$i/output/Gene/Summary.csv" | awk -F "," '{print $2}')
E1=$(grep "Reads Mapped to Gene: Unique+Multip.*e Gene," "$i/output/Gene/Summary.csv" | awk -F "," '{print $2}')
E2=$(grep "Reads Mapped to Gene: Unique Gene," "$i/output/Gene/Summary.csv" | awk -F "," '{print $2}')
F1=$(grep "Reads Mapped to GeneFull: Unique+Multip.*e GeneFull," "$i/output/GeneFull/Summary.csv" | awk -F "," '{print $2}')
F2=$(grep "Reads Mapped to GeneFull: Unique GeneFull," "$i/output/GeneFull/Summary.csv" | awk -F "," '{print $2}')
C=$( grep "Estimated Number of Cells," "$i/output/GeneFull/Summary.csv" | awk -F "," '{print $2}')
R2=$(grep "Unique Reads in Cells Mapped to GeneFull," "$i/output/GeneFull/Summary.csv" | awk -F "," '{print $2}')
CF=$(echo "$R1" | awk -v v="$R2" '{printf "%.3f\n",v/$1}')
R3=$(grep "UMIs in Cells," "$i/output/GeneFull/Summary.csv" | awk -F "," '{print $2}')
GC=$(grep "Median GeneFull per Cell," "$i/output/GeneFull/Summary.csv" | awk -F "," '{print $2}')
ST=$(grep "^soloStrand" "$i/Log.out" | grep RE-DEFINED | awk '{print $2}')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle absent RE-DEFINED strand markers safely

starsolo enables set -euo pipefail, so this pipeline returns non-zero when Log.out does not contain RE-DEFINED; the command then exits before the fallback ST="Undef" logic can run. That turns a previously tolerated condition into a hard failure and causes starsolo qc to abort on valid logs that omit this marker.

Useful? React with 👍 / 👎.


if [[ -z "$ST" ]]; then
ST="Undef"
elif [[ "$PAIRED" == "Paired" ]]; then
ST="Reverse"
fi

F2PCT=$(echo "$F2" | awk '{printf "%d\n",$1*100+0.5}')
if (( F2PCT <= 20 )); then
log_warn "Sample $i: GeneFull percentage ($F2PCT%) is too low!"
fi

echo -e "$i\t$R1\t$R2\t$CF\t$R3\t$C\t$GC\t$B\t$WL\t$REF\t$PAIRED\t$ST\t$G1\t$G2\t$E1\t$E2\t$F1\t$F2"

elif [[ -d "$i/output" && -s "$i/Log.out" ]]; then
# --- Plate-based (Smart-seq2 etc.) ---
PAIRED="Single"
if grep -q "mate 2" "$i/Log.out"; then
PAIRED="Paired"
fi

if [[ -n "$FORCE_SPECIES" ]]; then
REF="$FORCE_SPECIES"
else
REF=$(_qc_guess_species "$i/Log.out")
fi

WL="Undef"
if grep -q "manifest" "$i/Log.out"; then
WL="None(Smart-seq)"
fi

R1=$(grep "Number of Reads," "$i/output/Gene/Summary.csv" | awk -F "," '{print $2}')
B=$( grep "Reads With Valid Barcodes," "$i/output/Gene/Summary.csv" | awk -F "," '{print $2}')
G1=$(grep "Reads Mapped to Genome: Unique+Multiple," "$i/output/Gene/Summary.csv" | awk -F "," '{print $2}')
G2=$(grep "Reads Mapped to Genome: Unique," "$i/output/Gene/Summary.csv" | awk -F "," '{print $2}')
E1=$(grep "Reads Mapped to Gene: Unique+Multip.*e Gene," "$i/output/Gene/Summary.csv" | awk -F "," '{print $2}')
E2=$(grep "Reads Mapped to Gene: Unique Gene," "$i/output/Gene/Summary.csv" | awk -F "," '{print $2}')
F1=$(grep "Reads Mapped to GeneFull: Unique+Multip.*e GeneFull," "$i/output/GeneFull/Summary.csv" | awk -F "," '{print $2}')
F2=$(grep "Reads Mapped to GeneFull: Unique GeneFull," "$i/output/GeneFull/Summary.csv" | awk -F "," '{print $2}')
C=$( zcat "$i/output/Gene/raw/barcodes.tsv.gz" | wc -l)
R2=$(grep "yessubWLmatch_UniqueFeature" "$i/output/GeneFull/Features.stats" | awk '{print $2}')
CF=$(echo "$R1" | awk -v v="$R2" '{printf "%.3f\n",v/$1}')
R3=$(grep "yesUMIs" "$i/output/GeneFull/Features.stats" | awk '{print $2}')
GC=$(zcat "$i/output/GeneFull/raw/matrix.mtx.gz" | awk 'NR>3 {print $2}' | uniq -c | awk '{print $1}' | sort -n | awk '{a[i++]=$1} END {print a[int(i/2)]}')
ST=$(grep "^soloStrand" "$i/Log.out" | grep RE-DEFINED | awk '{print $2}')

[[ -z "$ST" ]] && ST="Undef"

F2PCT=$(echo "$F2" | awk '{printf "%d\n",$1*100+0.5}')
if (( F2PCT <= 20 )); then
log_warn "Sample $i: GeneFull percentage ($F2PCT%) is too low!"
fi

echo -e "$i\t$R1\t$R2\t$CF\t$R3\t$C\t$GC\t$B\t$WL\t$REF\t$PAIRED\t$ST\t$G1\t$G2\t$E1\t$E2\t$F1\t$F2"
fi
done
}

# ---------- Help -------------------------------------------------------------

show_qc_help() {
cat <<'EOF'
Usage: starsolo qc <dir1> [dir2 … dirN] [options]

Aggregate QC statistics across one or more STARsolo output directories.

Arguments:
<dir> One or more STARsolo output directories.

Options:
-s, --species <name> Override species label in output (e.g. human, mouse).
By default guessed from genomeDir in Log.out.
--no-contamination-check Skip cross-sample barcode overlap check.
-h, --help Show this help and exit.

Examples:
starsolo qc sample1/ sample2/ | column -t
starsolo qc results/GSM* --species human
starsolo qc sample1/ sample2/ --no-contamination-check
EOF
}
Loading
Loading