FASTQ to Microreact for Streptococcus pneumoniae

This tutorial is specifically tailored for Streptococcus pneumoniae and provides a step-by-step workflow for transforming raw read sequencing files (FASTQ) into an interactive Microreact visualisation. The tutorial begins with software environment setup. It then covers quality control and the generation of in silico data (such as GPSCs, serotypes, MLSTs, and AMR profiles) using the GPS Pipeline, followed by phylogenetic tree construction, and Microreact instance creation.

Table of Content

Prerequisites & Environment Setup
- Required Tools
- Installing Tools
  - GPS Pipeline
  - Tree Building Tools (snippy, snp-sites, fasttree, ete3, gubbins)
Quality Control & Generating in silico Data
- Run GPS Pipeline
- Filtering samples
Building Phylogenetic Tree
Creating Microreact Instance

Prerequisites & Environment Setup

You can use Linux, Windows (with WSL2), and macOS to follow this tutorial. We recommend using a system with at least 16GB of RAM.

Required Tools

Installing Tools

GPS Pipeline

Install dependencies
1. Install OpenJDK 17 (or later, up to 25) (use SDKMAN! to install an appropiate version of Temurin distribution)
2. Install Docker or Singularity/Apptainer
  - Linux: Docker Engine / Apptainer
  - macOS: Docker Desktop for macOS
  - Windows with WSL2: Docker Desktop for WSL2
Clone and initialise the pipeline

git should come preinstalled on most systems. If not, follow the official installation guide.
1. Go into where you want to keep the pipeline
```
cd /path/of/your/choice
```
2. Clone the repository
```
git clone https://github.com/GlobalPneumoSeq/gps-pipeline.git
```
3. Go into the directory of the pipeline
```
cd gps-pipeline
```
4. (Optional) Initialise the pipeline, so it can be used any time without the Internet
  
  If using Singularity/Apptainer, add -profile singularity to the below command
```
./run_pipeline --init
```

Tree Building Tools (`snippy`, `snp-sites`, `fasttree`, `ete3`, `gubbins`)

Install Conda
Install snippy, snp-sites, fasttree, ete3 in a Conda environment named buildtree

If using Mac computers with Apple Silicon, append --platform osx-64 to the below command.
```
conda create -n buildtree "snippy>=4.6.0" "snp-sites>=2.5.1" "fasttree>=2.2.0" "ete3>=3.1.3"
```
Install gubbins in a Conda environment named gubbins

gubbins need to be installed to a separated Conda environment due to package conflict with other tools.
```
conda create -n gubbins "gubbins>=3.4.3"
```

Quality Control & Generating in silico Data

Run GPS Pipeline

We use the GPS Pipeline to process raw read sequencing files (FASTQ) of Streptococcus pneumoniae samples, it will perform quality control (QC) and in silico typing (including serotype, MLST, GPSC, and AMR profile) fully automatic.

Saving FASTQ files of all samples in a single directory
Run the pipeline by specifying where the FASTQ files are stored (--reads) and the output directory (--output)

If using Singularity/Apptainer, add -profile singularity to the below command.
```
 ./run_pipeline --reads /path/to/fastqs --output /path/to/output
```
Once the run is completed, you will find results.csv in your specified output directory

Filtering samples

Before building a tree, we will filter out samples that failed QC in the GPS Pipeline. We will be saving a copy of results.csv with only QC passed samples as results_qcpass.csv and a list of those samples as ids_qcpass.txt

Go into GPS Pipeline output directory
```
cd /path/to/output
```
Extract GPS Pipeline results of QC-passed samples, results_qcpass.csv will be uploaded to your Microreact instance as metadata
```
awk -F , ' FNR==1 || $6=="PASS" ' results.csv > results_qcpass.csv
```

Extract names of QC passed samples

awk -F , ' FNR >1 { print $1 } ' results_qcpass.csv > ids_qcpass.txt

Building Phylogenetic Tree

We use snippy-multi script to run QC passed reads against the same reference.

Prepare `snippy-multi` input files

Run the following command to create an input file named snippy_multi_input.tsv

Update values of READS_DIR and IDS_QCPASS to the correct ones.

 READS_DIR="/path/to/fastqs"
 IDS_QCPASS="/path/to/ids_qcpass.txt"

 while read -r ID; do 
     echo ${ID}$'\t'${READS_DIR}/${ID}_1.fastq.gz$'\t'${READS_DIR}/${ID}_2.fastq.gz >> snippy_multi_input.tsv;
 done < ${IDS_QCPASS}

Get a reference sequence

Below command downloads a general reference sequence. You should use a suitable reference genome if you are working on a lineage/GPSC-specific analysis.
```
wget https://raw.githubusercontent.com/GlobalPneumoSeq/gps-pipeline/refs/heads/master/data/ATCC_700669_v1.fa
```

Run `snippy-multi`

Activate buildtree Conda environment
```
conda activate buildtree
```
Run snippy-multi to generate the script runme.sh, which will create the alignment in the next step

--cpus $(nproc) uses all available CPU cores when running the generated script.

Update the value of --ref if you are using an alternative reference sequence.
```
snippy-multi snippy_multi_input.tsv --ref ATCC_700669_v1.fa --cpus $(nproc) > runme.sh 
```
Grant execution permission to the newly created file so that you can run it
```
chmod +x runme.sh 
```
Run the script
```
./runme.sh
```
Clean alignment (see here for why the cleanup is needed)
```
snippy-clean_full_aln core.full.aln > clean.full.aln
```

Remove intermediate directories

cut -f1 snippy_multi_input.tsv | xargs rm -rf

Crossroads: Choose Your Path

Depending on the purpose of your tree building and the nature of your samples, you might choose one of the following paths:

Path A:
- Species-wide analysis (as gubbins only works on samples within a strain or lineage), or quick preliminary test
- Faster
Path B:
- Strain/lineage-specific analysis
- Slower

Path A - `snp-sites` → `fasttree` → `ete3`

Run snp-sites to identify and extract the SNPs from the alignment
```
snp-sites -o clean.full.SNPs.aln clean.full.aln
```

Run fasttree to build a tree

FastTree -nt -gtr clean.full.SNPs.aln > clean.full.SNPs.aln.tree

Run ete3 via Python to prune reference from tree file

echo -e "from ete3 import Tree\nt = Tree('clean.full.SNPs.aln.tree')\nt.search_nodes(name='Reference')[0].detach()\nt.write(outfile='clean.full.SNPs.aln.noref.tree')" | python3

The resulting clean.full.SNPs.aln.noref.tree will be uploaded to your Microreact instance as tree file

Path B - `gubbins` → `ete3`

Activate gubbins Conda environment
```
conda activate gubbins
```
Run gubbins to build a tree with recombination filtering

--threads $(nproc) uses all available CPU cores when running gubbins
```
run_gubbins.py --mar --threads $(nproc) -p gubbins_output clean.full.aln
```
Activate buildtree Conda environment
```
conda activate buildtree
```

Run ete3 via Python to prune reference from tree file

echo -e "from ete3 import Tree\nt = Tree('gubbins_output.final_tree.tre')\nt.search_nodes(name='Reference')[0].detach()\nt.write(outfile='gubbins_output.final_tree.noref.tre')" | python3

The resulting gubbins_output.final_tree.noref.tre will be uploaded to your Microreact instance as tree file

Creating Microreact Instance

You can refer to its official documentation on how to use Microreact. In brief, you can create an informative instance in just few simple steps:

Go to https://microreact.org/upload using your preferred web browser
Drag and drop results_qcpass.csv, and clean.full.SNPs.aln.noref.tree (Path A) or gubbins_output.final_tree.noref.tre (Path B)
Keep pressing Continue buttons until you see the visulisation
Pick a suitable colour column to change what the colours of the nodes are representing (e.g. Serotype)

In the GPS Project, we have standardised colour palette for GPSCs and serotypes. You can add a <FIELD NAME>__colour column (e.g. GPSC__colour) in your metadata (results_qcpass.csv in this case) to specify the colour palette of a column.
Add metadata blocks to visualise the in silico data generated by the GPS Pipeline to provide context to the tree
Save and share the project

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FASTQ to Microreact for Streptococcus pneumoniae

Table of Content

Prerequisites & Environment Setup

Required Tools

Installing Tools

GPS Pipeline

Tree Building Tools (`snippy`, `snp-sites`, `fasttree`, `ete3`, `gubbins`)

Quality Control & Generating in silico Data

Run GPS Pipeline

Filtering samples

Building Phylogenetic Tree

Prepare `snippy-multi` input files

Run `snippy-multi`

Crossroads: Choose Your Path

Path A - `snp-sites` → `fasttree` → `ete3`

Path B - `gubbins` → `ete3`

Creating Microreact Instance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

FASTQ to Microreact for Streptococcus pneumoniae

Table of Content

Prerequisites & Environment Setup

Required Tools

Installing Tools

GPS Pipeline

Tree Building Tools (snippy, snp-sites, fasttree, ete3, gubbins)

Quality Control & Generating in silico Data

Run GPS Pipeline

Filtering samples

Building Phylogenetic Tree

Prepare snippy-multi input files

Run snippy-multi

Crossroads: Choose Your Path

Path A - snp-sites → fasttree → ete3

Path B - gubbins → ete3

Creating Microreact Instance

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Tree Building Tools (`snippy`, `snp-sites`, `fasttree`, `ete3`, `gubbins`)

Prepare `snippy-multi` input files

Run `snippy-multi`

Path A - `snp-sites` → `fasttree` → `ete3`

Path B - `gubbins` → `ete3`

Packages