Quick Start¶
This guide walks through a complete first run of gmlst, from downloading a scheme to typing FASTA and FASTQ samples.
If you have not installed the tool yet, start with Installation.
Prerequisites¶
Before you begin, make sure:
gmlstis installed and onPATH- at least one backend is available, for example
blastn - you have internet access for scheme download
- your sample data is in FASTA or FASTQ format
Useful checks:
gmlst --version
gmlst utils check -b blastn
gmlst scheme list -p pubmlst -t mlst
Step 1: Download an MLST scheme¶
First, list schemes so you know what names are available:
gmlst scheme list
gmlst scheme list -p pubmlst -t mlst
Download a scheme by its canonical scheme name. Here we use saureus_1:
gmlst scheme download -s saureus_1
What this does:
- resolves the provider for
saureus_1 - downloads the allele FASTA files and profile table
- stores the scheme in the local cache for later reuse
You can inspect the downloaded scheme afterward:
gmlst scheme show -s saureus_1
Step 2: Type your first sample¶
For an assembled genome in FASTA format, the most direct command is:
gmlst typing mlst -s saureus_1 sample.fasta
This runs MLST typing against the downloaded scheme using the default blastn backend.
Typical TSV output looks like this:
FILE SCHEME ST arcC aroE glpF gmk pta tpi yqiL
sample saureus_1 1 1 1 1 1 1 1 1
If you prefer a more human-readable one line summary for a quick check:
gmlst typing mlst -s saureus_1 --format pretty sample.fasta
Example output:
sample: ST=1
Step 3: Try different backends¶
gmlst supports multiple alignment backends. The best choice depends on your input type and the balance you want between speed and sensitivity.
blastn¶
Good default for assembled genomes.
gmlst typing mlst -s saureus_1 -b blastn sample.fasta
kma¶
Works with FASTA and FASTQ. Often useful when comparing read-based runs.
gmlst typing mlst -s saureus_1 -b kma sample.fasta
minimap2¶
Supports FASTA and FASTQ. FASTQ mode includes targeted validation on uncertain loci.
gmlst typing mlst -s saureus_1 -b minimap2 sample.fasta
nucmer¶
Assembly-oriented whole genome alignment from MUMmer4.
gmlst typing mlst -s saureus_1 -b nucmer sample.fasta
Compare backends in one run¶
Use the benchmark utility if you want a side by side comparison:
gmlst utils benchmark -s saureus_1 -b blastn,kma,minimap2,nucmer sample.fasta
Step 4: Batch processing and output files¶
You can type multiple samples in one command.
Write TSV output¶
gmlst typing mlst -s saureus_1 samples/*.fasta -o results.tsv
This writes a tab-separated table that is easy to open in spreadsheets or parse in scripts.
Write JSON output¶
gmlst typing mlst -s saureus_1 --format json samples/*.fasta -o results.json
JSON output is useful for downstream automation, reporting, or novel allele extraction.
Increase sample-level parallelism¶
If you are processing many samples, you can run multiple samples in parallel:
gmlst typing mlst -s saureus_1 --max-workers 4 samples/*.fasta -o results.tsv
Step 5: FASTQ paired-end input¶
gmlst can auto-detect common paired-end naming patterns and pass them to supported backends as true paired reads.
Recognized patterns include:
sample_R1.fastq.gzandsample_R2.fastq.gzsample_1.fq.gzandsample_2.fq.gzsample.1.fastq.gzandsample.2.fastq.gz
Paired-end MLST with kma¶
gmlst typing mlst -s saureus_1 -b kma reads/sample_R1.fastq.gz reads/sample_R2.fastq.gz
Paired-end MLST with minimap2¶
gmlst typing mlst -s saureus_1 -b minimap2 reads/sample_R1.fastq.gz reads/sample_R2.fastq.gz
Notes:
- paired reads are not pre-merged into one temporary file
kmaandminimap2support FASTQ directlyblastnandnucmerare generally assembly-oriented backends
Step 6: Run cgMLST typing¶
Use typing cgmlst for larger schemes with many loci:
gmlst typing cgmlst -s vparahaemolyticus_3 sample.fasta
The default backend for typing cgmlst is minimap2.
You can select different cgMLST runtime modes depending on speed and sensitivity needs:
gmlst typing cgmlst -s vparahaemolyticus_3 --cgmlst-mode standard sample.fasta
gmlst typing cgmlst -s vparahaemolyticus_3 --cgmlst-mode chew-fast sample.fasta
gmlst typing cgmlst -s vparahaemolyticus_3 --cgmlst-mode chew-ultrafast sample.fasta
gmlst typing cgmlst -s vparahaemolyticus_3 --cgmlst-mode chew-balanced sample.fasta
gmlst typing cgmlst -s vparahaemolyticus_3 --cgmlst-mode chew-bsr sample.fasta
For FASTQ inputs, typing cgmlst automatically favors a kma-based path even if you requested minimap2, because the chew-style optimizations are FASTA-oriented.
Step 7: Run scheme-free typing¶
If you want to type samples without selecting a predefined scheme, use tgmlst:
gmlst typing tgmlst sample.fasta
Useful options include:
gmlst typing tgmlst --format json sample.fasta -o tgmlst.json
gmlst typing tgmlst --stats sample.fasta
gmlst typing tgmlst --save-scheme discovered_scheme.json sample.fasta
Scheme-free mode is helpful for exploratory workflows and cases where you want to derive a typing scheme from the data itself.
Understanding results¶
TSV output format¶
The default output format for typing mlst and typing cgmlst is TSV:
FILE SCHEME ST arcC aroE glpF gmk pta tpi yqiL
sample1 saureus_1 1 1 1 1 1 1 1 1
sample2 saureus_1 - 1 1 ~2 1 1 15? -
Columns mean:
FILE: sample identifierSCHEME: scheme name used for typingST: sequence type, or-when unresolved- remaining columns: per-locus allele calls
Call type markers¶
gmlst uses compact markers in the allele columns:
| Marker | Meaning |
|---|---|
23 |
exact match to allele 23 |
~19 |
closest known allele, or a novel call represented against the nearest allele ID |
15? |
partial call, allele 15 found with incomplete coverage |
- |
missing locus |
Practical interpretation:
- no prefix or suffix means an exact call
~means the locus is not a clean exact match?means coverage is incomplete-means the locus was not found
If any locus is non-exact or unresolved, ST may be reported as -.
Multicopy loci¶
Some organisms can show multicopy housekeeping gene signals. In those cases you may see comma-separated allele notation.
1,2¶
This means the same locus has conflicting high-confidence hits to multiple alleles.
1,1¶
This means the same allele appears to be present in multiple copies. This notation is only reported when same-copy counting is enabled.
Enable same-copy counting¶
gmlst typing mlst -s vparahaemolyticus_1 -b blastn --count-same-copy sample.fna
Current behavior:
1,2conflicting multicopy calls can appear without extra flags1,1same-allele copy counting currently applies toblastn- conflicting multicopy loci force
STto-to avoid overconfident typing
Recommended workflow for multicopy-prone organisms¶
For organisms or schemes where multicopy signals are common, a two-pass workflow is safer.
Pass 1: routine typing¶
Use a fast default run without same-copy counting:
gmlst typing mlst -s vparahaemolyticus_1 -b minimap2 samples/*.fna -o pass1.tsv
Pass 2: targeted review of flagged samples¶
Re-run only the problematic samples with blastn and explicit copy counting:
gmlst typing mlst -s vparahaemolyticus_1 -b blastn --count-same-copy flagged_sample.fna
This keeps the routine batch fast while still letting you inspect ambiguous loci carefully.
Next steps¶
- Read the full Command Reference
- Review Installation if you need backend setup help
- Explore the repository overview