Quick Start¶

This guide walks through a complete first run of gmlst, from downloading a scheme to typing FASTA and FASTQ samples.

If you have not installed the tool yet, start with Installation.

Prerequisites¶

Before you begin, make sure:

gmlst is installed and on PATH
at least one backend is available, for example blastn
you have internet access for scheme download
your sample data is in FASTA or FASTQ format

Useful checks:

gmlst --version
gmlst utils check -b blastn
gmlst scheme list -p pubmlst -t mlst

Step 1: Download an MLST scheme¶

First, list schemes so you know what names are available:

gmlst scheme list
gmlst scheme list -p pubmlst -t mlst

Download a scheme by its canonical scheme name. Here we use saureus_1:

gmlst scheme download -s saureus_1

What this does:

resolves the provider for saureus_1
downloads the allele FASTA files and profile table
stores the scheme in the local cache for later reuse

You can inspect the downloaded scheme afterward:

gmlst scheme show -s saureus_1

Step 2: Type your first sample¶

For an assembled genome in FASTA format, the most direct command is:

gmlst typing mlst -s saureus_1 sample.fasta

This runs MLST typing against the downloaded scheme using the default blastn backend.

Typical TSV output looks like this:

FILE    SCHEME  ST  arcC    aroE    glpF    gmk pta tpi yqiL
sample  saureus_1   1   1   1   1   1   1   1   1

If you prefer a more human-readable one line summary for a quick check:

gmlst typing mlst -s saureus_1 --format pretty sample.fasta

Example output:

sample: ST=1

Step 3: Try different backends¶

gmlst supports multiple alignment backends. The best choice depends on your input type and the balance you want between speed and sensitivity.

`blastn`¶

Good default for assembled genomes.

gmlst typing mlst -s saureus_1 -b blastn sample.fasta

`kma`¶

Works with FASTA and FASTQ. Often useful when comparing read-based runs.

gmlst typing mlst -s saureus_1 -b kma sample.fasta

`minimap2`¶

Supports FASTA and FASTQ. FASTQ mode includes targeted validation on uncertain loci.

gmlst typing mlst -s saureus_1 -b minimap2 sample.fasta

`nucmer`¶

Assembly-oriented whole genome alignment from MUMmer4.

gmlst typing mlst -s saureus_1 -b nucmer sample.fasta

Compare backends in one run¶

Use the benchmark utility if you want a side by side comparison:

gmlst utils benchmark -s saureus_1 -b blastn,kma,minimap2,nucmer sample.fasta

Step 4: Batch processing and output files¶

You can type multiple samples in one command.

Write TSV output¶

gmlst typing mlst -s saureus_1 samples/*.fasta -o results.tsv

This writes a tab-separated table that is easy to open in spreadsheets or parse in scripts.

Write JSON output¶

gmlst typing mlst -s saureus_1 --format json samples/*.fasta -o results.json

JSON output is useful for downstream automation, reporting, or novel allele extraction.

Increase sample-level parallelism¶

If you are processing many samples, you can run multiple samples in parallel:

gmlst typing mlst -s saureus_1 --max-workers 4 samples/*.fasta -o results.tsv

Step 5: FASTQ paired-end input¶

gmlst can auto-detect common paired-end naming patterns and pass them to supported backends as true paired reads.

Recognized patterns include:

sample_R1.fastq.gz and sample_R2.fastq.gz
sample_1.fq.gz and sample_2.fq.gz
sample.1.fastq.gz and sample.2.fastq.gz

Paired-end MLST with `kma`¶

gmlst typing mlst -s saureus_1 -b kma reads/sample_R1.fastq.gz reads/sample_R2.fastq.gz

Paired-end MLST with `minimap2`¶

gmlst typing mlst -s saureus_1 -b minimap2 reads/sample_R1.fastq.gz reads/sample_R2.fastq.gz

Notes:

paired reads are not pre-merged into one temporary file
kma and minimap2 support FASTQ directly
blastn and nucmer are generally assembly-oriented backends

Step 6: Run cgMLST typing¶

Use typing cgmlst for larger schemes with many loci:

gmlst typing cgmlst -s vparahaemolyticus_3 sample.fasta

The default backend for typing cgmlst is minimap2.

You can select different cgMLST runtime modes depending on speed and sensitivity needs:

gmlst typing cgmlst -s vparahaemolyticus_3 --cgmlst-mode standard sample.fasta
gmlst typing cgmlst -s vparahaemolyticus_3 --cgmlst-mode chew-fast sample.fasta
gmlst typing cgmlst -s vparahaemolyticus_3 --cgmlst-mode chew-ultrafast sample.fasta
gmlst typing cgmlst -s vparahaemolyticus_3 --cgmlst-mode chew-balanced sample.fasta
gmlst typing cgmlst -s vparahaemolyticus_3 --cgmlst-mode chew-bsr sample.fasta

For FASTQ inputs, typing cgmlst automatically favors a kma-based path even if you requested minimap2, because the chew-style optimizations are FASTA-oriented.

Step 7: Run scheme-free typing¶

If you want to type samples without selecting a predefined scheme, use tgmlst:

gmlst typing tgmlst sample.fasta

Useful options include:

gmlst typing tgmlst --format json sample.fasta -o tgmlst.json
gmlst typing tgmlst --stats sample.fasta
gmlst typing tgmlst --save-scheme discovered_scheme.json sample.fasta

Scheme-free mode is helpful for exploratory workflows and cases where you want to derive a typing scheme from the data itself.

Understanding results¶

TSV output format¶

The default output format for typing mlst and typing cgmlst is TSV:

FILE    SCHEME  ST  arcC    aroE    glpF    gmk pta tpi yqiL
sample1 saureus_1   1   1   1   1   1   1   1   1
sample2 saureus_1   -   1   1   ~2  1   1   15? -

Columns mean:

FILE: sample identifier
SCHEME: scheme name used for typing
ST: sequence type, or - when unresolved
remaining columns: per-locus allele calls

Call type markers¶

gmlst uses compact markers in the allele columns:

Marker	Meaning
`23`	exact match to allele 23
`~19`	closest known allele, or a novel call represented against the nearest allele ID
`15?`	partial call, allele 15 found with incomplete coverage
`-`	missing locus

Practical interpretation:

no prefix or suffix means an exact call
~ means the locus is not a clean exact match
? means coverage is incomplete
- means the locus was not found

If any locus is non-exact or unresolved, ST may be reported as -.

Multicopy loci¶

Some organisms can show multicopy housekeeping gene signals. In those cases you may see comma-separated allele notation.

`1,2`¶

This means the same locus has conflicting high-confidence hits to multiple alleles.

`1,1`¶

This means the same allele appears to be present in multiple copies. This notation is only reported when same-copy counting is enabled.

Enable same-copy counting¶

gmlst typing mlst -s vparahaemolyticus_1 -b blastn --count-same-copy sample.fna

Current behavior:

1,2 conflicting multicopy calls can appear without extra flags
1,1 same-allele copy counting currently applies to blastn
conflicting multicopy loci force ST to - to avoid overconfident typing

Recommended workflow for multicopy-prone organisms¶

For organisms or schemes where multicopy signals are common, a two-pass workflow is safer.

Pass 1: routine typing¶

Use a fast default run without same-copy counting:

gmlst typing mlst -s vparahaemolyticus_1 -b minimap2 samples/*.fna -o pass1.tsv

Pass 2: targeted review of flagged samples¶

Re-run only the problematic samples with blastn and explicit copy counting:

gmlst typing mlst -s vparahaemolyticus_1 -b blastn --count-same-copy flagged_sample.fna

This keeps the routine batch fast while still letting you inspect ambiguous loci carefully.

Next steps¶

Read the full Command Reference
Review Installation if you need backend setup help
Explore the repository overview