Novel Allele Workflow Guide¶
This guide covers the private MLST workflow in gmlst, from novel allele detection during typing to building a reusable custom scheme. For the core command reference, see commands.md. For a shorter getting-started example, see quickstart.md.
Overview¶
Novel allele detection helps you capture sequences that are close enough to a target locus to be trusted, but too different from the public database to be called as a known allele. That matters when you are working with local collections, outbreak sets, or long-running surveillance projects where your lab sees diversity before it appears in a public scheme.
The workflow has two goals:
- save novel allele sequences and complete novel profiles from typing output
- turn those discoveries into a private custom scheme that you can type against later
gmlst keeps that workflow CLI-first and file-based, so you can inspect every intermediate artifact and decide when to promote novel data into a reusable local scheme.
Complete Pipeline¶
# 1. Type samples and write machine-readable results
gmlst typing mlst -s saureus_1 --format json *.fasta -o typing_results.json
# 2. Extract novel alleles and complete novel profiles
gmlst utils extract -i typing_results.json --novel-allele --novel-profile \
--data-dir novel_data/
# 3. Create a private custom scheme from the extracted data
gmlst scheme create -t mlst -s saureus_1 --data-dir novel_data/ \
--desc "Lab collection 2024"
# 4. Add more novel data later
gmlst scheme update-custom -s custom_1 --data-dir more_novel_data/
# 5. Export the custom scheme for MST visualization
gmlst scheme export -s custom_1 --format grapetree -o mst.tsv
Typical result layout after the workflow:
novel_data/
├── arcC_novel.fasta
├── glpF_novel.fasta
└── profiles_novel.txt
~/.cache/gmlst/local/custom_1/
├── arcC.tfa
├── aroE.tfa
├── glpF.tfa
├── custom_1.txt
└── .meta.json
Step 1: Typing with Novel Detection¶
Start by typing your samples against a public MLST scheme and keep the output in JSON. JSON is the preferred input for downstream extraction because it preserves per-locus details, including novel sequence data.
gmlst typing mlst -s saureus_1 --format json *.fasta -o typing_results.json
If you want typing plus extraction-ready flags in one run, use:
gmlst typing mlst -s saureus_1 --format json \
--novel-allele --novel-profile \
--data-dir novel_data/ \
*.fasta -o typing_results.json
Why these flags matter:
--novel-allelesaves loci with good coverage but low identity as novel candidates--novel-profilerecords complete profiles that include those novel alleles--data-dirkeeps the novel outputs in a dedicated directory instead of mixing them with your typing table
Novel allele detection follows this rule of thumb:
- good coverage, but identity below 95 percent, becomes a novel allele candidate
Representative JSON-style example:
{
"sample_id": "isolate_042",
"scheme": "saureus_1",
"st": null,
"allele_calls": {
"arcC": {"allele_id": "1", "call_type": "exact"},
"aroE": {"allele_id": "7", "call_type": "exact"},
"glpF": {
"allele_id": "19",
"call_type": "novel",
"novel_sequence": "ATGAAACT..."
}
}
}
Step 2: Extracting Novel Data¶
Once you have typing output, extract the novel alleles and complete novel profiles into a reusable data directory.
From JSON typing output¶
gmlst utils extract -i typing_results.json --novel-allele --novel-profile \
--data-dir novel_data/
Typical output:
Novel alleles written:
glpF: novel_data/glpF_novel.fasta
gmk: novel_data/gmk_novel.fasta
Novel profiles written: novel_data/profiles_novel.txt
From TSV typing output¶
TSV works as a fallback when JSON is not available. In that case, gmlst needs the scheme name and access to the original sample FASTA files so it can recover the underlying sequence context.
gmlst utils extract -i typing_results.tsv -s ecoli_1 \
--novel-allele --novel-profile \
--samples-dir ./samples \
--data-dir novel_data/
Use the TSV path when you already have a typing table from an earlier run, but did not save JSON at the time.
Naming rules¶
- novel alleles use
{locus}_n{number}, for examplednaN_n1 - novel profiles use
N{number}, for exampleN1 - allele numbering is per locus, while profile numbering is global within the output set
Important behavior across repeated runs:
profiles_novel.txtis append-safe across runs{locus}_novel.fastafiles are generated fresh per run, so use a new--data-dirfor each batch or regenerate from a combined JSON file
Step 3: Creating Custom Schemes¶
After extraction, build a private scheme that merges the original public database with your new allele and profile data.
gmlst scheme create -t mlst -s saureus_1 --data-dir novel_data/ \
--desc "Lab collection 2024"
Why this step matters:
- it turns ad hoc novel files into a typeable scheme
- it preserves the original public alleles and STs
- it lets later samples match your lab-specific alleles directly
Custom schemes are auto-numbered as custom_1, custom_2, and so on. They are stored under the local provider and show up in the normal scheme list.
gmlst scheme list -p local
Realistic output:
PROVIDER TYPE SCHEME DESCRIPTION
local mlst custom_1 Lab collection 2024
The resulting scheme directory looks like this:
~/.cache/gmlst/local/custom_1/
├── arcC.tfa
├── aroE.tfa
├── glpF.tfa
├── custom_1.txt
└── .meta.json
Step 4: Updating Custom Schemes¶
When new isolates produce more novel alleles, update the existing custom scheme instead of creating a fresh one.
gmlst scheme update-custom -s custom_1 --data-dir more_novel_data/
This keeps numbering and metadata continuity in one place. It is the right choice when you are extending the same internal project or surveillance collection.
Typical update pattern:
# Type the next batch
gmlst typing mlst -s saureus_1 --format json batch2/*.fasta -o batch2.json
# Extract new novel data
gmlst utils extract -i batch2.json --novel-allele --novel-profile \
--data-dir more_novel_data/
# Merge it into the existing custom scheme
gmlst scheme update-custom -s custom_1 --data-dir more_novel_data/
After the update, custom_1 can be used for typing like any other scheme:
gmlst typing mlst -s custom_1 new_isolate.fasta
Step 5: Exporting for Visualization¶
Once your custom scheme contains the profile set you need, export it in GrapeTree-compatible TSV format for MST work.
gmlst scheme export -s custom_1 --format grapetree -o mst.tsv
That output is designed for downstream visualization, including the local web UI described in visual_guide.md.
Representative first lines of the exported file:
#Strain arcC aroE glpF gmk pta tpi yqiL
ST1 1 1 1 1 1 1 1
STN1 n1 7 3 9 4 2 1
The first column is a strain label derived from the profile identifier. Known alleles remain numeric, while novel alleles remain in nX form in the exported table.
Allele Extraction¶
The same utils extract command can also pull called alleles directly from a sample FASTA. This is useful when you want a per-sample MLST allele FASTA for inspection, alignment, or downstream analysis.
gmlst utils extract -i genome.fasta -s ecoli_1 > genome_mlst.fasta
To restrict extraction to specific loci:
gmlst utils extract -i genome.fasta -s ecoli_1 --allele dnaN,tsvA,abcN
Realistic output:
>dnaN_12 sample=genome
ATGGCTAACAAAGT...
>tsvA_4 sample=genome
ATGCGTATCGGTTA...
>abcN_18 sample=genome
ATGGATTTACCGAA...
Sequence Concatenation¶
If you want a concatenated allele sequence for phylogenetic or distance-based workflows, use utils concat on the extracted FASTA records.
gmlst utils concat -i genome_mlst.fasta -o genome_mlst_concat.fasta
This gives you one combined sequence per sample, which is often easier to pass into tree-building or alignment pipelines.
File Formats¶
Novel allele FASTA¶
File name pattern:
{locus}_novel.fasta
Example:
>dnaN_n1 sample=isolate_A1
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
>dnaN_n2 sample=isolate_B2
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCA
Novel profile file¶
File name:
profiles_novel.txt
Example:
ST sample dnaN gyrB abcZ
N1 isolate_A1 n1 5 12
N2 isolate_B2 n2 5 12
Custom scheme metadata¶
File name:
.meta.json
Representative schema:
{
"scheme": "custom_1",
"provider": "local",
"based_on": "saureus_1",
"based_on_provider": "pubmlst",
"scheme_type": "mlst",
"description": "Lab collection 2024",
"loci": ["arcC", "aroE", "glpF", "gmk", "pta", "tpi", "yqiL"],
"novel_alleles": {"arcC": ["n1"], "glpF": ["n1", "n2"]},
"novel_profiles": ["N1", "N2"],
"last_allele_number": {"arcC": 1, "glpF": 2}
}
The exact metadata can grow over time, but you should expect it to record where the custom scheme came from and how much novel content has been merged.
Best Practices¶
- Prefer JSON typing output for novel workflows. It preserves
novel_sequenceand avoids re-typing. - Keep one
novel_data/directory per project or batch series, not one giant shared scratch folder. - If you process multiple batches, keep separate allele extraction directories or rebuild from a combined JSON file, because novel allele FASTA files are not append-safe across runs.
- Add clear
--desctext when creating a custom scheme so the local catalog stays readable. - Reuse one custom scheme with
scheme update-customwhen you are extending the same collection. - Export GrapeTree TSV only after the custom scheme reflects the allele set you want to compare.
- Keep raw typing output, extracted novel data, and final custom schemes together in your project record.
Troubleshooting¶
--novel-profile did not produce a profile file¶
--novel-profile depends on --novel-allele, and only complete profiles can be written. If a sample has missing or partial loci, it will not become a novel profile.
My novel allele FASTA files changed between runs¶
That is expected if you reuse the same --data-dir across separate extractions. profiles_novel.txt appends safely, but {locus}_novel.fasta files are rewritten for each run.
Recommended fix:
# Keep each batch separate
gmlst utils extract -i batch1.json --novel-allele --novel-profile --data-dir novel_batch1/
gmlst utils extract -i batch2.json --novel-allele --novel-profile --data-dir novel_batch2/
I only have a TSV typing file¶
Use the TSV fallback path with both --samples-dir and -s/--scheme:
gmlst utils extract -i typing_results.tsv -s ecoli_1 \
--novel-allele --novel-profile \
--samples-dir ./samples --data-dir novel_data/
custom_1 does not appear in my normal scheme list¶
List local schemes explicitly:
gmlst scheme list -p local
Custom schemes belong to the local provider.
I want to compare samples in a tree¶
Export the custom scheme in GrapeTree format:
gmlst scheme export -s custom_1 --format grapetree -o mst.tsv
Then load that TSV into the visual workflow in visual_guide.md.