EMBOSS¶

Mark Renton 更新于: 2021-02-01

emboss

内容简介

EMBOSS 是欧洲分子生物学组织开发的 Unix/Linux 下的生物学分析工具。EMBOSS 包含工具众多，这里只介绍与微生物基因组分析可能会用到的一些工具，所有的软件和其文档参考官方文档。

由于 EMBOSS 的命令行方式比较传统，类似phylip的交互模式，和当前许多高通量测序工具的参数设置等不一致，加上许多新的工具代替，因此使用的机会不是很多。这里介绍部分与微生物基因组分析可能会用到的一些工具，所有的软件和其文档参考官方文档。

1. 安装 EMBOSS¶

# ubuntu 包含 emboss 发行版
$ sudo apt install emboss

# 通过 conda 安装
$ conda create -n emboss emboss
$ conda activate emboss

EMBOSS 工具集包含众多基于命令行的工具，可以集成到分析工具流中。EMBOSS 中的许多可以直接访问远程数据库，但这需要默认配置。如果是通过 conda 安装，则要将所在conda虚拟环境路径中的share/EMBOSS/emboss.default.template复制为emboss.default。如果是通过系统级安装，要在当前用户环境中配置数据库，则可以将emboss.default.template复制为~/.embossrc。

# 建立配置文件
(emboss)$ cd $CONDA_PREFIX/EMBOSS
(emboss)$ cp emboss.default.template emboss.default
(emboss)$ vim emboss.default
# 将需要激活的数据库注释符号取出，比如embl

# 查看可以使用的数据库
(emboss)$ showdb

2. 使用¶

查看各个程序¶

wossname: EMBOSS 包含众多应用程序，为方便查询，有一个专门的程序wossname可以查看所有程序及功能。

# 显示所有程序及功能
(emboss)$ wossname
# 显示某一个类别的程序
# 显示alignment比对相关的应用程序
(emboss)$ wossname alignment

# 按功能分组显示应用
(emboss)$ wossname -search "" | less

# 按字母顺序显示应用
(emboss)$ wossname -search "" -alphabetic | less

数据库相关¶

showdb: 显示可用的数据库 dbtell: 显示数据库相关信息

# 显示当前可用数据库
(emboss)$ showdb -full

# 显示embl数据库信息
(emboss)$ dbtell embl

序列处理¶

seqret: 生成序列或格式化序列

# 读入序列，以60字符为长度将其格式化为规范的fasta格式数据
(emboss)$ seqret 1.fas 2.fas

# 序列个是转换 fasta 格式转换为其他格式
(emboss)$ seqret fasta::1.fas gcg::1.gcg

# 直接从服务器读取序列
(emboss)$ seqret embl:x52524

# 通过交互方式进行操作
(emboss)$ seqret
Read and write (return) sequences
Input (gapped) sequence(s):         <--- 输入序列文件名称
output sequence(s) [...]:           <--- 输入序列保存的文件名称

showfeat: 显示序列信息

# 显示组装子各个 contigs 长度
(emboss)$ showfeat assembly.fasta result.showfeat
(emboss)$ less result.showfeat

transeq: DNA/RNA序列翻译成氨基酸序列，文件以 .pep 后缀保存
backtranseq: 氨基酸序列转换成DNA序列。

# fasta个是的DNA序列转换成氨基酸序列
(emboss)$ transeq input.fasta output.pep
# 将 input.pep 氨基酸序列转换成碱基格式保存成 output.fas DNA序列
(emboss)$ backtranseq input.pep output.fas

引物设计¶

eprimer3: primersearch:

双序列比对¶

双序列(Pairwise Alignment)比对的软件

needle: local alignment 序列比对 water: global alignment 序列比对，基于 water-smith 算法。

# 对2条序列 seq1.fas 和 seq2.fas 进行全局比对，生成 seq1v2 alignment
(emboss)$ water seq1.fas seq2.fas seq1v2.sw
(emboss)$ needle seq1.fas seq2.fas seq1v2.nl

needleall

多序列比对¶

emma: 基于 clustalW 的多序列比对程序 edialign: Local 多序列比对

(emboss)$ emma msa1.fas msa1.phy
(emboss)$ edialign

数据可视化¶

density: 绘制核酸密度图。

# 生成序列密度图
(emboss)$ density -seqall input.fasta -display D -gragh ps

工具列表¶

具体信息参见：工具列表

程序名称	用途
aaindexextract	从AAINDEX中提取氨基酸属性
abiview	显示 ABI 测序数据峰值
acdc	测试 ACD 文件
acdpretty	修正 ACD 格式文件
acdtable	从 ACD 文件应用中生成 HTML 格式的表格
acdtrace	Trace processing of an application ACD file (for testing)
acdvalid	Validate an application ACD file
aligncopy	Read and write alignments
aligncopypair	Read and write pairs from alignments
antigenic	Find antigenic sites in proteins
assemblyget	Get assembly of sequence reads
backtranambig	Back-translate a protein sequence to ambiguous nucleotide sequence
backtranseq	Back-translate a protein sequence to a nucleotide sequence
banana	Plot bending and curvature data for B-DNA
biosed	Replace or delete sequence sections
btwisted	Calculate the twisting in a B-DNA sequence
cachedas	Generate server cache file for DAS servers or for the DAS registry
cachedbfetch	Generate server cache file for Dbfetch/WSDbfetch data sources
cacheebeyesearch	Generate server cache file for EB-eye search domains
cacheensembl	Generate server cache file for an Ensembl server
cai	计算 condon adaptation index
chaos	Draw a chaos game representation plot for a nucleotide sequence
charge	Draw a protein charge plot
checktrans	Report STOP codons and ORF statistics of a protein
chips	Calculate Nc codon usage statistic
cirdna	Draw circular map of DNA constructs
codcmp	Codon usage table comparison
codcopy	Copy and reformat a codon usage table
coderet	Extract CDS, mRNA and translations from feature tables
compseq	Calculate the composition of unique words in sequences
cons	Create a consensus sequence from a multiple alignment
consambig	Create an ambiguous consensus sequence from a multiple alignment
cpgplot	Identify and plot CpG islands in nucleotide sequence(s)
cpgreport	Identify and report CpG-rich regions in nucleotide sequence(s)
cusp	Create a codon usage table from nucleotide sequence(s)
cutgextract	Extract codon usage tables from CUTG database
cutseq	Remove a section from a sequence
dan	Calculate nucleic acid melting temperature
dbiblast	Index a BLAST database
dbifasta	Index a fasta file database
dbiflat	Index a flat file database
dbigcg	Index a GCG formatted database
dbtell	Display information about a public database
dbxcompress	Compress an uncompressed dbx index
dbxedam	Index the EDAM ontology using b+tree indices
dbxfasta	Index a fasta file database using b+tree indices
dbxflat	Index a flat file database using b+tree indices
dbxgcg	Index a GCG formatted database using b+tree indices
dbxobo	Index an obo ontology using b+tree indices
dbxreport	Validate index and report internals for dbx databases
dbxresource	Index a data resource catalogue using b+tree indices
dbxstat	Dump statistics for dbx databases
dbxtax	Index NCBI taxonomy using b+tree indices
dbxuncompress	Uncompress a compressed dbx index
degapseq	Remove non-alphabetic (e.g. gap) characters from sequences
density	Draw a nucleic acid density plot
descseq	Alter the name or description of a sequence
diffseq	Compare and report features of two similar sequences
distmat	Create a distance matrix from a multiple sequence alignment
dotmatcher	Draw a threshold dotplot of two sequences
dotpath	Draw a non-overlapping wordmatch dotplot of two sequences
dottup	Display a wordmatch dotplot of two sequences
dreg	Regular expression search of nucleotide sequence(s)
drfinddata	Find public databases by data type
drfindformat	Find public databases by format
drfindid	Find public databases by identifier
drfindresource	Find public databases by resource
drget	Get data resource entries
drtext	Get data resource entries complete text
edamdef	Find EDAM ontology terms by definition
edamhasinput	Find EDAM ontology terms by has_input relation
edamhasoutput	Find EDAM ontology terms by has_output relation
edamisformat	Find EDAM ontology terms by is_format_of relation
edamisid	Find EDAM ontology terms by is_identifier_of relation
edamname	Find EDAM ontology terms by name
edialign	Local multiple alignment of sequences
einverted	Find inverted repeats in nucleotide sequences
embossdata	Find and retrieve EMBOSS data files
embossupdate	Checks for more recent updates to EMBOSS
embossversion	Report the current EMBOSS version number
emma	Multiple sequence alignment (ClustalW wrapper)
emowse	Search protein sequences by digest fragment molecular weight
entret	Retrieve sequence entries from flatfile databases and files
epestfind	Find PEST motifs as potential proteolytic cleavage sites
eprimer3	Pick PCR primers and hybridization oligos
eprimer32	Pick PCR primers and hybridization oligos
equicktandem	Find tandem repeats in nucleotide sequences
est2genome	Align EST sequences to genomic DNA sequence
etandem	Find tandem repeats in a nucleotide sequence
extractalign	Extract regions from a sequence alignment
extractfeat	Extract features from sequence(s)
extractseq	Extract regions from a sequence
featcopy	Read and write a feature table
featmerge	Merge two overlapping feature tables
featreport	Read and write a feature table
feattext	Return a feature table original text
findkm	Calculate and plot enzyme reaction data
freak	Generate residue/base frequency table or plot
fuzznuc	Search for patterns in nucleotide sequences
fuzzpro	Search for patterns in protein sequences
fuzztran	Search for patterns in protein sequences (translated)
garnier	Predict protein secondary structure using GOR method
geecee	Calculate fractional GC content of nucleic acid sequences
getorf	Find and extract open reading frames (ORFs)
godef	Find GO ontology terms by definition
goname	Find GO ontology terms by name
helixturnhelix	Identify nucleic acid-binding motifs in protein sequences
hmoment	Calculate and plot hydrophobic moment for protein sequence(s)
iep	Calculate the isoelectric point of proteins
infoalign	Display basic information about a multiple sequence alignment
infoassembly	Display information about assemblies
infobase	Return information on a given nucleotide base
inforesidue	Return information on a given amino acid residue
infoseq	Display basic information about sequences
isochore	Plot isochores in DNA sequences
jaspextract	Extract data from JASPAR
jaspscan	Scan DNA sequences for transcription factors
lindna	Draw linear maps of DNA constructs
listor	Write a list file of the logical OR of two sets of sequences
makenucseq	Create random nucleotide sequences
makeprotseq	Create random protein sequences
marscan	Find matrix/scaffold recognition (MRS) signatures in DNA sequences
maskambignuc	Mask all ambiguity characters in nucleotide sequences with N
maskambigprot	Mask all ambiguity characters in protein sequences with X
maskfeat	Write a sequence with masked features
maskseq	Write a sequence with masked regions
matcher	Waterman-Eggert local alignment of two sequences
megamerger	Merge two large overlapping DNA sequences
merger	Merge two overlapping sequences
msbar	Mutate a sequence
mwcontam	Find weights common to multiple molecular weights files
mwfilter	Filter noisy data from molecular weights file
needle	Needleman-Wunsch 全局比对
needleall	两两双序列比对
newcpgreport	Identify CpG islands in nucleotide sequence(s)
newcpgseek	Identify and report CpG-rich regions in nucleotide sequence(s)
newseq	Create a sequence file from a typed-in sequence
nohtml	Remove mark-up (e.g. HTML tags) from an ASCII text file
noreturn	Remove carriage return from ASCII files
nospace	Remove whitespace from an ASCII text file
notab	Replace tabs with spaces in an ASCII text file
notseq	Write to file a subset of an input stream of sequences
nthseq	Write to file a single sequence from an input stream of sequences
nthseqset	Read and write (return) one set of sequences from many
octanol	Draw a White-Wimley protein hydropathy plot
oddcomp	Identify proteins with specified sequence word composition
ontocount	Count ontology term(s)
ontoget	Get ontology term(s)
ontogetcommon	Get common ancestor for terms
ontogetdown	Get ontology term(s) by parent id
ontogetobsolete	Get ontology ontology terms
ontogetroot	Get ontology root terms by child identifier
ontogetsibs	Get ontology term(s) by id with common parent
ontogetup	Get ontology term(s) by id of child
ontoisobsolete	Report whether an ontology term id is obsolete
ontotext	Get ontology term(s) original full text
palindrome	Find inverted repeats in nucleotide sequence(s)
pasteseq	Insert one sequence into another
patmatdb	Search protein sequences with a sequence motif
patmatmotifs	Scan a protein sequence with motifs from the PROSITE database
pepcoil	Predict coiled coil regions in protein sequences
pepdigest	Report on protein proteolytic enzyme or reagent cleavage sites
pepinfo	Plot amino acid properties of a protein sequence in parallel
pepnet	Draw a helical net for a protein sequence
pepstats	Calculate statistics of protein properties
pepwheel	Draw a helical wheel diagram for a protein sequence
pepwindow	Draw a hydropathy plot for a protein sequence
pepwindowall	Draw Kyte-Doolittle hydropathy plot for a protein alignment
plotcon	Plot conservation of a sequence alignment
plotorf	Plot potential open reading frames in a nucleotide sequence
polydot	Draw dotplots for all-against-all comparison of a sequence set
preg	Regular expression search of protein sequence(s)
prettyplot	Draw a sequence alignment with pretty formatting
prettyseq	Write a nucleotide sequence and its translation to file
primersearch	Search DNA sequences for matches with primer pairs
printsextract	Extract data from PRINTS database for use by pscan
profit	Scan one or more sequences with a simple frequency matrix
prophecy	Create frequency matrix or profile from a multiple alignment
prophet	Scan one or more sequences with a Gribskov or Henikoff profile
prosextract	Process the PROSITE motif database for use by patmatmotifs
pscan	Scan protein sequence(s) with fingerprints from the PRINTS database
psiphi	Calculates phi and psi torsion angles from protein coordinates
rebaseextract	Process the REBASE database for use by restriction enzyme applications
recoder	Find restriction sites to remove (mutate) with no translation change
redata	Retrieve information from REBASE restriction enzyme database
refseqget	Get reference sequence
remap	Display restriction enzyme binding sites in a nucleotide sequence
restover	Find restriction enzymes producing a specific overhang
restrict	Report restriction enzyme cleavage sites in a nucleotide sequence
revseq	Reverse and complement a nucleotide sequence
seealso	Find programs with similar function to a specified program
seqcount	Read and count sequences
seqmatchall	All-against-all word comparison of a sequence set
seqret	Read and write (return) sequences
seqretsetall	Read and write (return) many sets of sequences
seqretsplit	Read sequences and write them to individual files
seqxref	Retrieve all database cross-references for a sequence entry
seqxrefget	Retrieve all cross-referenced data for a sequence entry
servertell	Display information about a public server
showalign	Display a multiple sequence alignment in pretty format
showdb	显示 EMBOSS 工具支持的数据库，一些工具可以直接操作远程数据库
showfeat	显示序列的属性，如长度等信息
showorf	Display a nucleotide sequence and translation in pretty format
showpep	Display protein sequences with features in pretty format
showseq	Display sequences with features in pretty format
showserver	Display information on configured servers
shuffleseq	Shuffle a set of sequences maintaining composition
sigcleave	Report on signal cleavage sites in a protein sequence
silent	Find restriction sites to insert (mutate) with no translation change
sirna	Find siRNA duplexes in mRNA
sixpack	Display a DNA sequence with 6-frame translation and ORFs
sizeseq	Sort sequences by size
skipredundant	Remove redundant sequences from an input set
skipseq	Read and write (return) sequences, skipping first few
splitsource	Split sequence(s) into original source sequences
splitter	Split sequence(s) into smaller sequences
stretcher	Needleman-Wunsch rapid global alignment of two sequences
stssearch	Search a DNA database for matches with a set of STS primers
supermatcher	Calculate approximate local pair-wise alignments of larger sequences
syco	Draw synonymous codon usage statistic plot for a nucleotide sequence
taxget	Get taxon(s)
taxgetdown	Get descendants of taxon(s)
taxgetrank	Get parents of taxon(s)
taxgetspecies	Get all species under taxon(s)
taxgetup	Get parents of taxon(s)
tcode	Identify protein-coding regions using Fickett TESTCODE statistic
textget	Get text data entries
textsearch	Search the textual description of sequence(s)
tfextract	Process TRANSFAC transcription factor database for use by tfscan
tfm	Display full documentation for an application
tfscan	Identify transcription factor binding sites in DNA sequences
tmap	Predict and plot transmembrane segments in protein sequences
tranalign	Generate an alignment of nucleic coding regions from aligned proteins
transeq	Translate nucleic acid sequences
trimest	Remove poly-A tails from nucleotide sequences
trimseq	Remove unwanted characters from start and end of sequence(s)
trimspace	Remove extra whitespace from an ASCII text file
twofeat	Find neighbouring pairs of features in sequence(s)
union	Concatenate multiple sequences into a single sequence
urlget	Get URLs of data resources
variationget	Get sequence variations
vectorstrip	Remove vectors from the ends of nucleotide sequence(s)
water	Smith-Waterman local alignment of sequences
whichdb	Search all sequence databases for an entry and retrieve it
wobble	Plot third base position variability in a nucleotide sequence
wordcount	Count and extract unique words in molecular sequence(s)
wordfinder	Match large sequences against one or more other sequences
wordmatch	Find regions of identity (exact matches) of two sequences
wossdata	Find programs by EDAM data
wossinput	Find programs by EDAM input data
wossname	Find programs by keywords in their short description
wossoperation	Find programs by EDAM operation
wossoutput	Find programs by EDAM output data
wossparam	Find programs by EDAM parameter
wosstopic	Find programs by EDAM topic
yank	Add a sequence reference (a full USA) to a list file