EMBOSS¶
更新于: 2021-02-01

内容简介
EMBOSS 是欧洲分子生物学组织开发的 Unix/Linux 下的生物学分析工具。EMBOSS 包含工具众多,这里只介绍与微生物基因组分析可能会用到的一些工具,所有的软件和其文档参考官方文档。
由于 EMBOSS 的命令行方式比较传统,类似phylip的交互模式,和当前许多高通量测序工具的参数设置等不一致,加上许多新的工具代替,因此使用的机会不是很多。这里介绍部分与微生物基因组分析可能会用到的一些工具,所有的软件和其文档参考官方文档。
1. 安装 EMBOSS¶
# ubuntu 包含 emboss 发行版
$ sudo apt install emboss
# 通过 conda 安装
$ conda create -n emboss emboss
$ conda activate emboss
EMBOSS 工具集包含众多基于命令行的工具,可以集成到分析工具流中。EMBOSS 中的许多可以直接访问远程数据库,但这需要默认配置。如果是通过 conda 安装,则要将所在conda虚拟环境路径中的share/EMBOSS/emboss.default.template复制为emboss.default。如果是通过系统级安装,要在当前用户环境中配置数据库,则可以将emboss.default.template复制为~/.embossrc。
# 建立配置文件
(emboss)$ cd $CONDA_PREFIX/EMBOSS
(emboss)$ cp emboss.default.template emboss.default
(emboss)$ vim emboss.default
# 将需要激活的数据库注释符号取出,比如embl
# 查看可以使用的数据库
(emboss)$ showdb
2. 使用¶
查看各个程序¶
wossname: EMBOSS 包含众多应用程序,为方便查询,有一个专门的程序wossname可以查看所有程序及功能。
# 显示所有程序及功能
(emboss)$ wossname
# 显示某一个类别的程序
# 显示alignment比对相关的应用程序
(emboss)$ wossname alignment
# 按功能分组显示应用
(emboss)$ wossname -search "" | less
# 按字母顺序显示应用
(emboss)$ wossname -search "" -alphabetic | less
数据库相关¶
showdb: 显示可用的数据库 dbtell: 显示数据库相关信息
序列处理¶
seqret: 生成序列或格式化序列
# 读入序列,以60字符为长度将其格式化为规范的fasta格式数据
(emboss)$ seqret 1.fas 2.fas
# 序列个是转换 fasta 格式转换为其他格式
(emboss)$ seqret fasta::1.fas gcg::1.gcg
# 直接从服务器读取序列
(emboss)$ seqret embl:x52524
# 通过交互方式进行操作
(emboss)$ seqret
Read and write (return) sequences
Input (gapped) sequence(s):         <--- 输入序列文件名称
output sequence(s) [...]:           <--- 输入序列保存的文件名称
showfeat: 显示序列信息
# 显示组装子各个 contigs 长度
(emboss)$ showfeat assembly.fasta result.showfeat
(emboss)$ less result.showfeat
transeq: DNA/RNA序列翻译成氨基酸序列,文件以 .pep 后缀保存
backtranseq: 氨基酸序列转换成DNA序列。
# fasta个是的DNA序列转换成氨基酸序列
(emboss)$ transeq input.fasta output.pep
# 将 input.pep 氨基酸序列转换成碱基格式保存成 output.fas DNA序列
(emboss)$ backtranseq input.pep output.fas
引物设计¶
eprimer3: primersearch:
双序列比对¶
双序列(Pairwise Alignment)比对的软件
needle: local alignment 序列比对 water: global alignment 序列比对,基于 water-smith 算法。
# 对2条序列 seq1.fas 和 seq2.fas 进行全局比对,生成 seq1v2 alignment
(emboss)$ water seq1.fas seq2.fas seq1v2.sw
(emboss)$ needle seq1.fas seq2.fas seq1v2.nl
needleall
多序列比对¶
emma: 基于 clustalW 的多序列比对程序 edialign: Local 多序列比对
数据可视化¶
density: 绘制核酸密度图。
工具列表¶
具体信息参见:工具列表
| 程序名称 | 用途 | 
|---|---|
| aaindexextract | 从AAINDEX中提取氨基酸属性 | 
| abiview | 显示 ABI 测序数据峰值 | 
| acdc | 测试 ACD 文件 | 
| acdpretty | 修正 ACD 格式文件 | 
| acdtable | 从 ACD 文件应用中生成 HTML 格式的表格 | 
| acdtrace | Trace processing of an application ACD file (for testing) | 
| acdvalid | Validate an application ACD file | 
| aligncopy | Read and write alignments | 
| aligncopypair | Read and write pairs from alignments | 
| antigenic | Find antigenic sites in proteins | 
| assemblyget | Get assembly of sequence reads | 
| backtranambig | Back-translate a protein sequence to ambiguous nucleotide sequence | 
| backtranseq | Back-translate a protein sequence to a nucleotide sequence | 
| banana | Plot bending and curvature data for B-DNA | 
| biosed | Replace or delete sequence sections | 
| btwisted | Calculate the twisting in a B-DNA sequence | 
| cachedas | Generate server cache file for DAS servers or for the DAS registry | 
| cachedbfetch | Generate server cache file for Dbfetch/WSDbfetch data sources | 
| cacheebeyesearch | Generate server cache file for EB-eye search domains | 
| cacheensembl | Generate server cache file for an Ensembl server | 
| cai | 计算 condon adaptation index | 
| chaos | Draw a chaos game representation plot for a nucleotide sequence | 
| charge | Draw a protein charge plot | 
| checktrans | Report STOP codons and ORF statistics of a protein | 
| chips | Calculate Nc codon usage statistic | 
| cirdna | Draw circular map of DNA constructs | 
| codcmp | Codon usage table comparison | 
| codcopy | Copy and reformat a codon usage table | 
| coderet | Extract CDS, mRNA and translations from feature tables | 
| compseq | Calculate the composition of unique words in sequences | 
| cons | Create a consensus sequence from a multiple alignment | 
| consambig | Create an ambiguous consensus sequence from a multiple alignment | 
| cpgplot | Identify and plot CpG islands in nucleotide sequence(s) | 
| cpgreport | Identify and report CpG-rich regions in nucleotide sequence(s) | 
| cusp | Create a codon usage table from nucleotide sequence(s) | 
| cutgextract | Extract codon usage tables from CUTG database | 
| cutseq | Remove a section from a sequence | 
| dan | Calculate nucleic acid melting temperature | 
| dbiblast | Index a BLAST database | 
| dbifasta | Index a fasta file database | 
| dbiflat | Index a flat file database | 
| dbigcg | Index a GCG formatted database | 
| dbtell | Display information about a public database | 
| dbxcompress | Compress an uncompressed dbx index | 
| dbxedam | Index the EDAM ontology using b+tree indices | 
| dbxfasta | Index a fasta file database using b+tree indices | 
| dbxflat | Index a flat file database using b+tree indices | 
| dbxgcg | Index a GCG formatted database using b+tree indices | 
| dbxobo | Index an obo ontology using b+tree indices | 
| dbxreport | Validate index and report internals for dbx databases | 
| dbxresource | Index a data resource catalogue using b+tree indices | 
| dbxstat | Dump statistics for dbx databases | 
| dbxtax | Index NCBI taxonomy using b+tree indices | 
| dbxuncompress | Uncompress a compressed dbx index | 
| degapseq | Remove non-alphabetic (e.g. gap) characters from sequences | 
| density | Draw a nucleic acid density plot | 
| descseq | Alter the name or description of a sequence | 
| diffseq | Compare and report features of two similar sequences | 
| distmat | Create a distance matrix from a multiple sequence alignment | 
| dotmatcher | Draw a threshold dotplot of two sequences | 
| dotpath | Draw a non-overlapping wordmatch dotplot of two sequences | 
| dottup | Display a wordmatch dotplot of two sequences | 
| dreg | Regular expression search of nucleotide sequence(s) | 
| drfinddata | Find public databases by data type | 
| drfindformat | Find public databases by format | 
| drfindid | Find public databases by identifier | 
| drfindresource | Find public databases by resource | 
| drget | Get data resource entries | 
| drtext | Get data resource entries complete text | 
| edamdef | Find EDAM ontology terms by definition | 
| edamhasinput | Find EDAM ontology terms by has_input relation | 
| edamhasoutput | Find EDAM ontology terms by has_output relation | 
| edamisformat | Find EDAM ontology terms by is_format_of relation | 
| edamisid | Find EDAM ontology terms by is_identifier_of relation | 
| edamname | Find EDAM ontology terms by name | 
| edialign | Local multiple alignment of sequences | 
| einverted | Find inverted repeats in nucleotide sequences | 
| embossdata | Find and retrieve EMBOSS data files | 
| embossupdate | Checks for more recent updates to EMBOSS | 
| embossversion | Report the current EMBOSS version number | 
| emma | Multiple sequence alignment (ClustalW wrapper) | 
| emowse | Search protein sequences by digest fragment molecular weight | 
| entret | Retrieve sequence entries from flatfile databases and files | 
| epestfind | Find PEST motifs as potential proteolytic cleavage sites | 
| eprimer3 | Pick PCR primers and hybridization oligos | 
| eprimer32 | Pick PCR primers and hybridization oligos | 
| equicktandem | Find tandem repeats in nucleotide sequences | 
| est2genome | Align EST sequences to genomic DNA sequence | 
| etandem | Find tandem repeats in a nucleotide sequence | 
| extractalign | Extract regions from a sequence alignment | 
| extractfeat | Extract features from sequence(s) | 
| extractseq | Extract regions from a sequence | 
| featcopy | Read and write a feature table | 
| featmerge | Merge two overlapping feature tables | 
| featreport | Read and write a feature table | 
| feattext | Return a feature table original text | 
| findkm | Calculate and plot enzyme reaction data | 
| freak | Generate residue/base frequency table or plot | 
| fuzznuc | Search for patterns in nucleotide sequences | 
| fuzzpro | Search for patterns in protein sequences | 
| fuzztran | Search for patterns in protein sequences (translated) | 
| garnier | Predict protein secondary structure using GOR method | 
| geecee | Calculate fractional GC content of nucleic acid sequences | 
| getorf | Find and extract open reading frames (ORFs) | 
| godef | Find GO ontology terms by definition | 
| goname | Find GO ontology terms by name | 
| helixturnhelix | Identify nucleic acid-binding motifs in protein sequences | 
| hmoment | Calculate and plot hydrophobic moment for protein sequence(s) | 
| iep | Calculate the isoelectric point of proteins | 
| infoalign | Display basic information about a multiple sequence alignment | 
| infoassembly | Display information about assemblies | 
| infobase | Return information on a given nucleotide base | 
| inforesidue | Return information on a given amino acid residue | 
| infoseq | Display basic information about sequences | 
| isochore | Plot isochores in DNA sequences | 
| jaspextract | Extract data from JASPAR | 
| jaspscan | Scan DNA sequences for transcription factors | 
| lindna | Draw linear maps of DNA constructs | 
| listor | Write a list file of the logical OR of two sets of sequences | 
| makenucseq | Create random nucleotide sequences | 
| makeprotseq | Create random protein sequences | 
| marscan | Find matrix/scaffold recognition (MRS) signatures in DNA sequences | 
| maskambignuc | Mask all ambiguity characters in nucleotide sequences with N | 
| maskambigprot | Mask all ambiguity characters in protein sequences with X | 
| maskfeat | Write a sequence with masked features | 
| maskseq | Write a sequence with masked regions | 
| matcher | Waterman-Eggert local alignment of two sequences | 
| megamerger | Merge two large overlapping DNA sequences | 
| merger | Merge two overlapping sequences | 
| msbar | Mutate a sequence | 
| mwcontam | Find weights common to multiple molecular weights files | 
| mwfilter | Filter noisy data from molecular weights file | 
| needle | Needleman-Wunsch 全局比对 | 
| needleall | 两两双序列比对 | 
| newcpgreport | Identify CpG islands in nucleotide sequence(s) | 
| newcpgseek | Identify and report CpG-rich regions in nucleotide sequence(s) | 
| newseq | Create a sequence file from a typed-in sequence | 
| nohtml | Remove mark-up (e.g. HTML tags) from an ASCII text file | 
| noreturn | Remove carriage return from ASCII files | 
| nospace | Remove whitespace from an ASCII text file | 
| notab | Replace tabs with spaces in an ASCII text file | 
| notseq | Write to file a subset of an input stream of sequences | 
| nthseq | Write to file a single sequence from an input stream of sequences | 
| nthseqset | Read and write (return) one set of sequences from many | 
| octanol | Draw a White-Wimley protein hydropathy plot | 
| oddcomp | Identify proteins with specified sequence word composition | 
| ontocount | Count ontology term(s) | 
| ontoget | Get ontology term(s) | 
| ontogetcommon | Get common ancestor for terms | 
| ontogetdown | Get ontology term(s) by parent id | 
| ontogetobsolete | Get ontology ontology terms | 
| ontogetroot | Get ontology root terms by child identifier | 
| ontogetsibs | Get ontology term(s) by id with common parent | 
| ontogetup | Get ontology term(s) by id of child | 
| ontoisobsolete | Report whether an ontology term id is obsolete | 
| ontotext | Get ontology term(s) original full text | 
| palindrome | Find inverted repeats in nucleotide sequence(s) | 
| pasteseq | Insert one sequence into another | 
| patmatdb | Search protein sequences with a sequence motif | 
| patmatmotifs | Scan a protein sequence with motifs from the PROSITE database | 
| pepcoil | Predict coiled coil regions in protein sequences | 
| pepdigest | Report on protein proteolytic enzyme or reagent cleavage sites | 
| pepinfo | Plot amino acid properties of a protein sequence in parallel | 
| pepnet | Draw a helical net for a protein sequence | 
| pepstats | Calculate statistics of protein properties | 
| pepwheel | Draw a helical wheel diagram for a protein sequence | 
| pepwindow | Draw a hydropathy plot for a protein sequence | 
| pepwindowall | Draw Kyte-Doolittle hydropathy plot for a protein alignment | 
| plotcon | Plot conservation of a sequence alignment | 
| plotorf | Plot potential open reading frames in a nucleotide sequence | 
| polydot | Draw dotplots for all-against-all comparison of a sequence set | 
| preg | Regular expression search of protein sequence(s) | 
| prettyplot | Draw a sequence alignment with pretty formatting | 
| prettyseq | Write a nucleotide sequence and its translation to file | 
| primersearch | Search DNA sequences for matches with primer pairs | 
| printsextract | Extract data from PRINTS database for use by pscan | 
| profit | Scan one or more sequences with a simple frequency matrix | 
| prophecy | Create frequency matrix or profile from a multiple alignment | 
| prophet | Scan one or more sequences with a Gribskov or Henikoff profile | 
| prosextract | Process the PROSITE motif database for use by patmatmotifs | 
| pscan | Scan protein sequence(s) with fingerprints from the PRINTS database | 
| psiphi | Calculates phi and psi torsion angles from protein coordinates | 
| rebaseextract | Process the REBASE database for use by restriction enzyme applications | 
| recoder | Find restriction sites to remove (mutate) with no translation change | 
| redata | Retrieve information from REBASE restriction enzyme database | 
| refseqget | Get reference sequence | 
| remap | Display restriction enzyme binding sites in a nucleotide sequence | 
| restover | Find restriction enzymes producing a specific overhang | 
| restrict | Report restriction enzyme cleavage sites in a nucleotide sequence | 
| revseq | Reverse and complement a nucleotide sequence | 
| seealso | Find programs with similar function to a specified program | 
| seqcount | Read and count sequences | 
| seqmatchall | All-against-all word comparison of a sequence set | 
| seqret | Read and write (return) sequences | 
| seqretsetall | Read and write (return) many sets of sequences | 
| seqretsplit | Read sequences and write them to individual files | 
| seqxref | Retrieve all database cross-references for a sequence entry | 
| seqxrefget | Retrieve all cross-referenced data for a sequence entry | 
| servertell | Display information about a public server | 
| showalign | Display a multiple sequence alignment in pretty format | 
| showdb | 显示 EMBOSS 工具支持的数据库,一些工具可以直接操作远程数据库 | 
| showfeat | 显示序列的属性,如长度等信息 | 
| showorf | Display a nucleotide sequence and translation in pretty format | 
| showpep | Display protein sequences with features in pretty format | 
| showseq | Display sequences with features in pretty format | 
| showserver | Display information on configured servers | 
| shuffleseq | Shuffle a set of sequences maintaining composition | 
| sigcleave | Report on signal cleavage sites in a protein sequence | 
| silent | Find restriction sites to insert (mutate) with no translation change | 
| sirna | Find siRNA duplexes in mRNA | 
| sixpack | Display a DNA sequence with 6-frame translation and ORFs | 
| sizeseq | Sort sequences by size | 
| skipredundant | Remove redundant sequences from an input set | 
| skipseq | Read and write (return) sequences, skipping first few | 
| splitsource | Split sequence(s) into original source sequences | 
| splitter | Split sequence(s) into smaller sequences | 
| stretcher | Needleman-Wunsch rapid global alignment of two sequences | 
| stssearch | Search a DNA database for matches with a set of STS primers | 
| supermatcher | Calculate approximate local pair-wise alignments of larger sequences | 
| syco | Draw synonymous codon usage statistic plot for a nucleotide sequence | 
| taxget | Get taxon(s) | 
| taxgetdown | Get descendants of taxon(s) | 
| taxgetrank | Get parents of taxon(s) | 
| taxgetspecies | Get all species under taxon(s) | 
| taxgetup | Get parents of taxon(s) | 
| tcode | Identify protein-coding regions using Fickett TESTCODE statistic | 
| textget | Get text data entries | 
| textsearch | Search the textual description of sequence(s) | 
| tfextract | Process TRANSFAC transcription factor database for use by tfscan | 
| tfm | Display full documentation for an application | 
| tfscan | Identify transcription factor binding sites in DNA sequences | 
| tmap | Predict and plot transmembrane segments in protein sequences | 
| tranalign | Generate an alignment of nucleic coding regions from aligned proteins | 
| transeq | Translate nucleic acid sequences | 
| trimest | Remove poly-A tails from nucleotide sequences | 
| trimseq | Remove unwanted characters from start and end of sequence(s) | 
| trimspace | Remove extra whitespace from an ASCII text file | 
| twofeat | Find neighbouring pairs of features in sequence(s) | 
| union | Concatenate multiple sequences into a single sequence | 
| urlget | Get URLs of data resources | 
| variationget | Get sequence variations | 
| vectorstrip | Remove vectors from the ends of nucleotide sequence(s) | 
| water | Smith-Waterman local alignment of sequences | 
| whichdb | Search all sequence databases for an entry and retrieve it | 
| wobble | Plot third base position variability in a nucleotide sequence | 
| wordcount | Count and extract unique words in molecular sequence(s) | 
| wordfinder | Match large sequences against one or more other sequences | 
| wordmatch | Find regions of identity (exact matches) of two sequences | 
| wossdata | Find programs by EDAM data | 
| wossinput | Find programs by EDAM input data | 
| wossname | Find programs by keywords in their short description | 
| wossoperation | Find programs by EDAM operation | 
| wossoutput | Find programs by EDAM output data | 
| wossparam | Find programs by EDAM parameter | 
| wosstopic | Find programs by EDAM topic | 
| yank | Add a sequence reference (a full USA) to a list file |