EMBOSS¶
更新于: 2021-02-01

内容简介
EMBOSS 是欧洲分子生物学组织开发的 Unix/Linux 下的生物学分析工具。EMBOSS 包含工具众多,这里只介绍与微生物基因组分析可能会用到的一些工具,所有的软件和其文档参考官方文档。
由于 EMBOSS 的命令行方式比较传统,类似phylip的交互模式,和当前许多高通量测序工具的参数设置等不一致,加上许多新的工具代替,因此使用的机会不是很多。这里介绍部分与微生物基因组分析可能会用到的一些工具,所有的软件和其文档参考官方文档。
1. 安装 EMBOSS¶
# ubuntu 包含 emboss 发行版
$ sudo apt install emboss
# 通过 conda 安装
$ conda create -n emboss emboss
$ conda activate emboss
EMBOSS 工具集包含众多基于命令行的工具,可以集成到分析工具流中。EMBOSS 中的许多可以直接访问远程数据库,但这需要默认配置。如果是通过 conda 安装,则要将所在conda虚拟环境路径中的share/EMBOSS/emboss.default.template复制为emboss.default。如果是通过系统级安装,要在当前用户环境中配置数据库,则可以将emboss.default.template复制为~/.embossrc。
# 建立配置文件
(emboss)$ cd $CONDA_PREFIX/EMBOSS
(emboss)$ cp emboss.default.template emboss.default
(emboss)$ vim emboss.default
# 将需要激活的数据库注释符号取出,比如embl
# 查看可以使用的数据库
(emboss)$ showdb
2. 使用¶
查看各个程序¶
wossname: EMBOSS 包含众多应用程序,为方便查询,有一个专门的程序wossname可以查看所有程序及功能。
# 显示所有程序及功能
(emboss)$ wossname
# 显示某一个类别的程序
# 显示alignment比对相关的应用程序
(emboss)$ wossname alignment
# 按功能分组显示应用
(emboss)$ wossname -search "" | less
# 按字母顺序显示应用
(emboss)$ wossname -search "" -alphabetic | less
数据库相关¶
showdb: 显示可用的数据库 dbtell: 显示数据库相关信息
序列处理¶
seqret: 生成序列或格式化序列
# 读入序列,以60字符为长度将其格式化为规范的fasta格式数据
(emboss)$ seqret 1.fas 2.fas
# 序列个是转换 fasta 格式转换为其他格式
(emboss)$ seqret fasta::1.fas gcg::1.gcg
# 直接从服务器读取序列
(emboss)$ seqret embl:x52524
# 通过交互方式进行操作
(emboss)$ seqret
Read and write (return) sequences
Input (gapped) sequence(s): <--- 输入序列文件名称
output sequence(s) [...]: <--- 输入序列保存的文件名称
showfeat: 显示序列信息
# 显示组装子各个 contigs 长度
(emboss)$ showfeat assembly.fasta result.showfeat
(emboss)$ less result.showfeat
transeq: DNA/RNA序列翻译成氨基酸序列,文件以 .pep 后缀保存
backtranseq: 氨基酸序列转换成DNA序列。
# fasta个是的DNA序列转换成氨基酸序列
(emboss)$ transeq input.fasta output.pep
# 将 input.pep 氨基酸序列转换成碱基格式保存成 output.fas DNA序列
(emboss)$ backtranseq input.pep output.fas
引物设计¶
eprimer3: primersearch:
双序列比对¶
双序列(Pairwise Alignment)比对的软件
needle: local alignment 序列比对 water: global alignment 序列比对,基于 water-smith 算法。
# 对2条序列 seq1.fas 和 seq2.fas 进行全局比对,生成 seq1v2 alignment
(emboss)$ water seq1.fas seq2.fas seq1v2.sw
(emboss)$ needle seq1.fas seq2.fas seq1v2.nl
needleall
多序列比对¶
emma: 基于 clustalW 的多序列比对程序 edialign: Local 多序列比对
数据可视化¶
density: 绘制核酸密度图。
工具列表¶
具体信息参见:工具列表
| 程序名称 | 用途 |
|---|---|
| aaindexextract | 从AAINDEX中提取氨基酸属性 |
| abiview | 显示 ABI 测序数据峰值 |
| acdc | 测试 ACD 文件 |
| acdpretty | 修正 ACD 格式文件 |
| acdtable | 从 ACD 文件应用中生成 HTML 格式的表格 |
| acdtrace | Trace processing of an application ACD file (for testing) |
| acdvalid | Validate an application ACD file |
| aligncopy | Read and write alignments |
| aligncopypair | Read and write pairs from alignments |
| antigenic | Find antigenic sites in proteins |
| assemblyget | Get assembly of sequence reads |
| backtranambig | Back-translate a protein sequence to ambiguous nucleotide sequence |
| backtranseq | Back-translate a protein sequence to a nucleotide sequence |
| banana | Plot bending and curvature data for B-DNA |
| biosed | Replace or delete sequence sections |
| btwisted | Calculate the twisting in a B-DNA sequence |
| cachedas | Generate server cache file for DAS servers or for the DAS registry |
| cachedbfetch | Generate server cache file for Dbfetch/WSDbfetch data sources |
| cacheebeyesearch | Generate server cache file for EB-eye search domains |
| cacheensembl | Generate server cache file for an Ensembl server |
| cai | 计算 condon adaptation index |
| chaos | Draw a chaos game representation plot for a nucleotide sequence |
| charge | Draw a protein charge plot |
| checktrans | Report STOP codons and ORF statistics of a protein |
| chips | Calculate Nc codon usage statistic |
| cirdna | Draw circular map of DNA constructs |
| codcmp | Codon usage table comparison |
| codcopy | Copy and reformat a codon usage table |
| coderet | Extract CDS, mRNA and translations from feature tables |
| compseq | Calculate the composition of unique words in sequences |
| cons | Create a consensus sequence from a multiple alignment |
| consambig | Create an ambiguous consensus sequence from a multiple alignment |
| cpgplot | Identify and plot CpG islands in nucleotide sequence(s) |
| cpgreport | Identify and report CpG-rich regions in nucleotide sequence(s) |
| cusp | Create a codon usage table from nucleotide sequence(s) |
| cutgextract | Extract codon usage tables from CUTG database |
| cutseq | Remove a section from a sequence |
| dan | Calculate nucleic acid melting temperature |
| dbiblast | Index a BLAST database |
| dbifasta | Index a fasta file database |
| dbiflat | Index a flat file database |
| dbigcg | Index a GCG formatted database |
| dbtell | Display information about a public database |
| dbxcompress | Compress an uncompressed dbx index |
| dbxedam | Index the EDAM ontology using b+tree indices |
| dbxfasta | Index a fasta file database using b+tree indices |
| dbxflat | Index a flat file database using b+tree indices |
| dbxgcg | Index a GCG formatted database using b+tree indices |
| dbxobo | Index an obo ontology using b+tree indices |
| dbxreport | Validate index and report internals for dbx databases |
| dbxresource | Index a data resource catalogue using b+tree indices |
| dbxstat | Dump statistics for dbx databases |
| dbxtax | Index NCBI taxonomy using b+tree indices |
| dbxuncompress | Uncompress a compressed dbx index |
| degapseq | Remove non-alphabetic (e.g. gap) characters from sequences |
| density | Draw a nucleic acid density plot |
| descseq | Alter the name or description of a sequence |
| diffseq | Compare and report features of two similar sequences |
| distmat | Create a distance matrix from a multiple sequence alignment |
| dotmatcher | Draw a threshold dotplot of two sequences |
| dotpath | Draw a non-overlapping wordmatch dotplot of two sequences |
| dottup | Display a wordmatch dotplot of two sequences |
| dreg | Regular expression search of nucleotide sequence(s) |
| drfinddata | Find public databases by data type |
| drfindformat | Find public databases by format |
| drfindid | Find public databases by identifier |
| drfindresource | Find public databases by resource |
| drget | Get data resource entries |
| drtext | Get data resource entries complete text |
| edamdef | Find EDAM ontology terms by definition |
| edamhasinput | Find EDAM ontology terms by has_input relation |
| edamhasoutput | Find EDAM ontology terms by has_output relation |
| edamisformat | Find EDAM ontology terms by is_format_of relation |
| edamisid | Find EDAM ontology terms by is_identifier_of relation |
| edamname | Find EDAM ontology terms by name |
| edialign | Local multiple alignment of sequences |
| einverted | Find inverted repeats in nucleotide sequences |
| embossdata | Find and retrieve EMBOSS data files |
| embossupdate | Checks for more recent updates to EMBOSS |
| embossversion | Report the current EMBOSS version number |
| emma | Multiple sequence alignment (ClustalW wrapper) |
| emowse | Search protein sequences by digest fragment molecular weight |
| entret | Retrieve sequence entries from flatfile databases and files |
| epestfind | Find PEST motifs as potential proteolytic cleavage sites |
| eprimer3 | Pick PCR primers and hybridization oligos |
| eprimer32 | Pick PCR primers and hybridization oligos |
| equicktandem | Find tandem repeats in nucleotide sequences |
| est2genome | Align EST sequences to genomic DNA sequence |
| etandem | Find tandem repeats in a nucleotide sequence |
| extractalign | Extract regions from a sequence alignment |
| extractfeat | Extract features from sequence(s) |
| extractseq | Extract regions from a sequence |
| featcopy | Read and write a feature table |
| featmerge | Merge two overlapping feature tables |
| featreport | Read and write a feature table |
| feattext | Return a feature table original text |
| findkm | Calculate and plot enzyme reaction data |
| freak | Generate residue/base frequency table or plot |
| fuzznuc | Search for patterns in nucleotide sequences |
| fuzzpro | Search for patterns in protein sequences |
| fuzztran | Search for patterns in protein sequences (translated) |
| garnier | Predict protein secondary structure using GOR method |
| geecee | Calculate fractional GC content of nucleic acid sequences |
| getorf | Find and extract open reading frames (ORFs) |
| godef | Find GO ontology terms by definition |
| goname | Find GO ontology terms by name |
| helixturnhelix | Identify nucleic acid-binding motifs in protein sequences |
| hmoment | Calculate and plot hydrophobic moment for protein sequence(s) |
| iep | Calculate the isoelectric point of proteins |
| infoalign | Display basic information about a multiple sequence alignment |
| infoassembly | Display information about assemblies |
| infobase | Return information on a given nucleotide base |
| inforesidue | Return information on a given amino acid residue |
| infoseq | Display basic information about sequences |
| isochore | Plot isochores in DNA sequences |
| jaspextract | Extract data from JASPAR |
| jaspscan | Scan DNA sequences for transcription factors |
| lindna | Draw linear maps of DNA constructs |
| listor | Write a list file of the logical OR of two sets of sequences |
| makenucseq | Create random nucleotide sequences |
| makeprotseq | Create random protein sequences |
| marscan | Find matrix/scaffold recognition (MRS) signatures in DNA sequences |
| maskambignuc | Mask all ambiguity characters in nucleotide sequences with N |
| maskambigprot | Mask all ambiguity characters in protein sequences with X |
| maskfeat | Write a sequence with masked features |
| maskseq | Write a sequence with masked regions |
| matcher | Waterman-Eggert local alignment of two sequences |
| megamerger | Merge two large overlapping DNA sequences |
| merger | Merge two overlapping sequences |
| msbar | Mutate a sequence |
| mwcontam | Find weights common to multiple molecular weights files |
| mwfilter | Filter noisy data from molecular weights file |
| needle | Needleman-Wunsch 全局比对 |
| needleall | 两两双序列比对 |
| newcpgreport | Identify CpG islands in nucleotide sequence(s) |
| newcpgseek | Identify and report CpG-rich regions in nucleotide sequence(s) |
| newseq | Create a sequence file from a typed-in sequence |
| nohtml | Remove mark-up (e.g. HTML tags) from an ASCII text file |
| noreturn | Remove carriage return from ASCII files |
| nospace | Remove whitespace from an ASCII text file |
| notab | Replace tabs with spaces in an ASCII text file |
| notseq | Write to file a subset of an input stream of sequences |
| nthseq | Write to file a single sequence from an input stream of sequences |
| nthseqset | Read and write (return) one set of sequences from many |
| octanol | Draw a White-Wimley protein hydropathy plot |
| oddcomp | Identify proteins with specified sequence word composition |
| ontocount | Count ontology term(s) |
| ontoget | Get ontology term(s) |
| ontogetcommon | Get common ancestor for terms |
| ontogetdown | Get ontology term(s) by parent id |
| ontogetobsolete | Get ontology ontology terms |
| ontogetroot | Get ontology root terms by child identifier |
| ontogetsibs | Get ontology term(s) by id with common parent |
| ontogetup | Get ontology term(s) by id of child |
| ontoisobsolete | Report whether an ontology term id is obsolete |
| ontotext | Get ontology term(s) original full text |
| palindrome | Find inverted repeats in nucleotide sequence(s) |
| pasteseq | Insert one sequence into another |
| patmatdb | Search protein sequences with a sequence motif |
| patmatmotifs | Scan a protein sequence with motifs from the PROSITE database |
| pepcoil | Predict coiled coil regions in protein sequences |
| pepdigest | Report on protein proteolytic enzyme or reagent cleavage sites |
| pepinfo | Plot amino acid properties of a protein sequence in parallel |
| pepnet | Draw a helical net for a protein sequence |
| pepstats | Calculate statistics of protein properties |
| pepwheel | Draw a helical wheel diagram for a protein sequence |
| pepwindow | Draw a hydropathy plot for a protein sequence |
| pepwindowall | Draw Kyte-Doolittle hydropathy plot for a protein alignment |
| plotcon | Plot conservation of a sequence alignment |
| plotorf | Plot potential open reading frames in a nucleotide sequence |
| polydot | Draw dotplots for all-against-all comparison of a sequence set |
| preg | Regular expression search of protein sequence(s) |
| prettyplot | Draw a sequence alignment with pretty formatting |
| prettyseq | Write a nucleotide sequence and its translation to file |
| primersearch | Search DNA sequences for matches with primer pairs |
| printsextract | Extract data from PRINTS database for use by pscan |
| profit | Scan one or more sequences with a simple frequency matrix |
| prophecy | Create frequency matrix or profile from a multiple alignment |
| prophet | Scan one or more sequences with a Gribskov or Henikoff profile |
| prosextract | Process the PROSITE motif database for use by patmatmotifs |
| pscan | Scan protein sequence(s) with fingerprints from the PRINTS database |
| psiphi | Calculates phi and psi torsion angles from protein coordinates |
| rebaseextract | Process the REBASE database for use by restriction enzyme applications |
| recoder | Find restriction sites to remove (mutate) with no translation change |
| redata | Retrieve information from REBASE restriction enzyme database |
| refseqget | Get reference sequence |
| remap | Display restriction enzyme binding sites in a nucleotide sequence |
| restover | Find restriction enzymes producing a specific overhang |
| restrict | Report restriction enzyme cleavage sites in a nucleotide sequence |
| revseq | Reverse and complement a nucleotide sequence |
| seealso | Find programs with similar function to a specified program |
| seqcount | Read and count sequences |
| seqmatchall | All-against-all word comparison of a sequence set |
| seqret | Read and write (return) sequences |
| seqretsetall | Read and write (return) many sets of sequences |
| seqretsplit | Read sequences and write them to individual files |
| seqxref | Retrieve all database cross-references for a sequence entry |
| seqxrefget | Retrieve all cross-referenced data for a sequence entry |
| servertell | Display information about a public server |
| showalign | Display a multiple sequence alignment in pretty format |
| showdb | 显示 EMBOSS 工具支持的数据库,一些工具可以直接操作远程数据库 |
| showfeat | 显示序列的属性,如长度等信息 |
| showorf | Display a nucleotide sequence and translation in pretty format |
| showpep | Display protein sequences with features in pretty format |
| showseq | Display sequences with features in pretty format |
| showserver | Display information on configured servers |
| shuffleseq | Shuffle a set of sequences maintaining composition |
| sigcleave | Report on signal cleavage sites in a protein sequence |
| silent | Find restriction sites to insert (mutate) with no translation change |
| sirna | Find siRNA duplexes in mRNA |
| sixpack | Display a DNA sequence with 6-frame translation and ORFs |
| sizeseq | Sort sequences by size |
| skipredundant | Remove redundant sequences from an input set |
| skipseq | Read and write (return) sequences, skipping first few |
| splitsource | Split sequence(s) into original source sequences |
| splitter | Split sequence(s) into smaller sequences |
| stretcher | Needleman-Wunsch rapid global alignment of two sequences |
| stssearch | Search a DNA database for matches with a set of STS primers |
| supermatcher | Calculate approximate local pair-wise alignments of larger sequences |
| syco | Draw synonymous codon usage statistic plot for a nucleotide sequence |
| taxget | Get taxon(s) |
| taxgetdown | Get descendants of taxon(s) |
| taxgetrank | Get parents of taxon(s) |
| taxgetspecies | Get all species under taxon(s) |
| taxgetup | Get parents of taxon(s) |
| tcode | Identify protein-coding regions using Fickett TESTCODE statistic |
| textget | Get text data entries |
| textsearch | Search the textual description of sequence(s) |
| tfextract | Process TRANSFAC transcription factor database for use by tfscan |
| tfm | Display full documentation for an application |
| tfscan | Identify transcription factor binding sites in DNA sequences |
| tmap | Predict and plot transmembrane segments in protein sequences |
| tranalign | Generate an alignment of nucleic coding regions from aligned proteins |
| transeq | Translate nucleic acid sequences |
| trimest | Remove poly-A tails from nucleotide sequences |
| trimseq | Remove unwanted characters from start and end of sequence(s) |
| trimspace | Remove extra whitespace from an ASCII text file |
| twofeat | Find neighbouring pairs of features in sequence(s) |
| union | Concatenate multiple sequences into a single sequence |
| urlget | Get URLs of data resources |
| variationget | Get sequence variations |
| vectorstrip | Remove vectors from the ends of nucleotide sequence(s) |
| water | Smith-Waterman local alignment of sequences |
| whichdb | Search all sequence databases for an entry and retrieve it |
| wobble | Plot third base position variability in a nucleotide sequence |
| wordcount | Count and extract unique words in molecular sequence(s) |
| wordfinder | Match large sequences against one or more other sequences |
| wordmatch | Find regions of identity (exact matches) of two sequences |
| wossdata | Find programs by EDAM data |
| wossinput | Find programs by EDAM input data |
| wossname | Find programs by keywords in their short description |
| wossoperation | Find programs by EDAM operation |
| wossoutput | Find programs by EDAM output data |
| wossparam | Find programs by EDAM parameter |
| wosstopic | Find programs by EDAM topic |
| yank | Add a sequence reference (a full USA) to a list file |