「Comparative-Genomics」官方文档解读

在基因组进化分析和基因家族分析中少不了对同源基因(homologs)的分析,当然同源基因又分为直系同源(orthologs)和旁系同源(paralogs),如图:

homologs

...

Photo courtesy of: Popo H. Liao (Own work), via Wikimedia Commons
如何区分请看:Homology Terminology: Never Say the Wrong Word Again

名词解释

Orthologues, Orthogroups & Paralogues

Orthologs are pairs of genes that descended from a single gene in the last common ancestor (LCA) of two species (Figure 2A & B). An orthogroup is the extension of the concept of orthology to groups of species. An orthogroup is the group of genes descended from a single gene in the LCA of a group of species (Figure 2A).

The example Figure 2 contains an orthogroup from three species, human, mouse and chicken. Human and mouse each have one gene in this orthogroup (HuA and MoA, respectively) while chicken has two genes (ChA1 and ChA2). The human and mouse genes are a pair of genes descended from a single gene in the last common ancestor of the two species, therefore these two genes are orthologs and there is a one-to-one orthology relationship between the two genes.

The two chicken genes arose from a gene duplication event after the lineage leading to chicken split from the lineage leading to human and mouse. As gene duplication events give rise to paralogs, ChA1 and ChA2 are paralogs of each other. However, both chicken genes are descended from a single gene in the last common ancestor of the three species. Therefore, both chicken genes are orthologs of the human gene and the mouse gene. Although they are orthologs, sometimes these complex relationships are referred to as co-orthologs (e.g. ChA1 and ChA2 are co-orthologs of HuA). In this case there is a many-to-one orthology relationship between the chicken genes and the human gene. There are only three kinds of orthology relationships one-to-one, many-to-one, and many-to-many. All of these relationships are identified by OrthoFinder.

Speak English, 说人话: 👴🏻、🐭和🐔有一个最近的共同的祖先(Last common ancestor,LCA),这个LCA在进化中分化出了👴🏻,🐭和🐔,LCA的一个基因A在🐔中复制了一次变成ChA1、ChA2,但是在哺乳动物中没有变,👴🏻中HuA和🐭中MoA,来掰扯下这4个基因的关系:

  • 直系同源: 可以理解为从老祖宗LCA那传下来的,压根没变就是figure2B中HuAMoAHuA分别和ChA1、ChA2MoA分别和ChA1、ChA2
  • 直系同源组: 这4个基因并称为直系同源组(有点基因家族的意思)
  • 旁系同源:在🐔中这个基因出现了复制事件,那么🐔里面的这两个基因:ChA1、ChA2称作旁系同源。

软件介绍

奉上软件地址:davidemms/OrthoFinder
官方分析流程:OrthoFinder Tutorials

orthofinder 能干啥

对于比较基因组学(comparative genomics)OrthoFinder是一个快速的,准确的以及全面的平台,作用:

  • 找到同源群组orthogroups [1](不知道怎么翻译合适)
  • 找到同源基因(orthologs);
  • 对orthogroups构建有根树
  • 鉴定基因复制时间(gene duplication events)[2]
  • 构建物种树。
  • 将基因的复制时间对应到物种树上。

软件安装

总结为2点:

  • linux和Mac用conda:这里要注意orthofinderPython环境为python2,我之前在mac上装在python3env下结果出现了各种奇葩报错,什么diamond不在环境变量,python出错等,装到python2下面立马就消停了...
## 最好给orthoFinder建一个新的conda environment。
conda creat -n orthofinder orthofinder=2.27
  • windows 最好用docker, 没试过,不评价,官方给的教程应该没问题。
docker pull davidemms/orthofinder
docker run -it --rm davidemms/orthofinder orthofinder -h
docker run --ulimit nofile=1000000:1000000 -it --rm -v /full/path/to/fastas:/input:Z davidemms/orthofinder orthofinder -f /input

软件使用

官方给的命令也很简单,只需要把你要做的物种的蛋白序列放到一个空路径下,然后:

OrthoFinder/orthofinder -f OrthoFinder/ExampleDataset &

这个只是初级阶段的用法,还有大量的参数可以调。

分析过程

自己尝试看了下没看懂,这里引用徐州更hoptop的博文

  • BLAST all-vs-all搜索。使用BLASTP以evalue=10e-3进行搜索,寻找潜在的同源基因。(除了BLAST, 还可以选择DIAMOND和MMSeq2)
  • 基于基因长度和系统发育距离对BLAST bit得分进行标准化。
  • 使用RBNHs确定同源组序列性相似度的阈值
  • 构建直系同源组图(orthogroup graph),用作MCL的输入
  • 使用MCL对基因进行聚类,划分直系同源组

OrthoFinder2在OrthoFinder的基础上增加了物种系统发育树的构建,流程如下

  • 为每个直系同源组构建基因系统发育树
  • 使用STAG算法从无根基因树上构建无根物种树
  • 使用STRIDE算法构建有根物种树
  • 有根物种树进一步辅助构建有根基因树。

--未完待续...


  1. What are orthogroups, orthologs & paralogs? ↩︎

  2. Gene duplication ↩︎