「WGS」VEP进行SNP注释

背景

软件安装

建议用conda新建环境安装,直接在其他环境下安装可能依赖会出问题。

# 新建一个叫vep的环境并且安装ensembl-vep
conda create -n vep -c bioconda ensembl-vep -y
# 激活环境
conda activate vep
# 测试
(vep) shawn@bio-Super-Server:~/02.project/01.maize_fatty_acid/04.result$ vep
# 如果成功会出现下面的提示
Possible precedence issue with control flow operator at /home/data/shawn/miniconda3/envs/vep/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
#----------------------------------#
# ENSEMBL VARIANT EFFECT PREDICTOR #
#----------------------------------#

Versions:
  ensembl              : 101.856c8e8
  ensembl-funcgen      : 101.b918a49
  ensembl-io           : 101.943b6c2
  ensembl-variation    : 101.819eef2
  ensembl-vep          : 101.0

Help: dev@ensembl.org , helpdesk@ensembl.org
Twitter: @ensembl

http://www.ensembl.org/info/docs/tools/vep/script/index.html

Usage:
./vep [--cache|--offline|--database] [arguments]

Basic options
=============

--help                 Display this message and quit

-i | --input_file      Input file
-o | --output_file     Output file
--force_overwrite      Force overwriting of output file
--species [species]    Species to use [default: "human"]

--everything           Shortcut switch to turn on commonly used options. See web
                       documentation for details [default: off]
--fork [num_forks]     Use forking to improve script runtime

For full option documentation see:
http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html

问题与解决方案

根据本次流程的搭建解决下面问题:

问题1 如果物种没有被官方收录怎么做

参考链接:vep custom annotation

VEP can integrate custom annotation from standard format files into your results by using the --custom flag.

These files may be hosted locally or remotely, with no limit to the number or size of the files. The files must be indexed using the tabix utility (BED, GFF, GTF, VCF); bigWig files contain their own indices.

Annotations typically appear as key=value pairs in the Extra column of the VEP output; they will also appear in the INFO column if using VCF format output. The value for a particular annotation is defined as the identifier for each feature; if not available, an identifier derived from the coordinates of the annotation is used. Annotations will appear in each line of output for the variant where multiple lines exist.

通过增加--custom标记来使用我们自己提供的注释文件,包括了GFF,GTF,VCF,BED,BigWig等格式的文件。

本次使用.gff文件作为注释文件,首先需要对.gff文件清洗一下,包括去掉注释行,对染色体排序,将.gff文件压缩成bgzip格式,最后构建tabix索引

Custom annotation files must be prepared in a particular way in order to work with tabix and therefore with VEP. Files must be stripped of comment lines, sorted in chromosome and position order, compressed using bgzip and finally indexed using tabix.

# myData.gff为原始的gff文件 数据清洗过程
grep -v "#" myData.gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > myData.gff.gz
# 构建索引
tabix -p gff myData.gff.gz

在使用的时候需要标清楚

./vep [...] --custom Filename , Short_name , File_type , Annotation_type , Force_report_coordinates , VCF_fields

这样你就可以使用多个注释文件同时对你的重测序结果进行注释了。

最终的使用方法, 官方的例子:

# Compressed VCF file
curl -O ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
# Index file
curl -O ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi

/vep [...] --custom clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN

## Where the selected ClinVar INFO fields (from the ClinVar VCF file) are:
# - CLNSIG:     Clinical significance for this single variant
# - CLNREVSTAT: ClinVar review status for the Variation ID
# - CLNDN:      ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB
# Of course you can select the INFO fields you want in the ClinVar VCF file

# Quick example on GRCh38:
./vep --id "1  230710048 230710048 A/G 1" --species homo_sapiens -o /path/to/output/output.txt --cache --offline --assembly GRCh38 --custom /path/to/custom_files/clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN

在使用转录本注释的GFF时,如果搞不明白缓存cache也可以不加,直接使用--gff调用gff文件也可以,也不需要使用--custom来对gff文件定义,例如:

VEP can use transcript annotations defined in GFF or GTF files. The files must be bgzipped and indexed with tabix and a FASTA file containing the genomic sequence is required in order to generate transcript models.

Your GFF or GTF file must be sorted in chromosomal order. VEP does not use header lines so it is safe to remove them.

grep -v "#" data.gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > data.gff.gz
tabix -p gff data.gff.gz
# 直接进行注释,注意这里出来的结果为tab分割的.txt文件
./vep -i input.vcf --gff data.gff.gz --fasta genome.fa.gz

问题2 如何提高效率

参考链接:Getting VEP to run faster

官方共提出了12点可以提高运行效率的方案,如果有精力可以都尝试一下,这里我采用了两个建议:

第一点: 设置--fork flag

Using forking enables VEP to run multiple parallel "threads", with each thread processing a subset of your input. Most modern computers have more than one processor core, so running VEP with forking enabled can give huge speed increases (3-4x faster in most cases). Even computers with a single core will see speed benefits due to overheads associated with using object-oriented code in Perl.

To use forking, you must choose a number of forks to use with the --fork flag. We recommend using 4 forks:

./vep -i my_input.vcf --fork 4 --offline

but depending on various factors specific to your setup you may see faster performance with fewer or more forks.

被这个folk 整的有些懵逼,显然他不是线程数,而作者最后说由于设置不同,增加或者减少forks的数量都有可能提速....

**第二点:**按染色体拆分.vcf.gz,并行

For very large files (for example those from whole-genome sequencing), VEP process can be easily parallelised by dividing your file into chunks (e.g. by chromosome). VEP will also work with tabix-indexed, bgzipped VCF files, and so the tabix utility could be used to divide the input file:

问题3 输出文件格式

参考链接:Variant Effect Predictorimg Data formats

如果不指定的话默认tab输出,但是为了方便后续,我们希望直接把注释信息写入vcf文件的info部分。

VEP can return the results in different formats:

Along with the results VEP computes and returns some statistics.

这里统计了VEP的输出格式,我们重点看下vcf输出

VCF output

The VEP script can also generate VCF output using the --vcf flag.

Main information about the specificity of the VEP VCF output format:

  • Consequences are added in the INFO field of the VCF file, using the key "CSQ" (you can change it using --vcf_info_field).
  • Data fields are encoded separated by the character "|" (pipe). The order of fields is written in the VCF header. Unpopulated fields are represented by an empty string.
  • Output fields in the "CSQ" INFO field can be configured by using --fields.
  • Each prediction, for a given variant, is separated by the character "," in the CSQ INFO field (e.g. when a variant overlaps more than 1 transcript)

​ Here is a list of the (default) fields you can find within the CSQ field:

Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID

Some fields are not reported by default and need configuring including HGVS, Symbol and Existing_variation.

Example of VEP command using the --vcf and --fields options:

./vep -i examples/homo_sapiens_GRCh38.vcf --cache --force_overwrite --vcf --fields "Allele,Consequence,Feature_type,Feature"

VCFs produced by VEP can be filtered by filter_vep.pl in the same way as standard format output files.

简单来说就是在最后加上--vcf的flag,可以用--fields 来控制INFO部分CSQ的格式,如果不想麻烦,就直接加一个--vcf的flag就可以了。最后通过-o参数自定义输出vcf的文件名通过--sf自定义输出html统计结果的文件名。

实战代码

看下工作目录的结构:

(vep) shawn@bio-Super-Server:~/working_dir/04.result$ tree
.
├── B73_RefGen_v2.fa -> /home/data/public/01.database/01.Database/01.genome_db/01.maize/03.B73_v2/B73_RefGen_v2.fa # B73_v2的参考基因组 
├── ZmB73_5a.59_WGS.gff3 -> /home/data/public/01.database/01.Database/01.genome_db/01.maize/03.B73_v2/ZmB73_5a.59_WGS.gff3 # MaizeGDB下载B73_v2的gff3注释文件
├── snp.all.filter.vcf.gz -> ../03.process/01.bamfile/vcf/snp.all.filter.vcf.gz # gatk call snp的结果
├── snp.all.filter.vcf.gz.tbi -> ../03.process/01.bamfile/vcf/snp.all.filter.vcf.gz.tbi # tabix 索引

第一步,分析环境建立与软件安装:

## 建立重测序分析环境
conda create -n wgs -c bioconda gatk samtools bedtools bcftools bwa sambamba fastqc -y
## 安装vep 
conda create -n vep -c bioconda ensembl-vep=101.0 -y
# 用tmux新建一个窗口,方便放后台跑
tmux new -t vep

第二步,格式化GFF文件

# 将GFF文件软连接到工作目录下
ln -s /home/data/public/01.database/01.Database/01.genome_db/01.maize/03.B73_v2/ZmB73_5a.59_WGS.gff3 ./
# 格式化
grep -v "#" ZmB73_5a.59_WGS.gff3 | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > data.gff.gz
# tabix索引
tabix -p gff data.gff.gz

第三步,拆分vcf文件,参考链接 Easy way to split VCF file by chromosome--biostar

mkdir split
# 脚本来自于 https://www.biostars.org/p/9506417/
# 首先写一个运行脚本
vim spilt.sh
# 按i进入编辑模式,写入一下代码
vcf_in=$1

vcf_out_stem=$2

for i in {1..10} #10条染色体
do
        bcftools view ${vcf_in} --regions Chr${i} -o ${vcf_out_stem}_${i}.vcf.gz -Oz$
done
# 按esc退出编辑模式,按:进入命令模式,输入wq回车保存
# 授予脚本执行权限
chmod +x split.sh
# 运行脚本
./split.sh snp_all.vcf.gz split/snp_chr &

# 跑完后查看结果
tree split
split/
├── snp_chr_1.vcf.gz
├── snp_chr_10.vcf.gz
├── snp_chr_2.vcf.gz
├── snp_chr_3.vcf.gz
├── snp_chr_4.vcf.gz
├── snp_chr_5.vcf.gz
├── snp_chr_6.vcf.gz
├── snp_chr_7.vcf.gz
├── snp_chr_8.vcf.gz
└── snp_chr_9.vcf.gz

第四步,进行注释

## 写一个vep注释的单行脚本
vim run_vep_anno.sh
vep -i split/snp_chr_${1}.vcf.gz --gff data.gff.gz --fasta B73_RefGen_v2.fa --fork 4 --vcf -o split/snp_anno_chr_${1}.vcf --sf split/snp_anno_chr_${1}.html
## wq保存退出
## parallel 并行注释
parallel bash run_vep_anno.sh ::: 1 2 3 4 5 6 7 8 9 10 &

第五步,查看结果

split/
├── snp_anno_chr_1.html # 统计结果报告
├── snp_anno_chr_1.vcf # 带注释的vcf文件
├── snp_anno_chr_1.vcf_warnings.txt # 运行过程的警告信息
├── snp_anno_chr_10.html
├── snp_anno_chr_10.vcf
├── snp_anno_chr_10.vcf_warnings.txt
├── snp_anno_chr_2.html
├── snp_anno_chr_2.vcf
├── snp_anno_chr_2.vcf_warnings.txt
├── snp_anno_chr_3.html
├── snp_anno_chr_3.vcf
├── snp_anno_chr_3.vcf_warnings.txt
├── snp_anno_chr_4.html
├── snp_anno_chr_4.vcf
├── snp_anno_chr_4.vcf_warnings.txt
├── snp_anno_chr_5.html
├── snp_anno_chr_5.vcf
├── snp_anno_chr_5.vcf_warnings.txt
├── snp_anno_chr_6.html
├── snp_anno_chr_6.vcf
├── snp_anno_chr_6.vcf_warnings.txt
├── snp_anno_chr_7.html
├── snp_anno_chr_7.vcf
├── snp_anno_chr_7.vcf_warnings.txt
├── snp_anno_chr_8.html
├── snp_anno_chr_8.vcf
├── snp_anno_chr_8.vcf_warnings.txt
├── snp_anno_chr_9.html
├── snp_anno_chr_9.vcf
├── snp_anno_chr_9.vcf_warnings.txt
├── snp_chr_1.vcf.gz
├── snp_chr_10.vcf.gz
├── snp_chr_2.vcf.gz
├── snp_chr_3.vcf.gz
├── snp_chr_4.vcf.gz
├── snp_chr_5.vcf.gz
├── snp_chr_6.vcf.gz
├── snp_chr_7.vcf.gz
├── snp_chr_8.vcf.gz
└── snp_chr_9.vcf.gz


查看下结果发现INFO以及加入了注释信息。

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  L01_114 L01_115 L01_117 L01_25  L01_26  L01_28  L01_29  L01_30  L01_32  L01_33  L01_34  L01_35  L01_36  L01_37  L01_38  L01_39  L01_49  L01_50
Chr1    15      .       A       G       289.14  PASS    AC=5;AF=0.139;AN=36;BaseQRankSum=1.53;DP=174;ExcessHet=0;FS=2.831;InbreedingCoeff=0.5213;MLEAC=4;MLEAF=0.111;MQ=51.54;MQRankSum=0.536;QD=12.57;ReadPosRankSum=-0.438;SOR=1.179;CSQ=G|downstream_gene_variant|MODIFIER|GRMZM2G059865|GRMZM2G059865|Transcript|GRMZM2G059865_T01|protein_coding|||||||||||4839|-1||||data.gff.gz|,G|downstream_gene_variant|MODIFIER|GRMZM2G059865|GRMZM2G059865|Transcript|GRMZM2G059865_T02|protein_coding|||||||||||4842|-1||||data.gff.gz|,G|downstream_gene_variant|MODIFIER|GRMZM2G059865|GRMZM2G059865|Transcript|GRMZM2G059865_T03|protein_coding|||||||||||4842|-1||||data.gff.gz|,G|5_prime_UTR_variant|MODIFIER|GRMZM2G060082|GRMZM2G060082|Transcript|GRMZM2G060082_T01|protein_coding|1/5||||13|||||||1||||data.gff.gz|,G|upstream_gene_variant|MODIFIER|GRMZM2G060082|GRMZM2G060082|Transcript|GRMZM2G060082_T02|protein_coding|||||||||||2592|1||||data.gff.gz|    GT:AD:DP:GQ:PL  0/0:5,0:5:15:0,15,197   0/1:14,4:18:99:113,0,429        0/0:3,0:3:9:0,9,126     0/0:6,0:6:18:0,18,190   0/0:4,0:4:12:0,12,169   0/0:3,0:3:9:0,9,89      0/0:12,0:12:36:0,36,464 0/0:6,0:6:18:0,18,199   0/0:11,0:11:33:0,33,425 0/0:9,0:9:27:0,27,287   0/0:9,0:9:0:0,0,252     0/0:5,0:5:15:0,15,192   0/0:5,0:5:15:0,15,191   0/0:4,0:4:12:0,12,148   1/1:0,2:2:6:85,6,0      1/1:0,3:3:9:126,9,0     0/0:11,0:11:33:0,33,412 0/0:10,0:10:30:0,30,318

建议输出表格形式,方便信息查阅

## 只需要把脚本里的代码修改一下
vep -i split/snp_chr_${1}.vcf.gz  --gff data.gff.gz --fasta B73_RefGen_v2.fa --fork 4 --tab -o
split/snp_anno_chr_${1}.xls split/snp_anno_chr_${1}.html 

=== END ===