背景
- 参考基因组: 玉米B73_v2,数据来源-MaizeGDB
- 基因注释文件:ZmB73_5a.59_WGS.gff3.gz
- 数据源:本课题自测的20个玉米的40x重测序数据。
软件安装
建议用conda新建环境安装,直接在其他环境下安装可能依赖会出问题。
# 新建一个叫vep的环境并且安装ensembl-vep
conda create -n vep -c bioconda ensembl-vep -y
# 激活环境
conda activate vep
# 测试
(vep) shawn@bio-Super-Server:~/02.project/01.maize_fatty_acid/04.result$ vep
# 如果成功会出现下面的提示
Possible precedence issue with control flow operator at /home/data/shawn/miniconda3/envs/vep/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
#----------------------------------#
# ENSEMBL VARIANT EFFECT PREDICTOR #
#----------------------------------#
Versions:
ensembl : 101.856c8e8
ensembl-funcgen : 101.b918a49
ensembl-io : 101.943b6c2
ensembl-variation : 101.819eef2
ensembl-vep : 101.0
Help: dev@ensembl.org , helpdesk@ensembl.org
Twitter: @ensembl
http://www.ensembl.org/info/docs/tools/vep/script/index.html
Usage:
./vep [--cache|--offline|--database] [arguments]
Basic options
=============
--help Display this message and quit
-i | --input_file Input file
-o | --output_file Output file
--force_overwrite Force overwriting of output file
--species [species] Species to use [default: "human"]
--everything Shortcut switch to turn on commonly used options. See web
documentation for details [default: off]
--fork [num_forks] Use forking to improve script runtime
For full option documentation see:
http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html
问题与解决方案
根据本次流程的搭建解决下面问题:
问题1 如果物种没有被官方收录怎么做
VEP can integrate custom annotation from standard format files into your results by using the --custom flag.
These files may be hosted locally or remotely, with no limit to the number or size of the files. The files must be indexed using the tabix utility (BED, GFF, GTF, VCF); bigWig files contain their own indices.
Annotations typically appear as key=value pairs in the Extra column of the VEP output; they will also appear in the INFO column if using VCF format output. The value for a particular annotation is defined as the identifier for each feature; if not available, an identifier derived from the coordinates of the annotation is used. Annotations will appear in each line of output for the variant where multiple lines exist.
通过增加--custom
标记来使用我们自己提供的注释文件,包括了GFF,GTF,VCF,BED,BigWig
等格式的文件。
本次使用.gff
文件作为注释文件,首先需要对.gff
文件清洗一下,包括去掉注释行,对染色体排序,将.gff
文件压缩成bgzip
格式,最后构建tabix
索引
Custom annotation files must be prepared in a particular way in order to work with tabix and therefore with VEP. Files must be stripped of comment lines, sorted in chromosome and position order, compressed using bgzip and finally indexed using tabix.
# myData.gff为原始的gff文件 数据清洗过程
grep -v "#" myData.gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > myData.gff.gz
# 构建索引
tabix -p gff myData.gff.gz
在使用的时候需要标清楚
./vep [...] --custom Filename , Short_name , File_type , Annotation_type , Force_report_coordinates , VCF_fields
这样你就可以使用多个注释文件同时对你的重测序结果进行注释了。
最终的使用方法, 官方的例子:
# Compressed VCF file
curl -O ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
# Index file
curl -O ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi
/vep [...] --custom clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN
## Where the selected ClinVar INFO fields (from the ClinVar VCF file) are:
# - CLNSIG: Clinical significance for this single variant
# - CLNREVSTAT: ClinVar review status for the Variation ID
# - CLNDN: ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB
# Of course you can select the INFO fields you want in the ClinVar VCF file
# Quick example on GRCh38:
./vep --id "1 230710048 230710048 A/G 1" --species homo_sapiens -o /path/to/output/output.txt --cache --offline --assembly GRCh38 --custom /path/to/custom_files/clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN
在使用转录本注释的GFF
时,如果搞不明白缓存cache
也可以不加,直接使用--gff
调用gff
文件也可以,也不需要使用--custom
来对gff
文件定义,例如:
VEP can use transcript annotations defined in GFF or GTF files. The files must be bgzipped and indexed with tabix and a FASTA file containing the genomic sequence is required in order to generate transcript models.
Your GFF or GTF file must be sorted in chromosomal order. VEP does not use header lines so it is safe to remove them.
grep -v "#" data.gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > data.gff.gz
tabix -p gff data.gff.gz
# 直接进行注释,注意这里出来的结果为tab分割的.txt文件
./vep -i input.vcf --gff data.gff.gz --fasta genome.fa.gz
问题2 如何提高效率
参考链接:Getting VEP to run faster
官方共提出了12点可以提高运行效率的方案,如果有精力可以都尝试一下,这里我采用了两个建议:
第一点: 设置--fork
flag
Using forking enables VEP to run multiple parallel "threads", with each thread processing a subset of your input. Most modern computers have more than one processor core, so running VEP with forking enabled can give huge speed increases (3-4x faster in most cases). Even computers with a single core will see speed benefits due to overheads associated with using object-oriented code in Perl.
To use forking, you must choose a number of forks to use with the --fork flag. We recommend using 4 forks:
./vep -i my_input.vcf --fork 4 --offline
but depending on various factors specific to your setup you may see faster performance with fewer or more forks.
被这个folk
整的有些懵逼,显然他不是线程数,而作者最后说由于设置不同,增加或者减少forks
的数量都有可能提速....
**第二点:**按染色体拆分.vcf.gz
,并行
For very large files (for example those from whole-genome sequencing), VEP process can be easily parallelised by dividing your file into chunks (e.g. by chromosome). VEP will also work with tabix-indexed, bgzipped VCF files, and so the tabix utility could be used to divide the input file:
问题3 输出文件格式
参考链接:Variant Effect Predictor Data formats
如果不指定的话默认tab输出,但是为了方便后续,我们希望直接把注释信息写入vcf
文件的info
部分。
VEP can return the results in different formats:
Along with the results VEP computes and returns some statistics.
这里统计了VEP
的输出格式,我们重点看下vcf
输出
VCF output
The VEP script can also generate VCF output using the --vcf flag.
Main information about the specificity of the VEP VCF output format:
- Consequences are added in the INFO field of the VCF file, using the key "CSQ" (you can change it using --vcf_info_field).
- Data fields are encoded separated by the character "|" (pipe). The order of fields is written in the VCF header. Unpopulated fields are represented by an empty string.
- Output fields in the "CSQ" INFO field can be configured by using --fields.
- Each prediction, for a given variant, is separated by the character "," in the CSQ INFO field (e.g. when a variant overlaps more than 1 transcript)
Here is a list of the (default) fields you can find within the CSQ field:
Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID
Some fields are not reported by default and need configuring including HGVS, Symbol and Existing_variation.
Example of VEP command using the --vcf and --fields options:
./vep -i examples/homo_sapiens_GRCh38.vcf --cache --force_overwrite --vcf --fields "Allele,Consequence,Feature_type,Feature"
VCFs produced by VEP can be filtered by filter_vep.pl in the same way as standard format output files.
简单来说就是在最后加上--vcf
的flag,可以用--fields
来控制INFO
部分CSQ
的格式,如果不想麻烦,就直接加一个--vcf
的flag就可以了。最后通过-o
参数自定义输出vcf
的文件名通过--sf
自定义输出html
统计结果的文件名。
实战代码
看下工作目录的结构:
(vep) shawn@bio-Super-Server:~/working_dir/04.result$ tree
.
├── B73_RefGen_v2.fa -> /home/data/public/01.database/01.Database/01.genome_db/01.maize/03.B73_v2/B73_RefGen_v2.fa # B73_v2的参考基因组
├── ZmB73_5a.59_WGS.gff3 -> /home/data/public/01.database/01.Database/01.genome_db/01.maize/03.B73_v2/ZmB73_5a.59_WGS.gff3 # MaizeGDB下载B73_v2的gff3注释文件
├── snp.all.filter.vcf.gz -> ../03.process/01.bamfile/vcf/snp.all.filter.vcf.gz # gatk call snp的结果
├── snp.all.filter.vcf.gz.tbi -> ../03.process/01.bamfile/vcf/snp.all.filter.vcf.gz.tbi # tabix 索引
第一步,分析环境建立与软件安装:
## 建立重测序分析环境
conda create -n wgs -c bioconda gatk samtools bedtools bcftools bwa sambamba fastqc -y
## 安装vep
conda create -n vep -c bioconda ensembl-vep=101.0 -y
# 用tmux新建一个窗口,方便放后台跑
tmux new -t vep
第二步,格式化GFF
文件
# 将GFF文件软连接到工作目录下
ln -s /home/data/public/01.database/01.Database/01.genome_db/01.maize/03.B73_v2/ZmB73_5a.59_WGS.gff3 ./
# 格式化
grep -v "#" ZmB73_5a.59_WGS.gff3 | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > data.gff.gz
# tabix索引
tabix -p gff data.gff.gz
第三步,拆分vcf
文件,参考链接 Easy way to split VCF file by chromosome--biostar
mkdir split
# 脚本来自于 https://www.biostars.org/p/9506417/
# 首先写一个运行脚本
vim spilt.sh
# 按i进入编辑模式,写入一下代码
vcf_in=$1
vcf_out_stem=$2
for i in {1..10} #10条染色体
do
bcftools view ${vcf_in} --regions Chr${i} -o ${vcf_out_stem}_${i}.vcf.gz -Oz$
done
# 按esc退出编辑模式,按:进入命令模式,输入wq回车保存
# 授予脚本执行权限
chmod +x split.sh
# 运行脚本
./split.sh snp_all.vcf.gz split/snp_chr &
# 跑完后查看结果
tree split
split/
├── snp_chr_1.vcf.gz
├── snp_chr_10.vcf.gz
├── snp_chr_2.vcf.gz
├── snp_chr_3.vcf.gz
├── snp_chr_4.vcf.gz
├── snp_chr_5.vcf.gz
├── snp_chr_6.vcf.gz
├── snp_chr_7.vcf.gz
├── snp_chr_8.vcf.gz
└── snp_chr_9.vcf.gz
第四步,进行注释
## 写一个vep注释的单行脚本
vim run_vep_anno.sh
vep -i split/snp_chr_${1}.vcf.gz --gff data.gff.gz --fasta B73_RefGen_v2.fa --fork 4 --vcf -o split/snp_anno_chr_${1}.vcf --sf split/snp_anno_chr_${1}.html
## wq保存退出
## parallel 并行注释
parallel bash run_vep_anno.sh ::: 1 2 3 4 5 6 7 8 9 10 &
第五步,查看结果
split/
├── snp_anno_chr_1.html # 统计结果报告
├── snp_anno_chr_1.vcf # 带注释的vcf文件
├── snp_anno_chr_1.vcf_warnings.txt # 运行过程的警告信息
├── snp_anno_chr_10.html
├── snp_anno_chr_10.vcf
├── snp_anno_chr_10.vcf_warnings.txt
├── snp_anno_chr_2.html
├── snp_anno_chr_2.vcf
├── snp_anno_chr_2.vcf_warnings.txt
├── snp_anno_chr_3.html
├── snp_anno_chr_3.vcf
├── snp_anno_chr_3.vcf_warnings.txt
├── snp_anno_chr_4.html
├── snp_anno_chr_4.vcf
├── snp_anno_chr_4.vcf_warnings.txt
├── snp_anno_chr_5.html
├── snp_anno_chr_5.vcf
├── snp_anno_chr_5.vcf_warnings.txt
├── snp_anno_chr_6.html
├── snp_anno_chr_6.vcf
├── snp_anno_chr_6.vcf_warnings.txt
├── snp_anno_chr_7.html
├── snp_anno_chr_7.vcf
├── snp_anno_chr_7.vcf_warnings.txt
├── snp_anno_chr_8.html
├── snp_anno_chr_8.vcf
├── snp_anno_chr_8.vcf_warnings.txt
├── snp_anno_chr_9.html
├── snp_anno_chr_9.vcf
├── snp_anno_chr_9.vcf_warnings.txt
├── snp_chr_1.vcf.gz
├── snp_chr_10.vcf.gz
├── snp_chr_2.vcf.gz
├── snp_chr_3.vcf.gz
├── snp_chr_4.vcf.gz
├── snp_chr_5.vcf.gz
├── snp_chr_6.vcf.gz
├── snp_chr_7.vcf.gz
├── snp_chr_8.vcf.gz
└── snp_chr_9.vcf.gz
查看下结果发现INFO
以及加入了注释信息。
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT L01_114 L01_115 L01_117 L01_25 L01_26 L01_28 L01_29 L01_30 L01_32 L01_33 L01_34 L01_35 L01_36 L01_37 L01_38 L01_39 L01_49 L01_50
Chr1 15 . A G 289.14 PASS AC=5;AF=0.139;AN=36;BaseQRankSum=1.53;DP=174;ExcessHet=0;FS=2.831;InbreedingCoeff=0.5213;MLEAC=4;MLEAF=0.111;MQ=51.54;MQRankSum=0.536;QD=12.57;ReadPosRankSum=-0.438;SOR=1.179;CSQ=G|downstream_gene_variant|MODIFIER|GRMZM2G059865|GRMZM2G059865|Transcript|GRMZM2G059865_T01|protein_coding|||||||||||4839|-1||||data.gff.gz|,G|downstream_gene_variant|MODIFIER|GRMZM2G059865|GRMZM2G059865|Transcript|GRMZM2G059865_T02|protein_coding|||||||||||4842|-1||||data.gff.gz|,G|downstream_gene_variant|MODIFIER|GRMZM2G059865|GRMZM2G059865|Transcript|GRMZM2G059865_T03|protein_coding|||||||||||4842|-1||||data.gff.gz|,G|5_prime_UTR_variant|MODIFIER|GRMZM2G060082|GRMZM2G060082|Transcript|GRMZM2G060082_T01|protein_coding|1/5||||13|||||||1||||data.gff.gz|,G|upstream_gene_variant|MODIFIER|GRMZM2G060082|GRMZM2G060082|Transcript|GRMZM2G060082_T02|protein_coding|||||||||||2592|1||||data.gff.gz| GT:AD:DP:GQ:PL 0/0:5,0:5:15:0,15,197 0/1:14,4:18:99:113,0,429 0/0:3,0:3:9:0,9,126 0/0:6,0:6:18:0,18,190 0/0:4,0:4:12:0,12,169 0/0:3,0:3:9:0,9,89 0/0:12,0:12:36:0,36,464 0/0:6,0:6:18:0,18,199 0/0:11,0:11:33:0,33,425 0/0:9,0:9:27:0,27,287 0/0:9,0:9:0:0,0,252 0/0:5,0:5:15:0,15,192 0/0:5,0:5:15:0,15,191 0/0:4,0:4:12:0,12,148 1/1:0,2:2:6:85,6,0 1/1:0,3:3:9:126,9,0 0/0:11,0:11:33:0,33,412 0/0:10,0:10:30:0,30,318
建议输出表格形式,方便信息查阅
## 只需要把脚本里的代码修改一下
vep -i split/snp_chr_${1}.vcf.gz --gff data.gff.gz --fasta B73_RefGen_v2.fa --fork 4 --tab -o
split/snp_anno_chr_${1}.xls split/snp_anno_chr_${1}.html
=== END ===