Hapmap and VCF formats and its integration with onemap
Written by Cristiane Hayumi Taniguti, Letícia Aparecida de Castro Lara, and Rodrigo Rampazo Amadeu
Hapmap (“Haplotype Map”) and VCF (“Variant Call Format”) formats were developed by International Consortiums to create an expressive database for polymorphisms in the human genome. They have unique format data which are our goal to describe in this article. These established data format are also used in the study of another species. The object here is described both formats, how to to convert one into another, and how to make the call in the mapping softwares.
Hapmap
The Hapmap Project was initiated in 2001 by the International HapMap Consortium. Its database is freely available to the public through the NCBI database dbSNP. The Project is also described at Nature 426 :789-796, 2003 [PMID: 14685227]. The file format estabilished through the project is also used in others projects and species.
The Hapmap file format is a table which consists of 11 columns plus one column for each sample genotyped. The first row contains the header labels of your samples, and each additional row contains all the information associated with a single SNP. You can get a Hapmap file by chromosome or a general file.
The current release consists of attributes, as follows:
rs# alleles chrom pos strand assembly# center protLSID assayLSID panelLSID QCcode sample1 |
rs#
contains the SNP identifier;alleles
contains SNP alleles according to NCBI database dbSNP;chrom
contains the chromosome that the SNP was mapped;pos
contains the respective position of this SNP on chromosome;strand
contains the orientation of the SNP in the DNA strand. Thus, SNPs could be in the forward (+) or in the reverse (-) orientation relative to the reference genome;assembly#
contains the version of reference sequence assembly (from NCBI);center
contains the name of genotyping center that produced the genotypes;protLSID
contains the identifier for HapMap protocol;assayLSID
contain the identifier HapMap assay used for genotyping;panelLSID
contains the identifier for panel of individuals genotyped;QCcode
contains the quality control for all entries;- subsequently, the list of sample names.
Below it is an example of VCF file:
rs# alleles chrom pos strand assembly# center protLSID assayLSID panelLSID QCcode Parental_1 Parental_2 ID_1 ID_2
44509 A/G 02 5565755 + NA NA NA NA NA NA AG AA AA AA
38019 G/A 02 43878360 + NA NA NA NA NA NA GG GA GG GG
89440 T/C 04 25220824 + NA NA NA NA NA NA TC TT NN TT
The codification of the genotypes follows bellow:
Code | Meaning |
---|---|
A | Adenine |
C | Cytosine |
G | Guanine |
T | Thymine |
R | A or G |
Y | C or T |
S | G or C |
W | A or T |
K | G or T |
M | A or C |
B | C or G or T |
D | A or G or T |
H | A or C or T |
V | A or C or G |
N | any base |
. or - | gap |
VCF
The VCF data format was developed for the 1000 Genomes Project by the International 1000 Genomes Consortium. For further informations please access the 1000 Genome Project webpage where not only are included several new populations, but also included the populations of the HapMap Project. This guide was based upon the wiki of 1000 Genomes Project .
The VCF format is divided in meta-information lines, header lines, and data lines.
Below it is an example of VCF file:
##fileformat=VCFv4.0
##fileDate=20160306
##source="Stacks v1.37"
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Allele Depth">
##FORMAT=<ID=GL,Number=.,Type=Float,Description="Genotype Likelihood">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Parental_1 Parental_2 ID_1 ID_2 ID_3
Chr02 5565755 44509 A G . PASS . GT:DP:AD:GL 0/1:31:23,8:.,42.98,. 0/0:13:9,0:.,18.02,. 0/0:30:30,0:.,41.59,. 0/0:27:27,0:.,37.43,. 0/0:25:25,0:.,34.66,.
The Meta-information are rows where the first one is mandatory and has
the VCF format version. Depending upon the VCF format version your data
will use one form or another to fill the parameters options. Pay
attention in what version your data and programs are working with and if
they are compatible. The first row can be followed by more, but not
mandatory, lines about the file origin as date of file creation, source,
reference, phasing, etc. After them, it may contain Information lines
##INFO
. The ##INFO
contains the following entries:
ID
name of the variable (a string);Number
: number of values (an integer);Type
: type of variable (integer, float, character, string, and flag);Description
: a string with the descrition.
And has the following syntax:
##INFO=<ID=ID,Number=”number”,Type=”type”,Description=”description”>
In the example, the line
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
has the identificator NS
, “Number of Samples With Data”
, represented
by 1
Integer
.
After the ##INFO
lines, it should contain the ##FORMAT
lines which
indicate each type of variable of your data. The ##FORMAT
contains the
following entries:
ID
name of the variable (a string);Number
number of values (an integer if version=4.0);Type
type of variable (integer, float, character, string, and flag);Description
: a String with the description.
And has the following syntax:
##FORMAT=<ID=ID,Number=number,Type=type,Description=”description”>
In the example, the line
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
has the identificator GT
, a “Genotype”
, represented by 1
String
.
After the Meta-information lines containing all the information and
format data, there is the header line. The header line has mandatorily 8
fixed columns and are tab limited. If it has genotypic data, it also has
the FORMAT
column followed by the sample IDs. The columns and its
respective data information are:
#CHROM
chromosome (alphanumeric string);POS
position in chromosome (integer);ID
identificator of the line (alphanumeric string);REF
reference base(s) (A,C,G,T,N);ALT
list of alternate non-reference base(s) (A,C,G,T,N);QUAL
phred-scaled quality score for the assertion made in ALT;FILTER
PASS if it has passed in all filters, if not, it should contain the reason;INFO
it is about the all INFO earlier described separated by semicolon;FORMAT
all the format IDs separated by colon;IDs
a tab separated list with sample identificators.
In the example, the header is:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Parental_1 Parental_2 ID_1 ID_2 ID_3
Where indicates all the columns and the list of sample identificators
(in the case: Parental_1
,Parental_2
,ID_1
,ID_2
,ID_3
).
After the header line, it has the data lines tab limited where ‘.’ represents missing data. In the example:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Parental_1 Parental_2 ID_1 ID_2 ID_3
Chr02 5565755 44509 A G . PASS . GT:DP:AD:GL 0/1:31:23,8:.,42.98,. 0/0:13:9,0:.,18.02,. 0/0:30:30,0:.,41.59,. 0/0:27:27,0:.,37.43,. 0/0:25:25,0:.,34.66,.
It is a SNP at chromosome 2 in the position 5565755 with identificator equals to 44506. The reference allele is an A and has one alternative allele G. There is no information about the quality of this data collected. It passed through the filter. There is no additional information about it. The samples are represented in the order GT:DP:AD:GL (genotype, read depth, allele depth, and genotype likelihood). The sample Euc_201 cell is:
0/1:31:23,8:.,42.98,.
In other words, Euc_202 sample genotype is 0/1, (i.e. A/G) with a read depth of 31. It has an allele depth of 23 for the first allele and 8 for the second allele with a genotype likelihood of 42.98.
The genotype can also be represented in terms of insertion and deletion, please see the wiki.
VCF x HapMap
Both formats can be easily interchangeable using the software TASSEL 4. Using the GUI version, just import the file and export as the type you want. Using the standalone version, first make sure that your TASSEL directory are in the PATH:
export PATH=$PATH:~/Programs/TASSEL_standalone
And then,
- VCF to Hapmap:
./run_pipeline.pl -Xmx5g -fork1 -vcf example.vcf -export -exportType Hapmap -runfork1C
- Hapmap to VCF:
/run_pipeline.pl -Xmx5g -fork1 -h example.hmp.txt -export -exportType VCF -runfork1
Using the software TASSEL 5:
git clone https://bitbucket.org/tasseladmin/tasselk-5-standalone.git
And then,
- VCF to Hapmap:
./run_pipeline.pl -Xmx5g -fork1 -vcf example.vcf -export -exportType Hapmap -fork1c
- Hapmap to VCF:
./run_pipeline.pl -Xmx5g -fork1 -h example.hmp.txt -export -exportType VCF -fork1c
Besides that, it can be found several others user made scripts available at github platform to do the same conversion.
VCF format is more informative than the Hapmap format. VCF can have unlimited information about the data as INFO, FORMAT, and file description. Hapmap is simpler and easier to edit and handle. However, Hapmap can store less data and versatile than VCF. When converting one in another be careful about the data you are missing in the process - mainly about the INFO and FORMAT fields if VCF. If converting Hapmap to VCF you can add information about the data after the converstion.
OneMap - vcf2raw function
Onemap is a software to do genetic linkage analysis and to build genetic
maps in experimental crosses: F2, backcrosses, RILs, and
full-sib. Its stable version has been available at
CRAN. Its
last version was updated on 2013-09-09. However, new functions are being
added for a new version. They are still in development and available at
a github repository. One of
these new functions called “vcf2raw” allows to convert a vcf file to an
OneMap raw file. At this tutorial we will explore the vcf2raw
function.
First, it is important to know about the OneMap raw format. For F2, backcrosses, and RILs, OneMap uses exactly the same as the MAPMAKER/EXP input file format. Here we will describe a brief overview about the format based upon the OneMap tutorial, but you can find more information at MAPMAKER/EXP manual. The file should look like this:
data type f2 intercross
10 5 2
*M1 A B H H A - B A A B
*M2 C - C C C - - C C A
*M3 D B D D - - B D D B
*M4 C C C - A C C A A C
*M5 C C C C C C C C C C
*weight 10.2 - 9.4 11.3 11.9 8.9 - 11.2 7.8 8.1
*length 1.7 2.1 1.8 2.0 1.0 - 1.7 1.0 - 1.1
The first line in the file should have the information about the population type, that can be denoted as “f2 backcross” for backcross populations type, or “f2 intercross” for F2, “ri self” for RIL produced by selfing, “ri sib” for RIL produced by sib mating. The second line contains the number of individuals in the progeny, the number of markers, and the number of quantitative traits. The phenotypic information may be present at the file, but it will not be used by OneMap. After these two lines, the file must have the genotypic information for each marker. The character * indicates the beginning of information of a marker, followed by the marker name.
The codification for genotypes is the following:
Code | Meaning |
---|---|
A | homozygous for allele A (from parental 1 - AA) |
B | homozygous for allele B (from parental 2 - BB) |
H | heterozygous carrying both alleles (AB) |
C | Not homozygous for allele A (Not AA) |
D | Not homozygous for allele B (Not BB) |
- | Missing data for the individual at this marker |
The quantitative trait data should come after the genotypic data and have a similar format. The trait values for each individual must be separate by at least one space, a tab or a line break. A dash (-) indicates missing data.
With this file format we can use the OneMap function called
read_mapmaker
to import the data to OneMap.
?read_mapmaker #For more informations about this function
onemap_file_f2 <- read_mapmaker(dir="C:/workingdirectory",
file="your_data_file.raw")
Constructing maps for outcrossing populations OneMap implements the Wu et al (2002a) strategy and uses its marker type classification described at the following table.
Parent | Offspring | |||
---|---|---|---|---|
Crosstype | Cross | Observed bands | Segregation | |
A | 1 | ab x cd | ac,ad,bc,bd | 1:1:1:1 |
2 | ab x ac | a,ac,ba,bc | 1:1:1:1 | |
3 | ab x co | ac,a,bc,b | 1:1:1:1 | |
4 | ao x bo | ab,a,b,o | 1:1:1:1 | |
B.1 | 5 | ab x ao | ab,2a,b | 1:2:1 |
B.2 | 6 | ao x ab | ab,2a,b | 1:2:1 |
B.3 | 7 | ab x ab | a,2ab,b | 1:2:1 |
C | 8 | ao x ao | 3a,o | 3:1 |
D.1 | 9 | ab x cc | ac,bc | 1:1 |
10 | ab x aa | a,ab | 1:1 | |
11 | ab x oo | a,b | 1:1 | |
12 | bo x aa | ab.a | 1:1 | |
13 | ao x oo | a,o | 1:1 | |
D.2 | 14 | cc x ab | ac,bc | 1:1 |
15 | aa x ab | a,ab | 1:1 | |
16 | oo x ab | a,ab | 1:1 | |
17 | aa x bo | ab,a | 1:1 | |
18 | oo x ao | a,o | 1:1 |
Letters A, B, C, and D indicate the segregation type (i.e., 1:1:1:1, 1:2:1, 3:1 or 1:1, respectively), while the number after the dot (e. g., A.1) indicates the observed bands in the offspring. Therefore, this information has to be in the input file at each line after the marker ID. The coding for marker genotypes used by OneMap is also the same one proposed by Wu et al (2002a) and the possible values vary according to the specific marker type.
Below, an example of input file which must be saved as text format:
cross type outcross
10 5 0 0 0
ID1 ID2 ID3 ID5 ID6 ID7 ID8 ID9 ID10
*M1 B3.7 ab,ab,-,ab,b,ab,ab,-,ab,b
*M2 D2.18 o,-,a,a,-,o,a,-,o,o
*M3 D1.13 o,a,a,o,o,-,a,o,a,o
*M4 A.4 ab,b,-,ab,a,b,ab,b,-,a
*M5 D2.18 a,a,o,-,o,o,a,o,o,o
Which can be imported to OneMap by the function read_onemap
:
?read_onemap #For more information about this function
onemap_file_outcross <- read_onemap("C:/workingdirectory","example.out.txt")
The first argument of read_onemap
function is the path to the file
directory and the second is the file name.
With the new function vcf2raw
it is possible to convert VCF file
format to OneMap format for any of these experimental populations. Plus,
it will keep the position information of the marker at the genome (if it
is available in the VCF) and consider it when building the map.
For some VCF header formats, the function will require a compressed and indexed VCF file. It can be done by the package tabix that is part of the Samtools software. It is also available at the Bioconductor inside the package RSamtools.
#Instaling RSamtolls package
source("https://bioconductor.org/biocLite.R")
biocLite("Rsamtools")
#Usage
library(Rsamtools)
zipped_vcf <- bgzip("example.vcf", "example.vcf.gz") #Compressing
idx <- indexTabix("example.vcf.gz", "vcf") #Indexing
The vcf2raw function have the parameters:
input
path to the input VCF file;output
type of cross. Must be one of: “outcross” for full-sibs; “f2 intercross”,for an F2 intercross progeny; “backcross”; “riself” for recombinant inbred lines by self-mating; or “risib” for recombinant inbred lines by sib-mating;cross
path to the output OneMap file;parent1
string or vector of strings specifying sample ID(s) of the first parent;parent2
string or vector of strings specifying sample ID(s) of the second parent;min_class
a real number between 0.0 and 1.0. For each parent and each variant site, defines the proportion of parent samples that must be of the same genotype for it to be assigned to the corresponding parent.
In our example we have a outcrossing population with parents Euc_201
and Euc_202. We do not have replicates for the parents, so the
“min_class” parameter will not be used for us. To call the file in R
:
?vcf2raw #For more informations about the function
vcf2raw(input = "example.vcf.gz", output = "example.raw", cross = "outcross", parent1 = "Parental_1", parent2 = "Parental_2")
Then we get the example.raw file:
data type outcross
3 15 1 1 0
ID_1 ID_2 ID_3
*44509 D1.10 a a a
*38019 D2.15 a a a
*89440 D1.10 a a a
*115628 D2.15 a a a
*124447 D1.10 ab - a
*132982 D2.15 - ab -
*155807 D1.10 ab a ab
*161174 B3.7 ab ab ab
*169671 B3.7 ab ab ab
*196363 B3.7 ab ab ab
*198491 B3.7 ab ab ab
*226621 B3.7 ab ab ab
*226621 B3.7 ab ab ab
*255557 B3.7 ab a ab
*296873 D1.10 ab - -
*CHROM Chr02 Chr02 Chr04 Chr05 Chr05 Chr06 Chr06 Chr07 Chr07 Chr08 Chr08 Chr09 Chr09 Chr11 scaffold_721
*POS 5565755 43878360 25220824 44192326 64399699 16254627 56636423 1462089 32868350 37209043 41026941 28459570 28459594 12942418 994
The closing lines (begining with *CHROM
and *POS
) keep the reference
sequence ID and position for each variant site. The new version of
OneMap will be able to deal with this information in order to increase
the precision of the genetic map. This file is a input file to
read_onemap
function.
Now you are good to go build your map on onemap package!