Database of Genomic Variants

Frequently Asked Question

How are the boundaries of CNVs identified and reported?
How is a gain or loss defined, and how can I get the variant frequency?
How do I compare the data in DGV to my patient cohort?
What types of filters are applied to the data before they are added to DGV?
What if I want to look at the entries that have been filtered/removed from DGV?
Can I just look at variants found in HapMap samples?
Why are some variants mapped to hg18 but not hg19?
How do I cite the database?
What is the data model used for DGV2?
When I search for 'external sample id' = "NA18510", I do not get any results, but I know that sample is in the database. Why is this?
What is the difference between an esv, essv, nsv, nssv and dgv accession?

Answers

How are the boundaries of CNVs identified and reported?
We report the start and end coordinates of the variant as reported in the original study. Depending on the method used for detection of the CNV the boundaries reported may be quite different from the actual underlying variant. This is obvious when looking at regions where a large number of different studies have reported the same variant. The data must therefore be interpreted with this in mind. An overlap between a reported CNV and a gene may therefore not be accurate, as the CNV may be much smaller than reported. Some studies also merge nearby variants into larger regions and this merging process may merge separate CNVs into one large variant.
Data from BAC clone CGH arrays:
Coordinates from studies using BAC arrays tend to overestimate the boundaries of CNVs. BACs are vectors containing large inserts of DNA generally in the range of 150-250Kb in size. Studies detecting CNVs using this approach always report the start and end of the BAC clones that give a result indicative of a variant. However, the BAC arrays are highly sensitive and variants as small as 20-30kb may be detected. A CNV of this size may therefore reside anywhere within the start and end coordinates of the clone, even through the actual variant is significantly smaller.
Data from SNP arrays and oligonucleotide CGH arrays:
The probes on oligo and SNP arrays are very short, and do therefore not suffer from the same bias as arrays with BAC clones. Overall, the boundaries from SNP arrays of high resolution tend to have more accurate boundary information, and are more likely to underestimate than overestimate the size of CNVs.
How is a gain or loss defined, and how can I get the variant frequency?
There are several things to consider when interpreting CNV data and CNV genotypes. It is important to keep in mind that CNV data is always relative. A CNV call can be relative to a specific reference sample, a pool of reference samples or relative to the reference assembly. Since different reference samples may have been used in different studies, what is called as a gain in one study may actually be called a loss in another.
Insertions and duplications:
Some gains in the database are annotated as only one base-pair in size. This means that there is an insertion into the reference sequence at that coordinate. The estimated size of the insertion is described in the detailed information page for the variant. When gains are not annotated as an insertion into the reference, the region that is highlighted represented the sequence that is duplicated. Importantly, most current technologies provide no information about the location of the duplicated sequence and it could theoretically be located anywhere in the genome. However, for most duplications that have been characterized in detail the additional copy has been found in tandem, or at least nearby, the original sequence.
CNV genotypes:
Another limitation of many studies to date is that they have not been able to correctly identify CNV genotypes. Calls are simply made as gains or losses relative to a given reference. The actual number of copies present, or whether gains or losses are homozygous or heterozygous can often not be accurately determined with existing tools. Therefore, the frequencies we report in the database are not allele frequencies, but just counts of gains and losses for each variant (which have to be interpreted in relation to the total sample size of the study).
Frequencies:
The frequency of a variation is defined by the authors and can be a relative measure compared to the number of samples tested, or if there is genotype data available, this could be represented as an allele frequency.
How do I compare the data in DGV to my patient cohort?
The database contains only data originally described in healthy controls. However, this does not mean the database should be used as a substitute for running a control set with your patient samples. The database is meant to serve as a guide. It will give information about whether there is a common variant in your region of interest. Just because a variant is annotated in the database does not mean that a similar variant cannot be disease causing in your patient sample. Similarly, a lack of variants in a specific region of the database does not necessarily mean there are no common variants at that locus. Factors such as probe coverage and resolution may differ significantly between platforms. Since the boundaries of variants reported in DGV are often inaccurate, it is also often difficult to know for sure if a variant found using a different experimental approach is the exact same as one annotated in DGV. Some of the older studies are also less reliable and did not include an estimation of the false discovery rate. The DGV therefore does contain data that represent false positives. As a rule of thumb, regions identified in many studies or by independent methods, are most likely real. Large variants identified in a single sample by a single study represent either extremely rare variants or may be false positives.
For a current review on the interpretation of array data, please see the following publication: Diagnostic interpretation of array data using public databases and internet sources
What types of filters are applied to the data before they are added to DGV?
The data undergoes a systematic review prior to inclusion in the database. We run a number of quality assurance steps to ensure high quality data is presented for users.
Many of the processing steps may be dependent on the study or method applied, and some of the more common steps are outlined here.
1. Study specific filters (request made by author to remove specific variants, variants detected in patient samples). If a study includes both cases and controls, we filter out all case-related data.
2. Chromosome Mapping
  1. Only variants mapped to one of the autosomes (1-22) or sex chromosomes (X,Y) are kept. Variants mapped to chrM, chrR, chr6_hap or chrUN are removed.
  2. Variants mapped to chromosome Y in female samples are removed.
3. Merging
  1. For studies which have analysed multiple samples, DGV will merge sample level calls together that share a 70% reciprocal overlap measured by length and position.
4. Size/Location
  1. Copy number variants larger than (or equal to) 50bp and smaller than 3Mb are kept, and inversions larger than 10Mb are removed.
  2. Variants which span gaps in the reference assembly are removed.
  3. Variants which correspond to Decipher Genomic Disorders are removed (> 70% shared length)
What if I want to look at the entries that have been filtered/removed from DGV?
You can obtain a GFF3 file of the filtered variants on the Downloads page, under the Filtered Variants heading.
Can I just look at variants found in HapMap samples?
Using the Query tool, go to the samples tab and filter by cohort. By selecting the HapMap cohort, and the filter all option, only data derived from the HapMap samples will be presented.
Why are some variants mapped to hg18 but not hg19?
When the variation data was mapped to hg19, we did our best to come up with a process that would result in a low error rate, while maximizing the number of variants kept in hg19. Due to changes in the underlying assembly, some regions are re-arranged while others contain novel sequence, thus changing the structure of the region. In most cases the assembly hasn't changed enough to cause difficultly in remapping, but there are some regions where we could no longer map the variant accurately.
How do I cite the database?
When citing the Database of Genomic Variants, please refer to: MacDonald JR, Ziman R, Yuen RK, Feuk L, Scherer SW. The database of genomic variants: a curated collection of structural variation in the human genome. Nucleic Acids Res. 2013 Oct 29. PubMed PMID: 24174537
What is the data model used for DGV?
The data model for the DGV can be found here .
When I search for 'external sample id' = "NA18510", I do not get any results, but I know that sample is in the database. Why is this?
There can be more than one 'external sample id' associated to a given Sample, so when searching for a specific Sample, please use the wildcard search function, "~".
What is the difference between an esv, essv, nsv, nssv and dgv accession?
Each study from DGV has been archived and accessioned by one of the two groups; dbVAR have assigned nsv/nssv accessions, while DGVa has assigned esv/essv accessions. An esv is an EBI structural variant, and an essv is an EBI supporting structural variant. An nsv is an NCBI structural variant, and an nssv is an NCBI supporting structural variant.
Supporting structural variants ("ssv") can also be described as sample level variants, where each ssv would represent the variant called in a single sample/individual. If there are many samples analysed in a study and if there are many samples which have the same variant, there will be multiple ssv's with the same start and end coordinates. These sample level variants are then merged and combined to form a representative variant that highlights the common variant found in that study. This is called a structural variant ("sv") record.
DGV has always provided this type of summary/merged variant and we have continued to do so in cases where there are a number of overlapping supporting variants that are almost identical, but may be slightly different due to the inherent variability within the experiment. The start/stop of variants in different samples may be offset or skewed to a certain degree based on the performance/accuracy of the experiment. If there are clusters of variants that share at least 70% reciprocal overlap in size/location, we will merge these together and provide an sv record that has our internal "dgv"-prefixed identifier.