Entity name | Primary key attributes | # Attributes | Description |
Analysis | Analysis_ID | 13 | This table captures information about each Analysis that is performed in one Study.
A Study may perform one or more Analyses within that Study.
An Analysis is the comparison of the Study's Dataset to a Reference. |
Cohort | Cohort_ID | 6 | This table contains information about a given Cohort .
A Cohort could be part of one Study or a Cohort could be examined in more than one Study. For example, Samples from the HapMap Project or the Human Genome Diversity Panel are Cohorts that are examined in several Studies. |
Copy_Number | Copy_Number_ID | 11 | This table captures descriptions of each Copy of a Copy Number Variant (CNV). At present, this information is not provided in a Study (primarily due to existing technological constraints). When these data become available, they can be captured in this table. |
Coverage | ID | 7 | This stand-alone table is used for generating the Statistics charts on the DGV website. If users had access to the database (e.g. super-user access to write queries), they would not typically query this table. It is not part of the DGV per se, but is part of the same schema so queries can be written against the DGV and this table more efficiently. |
Cytoband_Overlap | ID | 5 | This stand-alone table is used for generating the Statistics charts on the DGV website. If users had access to the database (e.g. super-user access to write queries), they would not typically query this table. It is not part of the DGV per se, but is part of the same schema so queries can be written against the DGV and this table more efficiently. |
Dataset | Dataset_ID | 8 | This table represents an instance of output, or Dataset, generated by processing one or more Samples on a given Platform (employing a given Method) within a Study.
One Study can have many Datasets. |
Dataset_Analysis | Dataset_Analysis_ID | 9 | This table describes each Dataset that is compared to one Reference during one Analysis within one Study.
One or more Analyses may be performed on the same Dataset in a Study. One or more Datasets may be Analyzed in one Study.
An Analysis is the comparison of the Study's Dataset to a Reference using an Analysis Tool. The output of the Analysis is the Variant set (described primarily in the 'Variant' table and secondarily in the 'Variant_Type', 'Copy_Number', and 'Variant_Mapping' tables).
For example, if one Analyses examined three Datasets, there would be three records in this table, each having the same 'Analysis_ID' but different 'Dataset_ID'. |
Dataset_Xref | Dataset_Xref_ID | 8 | This table captures any external identifers used to reference a Dataset that exists in the DGV. This information is currently not being captured by DGVa or dbVar.
For example, DGVa Dataset accession numbers would be recorded here. Those identifiers would cross-referecnce Datasets in the DGVa. |
Feature_Overlap | ID | 9 | This stand-alone table is used for generating the Statistics charts on the DGV website. If users had access to the database (e.g. super-user access to write queries), they would not typically query this table. It is not part of the DGV per se, but is part of the same schema so queries can be written against the DGV and this table more efficiently. |
Filter | Filter_ID | 9 | This table stores all records that are filtered out during the DGV Filtering steps, which is part of the automated DGV-curation pipeline. Each record that is filtered out is recorded with its Primary Key (unique table ID), table name, reason for being filtered out, and at what step the record was filtered out. Timestamps on each record help keep track of when, specifically, a record was filtered out. |
Gene_Overlap | ID | 5 | This stand-alone table is used for generating the Statistics charts on the DGV website. If users had access to the database (e.g. super-user access to write queries), they would not typically query this table. It is not part of the DGV per se, but is part of the same schema so queries can be written against the DGV and this table more efficiently. |
Merged_Analysis | Merged_Analysis_ID | 7 | Data from multiple Analyses can be merged and further Analyzed within one Study. This table captures the relationshjps between these multiple Analyses.
For example, Analysis 1, Analysis 2, and Analysis 3 could then be merged into Analysis 4. This table captures the relationship between Analyses 1, 2, and 3 and Analysis 4.
Analyses 1, 2, 3 are called "Sub Analyses" and Analysis 4 is called a "Super Analysis". |
Method_Platform_Sample | Method_Platform_Sample_ID | 9 | This table captures information about an instance of a Method and of a Platform applied to a given Sample within a Study. Also captured is the Dataset to which the Sample-Method-Platform combination belongs.
A Study can employ one or more Methods and one or more Platforms, in examining one or more Samples. Within one Study, the same Sample or Samples can be examined by one or more Methods or Platforms. This table captures the complex relationship between Method, Platform, and Sample. |
Method_Study | Method_Study_ID | 12 | This table captures each instance of a Method employed within one Study. One Study may employ one or more Methods, and/or one or more instances of the same Method. |
Platform_Study | Platform_Study_ID | 12 | This table captures each instance of a Platform being employed in a Study. One Study may employ one or more Platforms. |
Reference | Reference_ID | 11 | This table describes the Reference applied to one Dataset in one Study.
One or more Datasets are compared against one or more References. Each instance of this is captured in this table.
A Reference is that to which the Dataset set in a given Study is compared by Analysis. |
Reference_Assembly | Reference_Assembly_ID | 3 | This table is used as a back-end to serve a drop down list on the new DGV website for a user to select a given Reference Assembly in which to view the data. |
Sample | Sample_ID | 17 | This table contains all of the Samples that have been examined in any Study that is part of the DGV.
Some Studies do not report Sample IDs but instead only a Variant ID. In these cases, it will not be possible to associate a given Variant to a Sample. This would typically be due to the need to protect the privacy of the individual who provided the Sample e.g. in cases where there is non-consent to share Sample information |
Sample_Pooled | Sample_Pooled_ID | 7 | This table captures the mapping between a Pooled Sample and the individual Samples that comprise the Pooled Sample - when this information is provided for a given a Study. |
Sample_Study | Sample_Study_ID | 8 | This table resolves the many-to-many relationship between the Study and Sample tables. One Study may examine one or more Samples. One Sample may be examined in one or more Studies. |
Sample_Xref | Sample_Xref_ID | 10 | This table captures external identifiers used to refer to Samples stored in the DGV.
For example, a Sample may be assigned an (external) identifier when that Sample is analyzed in a Study. When that Sample is loaded to the DGV, it will receive a DGV database identifier. This table will capture the cross-reference between the external sample identifiers/accession numbers and the DGV identifiers.
There could be external identifiers from other systems or projects, such as the HapMap project, the Coriell repository, or the Human Genome Diversity Panel. This table resolves the cross-referencing of these identifiers/accession numbers and the DGV Sample identifiers. |
Study | Study_ID | 20 | This table captures information about the Study in which structural variants were discovered and validated (not all discovered variants are validated in a Study).
The 'universal identifier' of the study, usually the 'PubMed ID' (PMID) is captured in this table; the intention of this ID is to uniquely identify the Study across other databases.
The 1000 Genomes Project Studies generate multiple input files per Study, or a "Study group". As each of those input files are loaded to the DGV, each will be assigned its own Study_ID. In these cases, all files from a Study group will have the same 'Study_Accession' value (e.g. "estd59" represents many input files from the 1000 Genomes project; one Study Accession number per multiple files).
Sometimes an external database will be referenced in a record in the database. Below is a translation of the abbreviations one might find in a record in this database, and the full names of those abbreviations/external databases:
AE = EBI Array Express
CLONE = NCBI CLONE
dbSNP = NCBI dbSNP
dbGaP = NCBI dbGAP
GENBANK = NCBI GENBANK
GENE = NCBI GENE
GEO = NCBI GEO
HPO = Human Phenotype Ontology
MeSH = Medical Subject Headings
OMIM = Online Mendelian Inheritance in Man
PROBE = NCBI PROBE
SRA = NCBI Sequence Read Archive (to be discontinued in 2011-2012)
ERA = EBI SRA
TRACE = NCBI TRACE Archive (to be discontinued in 2011-2012) |
Study_File_Prefix | Study_File_Prefix_ID | 6 | |
Supporting_Merged_Variant | Supporting_Merged_Variant_ID | 7 | This table describes which Supporting Variants comprise a given Merged Variant. Each record in this table will include the Merged Variant ID and one of its corresponding Supporting Variant IDs; there will be 'n' number of records for every 'n' number of Supporting Variants that comprise a Merged Variant.
Note: The Variant table stores both Supporting and Merged Variant records. One Merged Variant is comprised of one or more Supporting Variants; Supporting Variants are not always reported in a Study (i.e. sometimes only the Merged, or "asserted" Variants are reported). This, the 'Sample_Merged_Variant' table resolves the one-to-many relationship of Supporting to Merged Variants that exist in the 'Variant' table.
Both dbVar and DGVa describe what DGV calls a "Merged Variant" as an "asserted Variant" (SV; e.g. esv12345) and all three systems call the Variants the comprise a Merged Variant, "Supporting Variants" (SSV; e.g. nssv12324). |
Translocation_Mapping | Translocation_Mapping_ID | 29 | This table captures the mapping coordinates of a Translocation Variant or any other Variant that may have mappings across two Chromosomes. e.g. translocation, transposon
A translocation can span more than one Chromosome, creating a one-to-many relationship between a Variant and its respective mapping records.
Note: For other Variant Types, their respective mappings are stored in the 'Variant_Mapping' table. Translocation Mapping information is not likely to be provided in a Study until technology is refined enough to detect such events. |
Variant | Variant_ID | 23 | This table captures all Variants reported in Studies that have been loaded to the DGV. This table stores both Supporting and Merged Variants.
For Studies that were accessioned by DGVa, the Variants are assigned a DGVa accession number that starts with "e"; for Studies that were accessioned by dbVar, the Variants are assigned a dbVar accession number that starts with "n".
For example, a Supporting Variant, in DGVa will be assigned an accession number like "essv12345"; merged Variants, or what DGVa and dbVar call "asserted" Variants, would be assigned an accession numer like "esv45678". For dbVar, these examples, instead, would be, respectively, "nssv12345" and "nsv45678". |
Variant_Analysis | Variant_Analysis_ID | 7 | One or more Variants can be related to one or more 'Dataset_Analysis' records. This table resolves the M:M relationship between the 'Dataset_Analysis' and 'Variant' tables. This is in order to support the relationships defined in the dbVar/DGVa XSD.
For each 'Dataset_Analysis' record to which a Variant is associated, this table will store a record for each instance of this. For example, Variant_1 is associated to Dataset_Analysis 1, 2, and 3. So in the 'Variant_Analysis' table there would be three records, each containing the same 'Variant_ID' and each with a different 'Dataset_Analysis_ID', i.e. 1, 2, or 3. |
Variant_Mapping | Variant_Mapping_ID | 29 | This table captures the mapping coordinates of a Variant.
The purpose of capturing these data in a separate table is to afford mappings to any of the human genome reference assemblies (e.g. hg17, hg18) for a given Variant |
Variant_Type | Variant_Type_ID | 21 | This table contains a list of all possible Variant Types that can be attributed to a given Variant.
A Variant could have more than one type, e.g. Inversion + Deletion. As technologies develop further, more refined Variant Type descriptions could arise. This table will allow for multiple Variant Types to be ascribed to one Variant.
NB The ICGC nomenclature already describes Variant Types with sub-categories. This is an example of how a Variant could be described with multiple Variant Types. |
Variant_Xref | Variant_Xref_ID | 9 | This table captures any external identifers used to reference a Variant that exists in the DGV.
For example, DGVa 'esv' (asserted variant) and 'essv' (supporting variant) accession numbers would be recorded here. Those identifiers would cross-referecnce Variants in the DGVa.
This table can also capture IDs from external databases for records that relate to a given Variant in this table. For example, if a Variant's sequence has been deposited to NCBI's Trace Archive (formally discontinued in February, 2011), that sequence will receive one or more TEMPLATE_IDs, e.g. TEMPLATE_ID = G248P82926C7. In this example, the 'External_ID_Source' = "Trace Archive" and the 'External_Variant_ID' = "G248P82926C7", with the corresponding 'Variant_ID' for that Variant from the 'Variant' table. |