DeZign for Databases v5.2

Logical model

List of entities

Entity name	Primary key attributes	# Attributes	Description
Analysis	Analysis_ID	13	This table captures information about each Analysis that is performed in one Study. A Study may perform one or more Analyses within that Study. An Analysis is the comparison of the Study's Dataset to a Reference.
Cohort	Cohort_ID	6	This table contains information about a given Cohort . A Cohort could be part of one Study or a Cohort could be examined in more than one Study. For example, Samples from the HapMap Project or the Human Genome Diversity Panel are Cohorts that are examined in several Studies.
Copy_Number	Copy_Number_ID	11	This table captures descriptions of each Copy of a Copy Number Variant (CNV). At present, this information is not provided in a Study (primarily due to existing technological constraints). When these data become available, they can be captured in this table.
Coverage	ID	7	This stand-alone table is used for generating the Statistics charts on the DGV website. If users had access to the database (e.g. super-user access to write queries), they would not typically query this table. It is not part of the DGV per se, but is part of the same schema so queries can be written against the DGV and this table more efficiently.
Cytoband_Overlap	ID	5	This stand-alone table is used for generating the Statistics charts on the DGV website. If users had access to the database (e.g. super-user access to write queries), they would not typically query this table. It is not part of the DGV per se, but is part of the same schema so queries can be written against the DGV and this table more efficiently.
Dataset	Dataset_ID	8	This table represents an instance of output, or Dataset, generated by processing one or more Samples on a given Platform (employing a given Method) within a Study. One Study can have many Datasets.
Dataset_Analysis	Dataset_Analysis_ID	9	This table describes each Dataset that is compared to one Reference during one Analysis within one Study. One or more Analyses may be performed on the same Dataset in a Study. One or more Datasets may be Analyzed in one Study. An Analysis is the comparison of the Study's Dataset to a Reference using an Analysis Tool. The output of the Analysis is the Variant set (described primarily in the 'Variant' table and secondarily in the 'Variant_Type', 'Copy_Number', and 'Variant_Mapping' tables). For example, if one Analyses examined three Datasets, there would be three records in this table, each having the same 'Analysis_ID' but different 'Dataset_ID'.
Dataset_Xref	Dataset_Xref_ID	8	This table captures any external identifers used to reference a Dataset that exists in the DGV. This information is currently not being captured by DGVa or dbVar. For example, DGVa Dataset accession numbers would be recorded here. Those identifiers would cross-referecnce Datasets in the DGVa.
Feature_Overlap	ID	9	This stand-alone table is used for generating the Statistics charts on the DGV website. If users had access to the database (e.g. super-user access to write queries), they would not typically query this table. It is not part of the DGV per se, but is part of the same schema so queries can be written against the DGV and this table more efficiently.
Filter	Filter_ID	9	This table stores all records that are filtered out during the DGV Filtering steps, which is part of the automated DGV-curation pipeline. Each record that is filtered out is recorded with its Primary Key (unique table ID), table name, reason for being filtered out, and at what step the record was filtered out. Timestamps on each record help keep track of when, specifically, a record was filtered out.
Gene_Overlap	ID	5	This stand-alone table is used for generating the Statistics charts on the DGV website. If users had access to the database (e.g. super-user access to write queries), they would not typically query this table. It is not part of the DGV per se, but is part of the same schema so queries can be written against the DGV and this table more efficiently.
Merged_Analysis	Merged_Analysis_ID	7	Data from multiple Analyses can be merged and further Analyzed within one Study. This table captures the relationshjps between these multiple Analyses. For example, Analysis 1, Analysis 2, and Analysis 3 could then be merged into Analysis 4. This table captures the relationship between Analyses 1, 2, and 3 and Analysis 4. Analyses 1, 2, 3 are called "Sub Analyses" and Analysis 4 is called a "Super Analysis".
Method_Platform_Sample	Method_Platform_Sample_ID	9	This table captures information about an instance of a Method and of a Platform applied to a given Sample within a Study. Also captured is the Dataset to which the Sample-Method-Platform combination belongs. A Study can employ one or more Methods and one or more Platforms, in examining one or more Samples. Within one Study, the same Sample or Samples can be examined by one or more Methods or Platforms. This table captures the complex relationship between Method, Platform, and Sample.
Method_Study	Method_Study_ID	12	This table captures each instance of a Method employed within one Study. One Study may employ one or more Methods, and/or one or more instances of the same Method.
Platform_Study	Platform_Study_ID	12	This table captures each instance of a Platform being employed in a Study. One Study may employ one or more Platforms.
Reference	Reference_ID	11	This table describes the Reference applied to one Dataset in one Study. One or more Datasets are compared against one or more References. Each instance of this is captured in this table. A Reference is that to which the Dataset set in a given Study is compared by Analysis.
Reference_Assembly	Reference_Assembly_ID	3	This table is used as a back-end to serve a drop down list on the new DGV website for a user to select a given Reference Assembly in which to view the data.
Sample	Sample_ID	17	This table contains all of the Samples that have been examined in any Study that is part of the DGV. Some Studies do not report Sample IDs but instead only a Variant ID. In these cases, it will not be possible to associate a given Variant to a Sample. This would typically be due to the need to protect the privacy of the individual who provided the Sample e.g. in cases where there is non-consent to share Sample information
Sample_Pooled	Sample_Pooled_ID	7	This table captures the mapping between a Pooled Sample and the individual Samples that comprise the Pooled Sample - when this information is provided for a given a Study.
Sample_Study	Sample_Study_ID	8	This table resolves the many-to-many relationship between the Study and Sample tables. One Study may examine one or more Samples. One Sample may be examined in one or more Studies.
Sample_Xref	Sample_Xref_ID	10	This table captures external identifiers used to refer to Samples stored in the DGV. For example, a Sample may be assigned an (external) identifier when that Sample is analyzed in a Study. When that Sample is loaded to the DGV, it will receive a DGV database identifier. This table will capture the cross-reference between the external sample identifiers/accession numbers and the DGV identifiers. There could be external identifiers from other systems or projects, such as the HapMap project, the Coriell repository, or the Human Genome Diversity Panel. This table resolves the cross-referencing of these identifiers/accession numbers and the DGV Sample identifiers.
Study	Study_ID	20	This table captures information about the Study in which structural variants were discovered and validated (not all discovered variants are validated in a Study). The 'universal identifier' of the study, usually the 'PubMed ID' (PMID) is captured in this table; the intention of this ID is to uniquely identify the Study across other databases. The 1000 Genomes Project Studies generate multiple input files per Study, or a "Study group". As each of those input files are loaded to the DGV, each will be assigned its own Study_ID. In these cases, all files from a Study group will have the same 'Study_Accession' value (e.g. "estd59" represents many input files from the 1000 Genomes project; one Study Accession number per multiple files). Sometimes an external database will be referenced in a record in the database. Below is a translation of the abbreviations one might find in a record in this database, and the full names of those abbreviations/external databases: AE = EBI Array Express CLONE = NCBI CLONE dbSNP = NCBI dbSNP dbGaP = NCBI dbGAP GENBANK = NCBI GENBANK GENE = NCBI GENE GEO = NCBI GEO HPO = Human Phenotype Ontology MeSH = Medical Subject Headings OMIM = Online Mendelian Inheritance in Man PROBE = NCBI PROBE SRA = NCBI Sequence Read Archive (to be discontinued in 2011-2012) ERA = EBI SRA TRACE = NCBI TRACE Archive (to be discontinued in 2011-2012)
Study_File_Prefix	Study_File_Prefix_ID	6
Supporting_Merged_Variant	Supporting_Merged_Variant_ID	7	This table describes which Supporting Variants comprise a given Merged Variant. Each record in this table will include the Merged Variant ID and one of its corresponding Supporting Variant IDs; there will be 'n' number of records for every 'n' number of Supporting Variants that comprise a Merged Variant. Note: The Variant table stores both Supporting and Merged Variant records. One Merged Variant is comprised of one or more Supporting Variants; Supporting Variants are not always reported in a Study (i.e. sometimes only the Merged, or "asserted" Variants are reported). This, the 'Sample_Merged_Variant' table resolves the one-to-many relationship of Supporting to Merged Variants that exist in the 'Variant' table. Both dbVar and DGVa describe what DGV calls a "Merged Variant" as an "asserted Variant" (SV; e.g. esv12345) and all three systems call the Variants the comprise a Merged Variant, "Supporting Variants" (SSV; e.g. nssv12324).
Translocation_Mapping	Translocation_Mapping_ID	29	This table captures the mapping coordinates of a Translocation Variant or any other Variant that may have mappings across two Chromosomes. e.g. translocation, transposon A translocation can span more than one Chromosome, creating a one-to-many relationship between a Variant and its respective mapping records. Note: For other Variant Types, their respective mappings are stored in the 'Variant_Mapping' table. Translocation Mapping information is not likely to be provided in a Study until technology is refined enough to detect such events.
Variant	Variant_ID	23	This table captures all Variants reported in Studies that have been loaded to the DGV. This table stores both Supporting and Merged Variants. For Studies that were accessioned by DGVa, the Variants are assigned a DGVa accession number that starts with "e"; for Studies that were accessioned by dbVar, the Variants are assigned a dbVar accession number that starts with "n". For example, a Supporting Variant, in DGVa will be assigned an accession number like "essv12345"; merged Variants, or what DGVa and dbVar call "asserted" Variants, would be assigned an accession numer like "esv45678". For dbVar, these examples, instead, would be, respectively, "nssv12345" and "nsv45678".
Variant_Analysis	Variant_Analysis_ID	7	One or more Variants can be related to one or more 'Dataset_Analysis' records. This table resolves the M:M relationship between the 'Dataset_Analysis' and 'Variant' tables. This is in order to support the relationships defined in the dbVar/DGVa XSD. For each 'Dataset_Analysis' record to which a Variant is associated, this table will store a record for each instance of this. For example, Variant_1 is associated to Dataset_Analysis 1, 2, and 3. So in the 'Variant_Analysis' table there would be three records, each containing the same 'Variant_ID' and each with a different 'Dataset_Analysis_ID', i.e. 1, 2, or 3.
Variant_Mapping	Variant_Mapping_ID	29	This table captures the mapping coordinates of a Variant. The purpose of capturing these data in a separate table is to afford mappings to any of the human genome reference assemblies (e.g. hg17, hg18) for a given Variant
Variant_Type	Variant_Type_ID	21	This table contains a list of all possible Variant Types that can be attributed to a given Variant. A Variant could have more than one type, e.g. Inversion + Deletion. As technologies develop further, more refined Variant Type descriptions could arise. This table will allow for multiple Variant Types to be ascribed to one Variant. NB The ICGC nomenclature already describes Variant Types with sub-categories. This is an example of how a Variant could be described with multiple Variant Types.
Variant_Xref	Variant_Xref_ID	9	This table captures any external identifers used to reference a Variant that exists in the DGV. For example, DGVa 'esv' (asserted variant) and 'essv' (supporting variant) accession numbers would be recorded here. Those identifiers would cross-referecnce Variants in the DGVa. This table can also capture IDs from external databases for records that relate to a given Variant in this table. For example, if a Variant's sequence has been deposited to NCBI's Trace Archive (formally discontinued in February, 2011), that sequence will receive one or more TEMPLATE_IDs, e.g. TEMPLATE_ID = G248P82926C7. In this example, the 'External_ID_Source' = "Trace Archive" and the 'External_Variant_ID' = "G248P82926C7", with the corresponding 'Variant_ID' for that Variant from the 'Variant' table.