U.S. flag

An official website of the United States government

dbGaP Molecular Data Submission Guide

Go back to dbGaP Study Submission Guide

Summary of the Molecular Data Submission Process

If you have a question, search through the commonly asked questions below. Otherwise, start with Which data types can be submitted as "Molecular data" to the dbGaP Submission Portal?

1. Which data types can be submitted as "Molecular Data" to the dbGaP Submission Portal?

Data generated with the use of molecular technologies (e.g., DNA/RNA/protein microarrays, DNA/RNA/protein sequencing, PCR) except for BAM, CRAM, and FASTQ data. No BAM, CRAM, and FASTQ files should be submitted as "Molecular Data" type to the dbGaP Submission Portal. High throughput human sequence data and alignment information should be submitted through a separate process: High throughput sequencing submission instructions. For specific requirements of each Molecular Data (non-SRA) type, click below:

2. When, where, and how should Molecular data be submitted?

Molecular data should be submitted to the dbGaP Submission Portal under the section "Other files" with type "Molecular Data". It should be submitted along with the phenotype data or as early as possible so that it enters a dbGaP genotype curator's queue.

Please include a README with a brief description of the data that you are submitting. It should minimally include genotyping steps, genome build, and technology if applicable.

To compress and bundle files, zip first then tar. Do not tar first then zip as this will significantly delay the processing time.

For VCFs, the files should be compressed using bgzip instead of zip as bgzip's block compression method can be directly used with VCFtools and BCFtools. This enables dbGaP to run qc checks quickly and report back to you any errors. For VCF files larger than 300GB, please split by chromosome, then tar the set of VCFs and submit as a single tarball.

3. What are the Sample ID requirements for all individual level "Molecular Data"?

Essential requirement: Sample IDs must be de-identified. Every sample ID found in an individual level Molecular Data file must be mapped to a consented subject in the Subject Sample Mapping (SSM) dataset. See SAMPLE_ID in Glossary for full requirement details. Sample IDs that do not follow the requirements will not be processed. If sample IDs are modified, please also modify the corresponding Sample Attributes dataset.

  • The sample ID is ideally the final aliquot used for a sequencing run or well on an array plate. A person with a given subject ID can have many samples.
  • If a sample ID is a technical control such as Coriell HapMap sample or a publicly available control, it must be mapped to a subject ID in the Subject Sample Mapping (SSM) DS and that subject ID must be explicitly marked as CONSENT=0 in the Subject Consent DS dataset.
  • Single cells or multiplexed single cells should each be given a unique sample ID.
  • Sample IDs in sequence derived genotypes (VCFs) must be identical to the sample IDs used in the corresponding sequence data (BAMs).
  • Include a File Sample Mapping (FSM) file to map sample IDs to single sample data files.
  • Include README to describe content of data files and QC anomalies especially if the content is not in one of the formats listed and fits into the "Other" category.
  • Check that files are not truncated.

4. How should Genotype Array data be formatted?

PLINK formatted genotype files are the preferred format to submit genotype array data. It is submitted as binary (.bed/.bim/.fam) or text (.map/.ped or .tfam/.tped) sets. Please see http://zzz.bwh.harvard.edu/plink/ and https://www.cog-genomics.org/plink/1.9 for PLINK specifications. The alleles should be encoded as ACGT for automated processing, otherwise, please be prepared for a longer processing time. Raw genotype data (Illumina .idat and Affymetrix .cel) should also be submitted if available. If Illumina's individual genotype reports or comparable reports are submitted without PLINK formatted sets, dbGaP will generate a PLINK formatted multisample set from the reports to include with the submitted files. Please do not submit VCFs for chip data.

  • See sample ID requirements. Every sample ID found in an individual level Molecular data file must be mapped to a consented subject in the Subject Sample Mapping (SSM) dataset. If each sample has multiple files, you may also create a File Sample Mapping (FSM) that has one column for the sample ID and the other column for the full filename including extensions.
  • Marker or Probe

    • Provide manufacturer's array manifest if available. This should include the .bmp or text readable file, which contains SNP or probe content on the array or assay. dbGaP will provide novel array manifest to dbSNP. *Required.
      col_num col_name
      1 IlmnID [unique ID]*
      2 Name [Marker name]*
      3 IlmnStrand
      4 SNP*
      5 AddressA_ID
      6 AlleleA_ProbeSeq*
      7 AddressB_ID
      8 AlleleB_ProbeSeq*
      9 GenomeBuild*
      10 Chr*
      11 MapInfo*
      12 Ploidy
      13 Species
      14 Source
      15 SourceVersion
      16 SourceStrand
      17 SourceSeq
      18 TopGenomicSeq
      19 BeadSetID
      20 Exp_Clusters
      21 RefStrand*
    • If not available, submit array annotation file with comprehesive marker/probe information (SNP, flanks, chr, position, genome build, reference strand, etc.)
    • The marker or probe information will be included in the release as a 'sample-info' component.
  • PLINK

    • .bed files
    • .fam/.ped/.tfam files
      • Annotate with IIDs (sample IDs) using the same sample IDs listed in the SSM
      • If dataset contains duplicated samples, create a sample-level PLINK set. List the subject IDs as the Family IDs and the sample IDs as the IIDs.
      • Indicate which of the duplicates you recommend to use for GWAS or analyses. A list of duplicates can be submitted as a separate file.
    • .bim/.map/.tped files
      • Annotate variants with ACGT alleles
      • DO NOT manually modify marker level information in the BIM file
      • Sample and marker filter (.keep, .extract) files may be provided
  • Raw Genotype Data (Illumina .idat or Affymetrix .cel)
    • Provide single sample genotypes in the format of .idat or .cel files
    • .idat files should include both green and red intensity files
    • Provide a File Sample Mapping (FSM) file which explicitly maps each report name to the sample ID listed in the SSM
  • Genotype Reports
    • Individual genotype reports may be single sample or multisample reports
    • Provide Illumina's final reports or comparable reports. Required columns: SNP Name, Sample ID, alleles, intensities, genotype call quality scores, B allele frequencies, and other relevant information
    • Provide dictionary to describe columns
    • When PLINK formatted genotype files are not provided, single sample or multisample reports will be combined into a single PLINK formatted multisample set for release.

Example of a single sample header and report from Illumina

[Header]
GSGT Version,1.9.4
Processing Date,2/25/2014 4:59 AM
Content,HumanOmni5Exome-4v1-1_A.bpm
Num SNPs,4641218
Total SNPs,4641218
Num Samples,1200
Total Samples,4181
File,534 of 1200
[Data]

Red is required and blue is recommended.

SNP Name
GC Score
Allele1 – Forward
Allele2 - Forward
Allele1 – Top
Allele2 – Top
Allele1 - Design
Allele2 - Design
Allele1 - AB
Allele2 - AB
Theta
R
X intensity
Y intensity
X Raw
Y Raw
B Allele Freq
Log R Ratio

  • QC your data to identify sample switches, contaminated DNA, unexpected duplicates and relatedness, and samples with high MCR
    • Exclude sample IDs and markers without genotype calls, where missing call rate (MCR) = 100%. Run PLINK command --missing
    • Verify genotype sex and phenotype sex are identical. Run PLINK command --check-sex. Resolve by excluding problematic sample IDs or providing evidence in the form of a README for samples with known sex chromosome anomalies.
    • Verify IBD results from the PLINK set are consistent with known relationships provided in the Pedigree DS. Merge PLINK sets if there are more than one PLINK set before running IBD checks. Run PLINK commands --freq and --genome. Note that monozygotic twins should be marked in the Pedigree DS. Correct issues with unexpected duplicates or relatedness in the pedigree and genotype data, OR provide README documenting issue and reason why it cannot be resolved.
  • Include description and data (IBD results, .genome file, thresholds, etc.) resulting from your data cleaning process. The QC data will be included in the release as a 'genotype-qc' component.

5. How should SNP, CNV, and structural variants derived from sequence data be formatted?

The Variant Call Format (VCF) is the preferred format to submit SNP, CNV, and structural variants. VCFs can be derived from Whole Genome Sequences (WGS), Whole Exome Sequences (WXS), or targeted sequences (Targeted-Capture or OTHER). Please see https://samtools.github.io/hts-specs/ for VCF specifications.

  • See sample ID requirements. Every sample ID found in an individual level Molecular data file must be mapped to a consented subject in the Subject Sample Mapping (SSM) dataset.
  • Marker or Probe
    • Provide Marker Annotations as a separate file from the VCFs and include gene, gene_family, other identifiers and details
  • VCF Header
    • Information relevant for genotype calls should not be included in the Marker Annotations, but rather in the VCF header: ##INFO
    • Genome Build (e.g., GRC38) should be included in the VCF header: ##reference
    • Exclude long internal paths to the individual data
  • Multisample VCFs vs. single sample VCFs
    • Multisample VCFs are preferred when variants are called across many samples.
    • Submit single sample VCFs only when the project calls variants for each sample independently against a reference genome and the variants are not compared across samples.
    • Merging single sample VCF called individually against reference into multisample VCFs is recommended ONLY if the KNOWN homozygous reference genotypes can be included for all samples/variants covered by sequencing. As per the VCF specification the genotype string './.' should be used for any UNKNOWN genotypes.
    • Sequence derived genotypes should be identified with the same sample IDs as the sequence data (.bam, .cram, .fastq) they were derived from
    • Final VCFs can be processed by standard processing software (PSEQ, BCFTOOLS, VCFTOOLS, TABIX)
      • If possible, submit tabix indexes along with the VCFs
      • Use bgzip to compress VCF files
    • Set FILTER=PASS for markers with high quality data
    • Adhere to VCF specifications for missing data (including chrX genotypes for male samples)
  • QC VCFs to identify sample switches, contaminated DNA, unexpected duplicates and relatedness, and samples with high MCR
    • Exclude sample IDs and markers without genotype calls, where missing call rate (MCR) = 100%. Run PLINK command --missing
    • Verify genotype sex and phenotype sex are identical. Run PLINK command --check-sex. Resolve by excluding problematic sample IDs or providing evidence in the form of a README for samples with known sex chromosome anomalies.
    • Verify IBD results from VCFs are consistent with known relationships provided in the Pedigree DS. Merge VCF sets if there are more than several multisample or single sample VCFs before running IBD checks. Note that monozygotic twins should be marked in the Pedigree DS. Correct issues with unexpected duplicates or relatedness in the pedigree and genotype data, OR provide README documenting issue and reason why it cannot be resolved.
    • dbGaP will create PLINK files from the VCFs to run QC checks, but will not release temporarily generated PLINK files.
  • Include description and data (IBD results, thresholds, etc.) resulting from your data cleaning process. The QC data will be included in the release as a 'genotype-qc' component.

For VCFs, the files should be compressed using bgzip instead of zip as bgzip's block compression method can be directly used with VCFtools and BCFtools. This enables dbGaP to run qc checks quickly and report back to you any errors. For VCF files larger than 300GB, please split by chromosome, then tar the set of VCFs and submit as a single tarball.

6. How should Imputations be formatted?

Imputed genotype data can be submitted if they are generated from PLINK, BCFTOOLS, VCFTOOLS, IMPUTE2, and MACH/MINIMAC. Please discuss with a genotype curator if another format needs to be submitted.

  • See sample ID requirements. Every sample ID found in an individual level Molecular data file must be mapped to a consented subject in the Subject Sample Mapping (SSM) dataset.
  • Name and version of the reference panel used for imputations should be included in the Experiment Description or Report file
  • If not in the Experiment Description or Report file, include separately in a README
    • Genotype set that was used as input
    • Software used
    • Thresholds or filters that were applied
  • Large datasets, greater than 10GB, should be split by chromosomes for faster processing time

7. How should Expression and Epigenetic data be formatted?

RNA microarray, RNA-seq derived expression, and methylation data may be submitted in the form of expression/methylation levels, exon/transcript/gene reads (number of reads overlapping a given feature such as an exon/transcript/gene), RPKMs (reads per kilobase million), or TPKMs (transcripts per kilobase million). If your data does not require controlled-access, please submit to NCBI GEO, which is an unrestricted access database. To submit to dbGaP:

  • See sample ID requirements. Every sample ID found in an individual level Molecular data file must be mapped to a consented subject in the Subject Sample Mapping (SSM) dataset.
  • Data should be tab-delimited text-formatted multisample matrices that have markers listed as the first column in each row and samples as column headers.
  • There should be the same number of columns in each row.
  • Meta information and datasets should be submitted together to make submission MIAME compliant: https://www.ncbi.nlm.nih.gov/geo/info/MIAME.html.
  • Meta information as txt formatted files should include:
    • General description of the experiments and datasets
    • Normalization procedures
    • Sample and marker filters
  • For arrays submitted as Illumina .idat or Affymetrix .cel files, follow instructions under "Raw Genotype Data" above.

8. How should Somatic and/or Germline Mutation Annotations be formatted?

Mutation Annotation Format (MAF) is a tab-delimited text file with aggregated mutation information from VCF files. It is used to describe genomic variations between tumor-normal tissues in cancer research. The column headers can be found here.

  • See sample ID requirements. Every sample ID found in an individual level Molecular data file must be mapped to a consented subject in the Subject Sample Mapping (SSM) dataset.
  • If any column in a submitted MAF file is different from those in the description, provide a data dictionary for your MAF file.

9. How should Molecular data, including -omics data, in non-standard format be formatted?

For molecular data that cannot be submitted in any of the formats listed above, for example, individual and summary level data, -omics, single cell, UCSC BED format, gVCFs.

  • See sample ID requirements. Every sample ID found in an individual level Molecular data file must be mapped to a consented subject in the Subject Sample Mapping (SSM) dataset.
  • Provide platform, probe/marker set or genome build for the reference genome in a README or with the experiment description.
  • Include a File Sample Mapping (FSM) file to map sample IDs to single sample data files. For multisamples, mark the column containing sample IDs and let us know how to split the content on sample IDs if it is not obviously labeled. Generally for multisample matrices, data should be tab-delimited, text-formatted, and have markers listed as the first column in each row and samples as column headers.
  • There should be the same number of columns in each row.
  • Provide data dictionary for the column headers.
  • Include chr/pos/ref_allele/alt_allele and any number of relevant meta information.
  • Remove failed and excluded samples (do not mark/highlight) from data sheets.
  • Indicate which files are supplementary or summary-level results.
  • Large datasets, greater than 10GB, should be split by chromosomes for faster processing time
  • Include description and data (IBD results, .genome file, thresholds, etc.) resulting from your data cleaning process. The QC data will be included in the release as a 'genotype-qc' component.
  • Data that cannot be QC'ed will be split by consents and packed.

10. What are common errors to check for and what will happen after I submit Molecular data?

We expect submitters to have checked all of the consistency issues below prior to submitting. dbGaP genotype curators will verify and package the molecular data for release. Any exceptional case, such as loss of heterozygosity (LOH), Mendelian violations that cannot be resolved, policy issues with distributing pedigree information, should be submitted with an additional README or other form of documentation. Please specify if these documentation can be provided to dbGaP users or should stay internal to dbGaP.

Sample identity is verified

  • Expected or Unexpected duplicates: samples found to have nearly identical genotypes are expected to belong to the same person unless the samples belong to a set of twins demarcated in the Pedigree DS. If a set of samples are expected duplicates, it means that the subject IDs (aka individual identifier (IID)) linked to the sample IDs in the SSM DS will be identical.

Sex of the samples are verified

  • Sex is checked using PLINK software and/or dbGaP GRAF using X chromosome heterozygosity rates and verified against the phenotype data if sex is reported.

Pedigree relations are verified

  • Pedigree relations are checked using IBD and/or dbGaP GRAF and verified against the phenotype data if a Pedigree DS is provided.

SNP filtering

  • Minor allele frequencies (MAF), missing call rates (MCR), Mendelian errors are checked using PLINK and other software.

Ancestry-specific allele frequencies are verified

  • dbGaP subjects with genomic data and that have been designated "non-sensitive" for release of Genomic Summary Results (GSR) in the dbGaP Submission System will also be analyzed using GRAF-pop and included for the ALFA (Allele Frequency Aggregator) project. Studies may be contacted to correct the submitted data or provide a README if:

    1. They contain allele frequencies that deviate from the expected range of known allele frequencies for the 12 diverse populations and/or
    2. The submitted ancestry or population deviates from the computed ancestry for a large number of samples.

Results of the checks may require the submitter to correct molecular data, phenotype data, or both. The most common error is that the IDs do not match between the molecular data and the phenotype data. Other common errors include missing samples and chromosomes, data to sample mapping errors, and data formatting errors.

Once all the qc checks pass, the individual level genotype data will be parsed by consents as demarcated in the Subject Consent DS and packed in a tar. Publicly available controls such as Coriell HapMaps will be included in a separate .MULTI tar file if there are multiple consent groups or with the individual level data if there is a single consent group.

All annotation and QC data that were submitted or generated by dbGaP to process and analyze the data are packed within download tars: - genotype-qc - sample-info - marker-info

Support Center

Last updated: 2024-04-12T20:18:35Z