Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation

BioProject Hierarchy at Google Cloud Platform GCP Documentation TOC Main documentation page

ALPHA RELEASE This is under active development and while we strive to maintain correctness, it is possible results may be unstable, unavailable, or incorrect at times. Please contact us by email at pd-help@ncbi.nlm.nih.gov before relying on this data for production analyses.

What data is available on the Google Cloud? Google Cloud Resources documentation Main documentation page

For a list of all resources see Pathogen Detection Resources at Google Cloud Platform

A dump of the BioProject hierarchy is available at Google Cloud Platform (GCP) in the ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy table at Google BigQuery. This data includes all data BioProjects for isolates in the Pathogen Detection browsers as well as any parent umbrella BioProjects they are linked to. This data allows you to identify all the isolates for a given parent bioproject. This data is also available on our FTP site (see the ReadMe.txt for details).

Pathogen Detection Resources available on the Google Cloud

Update Frequency BioProject hierarchy at GCP TOC Main documentation page

The bioproject_hierarchy table at Google Cloud BigQuery is updated daily. The information is also updated daily on our ftp site in https://ftp.ncbi.nlm.nih.gov/pathogen/Results/BioProject_Hierarchy/ with latest.bioproject_hierarchy.txt including the most recent dump.

Getting started with BigQuery

Our Getting started with BigQuery page has instructions on how to run queries with BigQuery.

What is ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy

The bioproject_hierarchy table contains information about bioprojects and their parents. A given bioproject may have multiple parents and each parent may have multiple children, so it is not a strictly tree-like structure. The organization of the bioprojects and their membership is determined by submitters and is not generated by NCBI or Pathogen Detection; data and bioproject labelling may be inconsistent.

Fields

Column Description
bioproject_id BioProject ID
bioproject_acc BioProject accession
bioproject_name BioProject name
bioproject_title BioProject title
top_organization "Top organization" or primary organization associated with this submission
parent_bioproject_id BioProject ID of the parent bioproject (if any) otherwise NULL
parent_bioproject_acc BioProject accession of the parent bioproject (if any) otherwise NULL
parent_bioproject_name BioProject name of the parent bioproject (if any) otherwise NULL
parent_bioproject_title BioProject title of the parent bioproject (if any) otherwise NULL
parent_top_organization Top organization of the parent bioproject (if any) otherwise NULL

Examples

Search for all the isolates belonging to a given umbrella bioproject

The following Google BigQuery Standard SQL will identify all the isolates for umbrella BioProject PRJNA514048

WITH RECURSIVE child_bioprojects as (
  SELECT parent_bioproject_acc, bioproject_acc FROM `ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy`
  UNION ALL
  SELECT b.parent_bioproject_acc, a.bioproject_acc
  FROM `ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy` a JOIN child_bioprojects b
  ON b.bioproject_acc = a.parent_bioproject_acc
)
SELECT  cb.parent_bioproject_acc, parent_bp.top_organization parent_organization,
  cb.bioproject_acc, child_bp.bioproject_name, child_bp.top_organization,
  isolates.target_acc, isolates.taxgroup_name, isolates.biosample_acc
FROM child_bioprojects cb
JOIN `ncbi-pathogen-detect.pdbrowser.isolates` isolates ON isolates.bioproject_acc = cb.bioproject_acc
JOIN `ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy` parent_bp ON parent_bp.bioproject_acc = cb.parent_bioproject_acc
JOIN `ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy` child_bp ON child_bp.bioproject_acc = cb.bioproject_acc
WHERE cb.parent_bioproject_acc = 'PRJNA514048'