Health
Pathogen Detection
Help
GCP
BioProject Hierarchy at Google Cloud Platform

BioProject Hierarchy at Google Cloud Platform

ALPHA RELEASE This is under active development and while we strive to maintain correctness, it is possible results may be unstable, unavailable, or incorrect at times. Please contact us by email at pd-help@ncbi.nlm.nih.gov before relying on this data for production analyses.

What data is available on the Google Cloud?
- Pathogen Detection Resources available on the Google Cloud
- Update Frequency
Getting started with BigQuery
What is ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy
Fields
Examples
- Search for all the isolates belonging to a given umbrella bioproject

What data is available on the Google Cloud?

For a list of all resources see Pathogen Detection Resources at Google Cloud Platform

A dump of the BioProject hierarchy is available at Google Cloud Platform (GCP) in the ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy table at Google BigQuery. This data includes all data BioProjects for isolates in the Pathogen Detection browsers as well as any parent umbrella BioProjects they are linked to. This data allows you to identify all the isolates for a given parent bioproject. This data is also available on our FTP site (see the ReadMe.txt for details).

Pathogen Detection Resources available on the Google Cloud

Update Frequency

The bioproject_hierarchy table at Google Cloud BigQuery is updated daily. The information is also updated daily on our ftp site in https://ftp.ncbi.nlm.nih.gov/pathogen/Results/BioProject_Hierarchy/ with latest.bioproject_hierarchy.txt including the most recent dump.

Getting started with BigQuery

Our Getting started with BigQuery page has instructions on how to run queries with BigQuery.

What is `ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy`

The bioproject_hierarchy table contains information about bioprojects and their parents. A given bioproject may have multiple parents and each parent may have multiple children, so it is not a strictly tree-like structure. The organization of the bioprojects and their membership is determined by submitters and is not generated by NCBI or Pathogen Detection; data and bioproject labelling may be inconsistent.

Fields

Column	Description
bioproject_id	BioProject ID
bioproject_acc	BioProject accession
bioproject_name	BioProject name
bioproject_title	BioProject title
top_organization	"Top organization" or primary organization associated with this submission
parent_bioproject_id	BioProject ID of the parent bioproject (if any) otherwise NULL
parent_bioproject_acc	BioProject accession of the parent bioproject (if any) otherwise NULL
parent_bioproject_name	BioProject name of the parent bioproject (if any) otherwise NULL
parent_bioproject_title	BioProject title of the parent bioproject (if any) otherwise NULL
parent_top_organization	Top organization of the parent bioproject (if any) otherwise NULL

Examples

Search for all the isolates belonging to a given umbrella bioproject

The following Google BigQuery Standard SQL will identify all the isolates for umbrella BioProject PRJNA514048

WITH RECURSIVE child_bioprojects as (
  SELECT parent_bioproject_acc, bioproject_acc FROM `ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy`
  UNION ALL
  SELECT b.parent_bioproject_acc, a.bioproject_acc
  FROM `ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy` a JOIN child_bioprojects b
  ON b.bioproject_acc = a.parent_bioproject_acc
)
SELECT  cb.parent_bioproject_acc, parent_bp.top_organization parent_organization,
  cb.bioproject_acc, child_bp.bioproject_name, child_bp.top_organization,
  isolates.target_acc, isolates.taxgroup_name, isolates.biosample_acc
FROM child_bioprojects cb
JOIN `ncbi-pathogen-detect.pdbrowser.isolates` isolates ON isolates.bioproject_acc = cb.bioproject_acc
JOIN `ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy` parent_bp ON parent_bp.bioproject_acc = cb.parent_bioproject_acc
JOIN `ncbi-pathogen-detect.pdbrowser.bioproject_hierarchy` child_bp ON child_bp.bioproject_acc = cb.bioproject_acc
WHERE cb.parent_bioproject_acc = 'PRJNA514048'