GRAF Software Documentation

GRAF (Genetic Relationship And Fingerprinting) is a package useful for analyzing and visualizing genotype data from genome-wide association studies. The latest version, GRAF 2.4, includes three main features:

    1. subject relationship inference (GRAF-rel);
    2. subject ancestry (or population structure) inference (GRAF-pop);
    3. subject sex determination using genotypes (GRAF-sex).
Both relationship and ancestry inferences are based on the genotypes of 10,000 pre-selected fingerprint (FP) SNPs extracted from the input dataset. Sex check is done using 1,000 pre-selected FP SNPs on X-chromosome. The GRAF package includes a main C++ program graf to calculate the relationships, predict subject ancestry and determine sexes, and four auxiliary Perl programs: PlotGraf.pl, PlotPopulations.pl and PlotSexCheckResults.pl to visualize the results, and SetRsIdsInBimFile.pl to convert SNP IDs in .bim file into RS IDs so that it can be taken by graf. Note that PlotGraf.pl, PlotPopulations.pl and PlotSexCheckResults.pl require that GD Graphics Library (http://search.cpan.org/~lds/GD-1.38/GD.pm) be installed.

The above executables and other files are included in the GRAF package GrafPkg.tar.gz and visible as separate files after the user executes the command:

    tar zxvf GrafPkg.tar.gz

GRAF-rel: inferring subject relationships using genotypes

GRAF-rel analyzes the genotypes of all the 10,000 fingerprinting SNPs (distributed as the file FP_SNPs.txt) and calculates the all genotype mismatch rate (AGMR) and the homozygous genotype mismatch rate (HGMR) for each pair of samples (Jin et al., 2017). AGMR is the percentage of SNPs on which the two genotypes are not identical, while HGMR is the genotype mismatch rate when only the SNPs with homozygous calls for both samples are considered.

graf compares the genotypes of all pairs of subjects and finds and reports the closely related pairs, while PlotGraf.pl takes the file generated by graf and plots graphs to show the distributions of HGMR and AGMR values.

Input files

In most usages, graf expects as input one or more genotype datasets in PLINK format, i.e., .bed, .bim, and .fam files that share a prefix in their names. However, since multiple samples can be collected from one subject and the subject sample mapping information is not stored in datasets in PLINK, graf reads subject sample mapping information from an SSM file and pedigree information from a pedigree file . The IDs (second column, no column header) in the PLINK .fam file are read as sample IDs by graf. If subject IDs are the same as sample IDs, then no SSM file is necessary, and graf will read the pedigree information from the PLINK .fam file.

However, if any of the sample IDs are different from their corresponding subject IDs, then an SSM file should be passed to graf. The SSM file should be a tab-delimited plain text file with a sample column and a subject column (with column headers—see dbGaP submission guide). When an SSM file is provided, a pedigree file (see dbGaP submission guide) should also be provided to pass the pedigree information to graf. The pedigree file should be a tab-delimited plain text file with at least the following 5 columns (with a column header row):

    1. FAMILY_ID
    2. SUBJECT_ID
    3. FATHER
    4. MOTHER
    5. SEX (1 = male; 2 = female; 0 or NULL = unknown)

    'SUBJECT_ID', 'FATHER', and 'MOTHER' are IDs of subjects (persons), not samples.

The SSM format is a two-column tab-delimited text file that establishes a mapping from sample IDs to subject IDs. The columns should have the headers SUBJECT_ID and SAMPLE_ID, respectively. An example SSM format file is included in the GRAF distribution with the name affy_hapmap_ssm.txt.

If there are identical twins in the datasets, the twin information should be entered in the optional 6th column MZ_TWIN_ID, where the same twin ID (can be an integer or a string) is used to indicate that subjects are identical twins. For example, if three subjects A, B, C are identical triplets, a unique ID, e.g., the integer 18, can be created for them and entered into the MZ_TWIN_ID column for subjects A, B, C.

The sample genotypes can also be stored in datasets with GRAF format. GRAF uses a single .fpg file to store the sample genotypes. An .fpg file is a plain text file with three columns: the first column is the dataset ID (integer) column; the second one is the sample ID column; and the third column stores sample genotypes in strings of hexadecimal numbers. Each hexadecimal number represents genotypes of two fingerprinting SNPs. The first hexadecimal number stores genotypes of the first two fingerprinting SNPs; the second number keeps genotypes of fingerprinting SNPs #3 and #4, and so on. If the hexadecimal number is converted to a binary number, then the first two bits keep the genotype of the first SNP and the last two bits are for the second SNP, with the following code meanings:

    00: 0 reference alleles
    01: 1 reference allele
    10: 2 reference alleles
    11: missing genotype

The .fpg file can be generated using the -geno option of the graf program and reused as input to the program in a subsequent run.

Included in the distribution are two sample datasets for which the file names have prefixes affy_hapmap and perlegen_hapmap. Both sets of sample files come in byte-encoded PLINK format, meaning that there are three files with suffixes {fam,bim,bed}.

Running graf to find closely related subjects

graf is a command line executable that can be run under GNU/LINUX 64 bit systems. Brief instructions are given when the program is executed without parameters:

$ graf

Usage: graf [options]
    -plink  PLINK set root:  File root of PLINK .bed, .bim, and .fam files
    -geno   fpg file:        Specify GRAF .fpg file
    -exfp   PLINK set list:  Extract fingerprinting genotypes from a list of PLINK sets (file roots) separated by commas
    -pop    output file:     Check subject populations and save results to the output file
    -sex    output file:     Do sex check using fingerprint genotypes on X-chromosome and save results to the output file
    -out    output file:     Output file to save the results
    -appd   DS No.:          Append extracted fingerprinting genotypes to the output file.  The integer is dataset No.
                             of the first PLINK set
    -ssrs   SS-RS mapping:   Specify SS# to RS# mapping file (Two columns: SS# and RS# without column headers)
    -ped    pedigree file:   Specify pedigree file of subject IDs (with column headers)
    -ssm    SSM file:        Specify dbGaP subject sample mapping file
    -maxhm  max HGMR value:  Specify maximum HGMR values for a pair of subjects to be reported by GRAF
    -xpmr   type:            Specify how expected HGMR and AGMR values are calculated for each type of relationship (default 1)
                             1: Use input dataset to calculate the expected HGMR and AGMR values
                             2: Use average HGMR and AGMR values in dbGaP database for the expected values
    -type   relation_type:   Specify relation type.  Acceptable values are 1, 2, 3, or 4 (default 3)
                             1: Find all duplicates and PO pairs
                             2: Find all duplicates, PO, and FS pairs
                             3: Find all duplicates, PO, FS, and second degree relatives
                             4: Compare all 10,000 SNPs to find all the related subjects

NOTE:
    1. Exactly one of the following two options should be selected: -plink or –geno.
    2. When option -exfp is selected, -out must also be selected and output file should have .fpg extension.
    3. When multiple PLINK sets are used, each dataset will be assigned an integer dataset ID starting with 1.
    4. The above PLINK set starting index can be specified using option -appd.
       When -appd is selected, the out file should be an existing GRAF .fpg file.
    5. Multiple datasets can be combined into a single geno file using the –exfp and –appd options.
    6. When multiple datasets are used, the program does pairwise comparisons to find related samples both within and across datasets.

Below are more detailed descriptions (with examples) of these options.

-plink

Allows the user to specify the name of the genotype dataset in PLINK .bed, .bim, .fam format. The parameter should be the file root of the plink set. In this example, graf will try to find the following three files: affy_hapmap.bed, aff_hapmap.bim, and affy_hapmap.fam. Example:

$ graf -plink data/affy_hapmap

-exfp

Extracts fingerprinting genotypes from multiple PLINK sets and saves the results to the file name specified by -out option. The datasets will be given integer dataset IDs starting from 1. The output file name should be new. Example:

$ graf -exfp data/affy_hapmap,data/perlegen_hapmap -out data/comb_hapmap.fpg

-exfp -appd

Extracts fingerprinting genotypes from a PLINK set and appends the results to an existing output file, with dataset ID specified by –appd option. Example (two steps):

$ graf -exfp data/affy_hapmap -out data/comb_hapmap2.fpg
$ graf -exfp data/perlegen_hapmap -out data/comb_hapmap2.fpg –appd 2

-geno

Allows the user to specify the name of the genotype dataset in GRAF format. Example:

$ graf -geno data/comb_hapmap.fpg

-ssm

Allows the user to specify the name of the subject sample mapping file in dbGaP format. When sample IDs are different from subject IDs, a subject sample mapping file is required. The subject sample mapping file should list all the sample IDs in the PLINK .fam file and their corresponding subject IDs. Example:

$ graf -plink data/affy_hapmap -ssm data/affy_hapmap_ssm.txt

-ped

Allows the user to specify the pedigree file in dbGaP format. When a pedigree file is specified with -ped option, graf will ignore the pedigree information in the PLINK .fam file and read the information from the pedigree file. The IDs in the pedigree file should be subject IDs. This option can take only one dataset at a time. Example:

$ graf -plink data/affy_hapmap -ssm data/affy_hapmap_ssm.txt -ped data/affy_hapmap_fake_pedigree.txt

-out

Allows the user to specify the name of the output file for saving the related pairs of samples detected by graf. If the output file is not specified, the output will be saved to a default file graf_rel_yyyymmdd_hhmm.txt, where yyyymmdd_hhmm is the current local time in this format. Example:

$ graf -plink data/affy_hapmap –out data/aff_hapmap_rels.txt

-maxhm

Sets the maximum HGMR value for related pairs outputted by graf. Subject pairs with HGMR greater than this value will be treated by graf as unrelated and will not be saved to the output file. The default maximum HGMR is 20. Example:

$ graf -plink data/affy_hapmap –out data/aff_hapmap_rels_m_15.txt –maxhm 15

-xpmr

Allows the user to specify how the expected HGMR and AGMR values are calculated. For each pair of subjects, graf estimates the allele frequency distribution of the fingerprinting SNPs of the population where the subjects are sampled, and then uses these allele frequencies to calculate the expected HGMR and AGMR values. Assuming all of the subjects in the input file(s) are sampled from the same population, graf uses the allele frequencies of all subjects in the input datasets to estimate the allele frequencies in the population. In cases when the sample size is small (fewer than 100 subjects) in the input datasets, graf uses the allele frequencies of all the subjects in dbGaP Fingerprint Collection to estimate the population allele frequencies. The user can use -xpmr option (1 or 2) to let graf choose one of the above two options to estimate the population allele frequencies. When the selection -xpmr 1 is combined with choices of -geno or -exfp that combine multiple datasets, then the allele frequencies are combined as a weighted average of all the participating datasets and the same weighted average is used for all pairwise comparisons. Example:

$ graf -plink data/affy_hapmap –xpmr 2

-type

Usage of graf involves a tradeoff between running time and prediction accuracy. To obtain high sensitivity, the program needs to check more SNPs, at the expense of a longer running time. The -type option allows the user to specify the relative type for which graf should try to find all the pairs. The type should be an integer from 1 to 4, with the code meanings shown in the above short description. The greater the type value, the more SNPs graf will check, and hence the more related samples it will find and the more time it will spend. The default type value is 3. Example:

$ graf -plink data/affy_hapmap –type 2

-ssrs

When the marker IDs in the PLINK .bim file are SS IDs, the user can use the -ssrs option to specify an SS to RS mapping file so that graf can convert the SS IDs to RS IDs. Example (assuming PLINK set DsWithSs.* exists):

$ graf -plink DsWithSs –SsToRs.txt

Output files

graf requires that an input genotype file, either in PLINK format (with -plink option) or in GRAF format (with -geno option) should be specified. When -exfp option is selected, the -out option should also be selected to specify the name of the output file. The output file is the genotype dataset in GRAF format (.fpg file), as described above.

The output file should have the extracted genotypes of the fingerprinting SNPs and can be passed back to graf as an input file in a later run.

When -exfp option is not selected, graf will use the genotype information in the input genotype dataset, find the related subjects or determine population structures, and will save the results to the output file.

If any related subjects are found by GRAF-rel, the results will be saved to the output file, which is a plain text file with the following columns:

            Sample1:       ID of the first sample in each pair
            Sample2:       ID of the second sample in each pair
            Subject1:      subject ID of the first sample in each pair
            Subject2:      subject ID of the second sample in each pair
            Sex11:         gender of the first subject in each pair, 1=male; 2=female
            Sex12:         gender of the second subject in each pair, 1=male; 2=female
            HG match:      number of SNPs with matched genotypes when only homozygous SNPs are counted
            HG miss:       number of SNPs with mismatched genotypes when only homozygous SNPs are counted
            HGMR:          Homozygous Genotype Mismatch Rate (%)
            AG match:      number of SNPs with matched genotypes when all SNPs are counted
            AG miss:       number of SNPs with mismatched genotypes when all SNPs are counted
            AGMR:          All Genotype Mismatch Rate (%)
            Geno relation: relationship determined by sample genotypes. See above for code meanings
            Ped relation:  relationship derived from subject sample mapping file and pedigree file (See Table 1 for code meanings).
            p_value:       probability that the genetic relationship is NOT the predicted type

Table 1. Pedigree relationships and the expected genetic relationships

Table1.

When multiple PLINK sets are checked pairwise, the output file will have two extra columns, DS1 and DS2, showing the dataset IDs for the pair of PLINK sets.

Running PlotGraf.pl to plot closely related subjects

PlotGraf.pl is a perl script that plots graphs to show the distributions of HGMR and AGMR values of the related pairs of subjects. It shows brief instructions when it is executed without parameters:

$ PlotGraf.pl

Usage: PlotGraf.pl <input related subject file> <output png file> <graph type> [Options]

Note:
    Valid graph types are:
        1 = HGMR histogram
        2 = AGMR histogram
        3 = HGMR + AGMR scatter plot

Options:
    -gw     graph width:  Set graph width in pixels
    -gh     graph height: Set graph height in pixels
    -xmax   max x value:  Set maximum HGMR or AGMR on x-axis of the histogram
    -ymax   max y value:  Set maximum number of pairs on y-axis of the histogram
    -dot    size:         Set dot size in pixels on the scatter plot
    -hfd    size:         Set dot size in pixels for HF (half sibling + full cousin) pairs

It takes three required parameters. The first parameter should be the name of the file that is generated by graf and contains related subject pairs. The second is the output .png file which shows the graph. The third is an integer representing the graph type. The options should be entered after the required parameters. Below are some examples showing how to run the script.

$ graf -plink data/affy_hapmap -maxhm 15 -ssm data/affy_hapmap_ssm.txt -ped data/affy_hapmap_fake_pedigree.txt -out data/affy_hapmap_rels_15.txt
$ PlotGraf.pl data/affy_hapmap_rels_15.txt data/affy_hapmap_hgmr.png 1
$ PlotGraf.pl data/affy_hapmap_rels_15.txt data/affy_hapmap_agmr.png 2
$ PlotGraf.pl data/affy_hapmap_rels_15.txt data/affy_hapmap_scatter.png 3

In the first step the C++ program finds related pairs and saves the results to affy_hapmap_rels_15.txt. Then PlotGraf.pl takes the results and plots histograms to show distributions of HGMR values of the related subjects, AGMR values of the duplicates, and a scatter plot to show distribution of both values.

In both histograms, the colored bars represent different types of relationships derived from the SSM and pedigree files (see Table 1 for the meanings of the two-letter abbreviations). The cyan lines show the cutoff values suggested by GRAF to separate different types of relationships determined by comparing the genotypes. In the scatter plot, each contour line shows the area that is predicted to contain 95% of the pairs for each relatedness type, assuming that all 10,000 fingerprinting SNPs are genotyped for all of the subjects in a large, homogeneous, random mating population. Note that the HapMap samples were collected from human individuals from very different populations, and GRAF is more accurate when predicting relatedness for subjects from a homogeneous population.

$ graf -plink data/affy_hapmap -maxhm 15 -ssm data/affy_hapmap_fake_ssm.txt -ped data/affy_hapmap_fake_pedigree.txt -out data/affy_hapmap_fake_rels.txt
$ PlotGraf.pl data/affy_hapmap_fake_rels.txt data/affy_hapmap_hgmr_f1.png 1 -gw 1000 -gh 500
$ PlotGraf.pl data/affy_hapmap_fake_rels.txt data/affy_hapmap_agmr_f1.png 2 -xmax 60 -ymax 20
$ PlotGraf.pl data/affy_hapmap_fake_rels.txt data/affy_hapmap_scatter_f1.png 3 -dot 5

The above examples show that graph size, axis limits and the scatter plot dot size can be adjusted by users. In the first step a fake pedigree and a fake SSM file are used to show how GRAF finds and reports errors in the pedigree and SSM files. The HGMR histogram generated in the second step shows that some of the related pairs reported by the pedigree and SSM file don’t match the genetic relatedness determined by GRAF. It also shows that the graph size can be adjusted by using options -gw and -gh. The AGMR histogram also shows the mismatches between the relationship types reported in the input files and those determined by GRAF. The axis limits can be adjusted by using -xmax and -ymax options. The scatter plot shows that the dot size can be adjusted using -dot option.

Multiple genotype datasets can be combined into one .fpg file and passed to graf for determining genetic relationships, e.g.,

$ graf -exfp data/affy_hapmap,data/perlegen_hapmap -out data/comb_hapmap.fpg
$ graf -geno data/comb_hapmap.fpg -out data/comb_hapmap_rels.txt -maxhm 15 -ped data/affy_hapmap_fake_pedigree.txt -ssm data/comb_hapmap_ssm.txt
$ PlotGraf.pl data/comb_hapmap_rels.txt data/comb_hapmap_hgmr.png 1
$ PlotGraf.pl data/comb_hapmap_rels.txt data/comb_hapmap_agmr.png 2
$ PlotGraf.pl data/comb_hapmap_rels.txt data/comb_hapmap_scatter.png 3

When multiple datasets are used, if there are no SSM and pedigree files, it is not required that the sample and subject IDs be unique across datasets. GRAF uses both DS# and subject/sample IDs to identify subjects or samples. In the output table, GRAF shows both the DS# and ID for each subject or sample. However, when there are SSM and pedigree files, it is required that IDs be unique across datasets. GRAF doesn’t take multiple SSM or pedigree files. The user needs to combine multiple SSM or pedigree files into one, and each ID in the combined SSM or pedigree file should represent only one sample or subject. Neither the SSM file nor the pedigree file has DS# columns.

The -hfd option of PlotGraf.pl lets the user set the dot size for the half sibling + full cousin pairs (HF, see Table 1) in the scatter plot. The HF relationship is genetically remoter than full sibling but closer than second degree relatives. In the scatter plot, these pairs are predicted to be between FS and D2 pairs. In the rare cases when there are HF pairs, the user can use -hfd option to highlight the HF pairs by setting different dot sizes for them.

GRAF-pop: inferring subject ancestry using genotypes

GRAF-pop calculates genetic distances from each subject to several reference populations and estimates subject ancestry and ancestral proportions based on these distances (Jin et al., 2019). Four genetic distances scores, GD1, GD2, GD3, GD4, are used in ancestry inference in the current version of GRAF. Subjects in the input datasets are clustered using these scores and plotted on scatter plots.

GRAF-pop assumes that each subject is an admixture of three ancestries: European (E), African (F), and Asian (A), and estimates ancestral proportions Pe, Pf, Pa based on GD1 and GD2 scores using barycentric coordinates. It also assigns a population ID (PopID) to each subject using the cutoff values shown in Tables 2 and 3.

Table 2. Grouping subjects based on the ancestry proportions

PopID Population Cutoff standard
1 European Pe ≥ 87%
2 African Pf ≥ 95%
3 East Asian Pa ≥ 95%
4 African American 40% ≤ Pf < 95% and Pa < 13%
5 Latin American 1 Pf < 40% and Pe < 87% and Pa < 13% and Pf ≥ Pa
6,7,8 (Three populations) Pa < 95% and Pe < 87% and Pf < 13% and Pf < Pa
9 Other Pa ≥ 13% and Pf ≥ 13%

Table 3. Separating Asians and Latin Americans using GD1 and GD4 scores

PopID Population Cutoff standard
7 Asian-Pacific Islander GD1 > 30 × (GD4)2 + 1.73
8 South Asian GD4 > 5 × (GD1 -1.69)2 + 0.042
6 Latin American 2 GD4 < 0 and PopID is not 7

Input files

As in GRAF-rel, GRAF-pop takes genotype datasets in either PLINK format (.fam, .bim, .bed) or GRAF format (.fpg). In addition, GRAF-pop can read self-reported ancestries from the input file and compare the ancestries inferred from genotypes with the self-reported ones. The input file should be a plain text file with two columns (without column header), containing subject ID and the self-reported ancestry, respectively.

Running graf to infer subject ancestry

Option -pop is used by graf to infer subject ancestry:

$ graf

Usage: graf [options]
    -pop    output file:     Check subject populations and save results to the output file

The following command determines population structures and saves results to the output file:

graf -plink data/G1000FpGeno -pop data/G1000_sbj_scores.txt

Running PlotPopulations.pl to plot population results

The results generated by graf can be passed to PlotPopulations.pl for further processing. The following instructions are displayed on the screen when the script is run without parameters:

$ PlotPopulations.pl

Usage: PlotPopulations.pl <input file> <output file> [Options]

Note:
    Output file should be either a .png file or a .txt file.

    If the output file is a .png file, the script will plot the results to a graph and save the graph to the file.
    If the output file is a .txt file, the script will save the calculated subject ancestry components to the file.

Options:
    Set window size in pixels
        -gw      graph width

    Set graph axis limits
        -xmin    min x value
        -xmax    max x value
        -ymin    min y value
        -ymax    max y value

    Set a rectangle area to retrieve subjects for graph of GD1 vs. GD2
        -xcmin   min x value
        -xcmax   max x value
        -ycmin   min y value
        -ycmax   max y value
        -isByd   0 or 1
                 0:  retrieve subjects whose values are within the above rectangle (default value)
                 1:  retrieve subjects whose values are beyond the above rectangle

    Set population cutoff lines
        -ecut   proportion: cutoff European proportion dividing Europeans from other population. Default 87%.
        -fcut   proportion: cutoff African proportion dividing Africans from other population. Default 95%.
                            Set it to -1 to combine African and African American populations
        -acut   proportion: cutoff East Asian proportion dividing East Asians from other populations. Default 95%.
                            Set it to -1 to combine East Asian and Asian-Pacific Islander populations
        -ohcut  proportion: cutoff African proportion dividing Latin Americans from Other population. Default 13%.
        -fhcut  proportion: cutoff African proportion dividing Latin Americans from African Americans. Default 40%.

    Select some self-reported populations (by IDs) to be highlighted on the graph
        -pops   comma-separated population IDs, e.g., -pops 1,3,4 -> highlight populations #1, #3, and #4

    Select self-reported populations (by IDs) to show areas including 95% dbGaP subjects with genotypes of at least 4000 fingerprint SNPs
        -areas  comma-separated dbGaP self-population IDs, e.g., -areas 1,3
            -> show areas that include 95% dbGaP subjects with self-reported populations #1 and #3
              1: European
              2: African
              3: East Asian
              4: African American
              5: Latin American 1
              6: Latin American 2
              7: Asian-Pacific Islander
              8: South Asian

    Select which score to show on the y-axis
        -gd4     1 or 0.  1: show GD4 on y-axis;  0: show GD2

    Set population cutoff lines
        -cutoff  1 or 0.  1: show cutoff lines;  0: hide cutoff lines

    Rotate the plot with respect to the x-axis by a certain angle
        -rotx    angle in degrees

    Set the size (diameter) of each dot that represents each subject
        -dot     pixels

    The input file with self-reported subject race information
        -spf     a file with two columns: subject and self-reported population

The script takes two required parameters, which must be the first two arguments and are not preceded by flags, unlike all the optional arguments, which are preceded by a flag. The first parameter should be the name of the file that is generated by graf -pop option and contains subject genetic distance scores. The second parameter is the output file, expected to be either a .png or .txt file. If the output file is a .png file, the script processes the scores and saves the results to the output file. The default graph is GD1 vs. GD2, e.g.,

$ graf -plink data/G1000FpGeno -pop data/G1000_sbj_scores.txt
$ PlotPopulations.pl data/G1000_sbj_scores.txt data/G1000_sbj_pops.png

When option -gd4 is set to 1, the script generates a graph of GD1 vs. GD4:

$ PlotPopulations.pl data/G1000_sbj_scores.txt data/G1000_sbj_pops_gd4.png -gd4 1

If the output file is a .txt file, the script processes the data and saves the results to the output file in a format of a rectangular table.

$ PlotPopulations.pl data/G1000_sbj_scores.txt data/G1000_sbj_list.txt

In the output file, columns P_e, P_f, P_a show each subject’s African, European, and East Asian proportions Pe, Pf, Pa, in percentages. The populations determined by GRAF-pop are included in the last two columns as an identifier and as the full name of the population.

When self-reported ancestries are available, the information can be passed to the script with -spf option so that the script can color-code the subjects using the self-reported ancestries, e.g.,

$ PlotPopulations.pl data/G1000_sbj_scores.txt data/G1000_sbj_pops_sp.png -spf data/G1000SbjSuperPop.txt

The format of the input ancestry file is described above. In the graph generated by the script, the ancestries are numbered and color coded.

The cutoff lines used to partition the subjects are drawn on the graphs when option -cutoff is set, e.g.,

$ PlotPopulations.pl data/G1000_sbj_scores.txt data/G1000_sbj_pops_cut.png -spf data/G1000SbjSuperPop.txt -cutoff 1
$ PlotPopulations.pl data/G1000_sbj_scores.txt data/G1000_sbj_pops_cut_gd4.png -spf data/G1000SbjSuperPop.txt -gd4 1 -cutoff 1

If multiple subjects appear at the same locations in the x-y plane, the user can use option -pops to bring some ancestries to the front, while setting some ancestries to the back and fade them out in the graph. For example, the following command generates a graph with the ancestry No. 5 (AMR, standing for Ad Mixed American) in the back and colored yellow. The assignments of colors to populations are currently hard-coded.

$ PlotPopulations.pl data/G1000_sbj_scores.txt data/G1000_sbj_pops_1234.png -spf data/G1000SbjSuperPop.txt -pops 1,3,2,4

The ancestry numbers following -pops should be separated by commas without spaces.

One can also use the -rotx option to rotate the graph of GD2 vs. GD1 around x-axis by a certain angle specified in degrees (can be any real number). For example, the following command generates a graph showing the subjects rotated by 90°:

$ PlotPopulations.pl data/G1000_sbj_scores.txt data/G1000_sbj_pops_90.png -spf data/G1000SbjSuperPop.txt -rotx 90

Options -gw, -xmin, -xmax, -ymin, -ymax, -dot, similar to those in PlotGraf.pl, can be used to adjust the graph size, specify axis limits, and set the dot size, e.g.,

$ PlotPopulations.pl data/G1000_sbj_scores.txt data/G1000_sbj_pops_gw.png -spf data/G1000SbjSuperPop.txt -gw 800 -ymin 1.1 -dot 5
$ PlotPopulations.pl data/G1000_sbj_scores.txt data/G1000_sbj_pops_gw_gd4.png -spf data/G1000SbjSuperPop.txt -gw 800 -gd4 1 -ymin -0.2

One can use the option -areas to select populations to show the expected oval areas that include 95% of dbGaP subjects with at least 4000 fingerprint SNPs with genotypes, e.g.,

$ PlotPopulations.pl data/G1000_sbj_scores.txt data/G1000_sbj_pops_a.png -spf data/G1000SbjSuperPop.txt -areas 1,4,7

The integers in the comma-delimited string represent the eight self-reported ancestry groups in dbGaP, with most common ancestry terms in each group shown below:

            1: European
            2: African
            3: East Asian
            4: African American
            5: Latin American 1
            6: Latin American 2
            7: Asian-Pacific Islander
            8: South Asian

GRAF-pop uses the ancestry proportions shown in Tables 2 and 3 as default cutoff values. The user can use options -ecut, -fcut, -acut, -ohcut, -ahcut, -fhcut to set the cutoff values to different numbers, e.g.,

$ PlotPopulations.pl data/G1000_sbj_scores.txt data/G1000_sbj_pops_ucut.png -spf data/G1000SbjSuperPop.txt -cutoff 1 -fcut 85 -ahcut 80 -ohcut 15.5

When -fcut or -acut are set to negative values, the African or East Asian cutoff line is not plotted on the graph, and the script does not distinguish Africans from African Americans, or East Asians from Asian-Pacific Islanders, e.g.,

$ PlotPopulations.pl data/G1000_sbj_scores.txt data/G1000_sbj_pops_nf.png -spf data/G1000SbjSuperPop.txt -cutoff 1 -fcut -1

As mentioned above, when the second parameter (the output file) is a .txt file, the script saves subjects and ancestry proportions into a rectangular table. Options -xcmin, -xcmax, -ycmin, -ycmax, -isByd can be used to specify a rectangular area and let the script retrieve subjects whose x(GD1), y(GD2) scores are either within or beyond this area. For example, the following command saves all subjects with 1.8 < GD1 < ∞ and -∞ < GD2 < 1.2, which are all the EAS (East Asian) subjects:

$ PlotPopulations.pl data/G1000_sbj_scores.txt data/G1000_sbj_list_cut.txt -spf data/G1000SbjSuperPop.txt -xcmin 1.8 -ycmax 1.2

When option -isByd is set to 1, the script retrieves subjects whose values are beyond the rectangular area specified by options -xcmin, -xcmax, -ycmin, -ycmax. For example, the following command excludes most of the 1000 Genomes Project’s subjects with super populations AMR (Ad Mixed American) and SAS (South Asian):

$ PlotPopulations.pl data/G1000_sbj_scores.txt data/G1000_sbj_list_cutb.txt -spf data/G1000SbjSuperPop.txt -xcmin 1.64 -xcmax 1.8 -ycmin 1.24 -ycmax 1.36 -isByd 1

GRAF-sex: determining subject sexes using genotypes of pre-selected SNPs

GRAF-sex calculates heterozygosity rate of genotypes of pre-selected 1000 SNPs on X-chromosome and determines the genotype sex and compares it with the self-reported phenotype sex for each subject.

Input files

As in GRAF-rel and GRAF-pop, GRAF-sex takes genotype datasets in either PLINK format (.fam, .bim, .bed) or GRAF format (.fpg). In addition, GRAF-rel can read self-reported sexes either from the PLINK .fam file or a separate input file and compares the sexes determined using genotypes with the self-reported phenotyep sexes. The input file should be a plain text file with two columns, containing sample ID and the self-reported phenotype sexes, respectively.

Running graf to determine genotype sex

Option -sex is used by graf to determine genotype sex:

$ graf

Usage: graf [options]
    -sex    output file:     Do sex check using fingerprint genotypes on X-chromosome and save results to the output file

The following command determines sample genotype sexes and saves results to the output file:

$ graf -plink data/SexTestDs01 -sex data/SexTestRes01.txt

Running PlotSexCheckResults.pl to plot sex check results

The results generated by graf can be passed to PlotSexCheckResults.pl for plotting. The following instructions are displayed on the screen when the script is run without parameters:

$ PlotSexCheckResults.pl

Usage: PlotSexCheckResults.pl [Options]

Options:
    Specify input and output files
        -in      (required) input file generated by graf with -sex option
        -out     (required) output .png file to save the graph plotted by the script
        -sum     output .txt file to save the summary generated by the script
        -sf      two-column plain text file including sample sexes (1=male; 2=female)

    Set window size in pixels and axis scale
        -gw      graph width
        -gh      graph height
        -ymax    max number of samples to plot on y-axis

    Set cutoff heterozygosity rates (%)
        -mhr     maximum heterozygosity rate of male samples
        -fhr1    minimum heterozygosity rate of female samples
        -fhr2    maximum heterozygosity rate of female samples

The script takes two required parameters, the input and output files, passed in though -in and -out options.

$ graf -plink data/SexTestDs01 -sex data/SexTestRes01.txt
$ PlotSexCheckResults.pl -in data/SexTestRes01.txt -out data/SexTestRes01.png 

The script plots the results in input file data/SexTestRes01.txt and saves the graph to data/SexTestRes01.png.

By default, PlotSexCheckResults.pl reads self-reported phenotype sexes from the .fam file of the PLINK set. The user can specify another file for the script to read the phenotype sexes, e.g.,

$ PlotSexCheckResults.pl -in data/SexTestRes01.txt -out data/SexTestRes01b.png -sf data/SexTestSmpSex02.txt

The user can use option -sum to save the summary into a specified file, e.g.,

$ PlotSexCheckResults.pl -in data/SexTestRes01.txt -out data/SexTestRes01b.png -sf data/SexTestSmpSex02.txt -sum data/SexTestRes01_sum.txt 

Option -ymax can be used to adjust the y-axis scale, e.g.,

$ PlotSexCheckResults.pl -in data/SexTestRes01.txt -out data/SexTestRes01c.png -sf data/SexTestSmpSex02.txt -sum data/SexTestRes01_sum.txt -ymax 30

Graph size can be adjusted using options -gw and -gh, e.g.,

$ PlotSexCheckResults.pl -in data/SexTestRes01.txt -out data/SexTestRes01d.png -sf data/SexTestSmpSex02.txt -ymax 25 -gw 800 -gh 400

The user can set cutoff heterozygosity rates (in percentage) using options -mhr (maximum heterozygosity rate of male samples), -fhr1 (minimum heterozygosity rate of female samples) and -fhr2 (maximum heterozygosity rate of female samples), e.g.,

$ PlotSexCheckResults.pl -in data/SexTestRes01.txt -out data/SexTestRes01e.png -sf data/SexTestSmpSex02.txt -ymax 25 -fhr1 34 -fhr2 52 -sum data/SexTestRes01e_sum.txt

Running SetRsIdsInBimFile.pl to convert SNP IDs in .bim file into RS IDs

graf requires that SNP IDs in the .bim be RS IDs. If they are not RS IDs, the user can use SetRsIdsInBimFile.pl to convert them into RS IDs.

$ SetRsIdsInBimFile.pl

This script checks chromosome and position for each SNPs in the PLINK .bim file to determine
if it is a GRAF fingerprint SNPs. Assuming genome build GRCh 37 or 38, the script does the following:

    if the SNP is a GRAF fingerprint SNP then
        if the SNP ID is an RS ID then
            if the SNP ID is different from that expected based on chromosome and position then
                replace the SNP ID as "RS_ID_conflict"
                report it in the WARNING message
        else
            replace the SNP ID with rs ID

If output file is specified, it saves the new results to the output file. Otherwise it updates the input file and saves the original .bim file to <plink_set_basename>_orig.bim

Usage: SetRsIdsInBimFile.pl <plink_bim_file> [output_bim_file]

For example, no sex check results are generated after the following command is run, since graf doesn’t find any fingerprint SNPs on X-chromosome:

$ graf -plink data/SexTestDs02 -sex data/SexTestRes02.txt

We can do sex check in the following two steps:

$ SetRsIdsInBimFile.pl data/SexTestDs02.bim
$ graf -plink data/SexTestDs02 -sex data/SexTestRes02.txt

SetRsIdsInBimFile.pl converts fingerprint SNPs to RS IDs on all chromosomes, which means it can be used for relationship, population and sex checks.

References

Jin Y, Schäffer AA, Sherry ST, and Feolo M (2017). Quickly identifying identical and closely related subjects in large databases using genotype data. PLoS One. 12(6):e0179106.

Jin Y, Schäffer AA, Feolo M, Holmes JB and Kattman BL (2019). GRAF-pop: A Fast Distance-based Method to Infer Subject Ancestry from Multiple Genotype Datasets without Principal Components Analysis. G3: Genes | Genomes | Genetics. August 1, 2019 vol. 9 no. 8 2447-2461.