Load & Graph Data
Here we’ll bring together much of our Julia knowledge to actually analyze a large dataset.
The example here will be based on a dataset from this paper:
McCallum, Q., K. Askelson, F.F. Fogarty, L. Natola, E. Nikelski, A. Huang, and D. Irwin. 2024. Pronounced differentiation on the Z chromosome and parts of the autosomes in crowned sparrows contrasts with mitochondrial paraphyly: implications for speciation. Journal of Evolutionary Biology, voae004, https://doi.org/10.1093/jeb/voae004
We will load in a large dataset of genotypes throughout the genomes of Golden-crowned Sparrows and two forms of White-crowned Sparrows: pugetensis and gambelii. We will then use my own custom-designed Julia package called SNPlots.jl to analyze and graph aspects of this dataset.
To work through this example, start with a new Pluto notebook. Give it a name and save it, so your work will be continuously saved. You will need a set of files that I will send you–these include one file for the SNPlots.jl
package of functions and four data files within a folder called Sparrow_data_McCallumetal2024
.
Set the working directory
First, set the working directory to match the folder that I sent you called JuliaPlutoDemoSNPplots
:
cd("PATHNAME/JuliaPlutoDemoSNPplots") # where PATHNAME is appropriate to your computer directory structure
(The Julia function cd()
changes directory.)
Check that the working directory is now set correctly, using the pwd()
function (print working directory):
pwd()
Load the SNPlots
package
I wrote this set of functions to facilitate the analysis of and graphing of genomic data. The scope is quite modest so far, mainly targeted towards producing certain figures that I wanted in our papers.
The SNPlots package is not yet released to the world, so please don’t spread widely yet. I anticipate releasing it when the first paper is published using figures it produces.
We first need to load two packages used by SNPlots:
using MultivariateStats # for doing Principal Compponents Analysis
using CairoMakie # the plotting package used by SNPlots.jl
Now load SNPlots:
begin
include("SNPlots.jl")
using .SNPlots
end
This way of loading a package is appropriate when the source file is on your local computer (rather than being a registered Julia package).
To confirm that the SNPlots package is loaded, enter this:
greet_SNPlots() SNPlots.
You should receive a greeting. 😄
Set up the paths and data file names
begin
= "Sparrow_data_McCallumetal2024/SparrowDemo_genotypes.012"
genotype_file_name = genotype_file_name * ".indv"
individuals_file_name = genotype_file_name * ".pos"
position_file_name = "Sparrow_data_McCallumetal2024/SparrowDemo.Fst_groups.txt"
metadataFile end
Load three more packages for importing data
These are for storing data as type DataFrame
, and reading in delimited files:
using DataFrames, CSV, DelimitedFiles
Enough setup, let’s load data!
First import metadata about thee samples:
= DataFrame(CSV.File(metadataFile)) metadata
Above, the CSV.file()
function interprets the correct delimitation format of the data file, and the DataFrame()
function creates a DataFrame
object (a wonderful way to store data where columns are variables and rows are individuals).
Now, Import the list of individuals (row names) of the genotype matrix:
begin
= DataFrame(CSV.File(individuals_file_name; header=["ind"], types=[String]))
ind = size(ind, 1) # number of individuals
indNum end
Check that the metadata file and .indv file have same number of individuals:
if nrow(metadata) != indNum
println("WARNING: number of rows in metadata file different than number of individuals in .indv file")
else println("Good news: number of rows in metadata file matches the number of individuals in .indv file")
end
Combine individual names and metadata into one data structure (enabling confirmation that names match and are in correct order):
= hcat(ind, metadata) ind_with_metadata
Load SNP positions:
= DataFrame(CSV.File(position_file_name; header=["chrom", "position"], types=[String, Int])) pos_whole_genome
Load the genotype matrix:
begin # read in genotype data
@time geno = readdlm(genotype_file_name, '\t', Int16, '\n')
= size(geno, 2) - 1 # because the first column is not a SNP (just a count from zero)
loci_count print(string("Read in genotypic data at ", loci_count," loci for ", indNum, " individuals. \n"))
= geno[:, Not(1)] #remove first column, which was just a row index
genosOnly end
In the data matrix, rows represent individuals and columns represent SNPs, and genotypes are as follows:
0: homozygous reference; 1: heterozygous; 2: homozygous alternate; -1: missing genotype
[If you want to filter individuals or SNPs with too much missing data, I have scripts that can be added here]
Choose our groups to plot, and their colours
= ["GCSP","PSWS","GWCS"] groups_to_plot
= ["gold","red","blue"] group_colors
To do a PCA, we first need to impute missing genotypes
Prepare data matrix to be ready for imputation, by replacing -1
with missing
data type:
begin
= Matrix{Union{Missing, Int16}}(genosOnly)
genosOnly_with_missing .== -1] .= missing;
genosOnly_with_missing[genosOnly_with_missing = Matrix{Union{Missing, Float32}}(genosOnly_with_missing)
genosOnly_with_missing end
Load Impute
package:
using Impute
Now actually impute the missing genotype values:
@time imputed_genos = Impute.svd(genosOnly_with_missing)
Let’s make the PCA plot!
begin
= SNPlots.plotPCA(imputed_genos, ind_with_metadata,
PCAmodel
groups_to_plot, group_colors;="Zonotrichia sparrows", regionText="whole_genome",
sampleSet=false, flip2=false, showPlot=true)
flip1
PCAmodel.PCAfigend
Gold symbols represent Golden-crowned Sparrows. Blue represent the gambelii form of White-crowned Sparrows. Red represent that the pugetensis form of White-crowned Sparrows.
Now let’s produce a genotype-by-individual plot
Some initial setup of parameters for the plot:
begin
= "GCSP" # the alleles most common in this group will be assigned the same color in the graph
group1 = "GCSP_PSWS" # The groups to compare for the Fst filter below
groupsToCompare = 0.8
Fst_cutoff = 0.2
missingFractionAllowed end
Calculate allele frequencies and sample sizes per group:
= SNPlots.getFreqsAndSampleSizes(genosOnly_with_missing,
freqs, sampleSizes ind_with_metadata.Fst_group, groups_to_plot)
Calculate Fst (allele frequency differentiation) between pairs of groups:
= SNPlots.getFst(freqs,
Fst, FstNumerator, FstDenominator, pairwiseNamesFst sampleSizes, groups_to_plot)
Choose the chromosome to plot:
begin
= "CM018231.2" # the name of a scaffold in the assembly
chr = SNPlots.chooseChrRegion(pos_whole_genome, chr) #; positionMin=1) # this gets the maximum position for the chromosome
regionInfo end
OK, let’s make the plot!
begin
= SNPlots.plotGenotypeByIndividual(groupsToCompare, Fst_cutoff,
plotInfo
missingFractionAllowed, regionInfo, pos_whole_genome, Fst, pairwiseNamesFst,
genosOnly_with_missing, ind_with_metadata, freqs,
groups_to_plot, group_colors)1] # this outputs the plot above
plotInfo[end
The figure shows the genotypes of individuals at SNPs that have Fst higher than Fst_cutoff
in the groups being compared (given by groupsToCompare
). The two alleles have different shades of purple, with triangles indicating heterozygotes. The bottom of the plot indicates the location of SNPs on the chromosome, and the blue shading indicates SNP density across the chromosome.
If the function works on your computer like it does on mine, it will create a separate plot window and show the figure both there and in your Pluto notebook.
Can you write a loop to cycle through a number of different chromosomes and make a plot for each? You would use a lot of the code in the previous two cells above. You could do this for these scaffolds, expressed here a a vector of strings: ["CM018230.2", "CM018231.2", "CM018232.2", "CM018233.2", "CM018234.2"]
A take-home message:
The SNPlots.plotGenotypeByIndividual()
function produces this complex plot from the repeated use of low-level plotting functions to simply draw lines, filled shapes, and text in the plotting window. If you learn the commands to draw such simple elements and the logic by which to determine where to draw them, you can design your own functions to make all sorts of complex figures.