Randomization in Simulation and Statistics

Author

Darren Irwin

The concept of random events is central to our understanding of many biological processes (e.g. mutation, genetic drift) and to hypothesis testing using data from a sample. Hence, we biologists often need to use random numbers in our computer programming. Random numbers enable us to model probabilistic processes, where different outcomes occur some pre-defined proportion of the time, if we were to do the process a very large number of times.

An interesting fact though is that computers cannot truly pick totally random numbers. Rather, there are deterministic algorithms that do a great job of producing series of numbers that have little apparent relationship with each other, and that appear to occur with equal probability over some range of values. These are often called ‘pseudorandom numbers’.

These algorithms start with a random number seed, and if you give a certain algorithm the same seed and run it twice, it will produce the same series of ‘random’ numbers. This can be useful if for instance you are testing a simulation and you want to ensure it produces the same output after you’ve made some small change to the code. Other times though, you want your random numbers to be as unpredictable as possible. For this we can use an arbitrary seed that we have never tried before; different seeds will produce totally different random sequences. A neat trick is to look up the time on the computer’s clock and use digits representing the number of milliseconds (after the last whole second) as the random number seed; this way the programmer cannot even predict the seed before running the code.

Most of the time though, we don’t need to think at all about the way Julia is generating random numbers, as it uses an excellent algorithm and uses an arbitrary seed based on difficult-to-predict aspects of the computer’s state. So generating random numbers in Julia is easy!

The `rand()` function

To get a random number between 0 and 1, we simply write the rand() function with no arguments:

rand()

0.5293703357968742

The output of this default use of the funtion is a floating-point number (that is, it is of type Float64).

Try running that a few times to see that the output is indeed random.

Better yet, let’s run it 10 times in one command:

rand(10)

10-element Vector{Float64}:
 0.7959931029946683
 0.5942896560455131
 0.39762918022138993
 0.8989015795579058
 0.40447976879331116
 0.22771118936599555
 0.0736144266175558
 0.02593389824904646
 0.8293922616220379
 0.1132989420306234

This shows that including an integer as the one argument tells Julia to generate that many random numbers (each a Float64 between 0 and 1).

But the rand() function makes great use of multiple dispatch. If we include a type as the one argument, it gives us a random value for that type:

rand(Int8)

-10

The returned value here is a random integer within the range of possible Int8 values.

To randomly produce true or false:

rand(Bool)

true

If we instead put a range as the one argument, Julia returns a number in that range:

rand(1:6)

The above simulates the role of a 6-sided die. Try it a few times.

Or we can give a vector or tuple containing a set of values to choose from:

rand([0, pi, 2pi])  # the brackets indicate a vector (i.e., an array)

3.141592653589793

rand(('A', 'C', 'G', 'T'))  # the parentheses indicate a tuple

'C': ASCII/Unicode U+0043 (category Lu: Letter, uppercase)

We can even write the above example more concisely, by providing all the Chars together as a String:

rand("ACGT")

'T': ASCII/Unicode U+0054 (category Lu: Letter, uppercase)

There is great flexibility in how we can use the rand() function:

rand([1, "DNA", false], (2,5))

2×5 Matrix{Any}:
 false       false       "DNA"  false            "DNA"
      "DNA"       "DNA"  "DNA"       "DNA"  false

Above, we provided as arguments both the set of possible outcomes and a tuple indicating the dimensions and size of the matrix to produce.

Setting up to get random numbers that are reproducible (!?!)

If we want our code to produce and use the same sequence of ‘random’ numbers, we need to load the Random package:

using Random

The rand() function can accept an optional first argument that specifies the random number generator. Here, we will use the Xoshiro algorithm (now the default in Julia) and give it an arbitrary seed of 321:

rand(Xoshiro(321), 5)

5-element Vector{Float64}:
 0.6893036297831823
 0.22036324422357434
 0.7799494887616285
 0.2330428384897486
 0.8716295743088799

Let’s see if we run that again:

rand(Xoshiro(321), 5)

5-element Vector{Float64}:
 0.6893036297831823
 0.22036324422357434
 0.7799494887616285
 0.2330428384897486
 0.8716295743088799

We get the same sequence of numbers. The numbers are ‘random’ in some sense, but repeatable. Try changing the value of the seed, and see what happens.

Instead of setting up a random number generator each time we call the rand() function, we might want to set up a random number generator and then use that throughout our program. That way, the set of numbers in our program is never repeated within our program, but the whole program will run the same each time. For this, we set up a random number generator (‘rng’) and use that each time:

rng = Xoshiro(321)
rand(rng, 5)

5-element Vector{Float64}:
 0.6893036297831823
 0.22036324422357434
 0.7799494887616285
 0.2330428384897486
 0.8716295743088799

rand(rng, 5)

5-element Vector{Float64}:
 0.02894933089311269
 0.07474632097497813
 0.38509511194135615
 0.6081412740086088
 0.17766823446325286

Now our two sets of numbers above are different (so ‘random’ with respect to each other), but reproducible on a larger level. (You can test this by rerunning the three lines of code above.)

We can use our generator to choose among specified options using a construction like this:

rand(rng, ["yes","no","maybe"], 5)

5-element Vector{String}:
 "maybe"
 "maybe"
 "no"
 "yes"
 "maybe"

Make some DNA

Write a function that produces a pseudo-random DNA sequence (using the letters A, C, G, T), of a length provided to the function. Ensure that your function always provides the same sequence, such that the only thing that differs between function calls is the length of the sequence. (You might find this useful: the function join() useful, as it assembles a string out of elements in a vector, e.g. join([2,"A"]) returns "2A".)

Normally-distributed random numbers

Often in the context of biology, we want normally-disributed random numbers. To draw from a normal distribution with mean 0 and standard deviation 1, use this function:

randn()

1.0352471968804533

Make a cell move

Make a graph showing the path of a cell that has moved randomly in 2-dimensional space. (Hint: Start the cell at a certain location give by an x and y value, and then alter the x and y values by adding independent random values)

Make a null distribution

Now that we know how to randomly sample from a set of outcomes, we can generate our own null distributions and use them in statistical tests of our data. Here is a simple example, borrowed from the UBC BIOL 300 teaching material (thanks to Mike Whitlock) and based on the data in this paper:

Hill, RA, and RA Burton 2005. Red enhances human performance in contests. Nature 435:293

In the 2004 olympics, athletes in four combat sports (boxing, taekwondo, Greco-Roman wrestling, and freestyle wrestling) were randomly assigned red or blue clothing. Given the random assignment, if color has no effect on the probability of winnning, then we expect about half of matches to be won by each color. In actuality, there were more red than blue winners in 16 rounds of competition, and more blue than red winners in only 4 rounds. Is this convincing enough evidence that color had a psychological effect on the athletes, or is it plausible that this outcome was just by chance?

Well, let’s generate a null distribution for the number of rounds won by red (out of 20 rounds), assuming color had no actual effect. To do this we can simulate red randomly winning or losing, do that 20 times, and record the number of red wins as one possible result when color has no effect. We can then do that whole process a large number of times and observe the proportion of times that we get a result of 16 red wins or something equally or more extreme. (This is an estimate of the P-value.)

function getSimulatedRedWins()
    redWins = 0  # initialize the counter at zero
    for i in 1:20  # loop through 20 rounds
        winner = rand(["red","blue"])  # pick a random winner for one round
        if winner == "red"  # if red winner, add one to the count of red wins
            redWins = redWins + 1 
        end
    end
    return redWins
end

getSimulatedRedWins()

Try another way of writing the above function

A fun thing about programming is there are many ways to solve a problem. Try different approaches for writing the function above. Can you figure out a way that doesn’t use a for loop?

Also, can you make the above function more general in some way? For instance, can you make it such that the number of rounds (20) is not fixed but can entered as an argument to the function?

We now have a function to get one simulated value for the number of red wins under the null hypothesis of no advantage for red. Lets run this many many times and generate a null distribution:

function getManySimResults(numSims)
    simRedWins = fill(-9, numSims) # make vector for data, with obviously wrong value
    for i in 1:numSims  # loop through simulations
        simRedWins[i] = getSimulatedRedWins()
    end
    return simRedWins
end

nullValues = getManySimResults(1_000_000)

using Plots

histogram(nullValues, bins = (0:21) .- 0.5, legend=false, xlabel = "Number of red wins",
    ylabel = "Number of simulations",)

The graph shows the distribution of the number of red wins out of 20 trials, under the null hypothesis that the probabilities of red vs. blue winning are both 50%.

Estimate a P-value

We can now compare our real data (16/20 wins were for red) and ask how much of our null distribution is that extreme or even more extreme. Hence we’ll add up the proportion of simulations with 16 or more red wins. Then, we’ll multiply by 2 to include the other extremes on the blue-winning side (making this a two-tailed test):

numExtremeSims = 2sum(nullValues .>= 16)

Our estimated P-value will be that number divided by the number of simulations:

pValue = numExtremeSims / length(nullValues)

0.011608

If you have a statistics background, you might have noticed the above situation is perfectly suited for a binomial test. In fact, the distribution that we simulated is extremely close to a binomial distribution. The only reason it differs is because our number of simulated outcomes is finite–meaning there are chance differences between or simulated distribution and the theoretical distribution, which is given by the binomial theorem.

We can compare our estimated P-value above with the P-value produce by a binomial test. For this, we’ll add a package to our Julia environment and then use it:

using Pkg  # this loads the Pkg package
Pkg.add("HypothesisTests")  # this downloads and installs the HypothesisTests package

using HypothesisTests
BinomialTest(16, 20, 0.5)

Binomial test
-------------
Population details:
    parameter of interest:   Probability of success
    value under h_0:         0.5
    point estimate:          0.8
    95% confidence interval: (0.5634, 0.9427)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0118

Details:
    number of observations: 20
    number of successes:    16

This command shows that the p-value of a precise binomial test (0.0118) is extremely close to our estimate based on a large number of simulations. (Doing an even larger number of simulations will tend to make the estimate even closer to the precise value from the binomial test.)

The P-value is below the usual significance threshold of 0.05, so the authors rejected the null of no effect of colour on winning, and concluded there is an advantage provided by wearing red rather than blue.

Permutation (or Randomization) test

Above, we used simulated data sets to generate a null distribution for a single variable (number of red wins out of 20 trials). We can use a related approach when interested in the association of two variables. We do this by assuming the null hypothesis of no association is true, and then repeatedly and randomly shuffle the values in the dataset to make a bunch of new datasets. We calculate a summary statistic from each shuffled dataset, and the distribution of those summary statistics is our null distribution for that statistic. If the statistic from our real data is in the extreme tails of that distribution, we can then reject the null and conclude that there is in fact an association between the variables. This procedure is called a permutation test or randomization test.

Let’s do a permutation test on data from this publication:

Johnson et al. (1999) Female remating propensity contingent on sexual cannibalism in sagebrush crickets, Cyphoderris strepitans: a mechanism of cryptic female choice. Behavioral Ecology 10: 227-233.

When these crickets mate, the male offers his hindwings for the female to eat. The question was: Why would they do such a thing?!? One idea: if the females get some nutrients from the wings, maybe they wait longer to remate (with another male), the implication being that males who feed females their wings might father more offspring.

Let’s load a dataset containing times to remating for two experimental groups of females: those mated initially with a winged male (and who mostly ate his wings) and those mated initially with a wingless male (where wings had been surgically removed).

using CSV, DataFrames, Downloads
cricket_file_URL = "https://github.com/darreni/JuliaProgrammingForBiologists/raw/refs/heads/main/example_data/cricketData.csv"
cricketData = DataFrame(CSV.File(Downloads.download(cricket_file_URL)))

25×2 DataFrame

Row	Male_state	Log_remating_time
	String15	Float64
1	Male_wingless	0.0
2	Male_wingless	0.7
3	Male_wingless	0.7
4	Male_wingless	1.4
5	Male_wingless	1.6
6	Male_wingless	1.8
7	Male_wingless	1.9
8	Male_wingless	1.9
9	Male_wingless	1.9
10	Male_wingless	2.2
11	Male_wingless	2.1
12	Male_wingless	2.1
13	Male_winged	1.4
14	Male_winged	1.6
15	Male_winged	1.9
16	Male_winged	2.3
17	Male_winged	2.6
18	Male_winged	2.8
19	Male_winged	2.8
20	Male_winged	2.8
21	Male_winged	3.1
22	Male_winged	3.8
23	Male_winged	3.9
24	Male_winged	4.5
25	Male_winged	4.7

Let’s make some quick histograms showing the distribution of remating time for each group:

using Plots
p1 = histogram(cricketData.Log_remating_time[cricketData.Male_state .== "Male_wingless"], bins=0:0.5:5, xlims=[0,5], title="females mated to wingless males", ylabel="frequency")

p2 = histogram(cricketData.Log_remating_time[cricketData.Male_state .== "Male_winged"], bins=0:0.5:5, xlims=[0,5], title="females mated to winged males", xlabel="Remating time", ylabel="frequency")

plot(p1, p2,
    layout = @layout [p
                      p])

If we want to test whether the two groups differ (beyond chance differences due to sampling error), we might be uncomfortable with a 2-sample t-test because that would require assumptions of normality and equal variance. Instead, we can do a permutation test.

The permutation test assumes nothing about the underlying distributions, simply holding up the null hypothesis of no difference in distributions between the two categories. We’ll test this by generating a null distribution for the difference in mean remating time between the two categories, and then compare our real difference to that null distribution.

Let’s make a function to calculate the difference in means:

function calculateCricketMeanDiffs(data::DataFrame)
    winged = data.Log_remating_time[data.Male_state .== "Male_winged"]
    wingless = data.Log_remating_time[data.Male_state .== "Male_wingless"]
    sum(winged)/length(winged) - sum(wingless)/length(wingless)
end

calculateCricketMeanDiffs (generic function with 1 method)

Now use that to calculate the difference in real group means:

realDiff = calculateCricketMeanDiffs(cricketData)

1.4134615384615383

So we have our real difference. Let’s compare that to possible differences we get if there is no real association between male state and remating time. We’ll shuffle the values of one variable (using the well-named shuffle() function) and construct a null distribution of mean differences:

using Random  # needed for the shuffle() function
function permuteOnce(data::DataFrame)
    permutedData = copy(data)  # this makes a copy of the dataframe, so original not altered
    shuffle!(permutedData.Log_remating_time)  # can specify a random number generator as a first argument, if desired
    return permutedData
end

# Test the function above:
permuteOnce(cricketData)

25×2 DataFrame

Row	Male_state	Log_remating_time
	String15	Float64
1	Male_wingless	2.1
2	Male_wingless	2.1
3	Male_wingless	3.9
4	Male_wingless	2.3
5	Male_wingless	1.6
6	Male_wingless	1.6
7	Male_wingless	1.4
8	Male_wingless	1.9
9	Male_wingless	1.9
10	Male_wingless	4.5
11	Male_wingless	4.7
12	Male_wingless	0.7
13	Male_winged	2.2
14	Male_winged	0.7
15	Male_winged	1.4
16	Male_winged	2.8
17	Male_winged	2.8
18	Male_winged	1.9
19	Male_winged	3.8
20	Male_winged	3.1
21	Male_winged	2.6
22	Male_winged	0.0
23	Male_winged	1.8
24	Male_winged	1.9
25	Male_winged	2.8

function permuteMany(data::DataFrame, reps::Int)
    nullDiffs = Vector{Float64}(undef, reps) 
    for i in 1:reps
        permutedData = permuteOnce(data)
        nullDiffs[i] = calculateCricketMeanDiffs(permutedData)
    end
    return nullDiffs
end

permutedDiffs = permuteMany(cricketData, 10_000)

10000-element Vector{Float64}:
  0.48397435897435903
  0.9647435897435901
 -0.2852564102564106
  0.4679487179487176
 -0.23717948717948678
 -0.2211538461538458
  0.628205128205128
  0.19551282051282026
 -0.6538461538461531
  0.1314102564102564
  0.06730769230769207
  0.40384615384615374
  0.3878205128205132
  ⋮
 -0.6217948717948725
  0.06730769230769207
  0.19551282051281982
 -0.7820512820512819
 -0.10897435897435903
 -0.4775641025641022
  0.08333333333333348
  0.019230769230769162
  0.0032051282051281937
 -0.23717948717948723
 -0.7179487179487185
 -0.653846153846154

Now let’s plot our null distribution and our real value:

histogram(permutedDiffs)
plot!([realDiff, realDiff], [0, 200], color="red", width=5)

So our real value (shown in red) is quite extreme compared to the null distribution of our test statistic, given the null hypothesis of no association between the experimental group and the response variable. This causes us to doubt the null. We can estimate the p-value by determining how many simulated values were beyond our actual value (times 2 to make it a 2-tailed test):

pValue = 2sum(permutedDiffs .>= realDiff) / length(permutedDiffs)

0.0004

Our estimate of the p-value tends to become more precise if we increase the number of permutations (see several cells up, where we called the permuteMany() function).

The p-value is far below the usual standard alpha of 0.05, so we are comfortable in rejecting the null and concluding that experimental removal of male hindwings does appear to shorten the remating time of females mated to that male (at least in this species of cricket).

Have fun with randomness!

I find experimenting with randomness can be one of the most enjoyable aspects of writing computer simulations. I think this is one reason humans enjoy games of chance–there is a surprising aspect to randomness, but also a strange sense of order can go along with it. Using functions like rand(), randn(), and shuffle(), I encourage you to have fun incorporating randomness into your programming.