Arrays & DataFrames

Author

Darren Irwin

Two Julia object types are especially useful for storing and manipulating datasets are Arrays (which we learned some about in the quick introduction) and DataFrames. On this page we’ll build understanding about each of these.

Arrays

You can think of an Array as a big box that contains one or more small boxes in a grid arrangement. They can have zero dimensions (i.e., one small box), one dimension (i.e., a stack of small boxes; also called a Vector), two dimensions (i.e., a grid of small boxes; also called a Matrix), or even more dimensions. We can store things in each small box, and refer to what is in that box by its indices (the box number along each dimension).

For example, here we create a 2-dimensional array of 3 by 4 dimensions:

A = [11 21 31 41
     0 0 0 0
     99 88 77 66]
3×4 Matrix{Int64}:
 11  21  31  41
  0   0   0   0
 99  88  77  66

Julia shows us the type of this object as being Matrix{Int64} meaning that it is a Matrix (a 2-dimensional Array) with elements of type Int64.

Let’s look in just one box using indexing:

A[3,2]
88

That gives us the value in row 3 and column 2.

Or we can take a slice of the array:

A[1:3,2:3]
3×2 Matrix{Int64}:
 21  31
  0   0
 88  77

Creating arrays

Often in programming, it is good practice to set up an array and then later fill it with meaningful values. This promotes efficiency (minimizing memory use and maximizing speed) compared to building an array by small pieces at a time.

We can initialize arrays in a number of ways. Here are a few of the possibilities:

B = ones(3, 7)
3×7 Matrix{Float64}:
 1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0
C = zeros(1,5)
1×5 Matrix{Float64}:
 0.0  0.0  0.0  0.0  0.0
D = fill(3.7, 3, 5, 2)
3×5×2 Array{Float64, 3}:
[:, :, 1] =
 3.7  3.7  3.7  3.7  3.7
 3.7  3.7  3.7  3.7  3.7
 3.7  3.7  3.7  3.7  3.7

[:, :, 2] =
 3.7  3.7  3.7  3.7  3.7
 3.7  3.7  3.7  3.7  3.7
 3.7  3.7  3.7  3.7  3.7

This last one is a 3-dimensional array. You can think of the dimensions as rows, columns, and pages or layers.

There are many times when we don’t care what the initial values in our array are. A time-saving method is to declare an array with an arbitrary value determined by the bit values already in the memory being accessed:

@time E = Array{Float64}(undef, 1000, 1000)
  0.000006 seconds (3 allocations: 7.629 MiB)
1000×1000 Matrix{Float64}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 ⋮                        ⋮              ⋱            ⋮                   
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
@time F = ones(1000,1000)
  0.004793 seconds (3 allocations: 7.629 MiB)
1000×1000 Matrix{Float64}:
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 ⋮                        ⋮              ⋱            ⋮                   
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0

Those two commands both create the same sized objects of type Matrix{Float64}, but the second takes longer to run because it has to change all the values to 1.0.

Arrays can store just about anything!

G = [1 1.0 "one"
     1//1 '1' BigInt(1)]
2×3 Matrix{Any}:
  1    1.0    "one"
 1//1   '1'  1

In fact, you can even put an array inside an array:

G[1,1] = [1 0
          0 1]
G
2×3 Matrix{Any}:
   [1 0; 0 1]  1.0    "one"
 1//1           '1'  1

The object G is of type Matrix{Any} which means that it is a matrix that can store anything in its small boxes. In contrast, the object A (which we created far above) is of type Matrix{Int64} which means that it can only store integers. If we try to put something else in its boxes, we get an error:

A[2,3] = "A string, not an integer"

Julia responds: MethodError: Cannot `convert` an object of type String to an object of type Int64

The error arises because A is set up as a matrix of integers, and we cannot put a string into it. We can fix this though be changing the type of the array:

A = convert(Matrix{Any}, A)  # converts A to type Matrix{Any}
A[2,3] = "A string, not an integer"
A
3×4 Matrix{Any}:
 11  21  31                            41
  0   0    "A string, not an integer"   0
 99  88  77                            66

Memory and speed efficiency of arrays

When we created the arrays A and G, Julia made a best guess as to what types it should allow in the boxes. In the case of A, when we first created it we put in only integers, so Julia made the assumption that we would always want only integers in it. Why would it make that limitation? Well, there are huge benefits in terms of the way the array is stored in memory. If only integers will be stored, then Julia knows how much memory to allocate. If anything could be stored in an array, then it has to set it up in memory in a more flexible way that is not as efficient.

Julia gives us the option of thinking a lot about efficiency in our coding, but it is also quite fast even when you don’t write efficient code, by making its best guess about how to store data. As you become a better programmer and work with larger datasets, it will become more useful to think about storing data efficiently.

As one example, let’s say we need to store a big matrix of random values, lets say 10,000 rows by 10,000 columns, so 100 million values. We could just venture forth with a simple command, like this:

bigMatrix = fill(1, 10_000, 10_000)    
10000×10000 Matrix{Int64}:
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 ⋮              ⋮              ⋮        ⋱        ⋮              ⋮           
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1

This creates a 10,000 by 10,000 array and fills it with the 1s. We can check the memory size of this array like this:

sizeof(bigMatrix)   
800000000

This tells us that bigMatrix uses 800 million bytes of memory. This makes sense, because it is of type Matrix{Int64} meaning that each element is stored as a 64-bit integer, meaning it uses 8 bytes (there are 8 bits in a byte).

If we think about our needs though, we might realize we will not need to store any big or small integers in this matrix. If our matrix will only be used to store integers ranging from -128 and 127, then those integers can all be stored in only 8 bits (1 byte). So let’s tell Julia that we want our Matrix set up like that:

bigMatrix2 = fill(Int8(1), 10_000, 10_000)   
10000×10000 Matrix{Int8}:
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 ⋮              ⋮              ⋮        ⋱        ⋮              ⋮           
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1

Now, we have filled our new matrix with Int8(1) which is the value 1 encoded as type Int8, meaning it takes up 8 bits, or 1 byte:

sizeof(bigMatrix2)   
100000000

This matrix takes 1/8 the memory of the first, but encodes exactly the same information. (But we have more limitation on what numbers it can store.)

Usually, information stored with lower memory footprint will also be quicker to access, meaning your programs will be faster.

BitArrays

A super efficient way to store a set of binary values (e.g. true/false, 1/0, on/off) is as a BitArray, wherein each element is stored as a single bit (the smallest memory unit in a computer). This means we can store 64 values in the same memory space as a default integer stored as Int64 would take:

myBitArray = trues(10_000, 10_000)  # trues() sets up a BitArray with all values set to 1
10000×10000 BitMatrix:
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 ⋮              ⋮              ⋮        ⋱        ⋮              ⋮           
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
sizeof(myBitArray)
12500000

Now our array of 1s takes only 12.5 million bytes, which is 1/64 the memory size of our bigMatrix of 1s above.

Logical indexing

One way to choose specific elements from arrays is using logical indexing, in which a series of true/false values (or as 1/0 values in a BitArray) are used to indicate the elements to choose. To show this, let’s start with a simple matrix:

m1 = reshape(1:16, 4, 4)  # reshape() turns a Vector into a Matrix
4×4 reshape(::UnitRange{Int64}, 4, 4) with eltype Int64:
 1  5   9  13
 2  6  10  14
 3  7  11  15
 4  8  12  16

Now we can use a vector of Boolean values to choose rows:

m1[[true, false, true, false], :]
2×4 Matrix{Int64}:
 1  5   9  13
 3  7  11  15

Or choose rows and columns like this:

m1[[true, false, true, false], [false, true, true, false]]
2×2 Matrix{Int64}:
 5   9
 7  11

An equivalent expression using BitVectors is:

m1[BitVector([1, 0, 1, 0]), BitVector([0, 1, 1, 0])]
2×2 Matrix{Int64}:
 5   9
 7  11

This lets us do things like choose the even rows and odd columns:

m1[iseven.(1:4), isodd.(1:4)]
2×2 Matrix{Int64}:
 2  10
 4  12

In the above expression, iseven.(1:4) produces a BitVector that indicates whether each integer from 1 to 4 is even. See that here:

iseven.(1:4)
4-element BitVector:
 0
 1
 0
 1

Above we used logical indexing to determine the rows and columns to include. We can also use it to pick elements more directly:

selectionMask = (m1 .% 3 .== 0)  # Chooses values divisible by 3
4×4 BitMatrix:
 0  0  1  0
 0  1  0  0
 1  0  0  1
 0  0  1  0

The above made a BitMatrix indicating which elements satisfied the condition. Now let’s use that to choose those elements:

m1[selectionMask]
5-element Vector{Int64}:
  3
  6
  9
 12
 15

This method—of specifying a condition for elements to satisfy, constructing a BitMatrix, and using that BitMatrix to index the array—is used often in data analysis.

Build a structure for storing genotypes

Your research involves genotypic data, and you want to efficiently keep track of both real and simulated diploid genotypes for multiple individuals and loci (i.e., genes). You can assume that there are only two alleles at each locus. Can you come up with a data object to store your data, and then store some example data in your object? (Hint: you have 3 dimensions of data: individuals, loci, and the two alleles at each locus.)

If that goes well, then write code that will choose only the heterozygous individuals at a given locus.

DataFrames

These are also hugely useful in biological data analysis. A DataFrame can be thought of as a series of same-length Vectors arranged as columns into a single table of data. Usually, each row represents an individual, whereas each column represents a distinct variable. Importantly, the different columns can have different types of elements (for example, one column might have Strings, one might have Ints, another might have Float64s, etc.). DataFrames store these different types of vectors efficiently. Furthermore, we can designate names for each column and refer to them by those names.

To use DataFrames, we must download and install a package. Type ] to enter the package mode, then enter this:

add DataFrames

Now press delete or backspace to enter the normal REPL mode, and enter this:

using DataFrames

Let’s enter some example data:

data = DataFrame(species = ["warbler", "wren", "sparrow", "flameback"],
          mass_g = [11, 9, 28, 300],
          random_num = rand(4))
4×3 DataFrame
Row species mass_g random_num
String Int64 Float64
1 warbler 11 0.715291
2 wren 9 0.87526
3 sparrow 28 0.321188
4 flameback 300 0.639962

The REPL now shows us our dataframe in a nice table format. We see it has interpreted our input data as we would like, with the three columns containing elements of type String, Int64, and Float64.

We can now refer to columns in a convenient way:

data.species
4-element Vector{String}:
 "warbler"
 "wren"
 "sparrow"
 "flameback"
data.mass_g
4-element Vector{Int64}:
  11
   9
  28
 300

This convenient reference allows us to choose subsets of the data, borrowing our knowledge from the Logical Indexing section above:

data.species[data.mass_g .> 20]
2-element Vector{String}:
 "sparrow"
 "flameback"

We can add a new column quite easily:

data.size_category = fill("undetermined", size(data, 1))  # the size(data,1) function gets the number of rows.
data
4×4 DataFrame
Row species mass_g random_num size_category
String Int64 Float64 String
1 warbler 11 0.715291 undetermined
2 wren 9 0.87526 undetermined
3 sparrow 28 0.321188 undetermined
4 flameback 300 0.639962 undetermined

Now let’s fill in our new column with meaningful values:

data.size_category[data.mass_g .> 100] .= "BIG"
data.size_category[100 .> data.mass_g .> 15] .= "medium"
data.size_category[15 .> data.mass_g] .= "small"
data
4×4 DataFrame
Row species mass_g random_num size_category
String Int64 Float64 String
1 warbler 11 0.715291 small
2 wren 9 0.87526 small
3 sparrow 28 0.321188 medium
4 flameback 300 0.639962 BIG

The DataFrames.jl package is extremely capable, and we are just touching the surface here of what it can do. A good source to learn more is: https://dataframes.juliadata.org/stable/man/basics/#First-Steps-with-DataFrames.jl