How to benchmark Julia code?

Julia is known to be a highly performant language, with benchmark results quite close to that of C. While that's certainly impressive, it would be even better if we can easily quantify how fast our code runs. No matter if you are a newbie or a seasoned programmer, benchmarking can help identify potential bottlenecks in your code. It can also be used to stress test your shiny new PC or laptop in my case. Luckily, there exists an excellent package BenchmarkTools.jl, which is what I will be using in the examples shown below.

Hardware

I will do a comparison of the benchmarks between two different platforms that I currently have access to:

Laptop - Dell XPS 13 Plus running Ubuntu 22.04, equipped with an Intel Core i5-1240P (12 cores, 16 threads), 16 GB LPDDR5 RAM and 1 TB NVMe SSD.
Desktop - AMD Ryzen 5 3600 (6 cores, 12 threads), 32 GB DDR4 RAM, 512 GB NVMe SSD, running on Elementary OS 7 Horus.

The RAM and SSD are likely not going to have any significant impact on the results. So, it's majorly going to be a CPU showdown - Intel vs AMD. Note that this is strictly not a one-to-one comparison since the 12th gen Intel mobile chip has a limited power budget, which means it might not be able to boost all cores to their maximum clocks. The cooling capacity of such a thin laptop is also rather limited, which could lead to thermal throttling. The desktop-class AMD processor on the other hand has a much larger power headroom to play with. Cooling is not an issue for the desktop, so all CPU cores should be able to sustain their maximum clocks when under load.

Add package

Start the Julia REPL. I am currently using version "1.8.5". Enter the pkg prompt by typing ], and then add the package:

add BenchmarkTools

Load the package

For our tests, we will also make of LinearAlgebra, which is already part of the base library (so no need to add it separately).

julia> using BenchmarkTools, LinearAlgebra

Test function

Let's create a function, which does some operations on a matrix. To summarize, we multiply the matrices, then compute the inverse, followed by tan and LU decomposition.

We can use the size of the matrix as an input argument. This will prove to be useful later since we will then be able to perform benchmark runs with increasing complexity.

julia> function do_something(size::Int64)

           A = rand(size, size)
           B = rand(size, size)

           C = A * B
           return C |> inv |> tan |> factorize
       end    
do_something (generic function with 1 method)

Sample run:

julia> do_something(5)
LU{Float64, Matrix{Float64}, Vector{Int64}}
L factor:
5×5 Matrix{Float64}:
  1.0         0.0         0.0         0.0      0.0
  0.283496    1.0         0.0         0.0      0.0
  0.0128731  -0.241892    1.0         0.0      0.0
 -0.148442   -0.598821   -0.950991    1.0      0.0
 -0.296558    0.0199563  -0.0782248  -0.42072  1.0
U factor:
5×5 Matrix{Float64}:
 1.51572   0.308015   0.235375  -0.975098  -0.42535
 0.0      -1.1294    -0.103104   0.970615   0.390145
 0.0       0.0       -0.395555   0.688296   0.0435477
 0.0       0.0        0.0        0.893948  -0.129836
 0.0       0.0        0.0        0.0        0.351661

Results will be different for every run since we haven't fixed the seed for random number generation. However, rand(size, size) will always generate a square matrix with individual entries between 0 and 1. And the sequence of operations remains unchanged once the function is defined.

Setting up the benchmark

BenchmarkTools.jl allows the use of several configuration parameters. Essentially, we want to execute the above function a certain number of times (samples) such that we have enough statistical accuracy. However, the benchmark might end up running for too long. So we would also like to limit the total time (seconds) it takes for the benchmark to run. A good rule of thumb is to start with a small sample size, and then gradually ramp it up. There's not really a downside to having a large number of sample runs. In this case, the more the merrier!

Set up the benchmark to run 50 samples within a time budget of 120 seconds. The matrix size is 100.

julia> b_N100_S50 = @benchmarkable do_something(100) seconds=120 samples=50

Test run:

That was quite fast. Let's increase the sample size to 100.

Results

Benchmark results are shown as a distribution over all the samples. In principle, code execution with a fixed input should always take the same amount of time. However, there are other sources of noise from the host machine, which usually add to the execution time. Therefore, instead of the mean, it is often argued that one should look at the minimum time to get an idea of the true baseline. In any case, we get a very nice visual representation of the time taken for all the runs including useful metrics such as the min, max, mean and median.

Size = 200

Intel

AMD

Results are comparable to each other. Let's see what happens when we increase the size of the matrix.

Size = 400

Intel

Doubling the matrix size quadruples the benchmark time.

AMD

The Ryzen CPU appears to be faster in this case.

Size = 600

Intel

Let's see if we can improve on the statistics with samples = 200.

Not a huge difference in the results, so it seems 100 samples are already good enough.

AMD

The trend continues with the AMD processor being almost twice as fast.

Size = 1000

Intel

AMD

Using multiple threads

By default, Julia starts with only 1 thread. This can be checked by doing the following:

julia> Threads.nthreads()
1

However, the function is able to use 8 threads, presumably due to parallelization in the LinearAlgebra package.

We can start Julia with multiple threads using the --threads or -t argument:

vikas@Mugetsu:~$ julia --threads 12
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.5 (2023-01-08)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> Threads.nthreads()
12

Performance does not seem to be impacted by starting Julia with 12 threads. However, there are some cases where this will be very useful. I will leave that discussion for another post.

Conclusion

The desktop-class AMD processor easily outperformed the Intel CPU, which is not surprising considering that the Intel chip is limited by power and cooling. Intel chip did have more cores (and threads), but that didn't help with our benchmarks. I hope you enjoyed reading through this post and learned something new in that process. Do subscribe to this series for similar content in the future. Also, please don't forget to give it a heart and share it with other Julia enthusiasts.

References

BenchmarkTools.jl

Benchmarking code in Julia

Table of contents

Hardware

Add package

Load the package

Test function

Setting up the benchmark

Results

Size = 200

Intel

AMD

Size = 400

Intel

AMD

Size = 600

Intel

AMD

Size = 1000

Intel

AMD

Using multiple threads

Conclusion

References