GPUInspector.MultiLoggerType

MultiLogger struct which combines normal and error output streams. Useful if you want your normal and error output that is printed to the terminal to also be saved to different files.

Arguments:

  • normal_file_name: Path to normal output file.
  • error_file_name: Path to error output file.
GPUInspector.StressTestBatchedType

GPU stress test (matrix multiplications) in which we try to run for a given time period. We try to keep the CUDA stream continously busy with matmuls at any point in time. Concretely, we submit batches of matmuls and, after half of them, we record a CUDA event. On the host, after submitting a batch, we (non-blockingly) synchronize on, i.e. wait for, the CUDA event and, if we haven't exceeded the desired duration already, submit another batch.

GPUInspector.StressTestEnforcedType

GPU stress test (matrix multiplications) in which we run almost precisely for a given time period (duration is enforced).

GPUInspector.StressTestFixedIterType

GPU stress test (matrix multiplications) in which we run for a given number of iteration, or try to run for a given time period (with potentially high uncertainty!). In the latter case, we estimate how long a synced matmul takes and set niter accordingly.

GPUInspector.StressTestStoreResultsType

GPU stress test (matrix multiplications) in which we store all matmul results and try to run as many iterations as possible for a certain memory limit (default: 90% of free memory).

This stress test is somewhat inspired by gpu-burn by Ville Timonen.

GPUInspector.UnitPrefixedBytesType

Abstract type representing an amount of data, i.e. a certain number of bytes, with a unit prefix (also "metric prefix"). Examples include the SI prefixes, like KB, MB, and GB, but also the binary prefixes (ISO/IEC 80000), like KiB, MiB, and GiB.

See https://en.wikipedia.org/wiki/Binary_prefix for more information.

GPUInspector.alloc_memMethod
alloc_mem(memsize::UnitPrefixedBytes; devs=(CUDA.device(),), dtype=Float32)

Allocates memory on the devices whose IDs are provided via devs. Returns a vector of memory handles (i.e. CuArrays).

Examples:

alloc_mem(MiB(1024)) # allocate on the currently active device
alloc_mem(B(40_000_000); devs=(0,1)) # allocate on GPU0 and GPU1
GPUInspector.bytesMethod
bytes(x::Number)

Returns an appropriate UnitPrefixedBytes object, representing the number of bytes.

Note: This function is type unstable by construction!

See simplify for what "appropriate" means here.

GPUInspector.bytesMethod
bytes(x::UnitPrefixedBytes)

Return the number of bytes (without prefix) as Float64.

GPUInspector.change_baseMethod

Toggle between

  • Base 10, SI prefixes, i.e. factors of 1000
  • Base 2, ISO/IEC prefixes, i.e. factors of 1024

Example:

julia> change_base(KB(13))
~12.7 KiB

julia> change_base(KiB(13))
~13.31 KB
GPUInspector.functionalFunction

Check if CUDA/GPU is available and functional. If not, print some (hopefully useful) debug information.

GPUInspector.get_cpu_statsMethod

Get information about all cpu cores. Returns a vector of vectors. The outer index corresponds to cpu cores. The inner vector contains the following information (in that order):

user nice system idle iowait irq softirq steal guest ?

See proc(5) for more information.

GPUInspector.get_cpu_utilizationFunction
get_cpu_utilization(core=getcpuid(); Δt=0.01)

Get the utilization (in percent) of the given cpu core over a certain time interval Δt.

GPUInspector.get_cpu_utilizationsFunction
get_cpu_utilizations(cores=0:Sys.CPU_THREADS-1; Δt=0.01)

Get the utilization (in percent) of the given cpu cores over a certain time interval Δt.

Based on this.

GPUInspector.get_temperatureFunction
get_temperature(device=CUDA.device())

Get current temperature of the given CUDA device in degrees Celsius.

GPUInspector.get_temperaturesFunction
get_temperatures(devices=CUDA.devices())

Get current temperature of the given CUDA devices in degrees Celsius.

GPUInspector.gpuidFunction

Get GPU index of the given device.

Note: GPU indices start with zero.

GPUInspector.gpuinfoMethod
gpuinfo(deviceid::Integer)

Print out detailed information about the GPU with the given deviceid.

Heavily inspired by the CUDA sample "deviceQueryDrv.cpp".

GPUInspector.host2device_bandwidthFunction
host2device_bandwidth([memsize::UnitPrefixedBytes=GiB(0.5)]; kwargs...)

Performs a host-to-device memory copy benchmark (time measurement) and returns the host-to-device bandwidth estimate (in GiB/s) derived from it.

Keyword arguments:

  • nbench (default: 10): number of time measurements (i.e. p2p memcopies)
  • verbose (default: true): set to false to turn off any printing.
  • stats (default: false): when true shows statistical information about the benchmark.
  • times (default: false): toggle printing of measured times.
  • dtype (default: Cchar): used data type.
  • io (default: stdout): set the stream where the results should be printed.

Examples:

host2device_bandwidth()
host2device_bandwidth(MiB(1024))
host2device_bandwidth(KiB(20_000); dtype=Int32)
GPUInspector.livemonitor_powerusageMethod
livemonitor_powerusage(duration) -> times, powerusage

Monitor the power usage of GPU(s) (in Watts) over a given time period, as specified by duration (in seconds). Returns the (relative) times as a Vector{Float64} and the power usage as a Vector{Vector{Float64}}.

For general keyword arguments, see livemonitor_something.

GPUInspector.livemonitor_somethingMethod
livemonitor_something(f, duration) -> times, values

Monitor some property of GPU(s), as specified through the function f, over a given time period, as specified by duration (in seconds). Returns the (relative) times as a Vector{Float64} and the temperatures as a Vector{Vector{Float64}}.

The function f will be called on a vector of devices (CuDevices or NVML.Devices) and should return a vector of Float64 values.

Keyword arguments:

  • freq (default: 1): polling rate in Hz.
  • devices (default: NVML.devices()): CuDevices or NVML.Devices to consider.
  • plot (default: false): Create a unicode plot after the monitoring.
  • liveplot (default: false): Create and update a unicode plot during the monitoring. Use optional ylims to specify fixed y limits.
  • title (default: ""): Title used in unicode plots.
  • ylabel (default: "Values"): y label used in unicode plots.

See: livemonitor_temperature, livemonitor_powerusage

GPUInspector.livemonitor_temperatureMethod
livemonitor_temperature(duration) -> times, temperatures

Monitor the temperature of GPU(s) over a given time period, as specified by duration (in seconds). Returns the (relative) times as a Vector{Float64} and the temperatures as a Vector{Vector{Float64}}.

For general keyword arguments, see livemonitor_something.

GPUInspector.memory_bandwidthFunction
memory_bandwidth([memsize; kwargs...])

Tries to estimate the peak memory bandwidth of a GPU in GiB/s by measuring the time it takes to perform a memcpy of a certain amount of data (as specified by memsize).

Keyword arguments:

  • device (default: CUDA.device()): CUDA device to be used.
  • dtype (default: Cchar): element type of the vectors.
  • verbose (default: true): toggle printing.
  • io (default: stdout): set the stream where the results should be printed.

See also: memory_bandwidth_scaling.

GPUInspector.memory_bandwidth_saxpyMethod

Tries to estimate the peak memory bandwidth of a GPU in GiB/s by measuring the time it takes to perform a SAXPY, i.e. a * x[i] + y[i].

Keyword arguments:

  • device (default: CUDA.device()): CUDA device to be used.
  • dtype (default: Float32): element type of the vectors.
  • size (default: 2^20 * 10): length of the vectors.
  • nbench (default: 5): number of measurements to be performed the best of which is used for the GiB/s computation.
  • verbose (default: true): toggle printing.
  • cublas (default: true): toggle between CUDA.axpy! and a custom saxpy_gpu_kernel!.
  • io (default: stdout): set the stream where the results should be printed.

See also: memory_bandwidth_saxpy_scaling.

GPUInspector.monitoring_startMethod
monitoring_start(; devices=CUDA.devices(), verbose=true)

Start monitoring of GPU temperature, utilization, power usage, etc.

Keyword arguments:

  • freq (default: 1): polling rate in Hz.
  • devices (default: CUDA.devices()): CuDevices or NVML.Devices to monitor.
  • thread (default: Threads.nthreads()): id of the Julia thread that should run the monitoring.
  • verbose (default: true): toggle verbose output.

See also monitoring_stop.

GPUInspector.monitoring_stopMethod
monitoring_stop(; verbose=true) -> results

Stops the GPU monitoring and returns the measured values. Specifically, results is a named tuple with the following keys:

  • time: the (relative) times at which we measured
  • temperature, power, compute, mem

See also monitoring_start and plot_monitoring_results.

GPUInspector.multi_logFunction

Logging function for MultiLogger. Use this in combination with MultiLogger if you want your normal and error output that is printed to the terminal to also be saved to different files.

Arguments:

  • MultiLogger: Instance of MultiLogger struct.
  • text: Text to be printed to terminal and written to file.
  • is_error (default: false): Flag which decides whether text should be written into normal or error stream.
GPUInspector.p2p_bandwidthFunction
p2p_bandwidth([memsize::UnitPrefixedBytes]; kwargs...)

Performs a peer-to-peer memory copy benchmark (time measurement) and returns an inter-gpu memory bandwidth estimate (in GiB/s) derived from it.

Keyword arguments:

  • src (default: 0): source device
  • dst (default: 1): destination device
  • nbench (default: 5): number of time measurements (i.e. p2p memcopies)
  • verbose (default: true): set to false to turn off any printing.
  • hist (default: false): when true, a UnicodePlots-based histogram is printed.
  • times (default: false): toggle printing of measured times.
  • alternate (default: false): alternate src and dst, i.e. copy data back and forth.
  • dtype (default: Float32): see alloc_mem.
  • io (default: stdout): set the stream where the results should be printed.

Examples:

p2p_bandwidth()
p2p_bandwidth(MiB(1024))
p2p_bandwidth(KiB(20_000); dtype=Int32)
GPUInspector.peakflops_gpuMethod
peakflops_gpu(; tensorcores=hastensorcores(), kwargs...)

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform

  • _kernel_fma_nfmas() * size many FMAs on CUDA cores (if tensorcores == false)
  • _kernel_wmma_nwmmas() many WMMAs on Tensor Cores (if tensorcores == true)

For more keyword argument options see peakflops_gpu_fmas and peakflops_gpu_wmmas.

GPUInspector.peakflops_gpu_fmasMethod
peakflops_gpu_fmas(; size::Integer=5_000_000, dtype=Float32, nbench=5, nkernel=5, device=CUDA.device(), verbose=true)

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_fma_nfmas() * size many FMAs on CUDA cores.

Keyword arguments:

  • device (default: CUDA.device()): CUDA device to be used.
  • dtype (default: Float32): element type of the matrices.
  • size (default: 5_000_000): length of vectors.
  • nkernel (default: 5): number of kernel calls that make up one benchmarking sample.
  • nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
  • verbose (default: true): toggle printing.
  • io (default: stdout): set the stream where the results should be printed.
GPUInspector.peakflops_gpu_matmulMethod
peakflops_gpu_matmul(; device, dtype=Float32, size=2^14, nmatmuls=5, nbench=5, verbose=true)

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform nmatmuls many (in-place) matrix-matrix multiplications.

Keyword arguments:

  • device (default: CUDA.device()): CUDA device to be used.
  • dtype (default: Float32): element type of the matrices.
  • size (default: 2^14): matrices will have dimensions (size, size).
  • nmatmuls (default: 5): number of matmuls that will make up the kernel to be timed.
  • nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
  • verbose (default: true): toggle printing.
  • io (default: stdout): set the stream where the results should be printed.

See also: peakflops_gpu_matmul_scaling, peakflops_gpu_matmul_graphs.

GPUInspector.peakflops_gpu_matmul_scalingMethod
peakflops_gpu_matmul_scaling(peakflops_func = peakflops_gpu_matmul; verbose=true) -> sizes, flops

Asserts the scaling of the given peakflops_function (defaults to peakflops_gpu_matmul) with increasing matrix size. If verbose=true (default), displays a unicode plot. Returns the considered sizes and TFLOP/s. For further options, see peakflops_gpu_matmul.

GPUInspector.peakflops_gpu_wmmasMethod
peakflops_gpu_wmmas()

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_wmma_nwmmas() many WMMAs on Tensor Cores.

Keyword arguments:

  • device (default: CUDA.device()): CUDA device to be used.
  • dtype (default: Float16): element type of the matrices. We currently only support Float16 (Int8, :TensorFloat32, :BFloat16, and Float64 might or might not work).
  • nkernel (default: 10): number of kernel calls that make up one benchmarking sample.
  • nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
  • threads (default: max. threads per block): how many threads to use per block (part of the kernel launch configuration).
  • blocks (default: 2048): how many blocks to use (part of the kernel launch configuration).
  • verbose (default: true): toggle printing.
  • io (default: stdout): set the stream where the results should be printed.
GPUInspector.plot_monitoring_resultsFunction
plot_monitoring_results(r::MonitoringResults, symbols=keys(r.results))

Plot the quantities specified through symbols of a MonitoringResults object. Will generate a textual in-terminal / in-logfile plot using UnicodePlots.jl.

GPUInspector.savefig_monitoring_resultsFunction
savefig_monitoring_results(r::MonitoringResults, symbols=keys(r.results))

Save plots of the quantities specified through symbols of a MonitoringResults object to disk. Note: Only available if CairoMakie.jl is loaded next to GPUInspector.jl.

GPUInspector.simplifyMethod
simplify(x::UnitPrefixedBytes[; base])

Given a UnitPrefixedBytes number x, finds a more appropriate UnitPrefixedBytes that represents the same number of bytes but with a smaller value.

The optional keyword argument base can be used to switch between base 2, i.e. ISO/IEC prefixes (default), and base 10. Allowed values are 2, 10, :SI, :ISO, and :IEC.

Note: This function is type unstable by construction!

Example:

julia> simplify(B(40_000_000))
~38.15 MiB

julia> simplify(B(40_000_000); base=10)
40.0 MB
GPUInspector.stresstestMethod
stresstest(device_or_devices)

Run a GPU stress test (matrix multiplication) on one or multiple GPU devices, as specified by the positional argument. If no argument is provided (only) the currently active GPU will be used.

Keyword arguments:

Choose one of the following (or none):

  • duration: stress test will take about the given time in seconds. (StressTestBatched)
  • enforced_duration: stress test will take almost precisely the given time in seconds. (StressTestEnforced)
  • approx_duration: stress test will hopefully take approximately the given time in seconds. No promises made! (StressTestFixedIter)
  • niter: stress test will run the given number of matrix-multiplications, however long that will take. (StressTestFixedIter)
  • mem: number (<:Real) between 0 and 1, indicating the fraction of the available GPU memory that should be used, or a <:UnitPrefixedBytes indicating an absolute memory limit. (StressTestStoreResults)

General settings:

  • dtype (default: Float32): element type of the matrices
  • monitoring (default: false): enable automatic monitoring, in which case a MonitoringResults object is returned.
  • size (default: 2048): matrices of size (size, size) will be used
  • verbose (default: true): toggle printing of information
  • parallel (default: true): If true, will (try to) run each GPU test on a different Julia thread. Make sure to have enough Julia threads.
  • threads (default: nothing): If parallel == true, this argument may be used to specify the Julia threads to use.
  • clearmem (default: false): If true, we call clear_all_gpus_memory after the stress test.
  • io (default: stdout): set the stream where the results should be printed.

When duration is specifiec (i.e. StressTestEnforced) there is also:

  • batch_duration (default: ceil(Int, duration/10)): desired duration of one batch of matmuls.
GPUInspector.stresstest_cpuMethod
stresstest_cpu(core_or_cores)

Run a CPU stress test (matrix multiplication) on one or multiple CPU cores, as specified by the positional argument. If no argument is provided (only) the currently active CPU core will be used.

Keyword arguments:

  • duration: stress test will take about the given time in seconds.
  • dtype (default: Float64): element type of the matrices
  • size (default: floor(Int, sqrt(L2_cachesize() / sizeof(dtype)))): matrices of size (size, size) will be used
  • verbose (default: true): toggle printing of information
  • parallel (default: true): If true, will (try to) run each CPU core test on a different Julia thread. Make sure to have enough Julia threads.
  • threads (default: nothing): If parallel == true, this argument may be used to specify the Julia threads to use.
GPUInspector.theoretical_peakflops_gpuMethod

Estimates the theoretical peak performance of a CUDA device in TFLOP/s.

Keyword arguments:

  • tensorcores (default: hastensorcores()): toggle usage of tensore cores. If false, cuda cores will be used.
  • verbose (default: true): toggle printing of information
  • device (default: device()): CUDA device to be analyzed
  • dtype (default: tensorcores ? Float16 : Float32): element type of the matrices
  • io (default: stdout): set the stream where the results should be printed.
GPUInspector.toggle_tensorcoremathFunction
toggle_tensorcoremath([enable::Bool; verbose=true])

Switches the CUDA.math_mode between CUDA.FAST_MATH (enable=true) and CUDA.DEFAULT_MATH (enable=false). For matmuls of CuArray{Float32}s, this should have the effect of using/enabling and not using/disabling tensor cores. Of course, this only works on supported devices and CUDA versions.

If no arguments are provided, this functions toggles between the two math modes.

GPUInspector.@unrollMacro
@unroll N expr

Takes a for loop as expr and informs the LLVM unroller to unroll it N times, if it is safe to do so.

GPUInspector.@unrollMacro

@unroll expr Takes a for loop as expr and informs the LLVM unroller to fully unroll it, if it is safe to do so and the loop count is known.

GPUInspector.@workerMacro
@worker pid ex

Spawns the given command on the given worker process.

Examples:

@worker 3 GPUInspector.functional()
@worker 3 stresstest(CUDA.devices(); duration=10, verbose=false)
GPUInspector.@workerMacro
@worker ex

Creates a worker process, spawns the given command on it, and kills the worker process once the command has finished execution.

Implementation: a Julia thread (we use @spawn) will be used to wait on the task and kill the worker.

Examples:

@worker GPUInspector.functional()
@worker stresstest(CUDA.devices(); duration=10, verbose=false)
GPUInspector.@worker_createMacro
@worker_create n -> pids

Create n workers (i.e. separate Julia processes) and execute using GPUInspector, CUDA on all of them. Returns the pids of the created workers.