Docstrings · GPUInspector.jl

GPUInspector.B — Type

Bytes

GPUInspector.GB — Type

Gigabytes, i.e. 10^9 = 1000^3 bytes

GPUInspector.GiB — Type

Gibibytes, i.e. 2^30 = 1024^3 bytes

GPUInspector.KB — Type

Kilobytes, i.e. 10^3 = 1000 bytes

GPUInspector.KiB — Type

Kibibytes, i.e. 2^10 = 1024 bytes

GPUInspector.MB — Type

Megabytes, i.e. 10^6 = 1000^2 bytes

GPUInspector.MiB — Type

Mebibytes, i.e. 2^20 = 1024^2 bytes

GPUInspector.MultiLogger — Type

MultiLogger struct which combines normal and error output streams. Useful if you want your normal and error output that is printed to the terminal to also be saved to different files.

Arguments:

normal_file_name: Path to normal output file.
error_file_name: Path to error output file.

GPUInspector.StressTestBatched — Type

GPU stress test (matrix multiplications) in which we try to run for a given time period. We try to keep the CUDA stream continously busy with matmuls at any point in time. Concretely, we submit batches of matmuls and, after half of them, we record a CUDA event. On the host, after submitting a batch, we (non-blockingly) synchronize on, i.e. wait for, the CUDA event and, if we haven't exceeded the desired duration already, submit another batch.

GPUInspector.StressTestEnforced — Type

GPU stress test (matrix multiplications) in which we run almost precisely for a given time period (duration is enforced).

GPUInspector.StressTestFixedIter — Type

GPU stress test (matrix multiplications) in which we run for a given number of iteration, or try to run for a given time period (with potentially high uncertainty!). In the latter case, we estimate how long a synced matmul takes and set niter accordingly.

GPUInspector.StressTestStoreResults — Type

GPU stress test (matrix multiplications) in which we store all matmul results and try to run as many iterations as possible for a certain memory limit (default: 90% of free memory).

This stress test is somewhat inspired by gpu-burn by Ville Timonen.

GPUInspector.TB — Type

Terabytes, i.e. 10^12 = 1000^4 bytes

GPUInspector.TiB — Type

Tebibytes, i.e. 2^40 = 1024^4 bytes

GPUInspector.UnitPrefixedBytes — Type

Abstract type representing an amount of data, i.e. a certain number of bytes, with a unit prefix (also "metric prefix"). Examples include the SI prefixes, like KB, MB, and GB, but also the binary prefixes (ISO/IEC 80000), like KiB, MiB, and GiB.

See https://en.wikipedia.org/wiki/Binary_prefix for more information.

GPUInspector.alloc_mem — Method

alloc_mem(memsize::UnitPrefixedBytes; devs=(CUDA.device(),), dtype=Float32)

Allocates memory on the devices whose IDs are provided via devs. Returns a vector of memory handles (i.e. CuArrays).

Examples:

alloc_mem(MiB(1024)) # allocate on the currently active device
alloc_mem(B(40_000_000); devs=(0,1)) # allocate on GPU0 and GPU1

GPUInspector.bytes — Method

bytes(x::Number)

Returns an appropriate UnitPrefixedBytes object, representing the number of bytes.

Note: This function is type unstable by construction!

See simplify for what "appropriate" means here.

GPUInspector.bytes — Method

bytes(x::UnitPrefixedBytes)

Return the number of bytes (without prefix) as Float64.

GPUInspector.change_base — Method

Toggle between

Base 10, SI prefixes, i.e. factors of 1000
Base 2, ISO/IEC prefixes, i.e. factors of 1024

Example:

julia> change_base(KB(13))
~12.7 KiB

julia> change_base(KiB(13))
~13.31 KB

GPUInspector.clear_all_gpus_memory — Function

Reclaim the unused memory of all available GPUs.

GPUInspector.clear_gpu_memory — Function

Reclaim the unused memory of the currently active GPU (i.e. device()).

GPUInspector.functional — Function

Check if CUDA/GPU is available and functional. If not, print some (hopefully useful) debug information.

GPUInspector.get_cpu_stats — Method

Get information about all cpu cores. Returns a vector of vectors. The outer index corresponds to cpu cores. The inner vector contains the following information (in that order):

user nice system idle iowait irq softirq steal guest ?

See proc(5) for more information.

GPUInspector.get_cpu_utilization — Function

get_cpu_utilization(core=getcpuid(); Δt=0.01)

Get the utilization (in percent) of the given cpu core over a certain time interval Δt.

GPUInspector.get_cpu_utilizations — Function

get_cpu_utilizations(cores=0:Sys.CPU_THREADS-1; Δt=0.01)

Get the utilization (in percent) of the given cpu cores over a certain time interval Δt.

Based on this.

GPUInspector.get_cpusocket_temperatures — Method

Tries to get the temperatures of the available CPUs (sockets not cores) in degrees Celsius.

Based on cat /sys/class/thermal/thermal_zone*/temp.

GPUInspector.get_gpu_utilization — Function

get_gpu_utilization(device=CUDA.device())

Get the current utilization of the given CUDA device in percent.

GPUInspector.get_gpu_utilizations — Function

get_gpu_utilizations(devices=CUDA.devices())

Get the current utilization of the given CUDA devices in percent.

GPUInspector.get_power_usage — Method

get_power_usage(device=CUDA.device())

Get current power usage of the given CUDA device in Watts.

GPUInspector.get_power_usages — Function

get_power_usages(devices=CUDA.devices())

Get current power usage of the given CUDA devices in Watts.

GPUInspector.get_temperature — Function

get_temperature(device=CUDA.device())

Get current temperature of the given CUDA device in degrees Celsius.

GPUInspector.get_temperatures — Function

get_temperatures(devices=CUDA.devices())

Get current temperature of the given CUDA devices in degrees Celsius.

GPUInspector.gpuid — Function

Get GPU index of the given device.

Note: GPU indices start with zero.

GPUInspector.gpuinfo — Method

gpuinfo(deviceid::Integer)

Print out detailed information about the GPU with the given deviceid.

Heavily inspired by the CUDA sample "deviceQueryDrv.cpp".

GPUInspector.gpuinfo_p2p_access — Method

Query peer-to-peer (i.e. inter-GPU) access support.

GPUInspector.gpus — Method

List the available GPUs.

GPUInspector.hastensorcores — Function

Checks whether the given CuDevice has Tensor Cores.

GPUInspector.host2device_bandwidth — Function

host2device_bandwidth([memsize::UnitPrefixedBytes=GiB(0.5)]; kwargs...)

Performs a host-to-device memory copy benchmark (time measurement) and returns the host-to-device bandwidth estimate (in GiB/s) derived from it.

Keyword arguments:

nbench (default: 10): number of time measurements (i.e. p2p memcopies)
verbose (default: true): set to false to turn off any printing.
stats (default: false): when true shows statistical information about the benchmark.
times (default: false): toggle printing of measured times.
dtype (default: Cchar): used data type.
io (default: stdout): set the stream where the results should be printed.

Examples:

host2device_bandwidth()
host2device_bandwidth(MiB(1024))
host2device_bandwidth(KiB(20_000); dtype=Int32)

GPUInspector.ismonitoring — Method

Checks if we are currently monitoring.

GPUInspector.kernel_fma — Method

Dummy kernel doing _kernel_fma_nfmas() many FMAs (default: 100_000).

GPUInspector.livemonitor_powerusage — Method

livemonitor_powerusage(duration) -> times, powerusage

Monitor the power usage of GPU(s) (in Watts) over a given time period, as specified by duration (in seconds). Returns the (relative) times as a Vector{Float64} and the power usage as a Vector{Vector{Float64}}.

For general keyword arguments, see livemonitor_something.

GPUInspector.livemonitor_something — Method

livemonitor_something(f, duration) -> times, values

Monitor some property of GPU(s), as specified through the function f, over a given time period, as specified by duration (in seconds). Returns the (relative) times as a Vector{Float64} and the temperatures as a Vector{Vector{Float64}}.

The function f will be called on a vector of devices (CuDevices or NVML.Devices) and should return a vector of Float64 values.

Keyword arguments:

freq (default: 1): polling rate in Hz.
devices (default: NVML.devices()): CuDevices or NVML.Devices to consider.
plot (default: false): Create a unicode plot after the monitoring.
liveplot (default: false): Create and update a unicode plot during the monitoring. Use optional ylims to specify fixed y limits.
title (default: ""): Title used in unicode plots.
ylabel (default: "Values"): y label used in unicode plots.

See: livemonitor_temperature, livemonitor_powerusage

GPUInspector.livemonitor_temperature — Method

livemonitor_temperature(duration) -> times, temperatures

Monitor the temperature of GPU(s) over a given time period, as specified by duration (in seconds). Returns the (relative) times as a Vector{Float64} and the temperatures as a Vector{Vector{Float64}}.

For general keyword arguments, see livemonitor_something.

GPUInspector.load_monitoring_results — Method

Given an HDF5 file created with save_monitoring_results, restore the saved monitoring results (i.e. output of monitoring_stop).

GPUInspector.memory_bandwidth — Function

memory_bandwidth([memsize; kwargs...])

Tries to estimate the peak memory bandwidth of a GPU in GiB/s by measuring the time it takes to perform a memcpy of a certain amount of data (as specified by memsize).

Keyword arguments:

device (default: CUDA.device()): CUDA device to be used.
dtype (default: Cchar): element type of the vectors.
verbose (default: true): toggle printing.
io (default: stdout): set the stream where the results should be printed.

See also monitoring_stop.

GPUInspector.monitoring_stop — Method

monitoring_stop(; verbose=true) -> results

Stops the GPU monitoring and returns the measured values. Specifically, results is a named tuple with the following keys:

time: the (relative) times at which we measured
temperature, power, compute, mem

GPUInspector.multi_log — Function

Logging function for MultiLogger. Use this in combination with MultiLogger if you want your normal and error output that is printed to the terminal to also be saved to different files.

Arguments:

MultiLogger: Instance of MultiLogger struct.
text: Text to be printed to terminal and written to file.
is_error (default: false): Flag which decides whether text should be written into normal or error stream.

GPUInspector.p2p_bandwidth — Function

p2p_bandwidth([memsize::UnitPrefixedBytes]; kwargs...)

Performs a peer-to-peer memory copy benchmark (time measurement) and returns an inter-gpu memory bandwidth estimate (in GiB/s) derived from it.

Keyword arguments:

src (default: 0): source device
dst (default: 1): destination device
nbench (default: 5): number of time measurements (i.e. p2p memcopies)
verbose (default: true): set to false to turn off any printing.
hist (default: false): when true, a UnicodePlots-based histogram is printed.
times (default: false): toggle printing of measured times.
alternate (default: false): alternate src and dst, i.e. copy data back and forth.
dtype (default: Float32): see alloc_mem.
io (default: stdout): set the stream where the results should be printed.

Examples:

p2p_bandwidth()
p2p_bandwidth(MiB(1024))
p2p_bandwidth(KiB(20_000); dtype=Int32)

GPUInspector.p2p_bandwidth_all — Method

p2p_bandwidth_all(args...; kwargs...)

Run p2p_bandwidth for all combinations of devices. Returns a matrix with the p2p memory bandwidth estimates.

GPUInspector.p2p_bandwidth_bidirectional — Function

Same as p2p_bandwidth but measures the bidirectional bandwidth (copying data back and forth).

GPUInspector.p2p_bandwidth_bidirectional_all — Method

Same as p2p_bandwidth_all but measures the bidirectional bandwidth (copying data back and forth).

GPUInspector.peakflops_gpu — Method

peakflops_gpu(; tensorcores=hastensorcores(), kwargs...)

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform

_kernel_fma_nfmas() * size many FMAs on CUDA cores (if tensorcores == false)
_kernel_wmma_nwmmas() many WMMAs on Tensor Cores (if tensorcores == true)

For more keyword argument options see peakflops_gpu_fmas and peakflops_gpu_wmmas.

GPUInspector.peakflops_gpu_fmas — Method

peakflops_gpu_fmas(; size::Integer=5_000_000, dtype=Float32, nbench=5, nkernel=5, device=CUDA.device(), verbose=true)

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_fma_nfmas() * size many FMAs on CUDA cores.

Keyword arguments:

device (default: CUDA.device()): CUDA device to be used.
dtype (default: Float32): element type of the matrices.
size (default: 5_000_000): length of vectors.
nkernel (default: 5): number of kernel calls that make up one benchmarking sample.
nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
verbose (default: true): toggle printing.
io (default: stdout): set the stream where the results should be printed.

GPUInspector.peakflops_gpu_matmul — Method

peakflops_gpu_matmul(; device, dtype=Float32, size=2^14, nmatmuls=5, nbench=5, verbose=true)

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform nmatmuls many (in-place) matrix-matrix multiplications.

Keyword arguments:

device (default: CUDA.device()): CUDA device to be used.
dtype (default: Float32): element type of the matrices.
size (default: 2^14): matrices will have dimensions (size, size).
nmatmuls (default: 5): number of matmuls that will make up the kernel to be timed.
nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
verbose (default: true): toggle printing.
io (default: stdout): set the stream where the results should be printed.

GPUInspector.peakflops_gpu_matmul_graphs — Method

Same as peakflops_gpu_matmul but uses CUDA's graph API to define and launch the kernel.

GPUInspector.peakflops_gpu_matmul_scaling — Method

peakflops_gpu_matmul_scaling(peakflops_func = peakflops_gpu_matmul; verbose=true) -> sizes, flops

Asserts the scaling of the given peakflops_function (defaults to peakflops_gpu_matmul) with increasing matrix size. If verbose=true (default), displays a unicode plot. Returns the considered sizes and TFLOP/s. For further options, see peakflops_gpu_matmul.

GPUInspector.peakflops_gpu_wmmas — Method

peakflops_gpu_wmmas()

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_wmma_nwmmas() many WMMAs on Tensor Cores.

Keyword arguments:

device (default: CUDA.device()): CUDA device to be used.
dtype (default: Float16): element type of the matrices. We currently only support Float16 (Int8, :TensorFloat32, :BFloat16, and Float64 might or might not work).
nkernel (default: 10): number of kernel calls that make up one benchmarking sample.
nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
threads (default: max. threads per block): how many threads to use per block (part of the kernel launch configuration).
blocks (default: 2048): how many blocks to use (part of the kernel launch configuration).
verbose (default: true): toggle printing.
io (default: stdout): set the stream where the results should be printed.

GPUInspector.plot_monitoring_results — Function

plot_monitoring_results(r::MonitoringResults, symbols=keys(r.results))

Plot the quantities specified through symbols of a MonitoringResults object. Will generate a textual in-terminal / in-logfile plot using UnicodePlots.jl.

GPUInspector.save_monitoring_results — Method

save_monitoring_results(filename::String, r::MonitoringResults; overwrite=false)

Store the given MonitoringResults (output of monitoring_stop) to disk as an HDF5 file with name filename.

GPUInspector.savefig_monitoring_results — Function

savefig_monitoring_results(r::MonitoringResults, symbols=keys(r.results))

Save plots of the quantities specified through symbols of a MonitoringResults object to disk. Note: Only available if CairoMakie.jl is loaded next to GPUInspector.jl.

GPUInspector.simplify — Method

simplify(x::UnitPrefixedBytes[; base])

Given a UnitPrefixedBytes number x, finds a more appropriate UnitPrefixedBytes that represents the same number of bytes but with a smaller value.

The optional keyword argument base can be used to switch between base 2, i.e. ISO/IEC prefixes (default), and base 10. Allowed values are 2, 10, :SI, :ISO, and :IEC.

Note: This function is type unstable by construction!

Example:

julia> simplify(B(40_000_000))
~38.15 MiB

julia> simplify(B(40_000_000); base=10)
40.0 MB

GPUInspector.stresstest — Method

stresstest(device_or_devices)

Run a GPU stress test (matrix multiplication) on one or multiple GPU devices, as specified by the positional argument. If no argument is provided (only) the currently active GPU will be used.

Keyword arguments:

Choose one of the following (or none):

duration: stress test will take about the given time in seconds. (StressTestBatched)
enforced_duration: stress test will take almost precisely the given time in seconds. (StressTestEnforced)
approx_duration: stress test will hopefully take approximately the given time in seconds. No promises made! (StressTestFixedIter)
niter: stress test will run the given number of matrix-multiplications, however long that will take. (StressTestFixedIter)
mem: number (<:Real) between 0 and 1, indicating the fraction of the available GPU memory that should be used, or a <:UnitPrefixedBytes indicating an absolute memory limit. (StressTestStoreResults)

General settings:

dtype (default: Float32): element type of the matrices
monitoring (default: false): enable automatic monitoring, in which case a MonitoringResults object is returned.
size (default: 2048): matrices of size (size, size) will be used
verbose (default: true): toggle printing of information
parallel (default: true): If true, will (try to) run each GPU test on a different Julia thread. Make sure to have enough Julia threads.
threads (default: nothing): If parallel == true, this argument may be used to specify the Julia threads to use.
clearmem (default: false): If true, we call clear_all_gpus_memory after the stress test.
io (default: stdout): set the stream where the results should be printed.

When duration is specifiec (i.e. StressTestEnforced) there is also:

batch_duration (default: ceil(Int, duration/10)): desired duration of one batch of matmuls.

GPUInspector.stresstest_cpu — Method

stresstest_cpu(core_or_cores)

Run a CPU stress test (matrix multiplication) on one or multiple CPU cores, as specified by the positional argument. If no argument is provided (only) the currently active CPU core will be used.

Keyword arguments:

duration: stress test will take about the given time in seconds.
dtype (default: Float64): element type of the matrices
size (default: floor(Int, sqrt(L2_cachesize() / sizeof(dtype)))): matrices of size (size, size) will be used
verbose (default: true): toggle printing of information
parallel (default: true): If true, will (try to) run each CPU core test on a different Julia thread. Make sure to have enough Julia threads.
threads (default: nothing): If parallel == true, this argument may be used to specify the Julia threads to use.

GPUInspector.theoretical_memory_bandwidth — Function

theoretical_memory_bandwidth(; device::CuDevice=CUDA.device(); verbose=true)

Estimates the theoretical maximal GPU memory bandwidth in GiB/s.

GPUInspector.theoretical_peakflops_gpu — Method

Estimates the theoretical peak performance of a CUDA device in TFLOP/s.

Keyword arguments:

tensorcores (default: hastensorcores()): toggle usage of tensore cores. If false, cuda cores will be used.
verbose (default: true): toggle printing of information
device (default: device()): CUDA device to be analyzed
dtype (default: tensorcores ? Float16 : Float32): element type of the matrices
io (default: stdout): set the stream where the results should be printed.

GPUInspector.toggle_tensorcoremath — Function

toggle_tensorcoremath([enable::Bool; verbose=true])

Switches the CUDA.math_mode between CUDA.FAST_MATH (enable=true) and CUDA.DEFAULT_MATH (enable=false). For matmuls of CuArray{Float32}s, this should have the effect of using/enabling and not using/disabling tensor cores. Of course, this only works on supported devices and CUDA versions.

If no arguments are provided, this functions toggles between the two math modes.

GPUInspector.@unroll — Macro

@unroll N expr

Takes a for loop as expr and informs the LLVM unroller to unroll it N times, if it is safe to do so.

GPUInspector.@unroll — Macro

@unroll expr Takes a for loop as expr and informs the LLVM unroller to fully unroll it, if it is safe to do so and the loop count is known.

GPUInspector.@worker — Macro

@worker pid ex

Spawns the given command on the given worker process.

Examples:

@worker 3 GPUInspector.functional()
@worker 3 stresstest(CUDA.devices(); duration=10, verbose=false)

GPUInspector.@worker — Macro

@worker ex

Creates a worker process, spawns the given command on it, and kills the worker process once the command has finished execution.

Implementation: a Julia thread (we use @spawn) will be used to wait on the task and kill the worker.

Examples:

@worker GPUInspector.functional()
@worker stresstest(CUDA.devices(); duration=10, verbose=false)

GPUInspector.@worker_create — Macro

@worker_create n -> pids

Create n workers (i.e. separate Julia processes) and execute using GPUInspector, CUDA on all of them. Returns the pids of the created workers.

GPUInspector.@worker_killall — Macro

Kills all Julia workers.