GPUInspector.B
— TypeBytes
GPUInspector.GB
— TypeGigabytes, i.e. 10^9 = 1000^3
bytes
GPUInspector.GiB
— TypeGibibytes, i.e. 2^30 = 1024^3
bytes
GPUInspector.KB
— TypeKilobytes, i.e. 10^3 = 1000
bytes
GPUInspector.KiB
— TypeKibibytes, i.e. 2^10 = 1024
bytes
GPUInspector.MB
— TypeMegabytes, i.e. 10^6 = 1000^2
bytes
GPUInspector.MiB
— TypeMebibytes, i.e. 2^20 = 1024^2
bytes
GPUInspector.MultiLogger
— TypeMultiLogger struct which combines normal and error output streams. Useful if you want your normal and error output that is printed to the terminal to also be saved to different files.
Arguments:
normal_file_name
: Path to normal output file.error_file_name
: Path to error output file.
GPUInspector.StressTestBatched
— TypeGPU stress test (matrix multiplications) in which we try to run for a given time period. We try to keep the CUDA stream continously busy with matmuls at any point in time. Concretely, we submit batches of matmuls and, after half of them, we record a CUDA event. On the host, after submitting a batch, we (non-blockingly) synchronize on, i.e. wait for, the CUDA event and, if we haven't exceeded the desired duration already, submit another batch.
GPUInspector.StressTestEnforced
— TypeGPU stress test (matrix multiplications) in which we run almost precisely for a given time period (duration is enforced).
GPUInspector.StressTestFixedIter
— TypeGPU stress test (matrix multiplications) in which we run for a given number of iteration, or try to run for a given time period (with potentially high uncertainty!). In the latter case, we estimate how long a synced matmul takes and set niter
accordingly.
GPUInspector.StressTestStoreResults
— TypeGPU stress test (matrix multiplications) in which we store all matmul results and try to run as many iterations as possible for a certain memory limit (default: 90% of free memory).
This stress test is somewhat inspired by gpu-burn by Ville Timonen.
GPUInspector.TB
— TypeTerabytes, i.e. 10^12 = 1000^4
bytes
GPUInspector.TiB
— TypeTebibytes, i.e. 2^40 = 1024^4
bytes
GPUInspector.UnitPrefixedBytes
— TypeAbstract type representing an amount of data, i.e. a certain number of bytes, with a unit prefix (also "metric prefix"). Examples include the SI prefixes, like KB, MB, and GB, but also the binary prefixes (ISO/IEC 80000), like KiB, MiB, and GiB.
See https://en.wikipedia.org/wiki/Binary_prefix for more information.
GPUInspector.alloc_mem
— Methodalloc_mem(memsize::UnitPrefixedBytes; devs=(CUDA.device(),), dtype=Float32)
Allocates memory on the devices whose IDs are provided via devs
. Returns a vector of memory handles (i.e. CuArray
s).
Examples:
alloc_mem(MiB(1024)) # allocate on the currently active device
alloc_mem(B(40_000_000); devs=(0,1)) # allocate on GPU0 and GPU1
GPUInspector.bytes
— Methodbytes(x::Number)
Returns an appropriate UnitPrefixedBytes
object, representing the number of bytes.
Note: This function is type unstable by construction!
See simplify
for what "appropriate" means here.
GPUInspector.bytes
— Methodbytes(x::UnitPrefixedBytes)
Return the number of bytes (without prefix) as Float64
.
GPUInspector.change_base
— MethodToggle between
- Base 10, SI prefixes, i.e. factors of 1000
- Base 2, ISO/IEC prefixes, i.e. factors of 1024
Example:
julia> change_base(KB(13))
~12.7 KiB
julia> change_base(KiB(13))
~13.31 KB
GPUInspector.clear_all_gpus_memory
— FunctionReclaim the unused memory of all available GPUs.
GPUInspector.clear_gpu_memory
— FunctionReclaim the unused memory of the currently active GPU (i.e. device()
).
GPUInspector.functional
— FunctionCheck if CUDA/GPU is available and functional. If not, print some (hopefully useful) debug information.
GPUInspector.get_cpu_stats
— MethodGet information about all cpu cores. Returns a vector of vectors. The outer index corresponds to cpu cores. The inner vector contains the following information (in that order):
user nice system idle iowait irq softirq steal guest ?
See proc(5) for more information.
GPUInspector.get_cpu_utilization
— Functionget_cpu_utilization(core=getcpuid(); Δt=0.01)
Get the utilization (in percent) of the given cpu core
over a certain time interval Δt
.
GPUInspector.get_cpu_utilizations
— Functionget_cpu_utilizations(cores=0:Sys.CPU_THREADS-1; Δt=0.01)
Get the utilization (in percent) of the given cpu cores
over a certain time interval Δt
.
Based on this.
GPUInspector.get_cpusocket_temperatures
— MethodTries to get the temperatures of the available CPUs (sockets not cores) in degrees Celsius.
Based on cat /sys/class/thermal/thermal_zone*/temp
.
GPUInspector.get_gpu_utilization
— Functionget_gpu_utilization(device=CUDA.device())
Get the current utilization of the given CUDA device in percent.
GPUInspector.get_gpu_utilizations
— Functionget_gpu_utilizations(devices=CUDA.devices())
Get the current utilization of the given CUDA devices in percent.
GPUInspector.get_power_usage
— Methodget_power_usage(device=CUDA.device())
Get current power usage of the given CUDA device in Watts.
GPUInspector.get_power_usages
— Functionget_power_usages(devices=CUDA.devices())
Get current power usage of the given CUDA devices in Watts.
GPUInspector.get_temperature
— Functionget_temperature(device=CUDA.device())
Get current temperature of the given CUDA device in degrees Celsius.
GPUInspector.get_temperatures
— Functionget_temperatures(devices=CUDA.devices())
Get current temperature of the given CUDA devices in degrees Celsius.
GPUInspector.gpuid
— FunctionGet GPU index of the given device.
Note: GPU indices start with zero.
GPUInspector.gpuinfo
— Methodgpuinfo(deviceid::Integer)
Print out detailed information about the GPU with the given deviceid
.
Heavily inspired by the CUDA sample "deviceQueryDrv.cpp".
GPUInspector.gpuinfo_p2p_access
— MethodQuery peer-to-peer (i.e. inter-GPU) access support.
GPUInspector.gpus
— MethodList the available GPUs.
GPUInspector.hastensorcores
— FunctionChecks whether the given CuDevice
has Tensor Cores.
GPUInspector.host2device_bandwidth
— Functionhost2device_bandwidth([memsize::UnitPrefixedBytes=GiB(0.5)]; kwargs...)
Performs a host-to-device memory copy benchmark (time measurement) and returns the host-to-device bandwidth estimate (in GiB/s) derived from it.
Keyword arguments:
nbench
(default:10
): number of time measurements (i.e. p2p memcopies)verbose
(default:true
): set to false to turn off any printing.stats
(default:false
): whentrue
shows statistical information about the benchmark.times
(default:false
): toggle printing of measured times.dtype
(default:Cchar
): used data type.io
(default:stdout
): set the stream where the results should be printed.
Examples:
host2device_bandwidth()
host2device_bandwidth(MiB(1024))
host2device_bandwidth(KiB(20_000); dtype=Int32)
GPUInspector.ismonitoring
— MethodChecks if we are currently monitoring.
GPUInspector.kernel_fma
— MethodDummy kernel doing _kernel_fma_nfmas()
many FMAs (default: 100_000
).
GPUInspector.livemonitor_powerusage
— Methodlivemonitor_powerusage(duration) -> times, powerusage
Monitor the power usage of GPU(s) (in Watts) over a given time period, as specified by duration
(in seconds). Returns the (relative) times as a Vector{Float64}
and the power usage as a Vector{Vector{Float64}}
.
For general keyword arguments, see livemonitor_something
.
GPUInspector.livemonitor_something
— Methodlivemonitor_something(f, duration) -> times, values
Monitor some property of GPU(s), as specified through the function f
, over a given time period, as specified by duration
(in seconds). Returns the (relative) times as a Vector{Float64}
and the temperatures as a Vector{Vector{Float64}}
.
The function f
will be called on a vector of devices (CuDevice
s or NVML.Device
s) and should return a vector of Float64
values.
Keyword arguments:
freq
(default:1
): polling rate in Hz.devices
(default:NVML.devices()
):CuDevice
s orNVML.Device
s to consider.plot
(default:false
): Create a unicode plot after the monitoring.liveplot
(default:false
): Create and update a unicode plot during the monitoring. Use optionalylims
to specify fixed y limits.title
(default:""
): Title used in unicode plots.ylabel
(default:"Values"
): y label used in unicode plots.
GPUInspector.livemonitor_temperature
— Methodlivemonitor_temperature(duration) -> times, temperatures
Monitor the temperature of GPU(s) over a given time period, as specified by duration
(in seconds). Returns the (relative) times as a Vector{Float64}
and the temperatures as a Vector{Vector{Float64}}
.
For general keyword arguments, see livemonitor_something
.
GPUInspector.load_monitoring_results
— MethodGiven an HDF5 file created with save_monitoring_results
, restore the saved monitoring results (i.e. output of monitoring_stop
).
GPUInspector.memory_bandwidth
— Functionmemory_bandwidth([memsize; kwargs...])
Tries to estimate the peak memory bandwidth of a GPU in GiB/s by measuring the time it takes to perform a memcpy of a certain amount of data (as specified by memsize
).
Keyword arguments:
device
(default:CUDA.device()
): CUDA device to be used.dtype
(default:Cchar
): element type of the vectors.verbose
(default:true
): toggle printing.io
(default:stdout
): set the stream where the results should be printed.
See also: memory_bandwidth_scaling
.
GPUInspector.memory_bandwidth_saxpy
— MethodTries to estimate the peak memory bandwidth of a GPU in GiB/s by measuring the time it takes to perform a SAXPY, i.e. a * x[i] + y[i]
.
Keyword arguments:
device
(default:CUDA.device()
): CUDA device to be used.dtype
(default:Float32
): element type of the vectors.size
(default:2^20 * 10
): length of the vectors.nbench
(default:5
): number of measurements to be performed the best of which is used for the GiB/s computation.verbose
(default:true
): toggle printing.cublas
(default:true
): toggle betweenCUDA.axpy!
and a customsaxpy_gpu_kernel!
.io
(default:stdout
): set the stream where the results should be printed.
See also: memory_bandwidth_saxpy_scaling
.
GPUInspector.memory_bandwidth_saxpy_scaling
— Methodmemory_bandwidth_saxpy_scaling() -> sizes, bandwidths
Measures the memory bandwidth (via memory_bandwidth_saxpy
) as a function of vector length. If verbose=true
(default), displays a unicode plot. Returns the considered lengths and GiB/s. For further options, see memory_bandwidth_saxpy
.
GPUInspector.memory_bandwidth_scaling
— Methodmemory_bandwidth_scaling() -> datasizes, bandwidths
Measures the memory bandwidth (via memory_bandwidth
) as a function of data size. If verbose=true
(default), displays a unicode plot. Returns the considered data sizes and GiB/s. For further options, see memory_bandwidth
.
GPUInspector.monitoring_start
— Methodmonitoring_start(; devices=CUDA.devices(), verbose=true)
Start monitoring of GPU temperature, utilization, power usage, etc.
Keyword arguments:
freq
(default:1
): polling rate in Hz.devices
(default:CUDA.devices()
):CuDevice
s orNVML.Device
s to monitor.thread
(default:Threads.nthreads()
): id of the Julia thread that should run the monitoring.verbose
(default:true
): toggle verbose output.
See also monitoring_stop
.
GPUInspector.monitoring_stop
— Methodmonitoring_stop(; verbose=true) -> results
Stops the GPU monitoring and returns the measured values. Specifically, results
is a named tuple with the following keys:
time
: the (relative) times at which we measuredtemperature
,power
,compute
,mem
See also monitoring_start
and plot_monitoring_results
.
GPUInspector.multi_log
— FunctionLogging function for MultiLogger. Use this in combination with MultiLogger if you want your normal and error output that is printed to the terminal to also be saved to different files.
Arguments:
MultiLogger
: Instance of MultiLogger struct.text
: Text to be printed to terminal and written to file.is_error
(default:false
): Flag which decides whethertext
should be written into normal or error stream.
GPUInspector.p2p_bandwidth
— Functionp2p_bandwidth([memsize::UnitPrefixedBytes]; kwargs...)
Performs a peer-to-peer memory copy benchmark (time measurement) and returns an inter-gpu memory bandwidth estimate (in GiB/s) derived from it.
Keyword arguments:
src
(default:0
): source devicedst
(default:1
): destination devicenbench
(default:5
): number of time measurements (i.e. p2p memcopies)verbose
(default:true
): set to false to turn off any printing.hist
(default:false
): whentrue
, a UnicodePlots-based histogram is printed.times
(default:false
): toggle printing of measured times.alternate
(default:false
): alternatesrc
anddst
, i.e. copy data back and forth.dtype
(default:Float32
): seealloc_mem
.io
(default:stdout
): set the stream where the results should be printed.
Examples:
p2p_bandwidth()
p2p_bandwidth(MiB(1024))
p2p_bandwidth(KiB(20_000); dtype=Int32)
GPUInspector.p2p_bandwidth_all
— Methodp2p_bandwidth_all(args...; kwargs...)
Run p2p_bandwidth
for all combinations of devices. Returns a matrix with the p2p memory bandwidth estimates.
GPUInspector.p2p_bandwidth_bidirectional
— FunctionSame as p2p_bandwidth
but measures the bidirectional bandwidth (copying data back and forth).
GPUInspector.p2p_bandwidth_bidirectional_all
— MethodSame as p2p_bandwidth_all
but measures the bidirectional bandwidth (copying data back and forth).
GPUInspector.peakflops_gpu
— Methodpeakflops_gpu(; tensorcores=hastensorcores(), kwargs...)
Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform
_kernel_fma_nfmas() * size
many FMAs on CUDA cores (iftensorcores == false
)_kernel_wmma_nwmmas()
many WMMAs on Tensor Cores (iftensorcores == true
)
For more keyword argument options see peakflops_gpu_fmas
and peakflops_gpu_wmmas
.
GPUInspector.peakflops_gpu_fmas
— Methodpeakflops_gpu_fmas(; size::Integer=5_000_000, dtype=Float32, nbench=5, nkernel=5, device=CUDA.device(), verbose=true)
Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_fma_nfmas() * size
many FMAs on CUDA cores.
Keyword arguments:
device
(default:CUDA.device()
): CUDA device to be used.dtype
(default:Float32
): element type of the matrices.size
(default:5_000_000
): length of vectors.nkernel
(default:5
): number of kernel calls that make up one benchmarking sample.nbench
(default:5
): number of measurements to be performed the best of which is used for the TFLOP/s computation.verbose
(default:true
): toggle printing.io
(default:stdout
): set the stream where the results should be printed.
GPUInspector.peakflops_gpu_matmul
— Methodpeakflops_gpu_matmul(; device, dtype=Float32, size=2^14, nmatmuls=5, nbench=5, verbose=true)
Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform nmatmuls
many (in-place) matrix-matrix multiplications.
Keyword arguments:
device
(default:CUDA.device()
): CUDA device to be used.dtype
(default:Float32
): element type of the matrices.size
(default:2^14
): matrices will have dimensions(size, size)
.nmatmuls
(default:5
): number of matmuls that will make up the kernel to be timed.nbench
(default:5
): number of measurements to be performed the best of which is used for the TFLOP/s computation.verbose
(default:true
): toggle printing.io
(default:stdout
): set the stream where the results should be printed.
See also: peakflops_gpu_matmul_scaling
, peakflops_gpu_matmul_graphs
.
GPUInspector.peakflops_gpu_matmul_graphs
— MethodSame as peakflops_gpu_matmul
but uses CUDA's graph API to define and launch the kernel.
See also: peakflops_gpu_matmul_scaling
.
GPUInspector.peakflops_gpu_matmul_scaling
— Methodpeakflops_gpu_matmul_scaling(peakflops_func = peakflops_gpu_matmul; verbose=true) -> sizes, flops
Asserts the scaling of the given peakflops_func
tion (defaults to peakflops_gpu_matmul
) with increasing matrix size. If verbose=true
(default), displays a unicode plot. Returns the considered sizes and TFLOP/s. For further options, see peakflops_gpu_matmul
.
GPUInspector.peakflops_gpu_wmmas
— Methodpeakflops_gpu_wmmas()
Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_wmma_nwmmas()
many WMMAs on Tensor Cores.
Keyword arguments:
device
(default:CUDA.device()
): CUDA device to be used.dtype
(default:Float16
): element type of the matrices. We currently only supportFloat16
(Int8
,:TensorFloat32
,:BFloat16
, andFloat64
might or might not work).nkernel
(default:10
): number of kernel calls that make up one benchmarking sample.nbench
(default:5
): number of measurements to be performed the best of which is used for the TFLOP/s computation.threads
(default: max. threads per block): how many threads to use per block (part of the kernel launch configuration).blocks
(default:2048
): how many blocks to use (part of the kernel launch configuration).verbose
(default:true
): toggle printing.io
(default:stdout
): set the stream where the results should be printed.
GPUInspector.plot_monitoring_results
— Functionplot_monitoring_results(r::MonitoringResults, symbols=keys(r.results))
Plot the quantities specified through symbols
of a MonitoringResults
object. Will generate a textual in-terminal / in-logfile plot using UnicodePlots.jl.
GPUInspector.save_monitoring_results
— Methodsave_monitoring_results(filename::String, r::MonitoringResults; overwrite=false)
Store the given MonitoringResults
(output of monitoring_stop
) to disk as an HDF5 file with name filename
.
GPUInspector.savefig_monitoring_results
— Functionsavefig_monitoring_results(r::MonitoringResults, symbols=keys(r.results))
Save plots of the quantities specified through symbols
of a MonitoringResults
object to disk. Note: Only available if CairoMakie.jl is loaded next to GPUInspector.jl.
GPUInspector.simplify
— Methodsimplify(x::UnitPrefixedBytes[; base])
Given a UnitPrefixedBytes
number x
, finds a more appropriate UnitPrefixedBytes
that represents the same number of bytes but with a smaller value.
The optional keyword argument base
can be used to switch between base 2, i.e. ISO/IEC prefixes (default), and base 10. Allowed values are 2
, 10
, :SI
, :ISO
, and :IEC
.
Note: This function is type unstable by construction!
Example:
julia> simplify(B(40_000_000))
~38.15 MiB
julia> simplify(B(40_000_000); base=10)
40.0 MB
GPUInspector.stresstest
— Methodstresstest(device_or_devices)
Run a GPU stress test (matrix multiplication) on one or multiple GPU devices, as specified by the positional argument. If no argument is provided (only) the currently active GPU will be used.
Keyword arguments:
Choose one of the following (or none):
duration
: stress test will take about the given time in seconds. (StressTestBatched)enforced_duration
: stress test will take almost precisely the given time in seconds. (StressTestEnforced)approx_duration
: stress test will hopefully take approximately the given time in seconds. No promises made! (StressTestFixedIter)niter
: stress test will run the given number of matrix-multiplications, however long that will take. (StressTestFixedIter)mem
: number (<:Real
) between 0 and 1, indicating the fraction of the available GPU memory that should be used, or a<:UnitPrefixedBytes
indicating an absolute memory limit. (StressTestStoreResults)
General settings:
dtype
(default:Float32
): element type of the matricesmonitoring
(default:false
): enable automatic monitoring, in which case aMonitoringResults
object is returned.size
(default:2048
): matrices of size(size, size)
will be usedverbose
(default:true
): toggle printing of informationparallel
(default:true
): Iftrue
, will (try to) run each GPU test on a different Julia thread. Make sure to have enough Julia threads.threads
(default:nothing
): Ifparallel == true
, this argument may be used to specify the Julia threads to use.clearmem
(default:false
): Iftrue
, we callclear_all_gpus_memory
after the stress test.io
(default:stdout
): set the stream where the results should be printed.
When duration
is specifiec (i.e. StressTestEnforced
) there is also:
batch_duration
(default:ceil(Int, duration/10)
): desired duration of one batch of matmuls.
GPUInspector.stresstest_cpu
— Methodstresstest_cpu(core_or_cores)
Run a CPU stress test (matrix multiplication) on one or multiple CPU cores, as specified by the positional argument. If no argument is provided (only) the currently active CPU core will be used.
Keyword arguments:
duration
: stress test will take about the given time in seconds.dtype
(default:Float64
): element type of the matricessize
(default:floor(Int, sqrt(L2_cachesize() / sizeof(dtype)))
): matrices of size(size, size)
will be usedverbose
(default:true
): toggle printing of informationparallel
(default:true
): Iftrue
, will (try to) run each CPU core test on a different Julia thread. Make sure to have enough Julia threads.threads
(default:nothing
): Ifparallel == true
, this argument may be used to specify the Julia threads to use.
GPUInspector.theoretical_memory_bandwidth
— Functiontheoretical_memory_bandwidth(; device::CuDevice=CUDA.device(); verbose=true)
Estimates the theoretical maximal GPU memory bandwidth in GiB/s.
GPUInspector.theoretical_peakflops_gpu
— MethodEstimates the theoretical peak performance of a CUDA device in TFLOP/s.
Keyword arguments:
tensorcores
(default:hastensorcores()
): toggle usage of tensore cores. Iffalse
, cuda cores will be used.verbose
(default:true
): toggle printing of informationdevice
(default:device()
): CUDA device to be analyzeddtype
(default:tensorcores ? Float16 : Float32
): element type of the matricesio
(default:stdout
): set the stream where the results should be printed.
GPUInspector.toggle_tensorcoremath
— Functiontoggle_tensorcoremath([enable::Bool; verbose=true])
Switches the CUDA.math_mode
between CUDA.FAST_MATH
(enable=true
) and CUDA.DEFAULT_MATH
(enable=false
). For matmuls of CuArray{Float32}
s, this should have the effect of using/enabling and not using/disabling tensor cores. Of course, this only works on supported devices and CUDA versions.
If no arguments are provided, this functions toggles between the two math modes.
GPUInspector.@unroll
— Macro@unroll N expr
Takes a for loop as expr
and informs the LLVM unroller to unroll it N
times, if it is safe to do so.
GPUInspector.@unroll
— Macro@unroll expr Takes a for loop as expr
and informs the LLVM unroller to fully unroll it, if it is safe to do so and the loop count is known.
GPUInspector.@worker
— Macro@worker pid ex
Spawns the given command on the given worker process.
Examples:
@worker 3 GPUInspector.functional()
@worker 3 stresstest(CUDA.devices(); duration=10, verbose=false)
GPUInspector.@worker
— Macro@worker ex
Creates a worker process, spawns the given command on it, and kills the worker process once the command has finished execution.
Implementation: a Julia thread (we use @spawn
) will be used to wait
on the task and kill the worker.
Examples:
@worker GPUInspector.functional()
@worker stresstest(CUDA.devices(); duration=10, verbose=false)
GPUInspector.@worker_create
— Macro@worker_create n -> pids
Create n
workers (i.e. separate Julia processes) and execute using GPUInspector, CUDA
on all of them. Returns the pids
of the created workers.
GPUInspector.@worker_killall
— MacroKills all Julia workers.