Workflow

A typical approach for porting or developing an application for the GPU is as follows:

develop an application using generic array functionality, and test it on the CPU with the Array type
port your application to the GPU by switching to the CuArray type
disallow the CPU fallback ("scalar indexing") to find operations that are not implemented for or incompatible with GPU execution
(optional) use lower-level, CUDA-specific interfaces to implement missing functionality or optimize performance

Scalar indexing

To facilitate porting code, CuArray supports executing so-called "scalar code" which processes one element at a time, e.g., in a for loop. Given how a GPU works, this is extremely slow and will negate any performance benefit of using a GPU. As such, you will be warned when performing this kind of iteration:

julia> a = CuArray([1])
1-element CuArray{Int64,1,Nothing}:
 1

julia> a[1] += 1
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays GPUArrays/src/indexing.jl:16
2

Once you've verified that your application executes correctly on the GPU, you should disallow scalar indexing and use GPU-friendly array operations instead:

julia> CUDA.allowscalar(false)

julia> a[1] .+ 1
ERROR: scalar getindex is disallowed
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] assertscalar(::String) at GPUArrays/src/indexing.jl:14
 [3] getindex(::CuArray{Int64,1,Nothing}, ::Int64) at GPUArrays/src/indexing.jl:54
 [4] top-level scope at REPL[5]:1

julia> a .+ 1
1-element CuArray{Int64,1,Nothing}:
 2

Many array operations however have been implemented themselves using scalar indexing. As a result, calling into a seemingly GPU-friendly array operation might error out:

julia> a = CuArray([1,2])
2-element CuArray{Int64,1,Nothing}:
 1
 2

julia> var(a)
0.5

julia> var(a,dims=1)
ERROR: scalar getindex is disallowed

To resolve such issues, many array operations for CuArray are replaced with GPU-friendly alternatives. If you run into a case like this, have a look at the CUDA.jl issue tracker and file a bug report if there isn't one yet.