CIFAR-10

Description from the original website

The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

Contents

Overview

The MLDatasets.CIFAR10 sub-module provides a programmatic interface to download, load, and work with the CIFAR-10 dataset.

using MLDatasets

# load full training set
train_x, train_y = CIFAR10.traindata()

# load full test set
test_x,  test_y  = CIFAR10.testdata()

The provided functions also allow for optional arguments, such as the directory dir where the dataset is located, or the specific observation indices that one wants to work with. For more information on the interface take a look at the documentation (e.g. ?CIFAR10.traindata).

FunctionDescription
download([dir])Trigger interactive download of the dataset
classnames()Return the class names as a vector of strings
traintensor([T], [indices]; [dir])Load the training images as an array of eltype T
trainlabels([indices]; [dir])Load the labels for the training images
testtensor([T], [indices]; [dir])Load the test images as an array of eltype T
testlabels([indices]; [dir])Load the labels for the test images
traindata([T], [indices]; [dir])Load images and labels of the training data
testdata([T], [indices]; [dir])Load images and labels of the test data

This module also provides utility functions to make working with the CIFAR-10 dataset in Julia more convenient.

FunctionDescription
convert2features(array)Convert the CIFAR-10 tensor to a flat feature matrix
convert2image(array)Convert the CIFAR-10 tensor/matrix to a colorant array

You can use the function convert2features to convert the given CIFAR-10 tensor to a feature matrix (or feature vector in the case of a single image). The purpose of this function is to drop the spatial dimensions such that traditional ML algorithms can process the dataset.

julia> CIFAR10.convert2features(CIFAR10.traintensor()) # full training data
3072×50000 Array{N0f8,2}:
[...]

To visualize an image or a prediction we provide the function convert2image to convert the given CIFAR10 horizontal-major tensor (or feature matrix) to a vertical-major Colorant array.

julia> CIFAR10.convert2image(CIFAR10.traintensor(1)) # first training image
32×32 Array{RGB{N0f8},2}:
[...]

API Documentation

Trainingset

MLDatasets.CIFAR10.traintensorFunction
traintensor([T = N0f8], [indices]; [dir]) -> Array{T}

Return the CIFAR-10 training images corresponding to the given indices as a multi-dimensional array of eltype T. If the corresponding labels are required as well, it is recommended to use CIFAR10.traindata instead.

The image(s) is/are returned in the native horizontal-major memory layout as a single numeric array. If T <: Integer, then all values will be within 0 and 255, otherwise the values are scaled to be between 0 and 1.

If the parameter indices is omitted or an AbstractVector, the images are returned as a 4D array (i.e. a Array{T,4}), in which the first dimension corresponds to the pixel rows (x) of the image, the second dimension to the pixel columns (y) of the image, the third dimension the RGB color channels, and the fourth dimension denotes the index of the image.

julia> CIFAR10.traintensor() # load all training images
32×32×3×50000 Array{N0f8,4}:
[...]

julia> CIFAR10.traintensor(Float32, 1:3) # first three images as Float32
32×32×3×3 Array{Float32,4}:
[...]

If indices is an Integer, the single image is returned as Array{T,3} in horizontal-major layout, which means that the first dimension denotes the pixel rows (x), the second dimension denotes the pixel columns (y), and the third dimension the RGB color channels of the image.

julia> CIFAR10.traintensor(1) # load first training image
32×32×3 Array{N0f8,3}:
[...]

As mentioned above, the images are returned in the native horizontal-major layout to preserve the original feature ordering. You can use the utility function convert2image to convert an CIFAR-10 array into a vertical-major Julia image with the appropriate RGB eltype.

julia> CIFAR10.convert2image(CIFAR10.traintensor(1)) # convert to column-major colorant array
32×32 Array{RGB{N0f8},2}:
[...]

The corresponding resource file(s) of the dataset is/are expected to be located in the specified directory dir. If dir is omitted the directories in DataDeps.default_loadpath will be searched for an existing CIFAR10 subfolder. In case no such subfolder is found, dir will default to ~/.julia/datadeps/CIFAR10. In the case that dir does not yet exist, a download prompt will be triggered. You can also use CIFAR10.download([dir]) explicitly for pre-downloading (or re-downloading) the dataset. Please take a look at the documentation of the package DataDeps.jl for more detail and configuration options.

MLDatasets.CIFAR10.trainlabelsFunction
trainlabels([indices]; [dir])

Returns the CIFAR-10 trainset labels corresponding to the given indices as an Int or Vector{Int}. The values of the labels denote the zero-based class-index that they represent (see CIFAR10.classnames for the corresponding names). If indices is omitted, all labels are returned.

julia> CIFAR10.trainlabels() # full training set
50000-element Array{Int64,1}:
 6
 9
 ⋮
 1
 1

julia> CIFAR10.trainlabels(1:3) # first three labels
3-element Array{Int64,1}:
 6
 9
 9

julia> CIFAR10.trainlabels(1) # first label
6

julia> CIFAR10.classnames()[CIFAR10.trainlabels(1) + 1] # corresponding name
"frog"

The corresponding resource file(s) of the dataset is/are expected to be located in the specified directory dir. If dir is omitted the directories in DataDeps.default_loadpath will be searched for an existing CIFAR10 subfolder. In case no such subfolder is found, dir will default to ~/.julia/datadeps/CIFAR10. In the case that dir does not yet exist, a download prompt will be triggered. You can also use CIFAR10.download([dir]) explicitly for pre-downloading (or re-downloading) the dataset. Please take a look at the documentation of the package DataDeps.jl for more detail and configuration options.

MLDatasets.CIFAR10.traindataFunction
traindata([T = N0f8], [indices]; [dir]) -> images, labels

Returns the CIFAR-10 trainingset corresponding to the given indices as a two-element tuple. If indices is omitted the full trainingset is returned. The first element of the return values will be the images as a multi-dimensional array, and the second element the corresponding labels as integers.

The image(s) is/are returned in the native horizontal-major memory layout as a single numeric array of eltype T. If T <: Integer, then all values will be within 0 and 255, otherwise the values are scaled to be between 0 and 1. The integer values of the labels correspond 1-to-1 the digit that they represent.

train_x, train_y = CIFAR10.traindata() # full datatset
train_x, train_y = CIFAR10.traindata(2) # only second observation
train_x, train_y = CIFAR10.traindata(dir="./CIFAR10") # custom folder

The corresponding resource file(s) of the dataset is/are expected to be located in the specified directory dir. If dir is omitted the directories in DataDeps.default_loadpath will be searched for an existing CIFAR10 subfolder. In case no such subfolder is found, dir will default to ~/.julia/datadeps/CIFAR10. In the case that dir does not yet exist, a download prompt will be triggered. You can also use CIFAR10.download([dir]) explicitly for pre-downloading (or re-downloading) the dataset. Please take a look at the documentation of the package DataDeps.jl for more detail and configuration options.

Take a look at CIFAR10.traintensor and CIFAR10.trainlabels for more information.

Testset

MLDatasets.CIFAR10.testtensorFunction
testtensor([T = N0f8], [indices]; [dir]) -> Array{T}

Return the CIFAR-10 test images corresponding to the given indices as a multi-dimensional array of eltype T. If the corresponding labels are required as well, it is recommended to use CIFAR10.testdata instead.

The image(s) is/are returned in the native horizontal-major memory layout as a single numeric array. If T <: Integer, then all values will be within 0 and 255, otherwise the values are scaled to be between 0 and 1.

If the parameter indices is omitted or an AbstractVector, the images are returned as a 4D array (i.e. a Array{T,4}), in which the first dimension corresponds to the pixel rows (x) of the image, the second dimension to the pixel columns (y) of the image, the third dimension the RGB color channels, and the fourth dimension denotes the index of the image.

julia> CIFAR10.testtensor() # load all training images
32×32×3×10000 Array{N0f8,4}:
[...]

julia> CIFAR10.testtensor(Float32, 1:3) # first three images as Float32
32×32×3×3 Array{Float32,4}:
[...]

If indices is an Integer, the single image is returned as Array{T,3} in horizontal-major layout, which means that the first dimension denotes the pixel rows (x), the second dimension denotes the pixel columns (y), and the third dimension the RGB color channels of the image.

julia> CIFAR10.testtensor(1) # load first training image
32×32×3 Array{N0f8,3}:
[...]

As mentioned above, the images are returned in the native horizontal-major layout to preserve the original feature ordering. You can use the utility function convert2image to convert an CIFAR-10 array into a vertical-major Julia image with the appropriate RGB eltype.

julia> CIFAR10.convert2image(CIFAR10.testtensor(1)) # convert to column-major colorant array
32×32 Array{RGB{N0f8},2}:
[...]

The corresponding resource file(s) of the dataset is/are expected to be located in the specified directory dir. If dir is omitted the directories in DataDeps.default_loadpath will be searched for an existing CIFAR10 subfolder. In case no such subfolder is found, dir will default to ~/.julia/datadeps/CIFAR10. In the case that dir does not yet exist, a download prompt will be triggered. You can also use CIFAR10.download([dir]) explicitly for pre-downloading (or re-downloading) the dataset. Please take a look at the documentation of the package DataDeps.jl for more detail and configuration options.

MLDatasets.CIFAR10.testlabelsFunction
testlabels([indices]; [dir])

Returns the CIFAR-10 testset labels corresponding to the given indices as an Int or Vector{Int}. The values of the labels denote the zero-based class-index that they represent (see CIFAR10.classnames for the corresponding names). If indices is omitted, all labels are returned.

julia> CIFAR10.testlabels() # full training set
10000-element Array{Int64,1}:
 3
 8
 ⋮
 1
 7

julia> CIFAR10.testlabels(1:3) # first three labels
3-element Array{Int64,1}:
 3
 8
 8

julia> CIFAR10.testlabels(1) # first label
3

julia> CIFAR10.classnames()[CIFAR10.testlabels(1) + 1] # corresponding name
"cat"

The corresponding resource file(s) of the dataset is/are expected to be located in the specified directory dir. If dir is omitted the directories in DataDeps.default_loadpath will be searched for an existing CIFAR10 subfolder. In case no such subfolder is found, dir will default to ~/.julia/datadeps/CIFAR10. In the case that dir does not yet exist, a download prompt will be triggered. You can also use CIFAR10.download([dir]) explicitly for pre-downloading (or re-downloading) the dataset. Please take a look at the documentation of the package DataDeps.jl for more detail and configuration options.

MLDatasets.CIFAR10.testdataFunction
testdata([T = N0f8], [indices]; [dir]) -> images, labels

Returns the CIFAR-10 testset corresponding to the given indices as a two-element tuple. If indices is omitted the full testset is returned. The first element of the return values will be the images as a multi-dimensional array, and the second element the corresponding labels as integers.

The image(s) is/are returned in the native horizontal-major memory layout as a single numeric array of eltype T. If T <: Integer, then all values will be within 0 and 255, otherwise the values are scaled to be between 0 and 1. The integer values of the labels correspond 1-to-1 the digit that they represent.

test_x, test_y = CIFAR10.testdata() # full datatset
test_x, test_y = CIFAR10.testdata(2) # only second observation
test_x, test_y = CIFAR10.testdata(dir="./CIFAR10") # custom folder

The corresponding resource file(s) of the dataset is/are expected to be located in the specified directory dir. If dir is omitted the directories in DataDeps.default_loadpath will be searched for an existing CIFAR10 subfolder. In case no such subfolder is found, dir will default to ~/.julia/datadeps/CIFAR10. In the case that dir does not yet exist, a download prompt will be triggered. You can also use CIFAR10.download([dir]) explicitly for pre-downloading (or re-downloading) the dataset. Please take a look at the documentation of the package DataDeps.jl for more detail and configuration options.

Take a look at CIFAR10.testtensor and CIFAR10.testlabels for more information.

Utilities

MLDatasets.CIFAR10.downloadFunction
download([dir]; [i_accept_the_terms_of_use])

Trigger the (interactive) download of the full dataset into "dir". If no dir is provided the dataset will be downloaded into "~/.julia/datadeps/CIFAR10".

This function will display an interactive dialog unless either the keyword parameter i_accept_the_terms_of_use or the environment variable DATADEPS_ALWAYS_ACCEPT is set to true. Note that using the data responsibly and respecting copyright/terms-of-use remains your responsibility.

MLDatasets.CIFAR10.convert2featuresFunction
convert2features(array)

Convert the given CIFAR-10 tensor to a feature matrix (or feature vector in the case of a single image). The purpose of this function is to drop the spatial dimensions such that traditional ML algorithms can process the dataset.

julia> CIFAR10.convert2features(CIFAR10.traintensor(Float32)) # full training data
3072×50000 Array{Float32,2}:
[...]

julia> CIFAR10.convert2features(CIFAR10.traintensor(Float32,1)) # first observation
3072-element Array{Float32,1}:
[...]
MLDatasets.CIFAR10.convert2imageFunction
convert2image(array) -> Array{RGB}

Convert the given CIFAR-10 horizontal-major tensor (or feature vector/matrix) to a vertical-major RGB array.

julia> CIFAR10.convert2image(CIFAR10.traintensor()) # full training dataset
32×32×50000 Array{RGB{N0f8},3}:
[...]

julia> CIFAR10.convert2image(CIFAR10.traintensor(1)) # first training image
32×32 Array{RGB{N0f8},2}:
[...]

References