MLJScientificTypes.jl

Implementation of a convention for scientific types, as used in the MLJ universe.

This package makes a distinction between machine type and scientific type of a Julia object:

  • The machine type refers to the Julia type being used to represent the object (for instance, Float64).

  • The scientific type is one of the types defined in ScientificTypes.jl reflecting how the object should be interpreted (for instance, Continuous or Multiclass).

A scientific type convention is an assignment of a scientific type to every Julia object, articulated by overloading the scitype method. The MLJ convention is the one adopted in the MLJ ecosystem, although it may be used in other scientific/statistical software.

This package additionally defines tools for type coercion (the coerce method) and scientific type "guessing" (the autotype method).

Developers interested in implementing a different convention will instead import Scientific Types.jl, following the documentation there, possibly using this repo as a template.

Type hierarchy

The supported scientific types have the following hierarchy:

Finite{N}
├─ Multiclass{N}
└─ OrderedFactor{N}

Infinite
├─ Continuous
└─ Count

Image{W,H}
├─ ColorImage{W,H}
└─ GrayImage{W,H}

ScientificTimeType
├─ ScientificDate
├─ ScientificTime
└─ ScientificDateTime

Table{K}

Textual

Unknown

Additionally, we regard the Julia native types Missing and Nothing as scientific types as well.

Getting started

This documentation focuses on properties of the scitype method specific to the MLJ convention. The scitype method satisfies certain universal properties, with respect to its operation on tuples, arrays and tables, set out in the ScientificTypes readme, but only implicitly described here.

To get the scientific type of a Julia object defined by the MLJ convention, call scitype:

using MLJScientificTypes # or using MLJ
scitype(3.14)
Continuous

For a vector, you can use scitype or elscitype (which will give you a scitype corresponding to the elements):

scitype([1,2,3,missing])
AbstractArray{Union{Missing, Count},1}
elscitype([1,2,3,missing])
Union{Missing, Count}

Occasionally, you may want to find the union of all scitypes of elements of an arbitrary iterable, which you can do with scitype_union:

scitype_union((ifelse(isodd(i), i, missing) for i in 1:5))
Union{Missing, Count}

Note calling scitype_union on a large array, for example, is typically much slower than calling scitype or elscitype.

Summary of the MLJ convention

The table below summarizes the MLJ convention for representing scientific types:

Type Tscitype(x) for x::Tpackage required
MissingMissing
NothingNothing
AbstractFloatContinuous
IntegerCount
StringTextual
CategoricalValueMulticlass{N} where N = nlevels(x), provided x.pool.ordered == falseCategoricalArrays
CategoricalStringMulticlass{N} where N = nlevels(x), provided x.pool.ordered == falseCategoricalArrays
CategoricalValueOrderedFactor{N} where N = nlevels(x), provided x.pool.ordered == trueCategoricalArrays
CategoricalStringOrderedFactor{N} where N = nlevels(x) provided x.pool.ordered == trueCategoricalArrays
DateScientificDateDates
TimeScientificTimeDates
DateTimeScientificDateTimeDates
AbstractArray{<:Gray,2}GrayImage{W,H} where (W, H) = size(x)ColorTypes
AbstractArrray{<:AbstractRGB,2}ColorImage{W,H} where (W, H) = size(x)ColorTypes
any table type T supported by Tables.jlTable{K} where K=Union{column_scitypes...}Tables

Here nlevels(x) = length(levels(x.pool)).

Notes

  • We regard the built-in Julia types Missing and Nothing as scientific types.
  • Finite{N}, Multiclass{N} and OrderedFactor{N} are all parameterized by the number of levels N. We export the alias Binary = Finite{2}.
  • Image{W,H}, GrayImage{W,H} and ColorImage{W,H} are all parameterized by the image width and height dimensions, (W, H).
  • On objects for which the MLJ convention has nothing to say, the scitype function returns Unknown.

Special note on binary data

MLJScientificTypes does not define a separate "binary" scientific type. Rather, when binary data has an intrinsic "true" class (for example pass/fail in a product test), then it should be assigned an OrderedFactor{2} scitype, while data with no such class (e.g., gender) should be assigned a Multiclass{2} scitype. In the OrderedFactor{2} case MLJ adopts the convention that the "true" class come after the "false" class in the ordering (corresponding to the usual assignment "false=0" and "true=1"). Of course, Finite{2} covers both cases of binary data.

Type coercion for tabular data

A common two-step work-flow is:

  1. Inspect the schema of some table, and the column scitypes in particular.

  2. Provide pairs of column names and scitypes (or a dictionary) that change the column machine types to reflect the desired scientific interpretation (scitype).

using DataFrames, Tables
X = DataFrame(
	 name=["Siri", "Robo", "Alexa", "Cortana"],
	 height=[152, missing, 148, 163],
	 rating=[1, 5, 2, 1])
schema(X)
┌─────────┬───────────────────────┬───────────────────────┐
│ _.names │ _.types               │ _.scitypes            │
├─────────┼───────────────────────┼───────────────────────┤
│ name    │ String                │ Textual               │
│ height  │ Union{Missing, Int64} │ Union{Missing, Count} │
│ rating  │ Int64                 │ Count                 │
└─────────┴───────────────────────┴───────────────────────┘
_.nrows = 4

In some further analysis of the data in X, a more likely interpretation is that :name is Multiclass, the :height is Continuous, and the :rating an OrderedFactor. Correcting the types with coerce:

Xfixed = coerce(X, :name=>Multiclass,
                   :height=>Continuous,
                   :rating=>OrderedFactor)
schema(Xfixed).scitypes
(Multiclass{4}, Union{Missing, Continuous}, OrderedFactor{3})

Note that because missing values were encountered in height, an "imperfect" type coercion to Union{Missing,Continuous} has been performed, and a warning issued. To avoid the warning, coerce to Union{Missing,Continuous} instead.

"Global" replacements based on existing scientific types are also possible, and can be mixed with the name-based replacements:

X  = (x = [1, 2, 3],
	  y = [:A, :B, :A],
	  z = [10, 20, 30])
Xfixed = coerce(X, Count=>Continuous, :y=>OrderedFactor)
schema(Xfixed).scitypes
(Continuous, OrderedFactor{2}, Continuous)

Finally there is a coerce! method that does in-place coercion provided the data structure supports it.

Type coercion for image data

To have a scientific type of Image a julia object must be a two-dimensional array whose element type is subtype of Gray or AbstractRGB (color types from the ColorTypes.jl package). And MLJ models typically expect collections of images in MLJ to be vectors of such two-dimensional arrays. Implementations of coerce allow the conversion of some common image formats into one of these. The eltype in these other formats can be any subtype of Real, which includes the FixedPoint type from the FixedPointNumbers.jl package.

Coercing a single image

Coercing a gray image, represented as a Real matrix (W x H format):

img = rand(10, 10)
coerce(img, GrayImage) |> scitype
GrayImage{10,10}

Coercing a color image, represented as a Real 3-D array (W x H x C format):

img = rand(10, 10, 3)
coerce(img, ColorImage) |> scitype
ColorImage{10,10}

Coercing collections of images

Coercing a collection of gray images, represented as a Real 3-D array (W x H x N format):

imgs = rand(10, 10, 3)
coerce(imgs, GrayImage) |> scitype
AbstractArray{GrayImage{10,10},1}

Coercing a collection of gray images, represented as a Real 4-D array (W x H x {1} x N format):

imgs = rand(10, 10, 1, 3)
coerce(imgs, GrayImage) |> scitype
AbstractArray{GrayImage{10,10},1}

Coercing a collection of color images, represented as a Real 4-D array (W x H x C x N format):

imgs = rand(10, 10, 3, 5)
coerce(imgs, ColorImage) |> scitype
AbstractArray{ColorImage{10,10},1}

Detailed usage examples

Basics

using CategoricalArrays
scitype((2.718, 42))
Tuple{Continuous,Count}

In the MLJ convention, to construct arrays with categorical scientific element type one needs to use CategorialArrays:

v = categorical(['a', 'c', 'a', missing, 'b'], ordered=true)
scitype(v[1])
OrderedFactor{3}
elscitype(v)
Union{Missing, OrderedFactor{3}}

Coercing to Multiclass:

w = coerce(v, Union{Missing,Multiclass})
elscitype(w)
Union{Missing, Multiclass{3}}

Working with tables

While schema is convenient for inspecting the column scitypes of a table, there is also a scitype for the tables themselves:

data = (x1=rand(10), x2=rand(10))
schema(data)
┌─────────┬─────────┬────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼─────────┼────────────┤
│ x1      │ Float64 │ Continuous │
│ x2      │ Float64 │ Continuous │
└─────────┴─────────┴────────────┘
_.nrows = 10
scitype(data)
Table{AbstractArray{Continuous,1}}

Similarly, any table implementing the Tables interface has scitype Table{K}, where K is the union of the scitypes of its columns.

Table scitypes are useful for dispatch and type checks, as shown here, with the help of a constructor for Table scitypes provided by Scientific Types.jl:

Table(Continuous, Count)
Table{<:Union{AbstractArray{<:Continuous},AbstractArray{<:Count}}}
scitype(data) <: Table(Continuous)
true
scitype(data) <: Table(Infinite)
true
data = (x=rand(10), y=collect(1:10), z = [1,2,3,1,2,3,1,2,3,1])
data = coerce(data, :z=>OrderedFactor)
scitype(data) <: Table(Continuous,Count,OrderedFactor)
true

Note that Table(Continuous,Finite) is a type union and not a Tableinstance.

Tuples and arrays

The behavior of scitype on tuples is as you would expect:

scitype((1, 4.5))
Tuple{Count,Continuous}

For performance reasons, the behavior of scitype on arrays has some wrinkles, in the case of missing values:

The scitype of an array. The scitype of an AbstractArray, A, is alwaysAbstractArray{U} where U is the union of the scitypes of the elements of A, with one exception: If typeof(A) <: AbstractArray{Union{Missing,T}} for some T different from Any, then the scitype of A is AbstractArray{Union{Missing, U}}, where U is the union over all non-missing elements, even if A has no missing elements.

julia> v = [1.3, 4.5, missing]
julia> scitype(v)
AbstractArray{Union{Missing, Continuous},1}
julia> scitype(v[1:2])
AbstractArray{Union{Missing, Continuous},1}

Automatic type conversion

The autotype function allows to use specific rules in order to guess appropriate scientific types for tabular data. Such rules would typically be more constraining than the ones implied by the active convention. When autotype is used, a dictionary of suggested types is returned for each column in the data; if none of the specified rule applies, the ambient convention is used as "fallback".

The function is called as:

autotype(X)

If the keyword only_changes is passed set to true, then only the column names for which the suggested type is different from that provided by the convention are returned.

autotype(X; only_changes=true)

To specify which rules are to be applied, use the rules keyword and specify a tuple of symbols referring to specific rules; the default rule is :few_to_finite which applies a heuristic for columns which have relatively few values, these columns are then encoded with an appropriate Finite type. It is important to note that the order in which the rules are specified matters; rules will be applied in that order.

autotype(X; rules=(:few_to_finite,))

Finally, you can also use the following shorthands:

autotype(X, :few_to_finite)
autotype(X, (:few_to_finite, :discrete_to_continuous))

Available rules

Rule symbolscitype suggestion
:few_to_finitean appropriate Finite subtype for columns with few distinct values
:discrete_to_continuousif not Finite, then Continuous for any Count or Integer scitypes/types
:string_to_multiclassMulticlass for any string-like column

Autotype can be used in conjunction with coerce:

X_coerced = coerce(X, autotype(X))

Examples

By default it only applies the :few_to_finite rule

n = 50
X = (a = rand("abc", n),         # 3 values, not number        --> Multiclass
	 b = rand([1,2,3,4], n),     # 4 values, number            --> OrderedFactor
	 c = rand([true,false], n),  # 2 values, number but only 2 --> Multiclass
	 d = randn(n),               # many values                 --> unchanged
	 e = rand(collect(1:n), n))  # many values                 --> unchanged
autotype(X, only_changes=true)
Dict{Symbol,Type} with 3 entries:
  :a => Multiclass
  :b => OrderedFactor
  :c => OrderedFactor

For example, we could first apply the :discrete_to_continuous rule, followed by :few_to_finite rule. The first rule will apply to b and e but the subsequent application of the second rule will mean we will get the same result apart for e (which will be Continuous)

autotype(X, only_changes=true, rules=(:discrete_to_continuous, :few_to_finite))
Dict{Symbol,Type} with 4 entries:
  :a => Multiclass
  :b => OrderedFactor
  :e => Continuous
  :c => OrderedFactor

One should check and possibly modify the returned dictionary before passing to coerce.

API reference

ScientificTypes.scitypeFunction
scitype(X)

The scientific type (interpretation) of X, as distinct from its machine type, as specified by the active convention.

Examples from the MLJ convention

julia> using MLJScientificTypes # or `using MLJ`
julia> scitype(3.14)
Continuous

julia> scitype([1, 2, 3, missing])
AbstractArray{Union{Missing, Count},1}

julia> scitype((5, "beige"))
Tuple{Count, Textual}

julia> using CategoricalArrays
julia> X = (gender = categorical([:M, :M, :F, :M, :F]),
            ndevices = [1, 3, 2, 3, 2])
julia> scitype(X)
Table{Union{AbstractArray{Count,1}, AbstractArray{Multiclass{2},1}}}

The specific behavior of scitype is governed by the active convention, as returned by ScientificTypes.convention(). The MLJScientificTypes.jl documentation details the convention demonstrated above.

ScientificTypes.elscitypeFunction
elscitype(A)

Return the element scientific type of an abstract array A. By definition, if scitype(A) = AbstractArray{S,N}, then elscitype(A) = S.

MLJScientificTypes.coerceFunction
coerce(A, ...; tight=false, verbosity=1)

Given a table A, return a copy of A ensuring that the scitype of the columns match new specifications. The specifications can be given as a a bunch of colname=>Scitype pairs or as a dictionary whose keys are names and values are scientific types:

coerce(X, col1=>scitype1, col2=>scitype2, ... ; verbosity=1)
coerce(X, d::AbstractDict; verbosity=1)

One can also specify pairs of type T1=>T2 in which case all columns with scientific element type subtyping Union{T1,Missing} will be coerced to the new specified scitype T2.

Examples

Specifiying (name, scitype) pairs:

using CategoricalArrays, DataFrames, Tables
X = DataFrame(name=["Siri", "Robo", "Alexa", "Cortana"],
              height=[152, missing, 148, 163],
              rating=[1, 5, 2, 1])
Xc = coerce(X, :name=>Multiclass, :height=>Continuous, :rating=>OrderedFactor)
schema(Xc).scitypes # (Multiclass, Continuous, OrderedFactor)

Specifying (T1, T2) pairs:

X  = (x = [1, 2, 3],
      y = rand(3),
      z = [10, 20, 30])
Xc = coerce(X, Count=>Continuous)
schema(Xfixed).scitypes # (Continuous, Continuous, Continuous)
MLJScientificTypes.autotypeFunction
autotype(X; kw...)

Return a dictionary of suggested scitypes for each column of X, a table or an array based on rules

Kwargs

  • only_changes=true: if true, return only a dictionary of the names for which applying autotype differs from just using the ambient convention. When coercing with autotype, only_changes should be true.
  • rules=(:few_to_finite,): the set of rules to apply.