MultiData
The aim of this package is to provide a simple and comfortable interface for managing multimodal data. It is built on top of DataFrames.jl with Machine learning applications in mind.
Installation
Currently this packages is still not registered so you need to run the following commands in a Julia REPL to install it:
import Pkg
Pkg.add("MultiData")
To install the developement version, run:
import Pkg
Pkg.add("https://github.com/aclai-lab/MultiData.jl#dev")
Usage
To instantiate a multimodal dataset, use the MultiDataset
constructor by providing: a) a DataFrame
containing all variables from different modalities, and b) a Vector{Vector{Union{Symbol,String,Int64}}}
object representing a grouping of some of the variables (identified by column index or name) into different modalities.
julia> using MultiData
julia> ts_cos = [cos(i) for i in 1:50000];
julia> ts_sin = [sin(i) for i in 1:50000];
julia> df_data = DataFrame(
:id => [1, 2],
:age => [30, 9],
:name => ["Python", "Julia"],
:stat => [deepcopy(ts_sin), deepcopy(ts_cos)]
)
2×4 DataFrame
Row │ id age name stat
│ Int64 Int64 String Array…
─────┼─────────────────────────────────────────────────────────
1 │ 1 30 Python [0.841471, 0.909297, 0.14112, -0…
2 │ 2 9 Julia [0.540302, -0.416147, -0.989992,…
julia> grouped_variables = [[2,3], [4]]; # group 2nd and 3rd variables in the first modality
# the 4th variable in the second modality and
# leave the first variable as a "spare variable"
julia> md = MultiDataset(df_data, grouped_variables)
● MultiDataset
└─ dimensionalities: (0, 1)
- Modality 1 / 2
└─ dimensionality: 0
2×2 SubDataFrame
Row │ age name
│ Int64 String
─────┼───────────────
1 │ 30 Python
2 │ 9 Julia
- Modality 2 / 2
└─ dimensionality: 1
2×1 SubDataFrame
Row │ stat
│ Array…
─────┼───────────────────────────────────
1 │ [0.841471, 0.909297, 0.14112, -0…
2 │ [0.540302, -0.416147, -0.989992,…
- Spare variables
└─ dimensionality: 0
2×1 SubDataFrame
Row │ id
│ Int64
─────┼───────
1 │ 1
2 │ 2
Now md
holds a MultiDataset
and all of its modalities can be conveniently iterated as elements of a Vector
:
julia> for (i, f) in enumerate(md)
println("Modality: ", i)
println(f)
println()
end
Modality: 1
2×2 SubDataFrame
Row │ age name
│ Int64 String
─────┼───────────────
1 │ 30 Python
2 │ 9 Julia
Modality: 2
2×1 SubDataFrame
Row │ stat
│ Array…
─────┼───────────────────────────────────
1 │ [0.841471, 0.909297, 0.14112, -0…
2 │ [0.540302, -0.416147, -0.989992,…
Note that each element of a MultiDataset
is a SubDataFrame
:
julia> eltype(md)
SubDataFrame
Spare variables will never be seen when accessing a MultiDataset
through its iterator interface. To access them see sparevariables
.