AMLPipelineBase.jl


Documentation Build Status Help

AMLPipelineBase.jl is the Base package of TSML.jl, AutoMLPipeline.jl, and Lale.jl.

AMLPipelineBase is written in pure Julia. It exposes the abstract types commonly shared by TSML and AutoMLPipeline. It also contains basic data preprocessing routines and learners for rapid prototyping. TSML extends AMLPipelineBase capability by specializing in Time-Series workflow while AutoMLPipeline focuses in ML pipeline optimization. Since AMLPipelineBase is written in pure Julia including its dependencies, the future target will be to exploit Julia's native multi-threading using thread-safe ML Julia libraries for scalability and performance.

AMLPipelineBase declares the following abstract data types:

abstract type Machine end
abstract type Computer    <: Machine  end
abstract type Workflow    <: Machine  end
abstract type Learner     <: Computer end
abstract type Transformer <: Computer end

AMLPipelineBase dynamically dispatches the fit! and transform! functions which must be overloaded by different subtypes of Machine.

function fit!(mc::Machine, input::DataFrame, output::Vector)
   error(typeof(mc),"not implemented")
end

function transform!(mc::Machine, input::DataFrame)
   error(typeof(mc),"not implemented")
end

Motivations

  • To provide a Base package for common functions and abstractions shared by:
    • AutoMLPipeline: A package for ML Pipeline Optimization
    • TSML: A package for Time-Series ML
    • Lale: A Julia wrapper of python's Lale for semi-automated data science.
  • To implement efficient multi-threading reduction workflow

Package Features

  • Symbolic pipeline API for high-level description and easy expression of complex pipeline structures and workflows
  • Easily extensible architecture by overloading just two main interfaces: fit! and transform!
  • Meta-ensembles that allow composition of ensembles of ensembles (recursively if needed) for robust prediction routines
  • Categorical and numerical feature selectors for specialized preprocessing routines based on types
  • Normalizers (zscore, unitrange, pca, fa) and Ensemble learners (voting, stacks, best)

Extending AMLPipelineBase

If you want to add your own filter/transformer/learner, it is trivial. Just take note that filters and transformers process the first input features and ignores the target output while learners process both the input features and target output arguments of the fit! function. The transform! function always expect one input argument in all cases.

First, import the abstract types and define your own mutable structure as subtype of either Learner or Transformer. Next, import the fit! and transform! functions to be overloaded. Also, load the DataFrames package to be used for data interchange.

using DataFrames
using AMLPipelineBase: AbsTypes

# import fit! and transform! for function overloading 
import AMLPipelineBase.AbsTypes: fit!, transform!  

# export new definitions for dynamic dispatch
export fit!, transform!, MyFilter

# define your filter structure
mutable struct MyFilter <: Transformer
  name::String
  model::Dict

  function MyFilter(args::Dict())
      ....
  end
end

# filters and transformers ignore the target argument. 
# learners process both the input features and target argument.
function fit!(fl::MyFilter, inputfeatures::DataFrame, target::Vector=Vector())
     ....
end

# transform! function expects an input dataframe and outputs a dataframe
function transform!(fl::MyFilter, inputfeatures::DataFrame)::DataFrame
     ....
end

Note that the main data interchange format is a dataframe so transform! output should always be a dataframe as well as the input for fit! and transform!. This is necessary so that the pipeline passes the dataframe format consistently to its filters or transformers or learners. Once you created a filter, you can use it as part of the pipeline together with the other learners and filters.

Installation

AMLPipelineBase is in the Julia Official package registry. The latest release can be installed at the Julia prompt using Julia's package management which is triggered by pressing ] at the julia prompt:

julia> ]
pkg> update
pkg> add AMLPipelineBase

Below outlines some typical way to preprocess and model any dataset.

1. Load Data, Extract Input (X) and Target (Y)
# Make sure that the input feature is a dataframe and the target output is a 1-D vector.
using AMLPipelineBase
profbdata = getprofb()
X = profbdata[:,2:end] 
Y = profbdata[:,1] |> Vector;
head(x)=first(x,5)
head(profbdata)

2. Load Filters, Transformers, and Learners

#### categorical preprocessing
ohe = OneHotEncoder()

#### Column selector
catf = CatFeatureSelector() 
numf = NumFeatureSelector()

#### Learners
rf    = RandomForest()
ada   = Adaboost()
pt    = PrunedTree()
stack = StackEnsemble()
best  = BestLearner()
vote  = VoteEnsemble()

3. Filter categories and hot-encode them

pohe = catf |> ohe
tr   = fit_transform!(pohe,X,Y)

4. Filter numeric features

pdec = numf
tr   = fit_transform!(pdec,X,Y)

5. A Pipeline for the Voting Ensemble Classification

# take all categorical columns and hot-bit encode each, 
# concatenate them to the numerical features,
# and feed them to the voting ensemble
pvote = (catf |> ohe) + (numf) |> vote
pred  = fit_transform!(pvote,X,Y)
sc    = score(:accuracy,pred,Y)
println(sc)
### cross-validate
acc(X,Y) = score(:accuracy,X,Y)
crossvalidate(pvote,X,Y,acc,10,true)

6. Use @pipelinex instead of @pipeline to print the corresponding function calls in 6

julia> @pipelinex (catf |> ohe) + (numf) |> vote
:(Pipeline(ComboPipeline(Pipeline(catf, ohe), numf), vote))

# another way is to use @macroexpand with @pipeline
julia> @macroexpand @pipeline (catf |> ohe) + (numf) |> vote
:(Pipeline(ComboPipeline(Pipeline(catf, ohe), numf), vote))

7. A Pipeline for the Random Forest (RF) Classification

# compute the pca, ica, fa of the numerical columns,
# combine them with the hot-bit encoded categorical features
# and feed all to the random forest classifier
prf  = (catf|> ohe) + numf   |> rf
pred = fit_transform!(prf,X,Y)
score(:accuracy,pred,Y) |> println
crossvalidate(prf,X,Y,acc,10)

9. A Pipeline for Random Forest Regression

using Statistics
iris      = getiris()
Xreg      = iris[:,1:3]
Yreg      = iris[:,4] |> Vector
rfreg     = (catf |> ohe) + (numf) |> rf
pred      = fit_transform!(rfreg,Xreg,Yreg)
rmse(X,Y) = mean((X .- Y).^2) |> sqrt
res       = crossvalidate(rfreg,Xreg,Yreg,rmse,10,true)

Note: More examples can be found in the TSML and AutoMLPipeline packages. Since the code is written in Julia, you are highly encouraged to read the source code and feel free to extend or adapt the package to your problem. Please feel free to submit PRs to improve the package features.

10. Performance Comparison of Several Learners

10.1 Sequential Processing
using Random
using DataFrames
Random.seed!(1)

disc   = CatNumDiscriminator()
catf   = CatFeatureSelector()
numf   = NumFeatureSelector()
ohe    = OneHotEncoder()
rf     = RandomForest()
ada    = Adaboost()
tree   = PrunedTree()
stack  = StackEnsemble()
best   = BestLearner()
vote   = VoteEnsemble()

learners = DataFrame()
for learner in [rf,ada,tree,stack,vote,best]
    pcmc = disc |> ((catf |> ohe) + numf) |> learner
    println(learner.name)
    mean,sd,_ = crossvalidate(pcmc,X,Y,acc,10,true)
    learners  = vcat(learners,DataFrame(name=learner.name,mean=mean,sd=sd))
end;
@show learners;
10.2 Parallel Processing
using Random
using DataFrames
using Distributed

nprocs() == 1 && addprocs()
@everywhere using DataFrames
@everywhere using AMLPipelineBase

rf     = RandomForest()
ada    = Adaboost()
tree   = PrunedTree()
stack  = StackEnsemble()
best   = BestLearner()
vote   = VoteEnsemble()
disc   = CatNumDiscriminator()
catf   = CatFeatureSelector()
numf   = NumFeatureSelector()

@everywhere acc(X,Y) = score(:accuracy,X,Y)

learners = @distributed (vcat) for learner in [rf,ada,tree,stack,vote,best]
    pcmc = disc |> ((catf |> ohe) + (numf)) |> learner
    println(learner.name)
    mean,sd,_ = crossvalidate(pcmc,X,Y,acc,10,true)
    DataFrame(name=learner.name,mean=mean,sd=sd)
end
@show learners;

11. Automatic Selection of Best Learner

You can use * operation as a selector function which outputs the result of the best learner. If we use the same pre-processing pipeline in 10, we expect that the average performance of best learner which is lsvc will be around 73.0.

Random.seed!(1)
pcmc = disc |> ((catf |> ohe) + (numf)) |> (rf * ada * tree)
crossvalidate(pcmc,X,Y,acc,10,true)

12. Learners as Transformers

It is also possible to use learners in the middle of expression to serve as transformers and their outputs become inputs to the final learner as illustrated below.

expr = ( 
         ((numf)+(catf |> ohe) |> rf) +
         ((numf)+(catf |> ohe) |> ada) +
         ((numf)+(catf |> ohe) |> tree) 
       ) |> ohe |> rf;                
crossvalidate(expr,X,Y,acc,10,true)

One can even include selector function as part of transformer preprocessing routine:

pjrf = disc |> ((catf |> ohe) + (numf |> rf)) |> 
               ((rf * ada ) + (rf * tree * vote)) |> ohe |> ada
crossvalidate(pjrf,X,Y,acc,10,true)

Note: The ohe is necessary in both examples because the outputs of the learners and selector function are categorical values that need to be hot-bit encoded before feeding to the final ada learner.

13. Tree Visualization of the Pipeline Structure

You can visualize the pipeline by using AbstractTrees Julia package.

# package installation 
using Pkg
Pkg.update()
Pkg.add("AbstractTrees") 

# load the packages
using AbstractTrees
using AMLPipelineBase

julia> expr = @pipelinex (catf |> ohe) + (numf) |> rf
:(Pipeline(ComboPipeline(Pipeline(catf, ohe), numf), rf))

julia> print_tree(stdout, expr)
:(Pipeline(ComboPipeline(Pipeline(catf, ohe), numf), rf))
├─ :Pipeline
├─ :(ComboPipeline(Pipeline(catf, ohe), numf))
  ├─ :ComboPipeline
  ├─ :(Pipeline(catf, ohe))
    ├─ :Pipeline
    ├─ :catf
    └─ :ohe
  └─ :numf
└─ :rf

Feature Requests and Contributions

We welcome contributions, feature requests, and suggestions. Here is the link to open an issue for any problems you encounter. If you want to contribute, please follow the guidelines in contributors page.

Help usage

Usage questions can be posted in: