CBFV

CBFV.generatefeaturesFunction
generatefeatures(data; elementdata,dropduplicate,combine,sumfeatures,returndataframe)
generatefeatures(dataname; kwargs...)

This is the primary function for generating the CBFV features for a dataset of formulas with or without existing features. This function will process the input data and grab the provided element database. The assigning of features is then executed based on the CBFV approach. If the returndataframe=true then a DataFrame data type is returned by this function with the added columns :target and :formula.

Note

I am not using OrderedDict so the column names will be arranged based on the native Dict ordering.

Arguments

  • data::DataFrame: This is the data set that you want to be featurized for example.
  • elementdata::Union{String,FileName} or Union{String,DataFrame}: The name of the internal database or the file path and

name to an external database.

  • dropduplicate::Bool=true: Option to drop duplicate entries.
  • combine::Bool=false: Option to combine existing features in data with the generated feature set.
  • sumfeatures::Bool=false: Option to include the sum_ feature columns.
  • returndataframe::Bool=true: Option to return a DataFrame. Will include :target and :formula columns.

Returns

  • generatedataframe::DataFrame
  • formulae::Vector{String}, features::Array{Number,2}, targets::Vector{Number}

The following featurization schemes are included within CBFV.jl:

  • oliynyk (default)
  • magpie
  • mat2vec
  • jarvis
  • onehot
  • random_200
using DataFrames
using CBFV
d = DataFrame(:formula=>["Tc1V1","Cu1Dy1","Cd3N2"],:target=>[248.539,66.8444,91.5034])
generatefeatures(d)
CBFV.processelementdatabaseMethod
processelementdatabase(data)

Takes the element feature dataframe and process it to return a dictionary with values of type Array{String,N}` and a Array representation of the entire database.

Arguments

  • data::DataFrame: element feature dataframe from database file

Returns

  • elementproperties::Dict{Symbol,Array{String,N}} : dictionary with keys :symbols,:index, and :missing which return Array{String,N} values for the dataframe
  • arrayrepresentation::Array{Any,2}: representation of the dataframe
CBFV.processinputdataMethod
processinputdata(datainput,elementdatabase)

Take the data set that contains the formula's, target values, and additional features and then extract the elemental properties from the element database provided. Also get the column/feature used in the element properties.

Arguments

  • datainput::DataFrame: data containing columns :formula and :target.
  • elementfeatures::Array{Number,2}: element feature set based on database

Returns

  • elpropnames::Array{String,1}: The names of the properties in elemental database
  • processeddata::Vector{Dict{Symbol,Any}}: The processed input data based on elemental database.
CBFV.readdatabasefileMethod
readdatabasefile(pathtofile)

Returns DataFrame of an elemental database file in databases/

Arguments

  • pathtofile::String: path to the CSV formatted file to read
  • stringtype::Type{Union{String,InlineString}}=String : CSV.jl string storage type
  • pool::Bool=false : CSV.File will pool String column values.

Returns

  • data::DataFrame: the dataframe representation of the csv file.
Note

Some of the behaviors of CSV.jl will create data types that are inconnsistant with the several function argument types in CBFV. If you use this function to read the data files the data frame constructed via CSV will work properly.

CBFV.FileNameType

generatefeatures Datatype for multiple dispatch. Allows for passing external database.