Docstrings · EmbeddingsTools.jl

EmbeddingsTools._check_tokens — Method

function _check_tokens(
    words::Vector{String},
    vocab::Vector{String}
)::Vector{Bool}

Returns a vector of logical values indicating whether each of the words is present in the vocabulary vocab.

EmbeddingsTools._ext — Method

_ext(path::AbstractString)::String

Returns the extension of file in path, if any, and an empty string otherwise.

EmbeddingsTools._get_vocab_index — Method

_get_vocab_index(
    word::String,
    vocab::Vector{String}
)::Int

Returns an index of a word in the vocabulary vocab. Asserts that word is present in the vocabulary.

EmbeddingsTools._get_vocab_index — Method

_get_vocab_index(
    word::String,
    vocab::Vector{String}
)::Int

Returns an index of a word in the vocabulary vocab. Asserts that word is present in the vocabulary. A word can be a substring.

EmbeddingsTools._get_vocab_indices — Method

_get_vocab_indices(
    words::Vector{String},
    vocab::Vector{String}
)::Vector{Int}

Returns a vector of indices of each word in words in the vocabulary vocab if word is present in the vocabulary.

EmbeddingsTools.get_vector — Method

get_vector(emb::IndexedWordEmbedding, query::String)

get_vector() returns embedding vector (Float32) for a given token query. Called with the embedding object rather than with the dictionary. Type-stable: returns a view of the embedding vector or throws an exception.

EmbeddingsTools.get_vector — Method

get_vector(emb::WordEmbedding, query::String)

EmbeddingsTools.limit — Method

limit(emb::AbstractEmbedding, n::Integer)::WordEmbedding

The limit() function creates a copy of an existing word embedding, containing only the first n tokens. This function is similar to using a max_vocab_size argument in read_embedding(). However, using read_vec() + limit() or read_vec() + subspace() is generally faster than using max_vocab_size or keep_words arguments, respectively.

EmbeddingsTools.limit — Method

limit(emb::IndexedWordEmbedding, n::Integer)::IndexedWordEmbedding

The limit() function creates a copy of an existing indexed word embedding, containing only the first n tokens. This function is similar to using a max_vocab_size argument in read_embedding(). However, using read_vec() + limit() or read_vec() + subspace() is generally faster than using max_vocab_size or keep_words arguments, respectively.

EmbeddingsTools.read_emb — Method

read_emb(path::AbstractString)::WordEmbedding

The function reads word embeddings from local binary embedding table files in .jld and .emb formats. These files are Julia binary files that contain a WordEmbedding object under the name "embedding".

EmbeddingsTools.read_embedding — Method

read_embedding(
    path::AbstractString;
    delim::AbstractChar=' ',
    max_vocab_size::Union{Int,Nothing}=nothing,
    keep_words::Union{Vector{String},Nothing}=nothing
)::WordEmbedding

The function read_embedding() is used to read embedding files in a conventional way. It creates a WordEmbedding object using CSV.jl. The function takes a path to the local embedding vector as an argument and has two optional keyword arguments: max_vocab_size and keep_words.

If max_vocab_size is specified, the function limits the size of the vector to that number. If a vector keep_words is provided, it only keeps those words. If a word in keep_words is not found, the function returns a zero vector for that word.

If the file is a WordEmbedding object within a Julia binary file (with extension .jld or in specific formats .emb or .wem), the entire embedding is loaded, and keyword arguments are not applicable. You can also use the read_emb() function directly on binary files. For pre-saved indexed embeddings, read_indexed_emb() currently is the only option.

Notes

Note that if you set max_vocab_size ≥ 45k, the function's performance may suffer compared to limit(read_embedding(path), max_vocab_size). This is because using this parameter restricts CSV.jl jobs to a single thread. However, using max_vocab_size results in less memory allocation than reading the entire file.

In addition, using keep_words with as many as 1k selected words is significantly slower yet more memory-efficient compared to subspace(index(read_embedding(path)), keep_words). For 10k selected words reading may take more than 5 seconds.

EmbeddingsTools.read_giant_vec — Method

read_giant_vec(
    path::AbstractString;
    delim::AbstractChar=' ',
    max_vocab_size::Union{Int,Nothing}=nothing,
    keep_words::Union{Vector{String},Nothing}=nothing
)::WordEmbedding

The conservative version of read_embedding() that handles large embedding tables, such as those used in FastText. It is adapted from a similar function in Embeddings.jl. The function reads a local embedding matrix from a specified path by going through each line and creates a WordEmbedding object. Additionally, you can provide the delimiter using delim and retain only certain words by specifying a list keep_words. However, this function can be slow, so we recommend setting the max_vocab_size parameter to a value less than 150k.

EmbeddingsTools.read_indexed_emb — Method

read_indexed_emb(path::AbstractString)::WordEmbedding

The function reads indexed word embeddings from local binary embedding table files in .jld2 and .iem formats. These files are Julia binary files that contain an IndexedWordEmbedding object under the name "embedding".

EmbeddingsTools.read_vec — Method

read_vec(path::AbstractString; delim::AbstractChar=' ')::WordEmbedding

The function read_vec() is used to read a local embedding matrix from a text file (.txt, .vec, etc) at a given path. It creates a WordEmbedding object using the CSV.jl package. The delimiter used for the text file can be set using the delim parameter. This function is a simplified version of read_embedding() and it always reads in the entire embedding table, making the logic more straightforward.

EmbeddingsTools.reduce_emb — Method

reduce_emb(emb::AbstractEmbedding, k::Integer; method::String="pca")::WordEmbedding

The following function takes an existing word embedding and reduces its embedding vectors to a specified number of dimensions k. The function returns a new WordEmbedding object. You can choose between two reduction techniques by setting the method parameter to either pca for Principal Component Analysis or svd for Singular Value Decomposition.

EmbeddingsTools.reduce_emb — Method

reduce_emb(emb::IndexedWordEmbedding, k::Integer; method::String="pca")::WordEmbedding

The following function takes an existing indexed word embedding and reduces its embedding vectors to a specified number of dimensions k. The function returns a new IndexedWordEmbedding object. You can choose between two reduction techniques by setting the method parameter to either pca for Principal Component Analysis or svd for Singular Value Decomposition.

EmbeddingsTools.reduce_pca — Function

reduce_pca(X::Matrix{Float32}, k::Int=2)::Matrix{Float32}

Reduce a matrix using Principal component analysis. This function returns the transformed input matrix X using k first principal components. The function assumes that the matrix is transposed with observations in columns and variables in rows.

Note: This function doesn't use MultivariateStats.jl to avoid unnecessary dependencies. We recommend using PCA from MultivariateStats.jl for principal component analysis.

EmbeddingsTools.reduce_svd — Function

reduce_svd(X::Matrix{Float32}, k::Int=2)::Matrix{Float32}

Reduce a matrix using Singular value decomposition. This function returns the transformed input matrix X using k first singular values. The function assumes that the matrix is transposed with observations in columns and variables in rows.

EmbeddingsTools.safe_get — Method

safe_get(emb::IndexedWordEmbedding, query::String)

For internal use only. This function is similar to get_vector() but returns a zero vector if the query is not in the vocabulary.

EmbeddingsTools.subspace — Method

subspace(emb::AbstractEmbedding, tokens::Vector{String})::WordEmbedding

The subspace() function takes an existing embedding and a subset of its vocabulary as input and creates a new WordEmbedding object. The order of embedding vectors in the new embedding corresponds to the order of input tokens. If a token is not found in the source embedding vocabulary, a zero vector is returned for that token.

It's worth noting that this method is relatively slow and doesn't assume the source embedding to be indexed. Therefore, it doesn't take advantage of a lookup dictionary. It's recommended to index an embedding before subsetting it for better performance.

EmbeddingsTools.subspace — Method

subspace(emb::IndexedWordEmbedding, tokens::Vector{String})::WordEmbedding

The subspace() function takes an already indexed embedding and a subset of its vocabulary as input and generates a new WordEmbedding object. It requires two arguments: emb which is the indexed embedding, and tokens which is a vector of strings representing the subset of vocabulary.

The order of embedding vectors in the new embedding corresponds to the order of the input tokens. If an out-of-vocabulary token is encountered, a zero vector is returned.

This method assumes that the source embedding is indexed and can use its lookup dictionary, making it relatively fast. It is recommended to index() an embedding before subsetting it.

Note that the output of the subspace() function is not an indexed embedding. So, if the user wants it indexed, it needs to be done manually.

EmbeddingsTools.write_embedding — Method

write_embedding(
    emb::IndexedWordEmbedding,
    path::AbstractString;
    max_vocab_size::Union{Int,Nothing}=nothing,
    keep_words::Union{Vector{String},Nothing}=nothing
)::Nothing

The write_embedding() function saves an IndexedWordEmbedding object to a binary file specified by path. The vocabulary can be filtered with keep_words and limited to max_vocab_size.

EmbeddingsTools.write_embedding — Method

write_embedding(
    emb::WordEmbedding,
    path::AbstractString;
    max_vocab_size::Union{Int,Nothing}=nothing,
    keep_words::Union{Vector{String},Nothing}=nothing
)::Nothing

The write_embedding() function saves a WordEmbedding object to a binary file specified by path. The vocabulary can be filtered with keep_words and limited to max_vocab_size.