EmbeddingsTools._check_tokens
— Methodfunction _check_tokens(
words::Vector{String},
vocab::Vector{String}
)::Vector{Bool}
Returns a vector of logical values indicating whether each of the words
is present in the vocabulary vocab
.
EmbeddingsTools._ext
— Method_ext(path::AbstractString)::String
Returns the extension of file in path
, if any, and an empty string otherwise.
EmbeddingsTools._get_vocab_index
— Method_get_vocab_index(
word::String,
vocab::Vector{String}
)::Int
Returns an index of a word
in the vocabulary vocab
. Asserts that word
is present in the vocabulary.
EmbeddingsTools._get_vocab_index
— Method_get_vocab_index(
word::String,
vocab::Vector{String}
)::Int
Returns an index of a word
in the vocabulary vocab
. Asserts that word
is present in the vocabulary. A word can be a substring.
EmbeddingsTools._get_vocab_indices
— Method_get_vocab_indices(
words::Vector{String},
vocab::Vector{String}
)::Vector{Int}
Returns a vector of indices of each word
in words
in the vocabulary vocab
if word
is present in the vocabulary.
EmbeddingsTools.get_vector
— Methodget_vector(emb::IndexedWordEmbedding, query::String)
get_vector()
returns embedding vector (Float32) for a given token query
. Called with the embedding object rather than with the dictionary. Type-stable: returns a view of the embedding vector or throws an exception.
EmbeddingsTools.get_vector
— Methodget_vector(emb::WordEmbedding, query::String)
get_vector()
returns embedding vector (Float32) for a given token query
. Called with the embedding object rather than with the dictionary. Type-stable: returns a view of the embedding vector or throws an exception.
EmbeddingsTools.limit
— Methodlimit(emb::AbstractEmbedding, n::Integer)::WordEmbedding
The limit()
function creates a copy of an existing word embedding, containing only the first n
tokens. This function is similar to using a max_vocab_size
argument in read_embedding()
. However, using read_vec()
+ limit()
or read_vec()
+ subspace()
is generally faster than using max_vocab_size
or keep_words
arguments, respectively.
EmbeddingsTools.limit
— Methodlimit(emb::IndexedWordEmbedding, n::Integer)::IndexedWordEmbedding
The limit()
function creates a copy of an existing indexed word embedding, containing only the first n
tokens. This function is similar to using a max_vocab_size
argument in read_embedding()
. However, using read_vec()
+ limit()
or read_vec()
+ subspace()
is generally faster than using max_vocab_size
or keep_words
arguments, respectively.
EmbeddingsTools.read_emb
— Methodread_emb(path::AbstractString)::WordEmbedding
The function reads word embeddings from local binary embedding table files in .jld
and .emb
formats. These files are Julia binary files that contain a WordEmbedding
object under the name "embedding"
.
EmbeddingsTools.read_embedding
— Methodread_embedding(
path::AbstractString;
delim::AbstractChar=' ',
max_vocab_size::Union{Int,Nothing}=nothing,
keep_words::Union{Vector{String},Nothing}=nothing
)::WordEmbedding
The function read_embedding()
is used to read embedding files in a conventional way. It creates a WordEmbedding
object using CSV.jl. The function takes a path to the local embedding vector as an argument and has two optional keyword arguments: max_vocab_size
and keep_words
.
If max_vocab_size
is specified, the function limits the size of the vector to that number. If a vector keep_words
is provided, it only keeps those words. If a word in keep_words
is not found, the function returns a zero vector for that word.
If the file is a WordEmbedding
object within a Julia binary file (with extension .jld
or in specific formats .emb
or .wem
), the entire embedding is loaded, and keyword arguments are not applicable. You can also use the read_emb()
function directly on binary files. For pre-saved indexed embeddings, read_indexed_emb()
currently is the only option.
Notes
Note that if you set max_vocab_size
≥ 45k, the function's performance may suffer compared to limit(read_embedding(path), max_vocab_size)
. This is because using this parameter restricts CSV.jl jobs to a single thread. However, using max_vocab_size
results in less memory allocation than reading the entire file.
In addition, using keep_words
with as many as 1k selected words is significantly slower yet more memory-efficient compared to subspace(index(read_embedding(path)), keep_words)
. For 10k selected words reading may take more than 5 seconds.
EmbeddingsTools.read_giant_vec
— Methodread_giant_vec(
path::AbstractString;
delim::AbstractChar=' ',
max_vocab_size::Union{Int,Nothing}=nothing,
keep_words::Union{Vector{String},Nothing}=nothing
)::WordEmbedding
The conservative version of read_embedding()
that handles large embedding tables, such as those used in FastText. It is adapted from a similar function in Embeddings.jl. The function reads a local embedding matrix from a specified path
by going through each line and creates a WordEmbedding
object. Additionally, you can provide the delimiter using delim
and retain only certain words by specifying a list keep_words
. However, this function can be slow, so we recommend setting the max_vocab_size
parameter to a value less than 150k.
EmbeddingsTools.read_indexed_emb
— Methodread_indexed_emb(path::AbstractString)::WordEmbedding
The function reads indexed word embeddings from local binary embedding table files in .jld2
and .iem
formats. These files are Julia binary files that contain an IndexedWordEmbedding
object under the name "embedding"
.
EmbeddingsTools.read_vec
— Methodread_vec(path::AbstractString; delim::AbstractChar=' ')::WordEmbedding
The function read_vec()
is used to read a local embedding matrix from a text file (.txt, .vec, etc) at a given path
. It creates a WordEmbedding
object using the CSV.jl package. The delimiter used for the text file can be set using the delim
parameter. This function is a simplified version of read_embedding()
and it always reads in the entire embedding table, making the logic more straightforward.
EmbeddingsTools.reduce_emb
— Methodreduce_emb(emb::AbstractEmbedding, k::Integer; method::String="pca")::WordEmbedding
The following function takes an existing word embedding and reduces its embedding vectors to a specified number of dimensions k
. The function returns a new WordEmbedding object. You can choose between two reduction techniques by setting the method
parameter to either pca
for Principal Component Analysis or svd
for Singular Value Decomposition.
EmbeddingsTools.reduce_emb
— Methodreduce_emb(emb::IndexedWordEmbedding, k::Integer; method::String="pca")::WordEmbedding
The following function takes an existing indexed word embedding and reduces its embedding vectors to a specified number of dimensions k
. The function returns a new IndexedWordEmbedding object. You can choose between two reduction techniques by setting the method
parameter to either pca
for Principal Component Analysis or svd
for Singular Value Decomposition.
EmbeddingsTools.reduce_pca
— Functionreduce_pca(X::Matrix{Float32}, k::Int=2)::Matrix{Float32}
Reduce a matrix using Principal component analysis. This function returns the transformed input matrix X
using k
first principal components. The function assumes that the matrix is transposed with observations in columns and variables in rows.
Note: This function doesn't use MultivariateStats.jl
to avoid unnecessary dependencies. We recommend using PCA
from MultivariateStats.jl
for principal component analysis.
EmbeddingsTools.reduce_svd
— Functionreduce_svd(X::Matrix{Float32}, k::Int=2)::Matrix{Float32}
Reduce a matrix using Singular value decomposition. This function returns the transformed input matrix X
using k
first singular values. The function assumes that the matrix is transposed with observations in columns and variables in rows.
EmbeddingsTools.safe_get
— Methodsafe_get(emb::IndexedWordEmbedding, query::String)
For internal use only. This function is similar to get_vector()
but returns a zero vector if the query
is not in the vocabulary.
EmbeddingsTools.subspace
— Methodsubspace(emb::AbstractEmbedding, tokens::Vector{String})::WordEmbedding
The subspace()
function takes an existing embedding and a subset of its vocabulary as input and creates a new WordEmbedding
object. The order of embedding vectors in the new embedding corresponds to the order of input tokens
. If a token is not found in the source embedding vocabulary, a zero vector is returned for that token.
It's worth noting that this method is relatively slow and doesn't assume the source embedding to be indexed. Therefore, it doesn't take advantage of a lookup dictionary. It's recommended to index an embedding before subsetting it for better performance.
EmbeddingsTools.subspace
— Methodsubspace(emb::IndexedWordEmbedding, tokens::Vector{String})::WordEmbedding
The subspace()
function takes an already indexed embedding and a subset of its vocabulary as input and generates a new WordEmbedding
object. It requires two arguments: emb
which is the indexed embedding, and tokens
which is a vector of strings representing the subset of vocabulary.
The order of embedding vectors in the new embedding corresponds to the order of the input tokens
. If an out-of-vocabulary token is encountered, a zero vector is returned.
This method assumes that the source embedding is indexed and can use its lookup dictionary, making it relatively fast. It is recommended to index()
an embedding before subsetting it.
Note that the output of the subspace()
function is not an indexed embedding. So, if the user wants it indexed, it needs to be done manually.
EmbeddingsTools.write_embedding
— Methodwrite_embedding(
emb::IndexedWordEmbedding,
path::AbstractString;
max_vocab_size::Union{Int,Nothing}=nothing,
keep_words::Union{Vector{String},Nothing}=nothing
)::Nothing
The write_embedding()
function saves an IndexedWordEmbedding
object to a binary file specified by path
. The vocabulary can be filtered with keep_words
and limited to max_vocab_size
.
EmbeddingsTools.write_embedding
— Methodwrite_embedding(
emb::WordEmbedding,
path::AbstractString;
max_vocab_size::Union{Int,Nothing}=nothing,
keep_words::Union{Vector{String},Nothing}=nothing
)::Nothing
The write_embedding()
function saves a WordEmbedding
object to a binary file specified by path
. The vocabulary can be filtered with keep_words
and limited to max_vocab_size
.