API Reference

A Julia library for working with text, hard-forked from TextAnalysis.jl.

Basic Co-occurrence Matrix (COOM) type.

Fields

  • coomm::SparseMatriCSC{T,Int} the actual COOM; elements represent

co-occurrences of two terms within a given window

  • terms::Vector{String} a list of terms that represent the lexicon of

the document or corpus

  • column_indices::OrderedDict{String, Int} a map between the terms and the

columns of the co-occurrence matrix

CooMatrix{T}(crps::Corpus [,terms] [;window=5, normalize=true])

Auxiliary constructor(s) of the CooMatrix type. The type T has to be a subtype of AbstractFloat. The constructor(s) requires a corpus crps and a terms structure representing the lexicon of the corpus. The latter can be a Vector{String}, an AbstractDict where the keys are the lexicon, or can be omitted, in which case the lexicon field of the corpus is used.

Basic Document-Term-Matrix (DTM) type.

Fields

  • dtm::SparseMatriCSC{T,Int} the actual DTM; rows represent terms

and columns represent documents

  • terms::Vector{String} a list of terms that represent the lexicon of

the corpus associated with the DTM

  • row_indices::OrderedDict{String, Int} a map between the terms and the

rows of the dtm

DocumentTermMatrix{T}(docs [,terms] [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Auxiliary constructor(s) of the DocumentTermMatrix type. The type T has to be a subtype of Real. The constructor(s) requires a corpus or vector of strings docs and a terms structure representing the lexicon of the corpus. The latter can be a Vector{String}, an AbstractDict where the keys are the lexicon, or can be missing, in which case the lexicon field of the corpus is used.

LSAModel{S<:AbstractString, T<:AbstractFloat, A<:AbstractMatrix{T}, H<:Integer}

LSA (latent semantic analysis) model. It constructs from a document term matrix (dtm) a model that can be used to embed documents in a latent semantic space pertaining to the data. The model requires that the document term matrix be a DocumentTermMatrix{T<:AbstractFloat} because the elements of the matrices resulted from the SVD operation are floating point numbers and these have to match or be convertible to type T.

Fields

  • vocab::Vector{S} a vector with all the words in the corpus
  • vocab_hash::OrderedDict{S,H} a word to index in word embeddings matrix mapping
  • Σinv::A diagonal of the inverse singular value matrix
  • Uᵀ::A transpose of the word embedding matrix
  • stats::Symbol the statistical measure to use for word importances in documents. Available values are: :count (term count), :tf (term frequency), :tfidf (default, term frequency-inverse document frequency) and :bm25 (Okapi BM25)
  • idf::Vector{T} inverse document frequencies for the words in the vocabulary
  • nwords::T averge number of words in a document
  • ngram_complexity::Int ngram complexity
  • κ::Int the κ parameter of the BM25 statistic
  • β::Float64 the β parameter of the BM25 statistic
  • tol::T minimum size of the vector components (default T(1e-15))

SVD matrices U, Σinv and V:

If X is a m×n document-term-matrix with n documents and m words so that X[i,j] represents a statistical indicator of the importance of term i in document j then:

  • U, Σ, V = svd(X)
  • Σinv = diag(inv(Σ))
  • Uᵀ = U'
  • X ≈ U * Σ * V'

The matrix V of document embeddings is not actually stored in the model.

Examples

julia> using StringAnalysis

       doc1 = StringDocument("This is a text about an apple. There are many texts about apples.")
       doc2 = StringDocument("Pears and apples are good but not exotic. An apple a day keeps the doctor away.")
       doc3 = StringDocument("Fruits are good for you.")
       doc4 = StringDocument("This phrase has nothing to do with the others...")
       doc5 = StringDocument("Simple text, little info inside")

       crps = Corpus(AbstractDocument[doc1, doc2, doc3, doc4, doc5])
       prepare!(crps, strip_punctuation)
       update_lexicon!(crps)
       dtm = DocumentTermMatrix{Float32}(crps, collect(keys(crps.lexicon)))

       ### Build LSA Model ###
       lsa_model = LSAModel(dtm, k=3, stats=:tf)

       query = StringDocument("Apples and an exotic fruit.")
       idxs, corrs = cosine(lsa_model, crps, query)

       println("Query: "$(query.text)"")
       for (idx, corr) in zip(idxs, corrs)
           println("$corr -> "$(crps[idx].text)"")
       end
Query: "Apples and an exotic fruit."
0.9746108 -> "Pears and apples are good but not exotic  An apple a day keeps the doctor away "
0.870703 -> "This is a text about an apple  There are many texts about apples "
0.7122063 -> "Fruits are good for you "
0.22725986 -> "This phrase has nothing to do with the others "
0.076901935 -> "Simple text  little info inside "

References:

RPModel{S<:AbstractString, T<:AbstractFloat, A<:AbstractMatrix{T}, H<:Integer}

Random projection model. It constructs from a document term matrix (DTM) a model that can be used to embed documents in a random sub-space. The model requires that the document term matrix be a DocumentTermMatrix{T<:AbstractFloat} because the elements of the matrices resulted projection operation are floating point numbers and these have to match or be convertible to type T. The approach is based on the effects of the Johnson-Lindenstrauss lemma.

Fields

  • vocab::Vector{S} a vector with all the words in the corpus
  • vocab_hash::OrderedDict{S,H} a word to index in the random projection maatrix mapping
  • R::A the random projection matrix
  • stats::Symbol the statistical measure to use for word importances in documents. Available values are: :count (term count), :tf (term frequency), :tfidf (default, term frequency-inverse document frequency) and :bm25 (Okapi BM25)
  • idf::Vector{T} inverse document frequencies for the words in the vocabulary
  • nwords::T averge number of words in a document
  • ngram_complexity::Int ngram complexity
  • κ::Int the κ parameter of the BM25 statistic
  • β::Float64 the β parameter of the BM25 statistic
  • project::Bool specifies whether the model actually performs the projection or not; it is false if the number of dimensions provided is zero or negative

References:

TextHashFunction(hash_function::Function, cardinality::Int)

The basic structure for performing text hashing: uses the hash_function to generate feature vectors of length cardinality.

Details

The hash trick is the use a hash function instead of a lexicon to determine the columns of a DocumentTermMatrix-like encoding of the data. To produce a DTM for a Corpus for which we do not have an existing lexicon, we need someway to map the terms from each document into column indices. We use the now standard "Hash Trick" in which we hash strings and then reduce the resulting integers modulo N, which defines the numbers of columns we want our DTM to have. This amounts to doing a non-linear dimensionality reduction with low probability that similar terms hash to the same dimension.

To make things easier, we wrap Julia's hash functions in a new type, TextHashFunction, which maintains information about the desired cardinality of the hashes.

References:

Examples

julia> doc = StringDocument("this is a text")
       thf = TextHashFunction(hash, 13)
       hash_dtv(doc, thf, Float16)
13-element Array{Float16,1}:
 1.0
 1.0
 0.0
 0.0
 0.0
 0.0
 0.0
 2.0
 0.0
 0.0
 0.0
 0.0
 0.0
coom(c::CooMatrix)

Access the co-occurrence matrix field coom of a CooMatrixc.

coom(entity, eltype=DEFAULT_FLOAT_TYPE [;window=5, normalize=true])

Access the co-occurrence matrix of the CooMatrix associated with the entity. The CooMatrix{T} will first have to be created in order for the actual matrix to be accessed.

StringAnalysis.cosineFunction.
cosine(model, docs, doc, n=10)

Return the positions of the n closest neighboring documents to doc found in docs. docs can be a corpus or document term matrix. The vector representations of docs and doc are obtained with the model which can be either a LSAModel or RPModel.

StringAnalysis.dtmMethod.
dtm(d::DocumentTermMatrix)

Access the matrix of a DocumentTermMatrixd.

StringAnalysis.dtmMethod.
dtm(docs::Corpus, eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Access the matrix of the DTM associated with the corpus docs. The DocumentTermMatrix{T} will first have to be created in order for the actual matrix to be accessed.

StringAnalysis.dtvMethod.
dtv(d, lex::OrderedDict{String,Int}, eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Creates a document-term-vector with elements of type T for document d using the lexicon lex. d can be an AbstractString or an AbstractDocument.

StringAnalysis.dtvMethod.
dtv(crps::Corpus, idx::Int, eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Creates a document-term-vector with elements of type T for document idx of the corpus crps.

dtv_regex(d, lex::OrderedDict{String,Int}, eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Creates a document-term-vector with elements of type T for document d using the lexicon lex. The tokens of document d are assumed to be regular expressions in text format. d can be an AbstractString or an AbstractDocument.

Examples

julia> dtv_regex(NGramDocument("a..b"), OrderedDict("aaa"=>1, "aaab"=>2, "accb"=>3, "bbb"=>4), Float32)
4-element Array{Float32,1}:
 0.0
 1.0
 1.0
 0.0
each_dtv(crps::Corpus [; eltype::Type{U}=DEFAULT_DTM_TYPE, ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Iterates through the columns of the DTM of the corpus crps without constructing it. Useful when the DTM would not fit in memory. eltype specifies the element type of the generated vectors.

each_hash_dtv(crps::Corpus [; eltype::Type{U}=DEFAULT_DTM_TYPE, ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Iterates through the columns of the hashed DTM of the corpus crps without constructing it. Useful when the DTM would not fit in memory. eltype specifies the element type of the generated vectors.

embed_document(lm, doc)

Return the vector representation of doc, obtained using the LSA model lm. doc can be an AbstractDocument, Corpus or DTV or DTM.

embed_document(rpm, doc)

Return the vector representation of doc, obtained using the random projection model rpm. doc can be an AbstractDocument, Corpus or DTV or DTM.

frequent_terms(crps::Corpus, alpha)

Returns a vector with frequent terms among all documents. The parameter alpha indicates the sparsity threshold (a frequency <= alpha means sparse).

frequent_terms(doc, alpha)

Returns a vector with frequent terms in the document doc. The parameter alpha indicates the sparsity threshold (a frequency <= alpha means sparse).

get_vector(lm, word)

Returns the vector representation of word from the LSA model lm.

get_vector(rpm, word)

Returns the random projection vector corresponding to word in the random projection model rpm.

hash_dtm(crps::Corpus [,h::TextHashFunction], eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Creates a hashed DTM with elements of type T for corpus crps using the the hashing function h. If h is missing, the hash function of the Corpus is used.

hash_dtv(d, h::TextHashFunction, eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Creates a hashed document-term-vector with elements of type T for document d using the hashing function h. d can be an AbstractString or an AbstractDocument.

in_vocabulary(lm, word)

Return true if word is part of the vocabulary of the LSA model lm and false otherwise.

in_vocabulary(rpm, word)

Return true if word is part of the vocabulary of the random projection model rpm and false otherwise.

index(lm, word)

Return the index of word from the LSA model lm.

index(rpm, word)

Return the index of word from the random projection model rpm.

StringAnalysis.ldaMethod.
ϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64)

Perform Latent Dirichlet allocation.

Arguments

  • α Dirichlet dist. hyperparameter for topic distribution per document. α<1 yields a sparse topic mixture for each document. α>1 yields a more uniform topic mixture for each document.
  • β Dirichlet dist. hyperparameter for word distribution per topic. β<1 yields a sparse word mixture for each topic. β>1 yields a more uniform word mixture for each topic.

Return values

  • ϕ: ntopics × nwords Sparse matrix of probabilities s.t. sum(ϕ, 1) == 1
  • θ: ntopics × ndocs Dense matrix of probabilities s.t. sum(θ, 1) == 1
load_lsa_model(filename, eltype; [sparse=false])

Loads an LSA model from filename into an LSA model object. The embeddings matrix element type is specified by eltype (default DEFAULT_FLOAT_TYPE) while the keyword argument sparse specifies whether the matrix should be sparse or not.

load_rp_model(filename, eltype; [sparse=true])

Loads an random projection model from filename into an random projection model object. The projection matrix element type is specified by eltype (default DEFAULT_FLOAT_TYPE) while the keyword argument sparse specifies whether the matrix should be sparse or not.

StringAnalysis.lsaMethod.
lsa(X [;k=<num documents>, stats=:tfidf, κ=2, β=0.75, tol=1e-15])

Constructs a LSA model. The input X can be a Corpus or a DocumentTermMatrix. Use ?LSAModel for more details. Vector components smaller than tol will be zeroed out.

StringAnalysis.ngramsFunction.
ngrams(d, n=DEFAULT_GRAM_COMPLEXITY [; tokenizer=DEFAULT_TOKENIZER])

Access the document text of d as n-gram counts. The ngrams contain at most n tokens which are obtained using tokenizer.

ngrams!(d, new_ngrams)

Replace the original n-grams of document d with new_ngrams.

StringAnalysis.rpMethod.
rp(X [;k=m, density=1/sqrt(k), stats=:tfidf, ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, κ=2, β=0.75])

Constructs a random projection model. The input X can be a Corpus or a DocumentTermMatrix with m words in the lexicon. The model does not store the corpus or DTM document embeddings, just the projection matrix. Use ?RPModel for more details.

save(lm, filename)

Saves an LSA model lm to disc in file filename.

save_rp_model(rpm, filename)

Saves an random projection model rpm to disc in file filename.

sentence_tokenize([lang,] s)

Splits string s into sentences using WordTokenizers.split_sentences function to perform the tokenization. If a language lang is provided, it ignores it ;)

similarity(model, doc1, doc2)

Return the cosine similarity value between two documents doc1 and doc2 whose vector representations have been obtained using the model, which can be either a LSAModel or RPModel.

sparse_terms(crps::Corpus, alpha)

Returns a vector with rare terms among all documents. The parameter alpha indicates the sparsity threshold (a frequency <= alpha means sparse).

sparse_terms(doc, alpha)

Returns a vector with rare terms in the document doc. The parameter alpha indicates the sparsity threshold (a frequency <= alpha means sparse).

text!(d, new_text)

Replace the original text of document d with new_text.

text(d)

Access the text of document d if possible.

" tokenize(doc [;method, splitter])

Tokenizes the document doc based on the mehtod (default :default, i.e. a WordTokenizers.jl tokenizer) and the splitter, which is a Regex used if method=:stringanalysis.

tokens!(d, new_tokens)

Replace the original tokens of document d with new_tokens.

tokens(d [; method=DEFAULT_TOKENIZER])

Access the tokens of document d as a token array. The method keyword argument specifies the type of tokenization to perform. Available options are :default and :stringanalysis.

vocabulary(lm)

Return the vocabulary as a vector of words of the LSA model lm.

vocabulary(rpm)

Return the vocabulary as a vector of words of the random projection model rpm.

Base.sizeMethod.
size(lm)

Return a tuple containin input and output dimensionalities of the LSA model lm.

Base.sizeMethod.
size(rpm)

Return a tuple containing the input data and projection sub-space dimensionalities of the random projection model rpm.

Base.summaryMethod.
summary(doc)

Shows information about the document doc.

Base.summaryMethod.
summary(crps)

Shows information about the corpus crps.

abstract_convert(document::AbstractDocument, parameter::Union{Nothing, Type{T}})

Tries converting document::AbstractDocument to one of the concrete types with witch StringAnalysis works i.e. StringDocument{T}, TokenDocument{T}, NGramDocument{T}. A user-defined convert method between the typeof(document) and the concrete types should be defined.

columnindices(terms)

Identical to rowindices. Returns a dictionary that maps each term from the vector terms to a integer idex.

coo_matrix(::Type{T}, doc::Vector{AbstractString}, vocab::OrderedDict{AbstractString, Int}, window::Int, normalize::Bool)

Basic low-level function that calculates the co-occurence matrix of a document. Returns a sparse co-occurence matrix sized n × n where n = length(vocab) with elements of type T. The document doc is represented by a vector of its terms (in order). The keywordswindowandnormalize` indicate the size of the sliding word window in which co-occurrences are counted and whether to normalize of not the counts by the distance between word positions.

Examples

julia> using StringAnalysis
       doc = StringDocument("This is a text about an apple. There are many texts about apples.")
       docv = tokenize(text(doc))
       vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
       StringAnalysis.coo_matrix(Float16, docv, vocab, 5, true)
3×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 4 stored entries:
  [2, 1]  =  2.0
  [1, 2]  =  2.0
  [3, 2]  =  0.3999
  [2, 3]  =  0.3999
embed_word(lm, word)

Return the vector representation of word using the LSA model lm.

random_projection_matrix(k::Int, m::Int, eltype::Type{T<:AbstractFloat}, density::Float64)

Builds a k×m sparse random projection matrix with elements of type T and a non-zero element frequency of density. k and m are the output and input dimensionalities.

Matrix Probabilities

If we note s = 1 / density, the components of the random matrix are drawn from:

  • -sqrt(s) / sqrt(k) with probability 1/2s
  • 0 with probability 1 - 1/s
  • +sqrt(s) / sqrt(k) with probability 1/2s

No projection hack

If k<=0 no projection is performed and the function returns an identity matrix sized m×m with elements of type T. This is useful if one does not want to embed documents but rather calculate term frequencies, BM25 and other statistical indicators (similar to dtv).

remove_patterns!(d, rex)

Removes from the document or corpus d the text matching the pattern described by the regular expression rex.

remove_patterns(s, rex)

Removes from the string s the text matching the pattern described by the regular expression rex.

rowindices(terms)

Returns a dictionary that maps each term from the vector terms to a integer idex.

tokenize_default([lang,] s)

Splits string s into tokens on whitespace using WordTokenizers.tokenize function to perform the tokenization. If a language lang is provided, it ignores it ;)

tokenize_stringanalysis(doc [;splitter])

Function that quickly tokenizes doc based on the splitting pattern specified by splitter::RegEx. Supported types for doc are: AbstractString, Vector{AbstractString}, StringDocument and NGramDocument.