API References

Base.merge!Method
merge!(dtm1::DocumentTermMatrix{T}, dtm2::DocumentTermMatrix{T}) where {T}

Merge one DocumentTermMatrix instance into another. Documents are appended to the end. Terms are re-sorted. For efficiency, this may result in modifications to dtm2 as well.

TextAnalysis.columnindicesMethod
columnindices(terms::Vector{String})

Creates a column index lookup dictionary from a vector of terms.

TextAnalysis.coo_matrixMethod
coo_matrix(::Type{T}, doc::Vector{AbstractString}, vocab::OrderedDict{AbstractString, Int}, window::Int, normalize::Bool)

Basic low-level function that calculates the co-occurence matrix of a document. Returns a sparse co-occurence matrix sized n × n where n = length(vocab) with elements of type T. The document doc is represented by a vector of its terms (in order). The keywordswindowandnormalize` indicate the size of the sliding word window in which co-occurrences are counted and whether to normalize of not the counts by the distance between word positions.

Example

julia> using TextAnalysis, DataStructures
       doc = StringDocument("This is a text about an apple. There are many texts about apples.")
       docv = TextAnalysis.tokenize(language(doc), text(doc))
       vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
       TextAnalysis.coo_matrix(Float16, docv, vocab, 5, true)

3×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 4 stored entries:
  [2, 1]  =  2.0
  [1, 2]  =  2.0
  [3, 2]  =  0.3999
  [2, 3]  =  0.3999
TextAnalysis.coomMethod
coom(c::CooMatrix)

Access the co-occurrence matrix field coom of a CooMatrixc.

TextAnalysis.coomMethod
coom(entity, eltype=DEFAULT_FLOAT_TYPE [;window=5, normalize=true])

Access the co-occurrence matrix of the CooMatrix associated with the entity. The CooMatrix{T} will first have to be created in order for the actual matrix to be accessed.

TextAnalysis.cos_similarityMethod
function cos_similarity(tfm::AbstractMatrix)

cos_similarity calculates the cosine similarity from a term frequency matrix (typically the tf-idf matrix).

Example

crps = Corpus( StringDocument.([
    "to be or not to be",
    "to sing or not to sing",
    "to talk or to silence"]) )
update_lexicon!(crps)
d = dtm(crps)
tfm = tf_idf(d)
cs = cos_similarity(tfm)
Matrix(cs)
    # 3×3 Array{Float64,2}:
    #  1.0        0.0329318  0.0
    #  0.0329318  1.0        0.0
    #  0.0        0.0        1.0
TextAnalysis.counter2Method
counter is used to make conditional distribution, which is used by score functions to 
calculate conditonal frequency distribution
TextAnalysis.dtmMethod
dtm(crps::Corpus)
dtm(d::DocumentTermMatrix)
dtm(d::DocumentTermMatrix, density::Symbol)

Creates a simple sparse matrix of DocumentTermMatrix object.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
                      StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> dtm(DocumentTermMatrix(crps))
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
  [1, 1]  =  1
  [2, 1]  =  1
  [1, 2]  =  2
  [2, 3]  =  2
  [1, 4]  =  1
  [2, 4]  =  1
  [1, 5]  =  1
  [2, 5]  =  1
  [1, 6]  =  1
  [2, 6]  =  1

julia> dtm(DocumentTermMatrix(crps), :dense)
2×6 Array{Int64,2}:
 1  2  0  1  1  1
 1  0  2  1  1  1
TextAnalysis.dtvMethod
dtv(d::AbstractDocument, lex::Dict{String, Int})

Produce a single row of a DocumentTermMatrix.

Individual documents do not have a lexicon associated with them, we have to pass in a lexicon as an additional argument.

Examples

julia> dtv(crps[1], lexicon(crps))
1×6 Array{Int64,2}:
 1  2  0  1  1  1
TextAnalysis.everygramMethod
everygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1)where { T <: AbstractString}

Return all possible ngrams generated from sequence of items, as an Array{String,1}

Example

julia> seq = ["To","be","or","not"]
julia> a = everygram(seq,min_len=1, max_len=-1)
 10-element Array{Any,1}:
  "or"          
  "not"         
  "To"          
  "be"                  
  "or not" 
  "be or"       
  "be or not"   
  "To be or"    
  "To be or not"
TextAnalysis.extend!Method
extend!(model::NaiveBayesClassifier, dictElement)

Add the dictElement to dictionary of the Classifier model.

TextAnalysis.featuresMethod
features(::AbstractDict, dict)

Compute an Array, mapping the value corresponding to elements of dict to the input AbstractDict.

TextAnalysis.fit!Method
fit!(model::NaiveBayesClassifier, str, class)
fit!(model::NaiveBayesClassifier, ::Features, class)
fit!(model::NaiveBayesClassifier, ::StringDocument, class)

Fit the weights for the model on the input data.

TextAnalysis.fmeasure_lcsFunction
fmeasure_lcs(RLCS, PLCS, β)

Compute the F-measure based on WLCS.

Arguments

  • RLCS - Recall Factor
  • PLCS - Precision Factor
  • β - Parameter
TextAnalysis.frequent_termsFunction
frequent_terms(crps, alpha=0.95)

Find the frequent terms from Corpus, occuring more than alpha percentage of the documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                      StringDocument("This is Document 2")])
A Corpus with 2 documents:
 * 2 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> frequent_terms(crps)
3-element Array{String,1}:
 "is"
 "This"
 "Document"

See also: remove_frequent_terms!, sparse_terms

TextAnalysis.hash_dtmMethod
hash_dtm(crps::Corpus)
hash_dtm(crps::Corpus, h::TextHashFunction)

Represents a Corpus as a Matrix with N entries.

TextAnalysis.hash_dtvMethod
hash_dtv(d::AbstractDocument)
hash_dtv(d::AbstractDocument, h::TextHashFunction)

Represents a document as a vector with N entries.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
                      StringDocument("To become or not to become")])

julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)

julia> hash_dtv(crps[1], h)
1×10 Array{Int64,2}:
 0  2  0  0  1  3  0  0  0  0

julia> hash_dtv(crps[1])
1×100 Array{Int64,2}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0
TextAnalysis.index_hashMethod
index_hash(str, TextHashFunc)

Shows mapping of string to integer.

Parameters: - str = Max index used for hashing (default 100) - TextHashFunc = TextHashFunction type object

julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)

julia> index_hash("a", h)
8

julia> index_hash("b", h)
7
TextAnalysis.inverse_indexMethod
inverse_index(crps::Corpus)

Shows the inverse index of a corpus.

If we are interested in a specific term, we often want to know which documents in a corpus contain that term. The inverse index tells us this and therefore provides a simplistic sort of search algorithm.

TextAnalysis.language!Method
language!(doc, lang::Language)

Set the language of doc to lang.

Example

julia> d = StringDocument("String Document 1")

julia> language!(d, Languages.Spanish())

julia> d.metadata.language
Languages.Spanish()

See also: language, languages, languages!

TextAnalysis.languages!Method
languages!(crps, langs::Vector{Language})
languages!(crps, lang::Language)

Update languages of documents in a Corpus.

If the input is a Vector, then language of the ith document is set to the ith element in the vector, respectively. However, the number of documents must equal the length of vector.

See also: languages, language!, language

TextAnalysis.ldaMethod
ϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64; kwargs...)

Perform Latent Dirichlet allocation.

Required Positional Arguments

  • α Dirichlet dist. hyperparameter for topic distribution per document. α<1 yields a sparse topic mixture for each document. α>1 yields a more uniform topic mixture for each document.
  • β Dirichlet dist. hyperparameter for word distribution per topic. β<1 yields a sparse word mixture for each topic. β>1 yields a more uniform word mixture for each topic.

Optional Keyword Arguments

  • showprogress::Bool. Show a progress bar during the Gibbs sampling. Default value: true.

Return Values

  • ϕ: ntopics × nwords Sparse matrix of probabilities s.t. sum(ϕ, 1) == 1
  • θ: ntopics × ndocs Dense matrix of probabilities s.t. sum(θ, 1) == 1
TextAnalysis.lexical_frequencyMethod
lexical_frequency(crps::Corpus, term::AbstractString)

Tells us how often a term occurs across all of the documents.

TextAnalysis.lexiconMethod
lexicon(crps::Corpus)

Shows the lexicon of the corpus.

Lexicon of a corpus consists of all the terms that occur in any document in the corpus.

Example

julia> crps = Corpus([StringDocument("Name Foo"),
                          StringDocument("Name Bar")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens

julia> lexicon(crps)
Dict{String,Int64} with 0 entries
TextAnalysis.lookupMethod

lookup a sequence or words in the vocabulary

Return an Array of String

TextAnalysis.lsaMethod
lsa(dtm::DocumentTermMatrix)
lsa(crps::Corpus)

Performs Latent Semantic Analysis or LSA on a corpus.

TextAnalysis.ngramizeMethod
ngramize(lang, tokens, n)

Compute the ngrams of tokens of the order n.

Example

julia> ngramize(Languages.English(), ["To", "be", "or", "not", "to"], 3)
Dict{AbstractString,Int64} with 3 entries:
  "be or not" => 1
  "or not to" => 1
  "To be or"  => 1
TextAnalysis.ngramizenewMethod
ngramizenew( words::Vector{T}, nlist::Integer...) where { T <: AbstractString}

ngramizenew is used to out putting ngrmas in set

Example

julia> seq=["To","be","or","not","To","not","To","not"]
julia> ngramizenew(seq ,2)
 7-element Array{Any,1}:
  "To be" 
  "be or" 
  "or not"
  "not To"
  "To not"
  "not To"
  "To not"
TextAnalysis.ngramsMethod
ngrams(ngd::NGramDocument, n::Integer)
ngrams(d::AbstractDocument, n::Integer)
ngrams(d::NGramDocument)
ngrams(d::AbstractDocument)

Access the document text as n-gram counts.

Example

julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...

julia> ngrams(sd)
 Dict{String,Int64} with 7 entries:
  "or"   => 1
  "not"  => 1
  "to"   => 1
  "To"   => 1
  "be"   => 1
  "be.." => 1
  "."    => 1
TextAnalysis.onegramizeMethod
onegramize(lang, tokens)

Create the unigrams dict for input tokens.

Example

julia> onegramize(Languages.English(), ["To", "be", "or", "not", "to", "be"])
Dict{String,Int64} with 5 entries:
  "or"  => 1
  "not" => 1
  "to"  => 1
  "To"  => 1
  "be"  => 2
TextAnalysis.padding_ngramMethod
padding_ngram(word::Vector{T}, n=1; pad_left=false, pad_right=false, left_pad_symbol="<s>", right_pad_symbol ="</s>") where { T <: AbstractString}

padding _ngram is used to pad both left and right of sentence and out putting ngrmas of order n

It also pad the original input Array of string

Example

julia> example = ["1","2","3","4","5"]

julia> padding_ngrams(example,2,pad_left=true,pad_right=true)
 6-element Array{Any,1}:
  "<s> 1" 
  "1 2"   
  "2 3"   
  "3 4"   
  "4 5"   
  "5 </s>"
TextAnalysis.predictMethod
predict(::NaiveBayesClassifier, str)
predict(::NaiveBayesClassifier, ::Features)
predict(::NaiveBayesClassifier, ::StringDocument)

Predict probabilities for each class on the input Features or String.

TextAnalysis.prepare!Method
prepare!(doc, flags)
prepare!(crps, flags)

Preprocess document or corpus based on the input flags.

List of Flags

  • strip_patterns
  • stripcorruptutf8
  • strip_case
  • stem_words
  • tagpartof_speech
  • strip_whitespace
  • strip_punctuation
  • strip_numbers
  • stripnonletters
  • stripindefinitearticles
  • stripdefinitearticles
  • strip_articles
  • strip_prepositions
  • strip_pronouns
  • strip_stopwords
  • stripsparseterms
  • stripfrequentterms
  • striphtmltags

Example

julia> doc = StringDocument("This is a document of mine")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: This is a document of mine
julia> prepare!(doc, strip_pronouns | strip_articles)
julia> text(doc)
"This is   document of "
TextAnalysis.probFunction

To get probability of word given that context

In otherwords, for given context calculate frequency distribution of word

TextAnalysis.prune!Method
prune!(dtm::DocumentTermMatrix{T}, document_positions; compact::Bool=true, retain_terms::Union{Nothing,Vector{T}}=nothing) where {T}

Delete documents specified by document_positions from a document term matrix. Optionally compact the matrix by removing unreferenced terms.

TextAnalysis.remove_case!Method
remove_case!(doc)
remove_case!(crps)

Convert the text of doc or crps to lowercase. Does not support FileDocument or crps containing FileDocument.

Example

julia> str = "The quick brown fox jumps over the lazy dog"
julia> sd = StringDocument(str)
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: The quick brown fox jumps over the lazy dog
julia> remove_case!(sd)
julia> sd.text
"the quick brown fox jumps over the lazy dog"

See also: remove_case

TextAnalysis.remove_frequent_terms!Function
remove_frequent_terms!(crps, alpha=0.95)

Remove terms in crps, occuring more than alpha percent of documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                      StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> remove_frequent_terms!(crps)
julia> text(crps[1])
"     1"
julia> text(crps[2])
"     2"

See also: remove_sparse_terms!, frequent_terms

TextAnalysis.remove_html_tags!Method
remove_html_tags!(doc::StringDocument)
remove_html_tags!(crps)

Remove html tags from the StringDocument or documents crps. Does not work for documents other than StringDocument.

Example

julia> html_doc = StringDocument(
             "
               <html>
                   <head><script language="javascript">x = 20;</script></head>
                   <body>
                       <h1>Hello</h1><a href="world">world</a>
                   </body>
               </html>
             "
            )
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet:  <html> <head><s
julia> remove_html_tags!(html_doc)
julia> strip(text(html_doc))
"Hello world"

See also: remove_html_tags

TextAnalysis.remove_patterns!Method
remove_patterns!(doc, rex::Regex)
remove_patterns!(crps, rex::Regex)

Remove patterns matched by rex in document or Corpus. Does not modify FileDocument or Corpus containing FileDocument. See also: remove_patterns

TextAnalysis.remove_sparse_terms!Function
remove_sparse_terms!(crps, alpha=0.05)

Remove sparse terms in crps, occuring less than alpha percent of documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                      StringDocument("This is Document 2")])
A Corpus with 2 documents:
 * 2 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> remove_sparse_terms!(crps, 0.5)
julia> crps[1].text
"This is Document "
julia> crps[2].text
"This is Document "

See also: remove_frequent_terms!, sparse_terms

TextAnalysis.remove_whitespace!Method
remove_whitespace!(doc)
remove_whitespace!(crps)

Squash multiple whitespaces to a single space and remove all leading and trailing whitespaces in document or crps. Does no-op for FileDocument, TokenDocument or NGramDocument. See also: remove_whitespace

TextAnalysis.remove_words!Method
remove_words!(doc, words::Vector{AbstractString})
remove_words!(crps, words::Vector{AbstractString})

Remove the occurences of words from doc or crps.

Example

julia> str="The quick brown fox jumps over the lazy dog"
julia> sd=StringDocument(str);
julia> remove_words = ["fox", "over"]
julia> remove_words!(sd, remove_words)
julia> sd.text
"the quick brown   jumps   the lazy dog"
TextAnalysis.scoreFunction
score(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probablity of word given that context in InterpolatedLanguageModel

Apply Kneserney and WittenBell smoothing depending upon the sub-Type

TextAnalysis.scoreFunction
score(m::MLE, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probablity of word given that context in MLE

TextAnalysis.scoreMethod
score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probablity of word given that context

Add-one smoothing to Lidstone or Laplace(gammamodel) models

TextAnalysis.sentence_tokenizeMethod
sentence_tokenize(language, str)

Split str into sentences.

Example

julia> sentence_tokenize(Languages.English(), "Here are few words! I am Foo Bar.")
2-element Array{SubString{String},1}:
 "Here are few words!"
 "I am Foo Bar."

See also: tokenize

TextAnalysis.sparse_termsFunction
sparse_terms(crps, alpha=0.05])

Find the sparse terms from Corpus, occuring in less than alpha percentage of the documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                      StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> sparse_terms(crps, 0.5)
2-element Array{String,1}:
 "1"
 "2"

See also: remove_sparse_terms!, frequent_terms

TextAnalysis.standardize!Method
standardize!(crps::Corpus, ::Type{T}) where T <: AbstractDocument

Standardize the documents in a Corpus to a common type.

Example

julia> crps = Corpus([StringDocument("Document 1"),
		              TokenDocument("Document 2"),
		              NGramDocument("Document 3")])
A Corpus with 3 documents:
 * 1 StringDocument's
 * 0 FileDocument's
 * 1 TokenDocument's
 * 1 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens


julia> standardize!(crps, NGramDocument)

julia> crps
A Corpus with 3 documents:
 * 0 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 3 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
TextAnalysis.stem!Method
stem!(doc)
stem!(crps)

Stems the document or documents in crps with a suitable stemmer.

Stemming cannot be done for FileDocument and Corpus made of these type of documents.

TextAnalysis.stem!Method
stem!(crps::Corpus)

Stem an entire corpus. Assumes all documents in the corpus have the same language (picked from the first)

TextAnalysis.summarizeMethod
summarize(doc [, ns])

Summarizes the document and returns ns number of sentences. By default ns is set to the value 5.

Example

julia> s = StringDocument("Assume this Short Document as an example. Assume this as an example summarizer. This has too foo sentences.")

julia> summarize(s, ns=2)
2-element Array{SubString{String},1}:
 "Assume this Short Document as an example."
 "This has too foo sentences."
TextAnalysis.tag_scheme!Method
tag_scheme!(tags, current_scheme::String, new_scheme::String)

Convert tags from current_scheme to new_scheme.

List of tagging schemes currently supported-

  • BIO1 (BIO)
  • BIO2
  • BIOES

Example

julia> tags = ["I-LOC", "O", "I-PER", "B-MISC", "I-MISC", "B-PER", "I-PER", "I-PER"]

julia> tag_scheme!(tags, "BIO1", "BIOES")

julia> tags
8-element Array{String,1}:
 "S-LOC"
 "O"
 "S-PER"
 "B-MISC"
 "E-MISC"
 "B-PER"
 "I-PER"
 "E-PER"
TextAnalysis.textMethod
text(fd::FileDocument)
text(sd::StringDocument)
text(ngd::NGramDocument)

Access the text of Document as a string.

Example

julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...

julia> text(sd)
"To be or not to be..."
TextAnalysis.tf!Method
tf!(dtm::SparseMatrixCSC{Real}, tf::SparseMatrixCSC{AbstractFloat})

Overwrite tf with the term frequency of the dtm.

tf should have the has same nonzeros as dtm.

See also: tf, tf_idf, tf_idf!

TextAnalysis.tf!Method
tf!(dtm::AbstractMatrix{Real}, tf::AbstractMatrix{AbstractFloat})

Overwrite tf with the term frequency of the dtm.

Works correctly if dtm and tf are same matrix.

See also: tf, tf_idf, tf_idf!

TextAnalysis.tfMethod
tf(dtm::DocumentTermMatrix)
tf(dtm::SparseMatrixCSC{Real})
tf(dtm::Matrix{Real})

Compute the term-frequency of the input.

Example

julia> crps = Corpus([StringDocument("To be or not to be"),
              StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)

julia> tf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
  [1, 1]  =  0.166667
  [2, 1]  =  0.166667
  [1, 2]  =  0.333333
  [2, 3]  =  0.333333
  [1, 4]  =  0.166667
  [2, 4]  =  0.166667
  [1, 5]  =  0.166667
  [2, 5]  =  0.166667
  [1, 6]  =  0.166667
  [2, 6]  =  0.166667

See also: tf!, tf_idf, tf_idf!

TextAnalysis.tf_idf!Method
tf_idf!(dtm::SparseMatrixCSC{Real}, tfidf::SparseMatrixCSC{AbstractFloat})

Overwrite tfidf with the tf-idf (Term Frequency - Inverse Doc Frequency) of the dtm.

The arguments must have same number of nonzeros.

See also: tf, tf_idf, tf_idf!

TextAnalysis.tf_idf!Method
tf_idf!(dtm::AbstractMatrix{Real}, tf_idf::AbstractMatrix{AbstractFloat})

Overwrite tf_idf with the tf-idf (Term Frequency - Inverse Doc Frequency) of the dtm.

dtm and tf-idf must be matrices of same dimensions.

See also: tf, tf! , tf_idf

TextAnalysis.tf_idfMethod
tf(dtm::DocumentTermMatrix)
tf(dtm::SparseMatrixCSC{Real})
tf(dtm::Matrix{Real})

Compute tf-idf value (Term Frequency - Inverse Document Frequency) for the input.

In many cases, raw word counts are not appropriate for use because:

  • Some documents are longer than other documents
  • Some words are more frequent than other words

A simple workaround this can be done by performing TF-IDF on a DocumentTermMatrix

Example

julia> crps = Corpus([StringDocument("To be or not to be"),
              StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)

julia> tf_idf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
  [1, 1]  =  0.0
  [2, 1]  =  0.0
  [1, 2]  =  0.231049
  [2, 3]  =  0.231049
  [1, 4]  =  0.0
  [2, 4]  =  0.0
  [1, 5]  =  0.0
  [2, 5]  =  0.0
  [1, 6]  =  0.0
  [2, 6]  =  0.0

See also: tf!, tf_idf, tf_idf!

TextAnalysis.titles!Method
titles!(crps, vec::Vector{String})
titles!(crps, str)

Update titles of the documents in a Corpus.

If the input is a String, set the same title for all documents. If the input is a vector, set title of ith document to corresponding ith element in the vector vec. In the latter case, the number of documents must equal the length of vector.

See also: titles, title!, title

TextAnalysis.tokensMethod
tokens(d::TokenDocument)
tokens(d::(Union{FileDocument, StringDocument}))

Access the document text as a token array.

Example

julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...

julia> tokens(sd)
7-element Array{String,1}:
    "To"
    "be"
    "or"
    "not"
    "to"
    "be.."
    "."
TextAnalysis.weighted_lcsFunction
weighted_lcs(X, Y, weight_score::Bool, returns_string::Bool, weigthing_function::Function)

Compute the Weighted Longest Common Subsequence of X and Y.

WordTokenizers.tokenizeMethod
tokenize(language, str)

Split str into words and other tokens such as punctuation.

Example

julia> tokenize(Languages.English(), "Too foo words!")
4-element Array{String,1}:
 "Too"
 "foo"
 "words"
 "!"

See also: sentence_tokenize

TextAnalysis.CooMatrixType

Basic Co-occurrence Matrix (COOM) type.

Fields

  • coom::SparseMatriCSC{T,Int} the actual COOM; elements represent

co-occurrences of two terms within a given window

  • terms::Vector{String} a list of terms that represent the lexicon of

the document or corpus

  • column_indices::OrderedDict{String, Int} a map between the terms and the

columns of the co-occurrence matrix

TextAnalysis.CooMatrixMethod
CooMatrix{T}(crps::Corpus [,terms] [;window=5, normalize=true])

Auxiliary constructor(s) of the CooMatrix type. The type T has to be a subtype of AbstractFloat. The constructor(s) requires a corpus crps and a terms structure representing the lexicon of the corpus. The latter can be a Vector{String}, an AbstractDict where the keys are the lexicon, or can be omitted, in which case the lexicon field of the corpus is used.

TextAnalysis.CorpusMethod
Corpus(docs::Vector{T}) where {T <: AbstractDocument}

Collections of documents are represented using the Corpus type.

Example

julia> crps = Corpus([StringDocument("Document 1"),
		              StringDocument("Document 2")])
A Corpus with 2 documents:
 * 2 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
TextAnalysis.DocumentMetadataMethod
DocumentMetadata(language, title::String, author::String, timestamp::String)

Stores basic metadata about Document.

...

Arguments

  • language: What language is the document in? Defaults to Languages.English(), a Language instance defined by the Languages package.
  • title::String : What is the title of the document? Defaults to "Untitled Document".
  • author::String : Who wrote the document? Defaults to "Unknown Author".
  • timestamp::String : When was the document written? Defaults to "Unknown Time".

...

TextAnalysis.DocumentTermMatrixMethod
DocumentTermMatrix(crps::Corpus)
DocumentTermMatrix(crps::Corpus, terms::Vector{String})
DocumentTermMatrix(crps::Corpus, lex::AbstractDict)
DocumentTermMatrix(dtm::SparseMatrixCSC{Int, Int},terms::Vector{String})

Represent documents as a matrix of word counts.

Allow us to apply linear algebra operations and statistical techniques. Need to update lexicon before use.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
                      StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)
A 2 X 6 DocumentTermMatrix

julia> m.dtm
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
  [1, 1]  =  1
  [2, 1]  =  1
  [1, 2]  =  2
  [2, 3]  =  2
  [1, 4]  =  1
  [2, 4]  =  1
  [1, 5]  =  1
  [2, 5]  =  1
  [1, 6]  =  1
  [2, 6]  =  1
TextAnalysis.FileDocumentMethod
FileDocument(pathname::AbstractString)

Represents a document using a plain text file on disk.

Example

julia> pathname = "/usr/share/dict/words"
"/usr/share/dict/words"

julia> fd = FileDocument(pathname)
A FileDocument
 * Language: Languages.English()
 * Title: /usr/share/dict/words
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: A A's AMD AMD's AOL AOL's Aachen Aachen's Aaliyah
TextAnalysis.KneserNeyInterpolatedMethod
KneserNeyInterpolated(word::Vector{T}, discount:: Float64,unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Initiate Type for providing KneserNey Interpolated language model.

The idea to abstract this comes from Chen & Goodman 1995.

TextAnalysis.LaplaceType
Laplace(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Function to initiate Type(Laplace) for providing Laplace-smoothed scores.

In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma = 1.

TextAnalysis.LidstoneMethod
Lidstone(word::Vector{T}, gamma:: Float64, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Function to initiate Type(Lidstone) for providing Lidstone-smoothed scores.

In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma.

TextAnalysis.MLEMethod
MLE(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Initiate Type for providing MLE ngram model scores.

Implementation of Base Ngram Model.

TextAnalysis.NGramDocumentMethod
NGramDocument(txt::AbstractString, n::Integer=1)
NGramDocument(txt::AbstractString, dm::DocumentMetadata, n::Integer=1)
NGramDocument(ng::Dict{T, Int}, n::Integer=1) where T <: AbstractString

Represents a document as a bag of n-grams, which are UTF8 n-grams and map to counts.

Example

julia> my_ngrams = Dict{String, Int}("To" => 1, "be" => 2,
                                     "or" => 1, "not" => 1,
                                     "to" => 1, "be..." => 1)
Dict{String,Int64} with 6 entries:
  "or"    => 1
  "be..." => 1
  "not"   => 1
  "to"    => 1
  "To"    => 1
  "be"    => 2

julia> ngd = NGramDocument(my_ngrams)
A NGramDocument{AbstractString}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
TextAnalysis.NaiveBayesClassifierMethod
NaiveBayesClassifier([dict, ]classes)

A Naive Bayes Classifier for classifying documents.

Example

julia> using TextAnalysis: NaiveBayesClassifier, fit!, predict
julia> m = NaiveBayesClassifier([:spam, :non_spam])
NaiveBayesClassifier{Symbol}(String[], Symbol[:spam, :non_spam], Array{Int64}(0,2))

julia> fit!(m, "this is spam", :spam)
NaiveBayesClassifier{Symbol}(["this", "is", "spam"], Symbol[:spam, :non_spam], [2 1; 2 1; 2 1])

julia> fit!(m, "this is not spam", :non_spam)
NaiveBayesClassifier{Symbol}(["this", "is", "spam", "not"], Symbol[:spam, :non_spam], [2 2; 2 2; 2 2; 1 2])

julia> predict(m, "is this a spam")
Dict{Symbol,Float64} with 2 entries:
  :spam     => 0.59883
  :non_spam => 0.40117
TextAnalysis.StringDocumentMethod
StringDocument(txt::AbstractString)

Represents a document using a UTF8 String stored in RAM.

Example

julia> str = "To be or not to be..."
"To be or not to be..."

julia> sd = StringDocument(str)
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...
TextAnalysis.TextHashFunctionMethod
TextHashFunction(cardinality)
TextHashFunction(hash_function, cardinality)

The need to create a lexicon before we can construct a document term matrix is often prohibitive. We can often employ a trick that has come to be called the Hash Trick in which we replace terms with their hashed valued using a hash function that outputs integers from 1 to N.

Parameters: - cardinality = Max index used for hashing (default 100) - hash_function = function used for hashing process (default function present, see code-base)

julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)
TextAnalysis.TokenDocumentMethod
TokenDocument(txt::AbstractString)
TokenDocument(txt::AbstractString, dm::DocumentMetadata)
TokenDocument(tkns::Vector{T}) where T <: AbstractString

Represents a document as a sequence of UTF8 tokens.

Example

julia> my_tokens = String["To", "be", "or", "not", "to", "be..."]
6-element Array{String,1}:
    "To"
    "be"
    "or"
    "not"
    "to"
    "be..."

julia> td = TokenDocument(my_tokens)
A TokenDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
TextAnalysis.VocabularyType
Vocabulary(word,unk_cutoff =1 ,unk_label = "<unk>")

Stores language model vocabulary. Satisfies two common language modeling requirements for a vocabulary:

  • When checking membership and calculating its size, filters items

by comparing their counts to a cutoff value. Adds a special "unknown" token which unseen words are mapped to.

Example

julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]
julia> vocabulary = Vocabulary(words, 2) 
  Vocabulary(Dict("<unk>"=>1,"c"=>3,"a"=>3,"d"=>2), 2, "<unk>") 

julia> vocabulary.vocab
  Dict{String,Int64} with 4 entries:
   "<unk>" => 1
   "c"     => 3
   "a"     => 3
   "d"     => 2

Tokens with counts greater than or equal to the cutoff value will
be considered part of the vocabulary.
julia> vocabulary.vocab["c"]
 3

julia> "c" in keys(vocabulary.vocab)
 true

julia> vocabulary.vocab["d"]
 2

julia> "d" in keys(vocabulary.vocab)
 true

Tokens with frequency counts less than the cutoff value will be considered not
part of the vocabulary even though their entries in the count dictionary are
preserved.
julia> "b" in keys(vocabulary.vocab)
 false

julia> "<unk>" in keys(vocabulary.vocab)
 true

We can look up words in a vocabulary using its `lookup` method.
"Unseen" words (with counts less than cutoff) are looked up as the unknown label.
If given one word (a string) as an input, this method will return a string.
julia> lookup("a")
 'a'

julia> word = ["a", "-", "d", "c", "a"]

julia> lookup(vocabulary ,word)
 5-element Array{Any,1}:
  "a"    
  "<unk>"
  "d"    
  "c"    
  "a"

If given a sequence, it will return an Array{Any,1} of the looked up words as shown above.
   
It's possible to update the counts after the vocabulary has been created.
julia> update(vocabulary,["b","c","c"])
 1

julia> vocabulary.vocab["b"]
 1
TextAnalysis.WittenBellInterpolatedMethod
WittenBellInterpolated(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}

Initiate Type for providing Interpolated version of Witten-Bell smoothing.

The idea to abstract this comes from Chen & Goodman 1995.