LSA: Latent Semantic Analysis

Often we want to think about documents from the perspective of semantic content. One standard approach to doing this, is to perform Latent Semantic Analysis or LSA on the corpus.

lsa(crps)
lsa(dtm)

lsa uses tf_idf for statistics.

julia> crps = Corpus([StringDocument("this is a string document"), TokenDocument("this is a token document")])

julia> F1.lsa(crps)
LinearAlgebra.SVD{Float64,Float64,Array{Float64,2}}([1.0 0.0; 0.0 1.0], [0.138629, 0.138629], [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 1.0])

lsa can also be performed on a DocumentTermMatrix.

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)
A 2 X 6 DocumentTermMatrix

julia> F2 = lsa(m)
SVD{Float64,Float64,Array{Float64,2}}([1.0 0.0; 0.0 1.0], [0.138629, 0.138629], [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 1.0])

LDA: Latent Dirichlet Allocation

Another way to get a handle on the semantic content of a corpus is to use Latent Dirichlet Allocation:

First we need to produce the DocumentTermMatrix

julia> crps = Corpus([StringDocument("This is the Foo Bar Document"), StringDocument("This document has too Foo words")])
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)

Latent Dirchlet Allocation has two hyper parameters -

  • α : The hyperparameter for topic distribution per document. α<1 yields a sparse topic mixture for each document. α>1 yields a more uniform topic mixture for each document.
  • β : The hyperparameter for word distribution per topic. β<1 yields a sparse word mixture for each topic. β>1 yields a more uniform word mixture for each topic.
julia> k = 2            # number of topics
julia> iterations = 1000 # number of gibbs sampling iterations

julia> α = 0.1      # hyper parameter
julia> β  = 0.1       # hyper parameter

julia> ϕ, θ  = lda(m, k, iterations, α, β)
(
  [2 ,  1]  =  0.333333
  [2 ,  2]  =  0.333333
  [1 ,  3]  =  0.222222
  [1 ,  4]  =  0.222222
  [1 ,  5]  =  0.111111
  [1 ,  6]  =  0.111111
  [1 ,  7]  =  0.111111
  [2 ,  8]  =  0.333333
  [1 ,  9]  =  0.111111
  [1 , 10]  =  0.111111, [0.5 1.0; 0.5 0.0])

See ?lda for more help.