Topic Modeling

Another application of Natural Language Processing is Topic Modeling, and in this section, we are going to extract the topics for Chapter 18 (The Cave). To do this, again TextAnalysis.jl (Julia's leading NLP library) is used. The model for this task will be Latent Dirichlet Allocation (LDA), but Latent Semantic Analysis (LSA) is also available in TextAnalysis.jl. To start with, load the data as follows:

julia> using JuliaDB

julia> using PrettyTables

julia> using QuranTree

julia> using TextAnalysis

julia> @ptconf vcrop_mode=:middle tf=tf_compact

julia> crps, tnzl = QuranData() |> load;

julia> crpsdata = table(crps)
Quranic Arabic Corpus (morphology)
(C) 2011 Kais Dukes

Table with 128219 rows, 7 columns:
Columns:
#  colname   type
───────────────────
1  chapter   Int64
2  verse     Int64
3  word      Int64
4  part      Int64
5  form      String
6  tag       String
7  features  String
Note

You need to install JuliaDB.jl and PrettyTables.jl to successfully run the code.

using Pkg
Pkg.add("JuliaDB")
Pkg.add("PrettyTables")

Data Preprocessing

The first data processing will be the removal of all Disconnected Letters like الٓمٓ ,الٓمٓصٓ, among others. This is done as follows:

julia> function preprocess(s::String)
           feat = parse(Features, s)
           disletters = isfeature(feat, AbstractDisLetters)
           prepositions = isfeature(feat, AbstractPreposition)
           particles = isfeature(feat, AbstractParticle)
           conjunctions = isfeature(feat, AbstractConjunction)
           pronouns = isfeature(feat, AbstractPronoun)
           adverbs = isfeature(feat, AbstractAdverb)
       
           return !disletters && !prepositions && !particles && !conjunctions && !pronouns && !adverbs
       end
preprocess (generic function with 1 method)

julia> crpstbl = filter(t -> preprocess(t.features), crpsdata[18].data)
Table with 827 rows, 7 columns:
Columns:
#  colname   type
───────────────────
1  chapter   Int64
2  verse     Int64
3  word      Int64
4  part      Int64
5  form      String
6  tag       String
7  features  String

Next, we create a copy of the above data so we have the original state, and use the copy to do further data processing.

julia> crpsnew = deepcopy(crpstbl)
Table with 827 rows, 7 columns:
Columns:
#  colname   type
───────────────────
1  chapter   Int64
2  verse     Int64
3  word      Int64
4  part      Int64
5  form      String
6  tag       String
7  features  String

julia> feats = select(crpsnew, :features)
827-element WeakRefStrings.StringArray{String,1}:
 "STEM|POS:N|LEM:Hamod|ROOT:Hmd|M|NOM"
 "STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN"
 "STEM|POS:V|PERF|(IV)|LEM:>anzala|ROOT:nzl|3MS"
 "STEM|POS:N|LEM:Eabod|ROOT:Ebd|M|GEN"
 "STEM|POS:V|IMPF|LEM:jaEala|ROOT:jEl|3MS|MOOD:JUS"
 "STEM|POS:N|LEM:Eiwaj|ROOT:Ewj|M|NOM"
 "PREFIX|l:PRP+"
 "STEM|POS:V|IMPF|(IV)|LEM:>an*ara|ROOT:n*r|3MS|MOOD:SUBJ"
 "STEM|POS:N|LEM:l~adun|ROOT:ldn|GEN"
 "STEM|POS:V|IMPF|(II)|LEM:bu\$~ira|ROOT:b\$r|3MS|MOOD:SUBJ"
 ⋮
 "STEM|POS:ADJ|LEM:wa`Hid|ROOT:wHd|MS|INDEF|NOM"
 "STEM|POS:V|PERF|LEM:kaAna|ROOT:kwn|SP:kaAn|3MS"
 "STEM|POS:V|IMPF|LEM:yarojuwA@|ROOT:rjw|3MS"
 "STEM|POS:N|LEM:rab~|ROOT:rbb|M|GEN"
 "PREFIX|l:IMPV+"
 "STEM|POS:V|IMPF|LEM:Eamila|ROOT:Eml|3MS|MOOD:JUS"
 "STEM|POS:V|IMPF|(IV)|LEM:>a\$oraka|ROOT:\$rk|3MS|MOOD:JUS"
 "STEM|POS:N|LEM:EibaAdat|ROOT:Ebd|F|GEN"
 "STEM|POS:N|LEM:rab~|ROOT:rbb|M|GEN"

julia> feats = parse.(Features, feats)
827-element Array{AbstractFeature,1}:
 Stem(:N, N, AbstractFeature[Lemma("Hamod"), Root("Hmd"), M, NOM])
 Stem(:PN, PN, AbstractFeature[Lemma("{ll~ah"), Root("Alh"), GEN])
 Stem(:V, V, AbstractFeature[Lemma(">anzala"), Root("nzl"), PERF, IV, 3, M, S, IND, ACT])
 Stem(:N, N, AbstractFeature[Lemma("Eabod"), Root("Ebd"), M, GEN])
 Stem(:V, V, AbstractFeature[Lemma("jaEala"), Root("jEl"), JUS, IMPF, 3, M, S, ACT, I])
 Stem(:N, N, AbstractFeature[Lemma("Eiwaj"), Root("Ewj"), M, NOM])
 Prefix(Symbol("l:PRP+"), PRP)
 Stem(:V, V, AbstractFeature[Lemma(">an*ara"), Root("n*r"), SUBJ, IMPF, IV, 3, M, S, ACT])
 Stem(:N, N, AbstractFeature[Lemma("l~adun"), Root("ldn"), GEN])
 Stem(:V, V, AbstractFeature[Lemma("bu\$~ira"), Root("b\$r"), SUBJ, IMPF, II, 3, M, S, ACT])
 ⋮
 Stem(:ADJ, ADJ, AbstractFeature[Lemma("wa`Hid"), Root("wHd"), M, S, INDEF, NOM])
 Stem(:V, V, AbstractFeature[Lemma("kaAna"), Root("kwn"), Special("kaAn"), PERF, 3, M, S, IND, ACT, I])
 Stem(:V, V, AbstractFeature[Lemma("yarojuwA@"), Root("rjw"), IMPF, 3, M, S, IND, ACT, I])
 Stem(:N, N, AbstractFeature[Lemma("rab~"), Root("rbb"), M, GEN])
 Prefix(Symbol("l:IMPV+"), IMPV)
 Stem(:V, V, AbstractFeature[Lemma("Eamila"), Root("Eml"), JUS, IMPF, 3, M, S, ACT, I])
 Stem(:V, V, AbstractFeature[Lemma(">a\$oraka"), Root("\$rk"), JUS, IMPF, IV, 3, M, S, ACT])
 Stem(:N, N, AbstractFeature[Lemma("EibaAdat"), Root("Ebd"), F, GEN])
 Stem(:N, N, AbstractFeature[Lemma("rab~"), Root("rbb"), M, GEN])

Lemmatization

Using the above parsed features, we then convert the form of the tokens into its lemma. This is useful for addressing minimal variations due to inflection.

julia> lemmas = lemma.(feats)
827-element Array{Union{Missing, String},1}:
 "Hamod"
 "{ll~ah"
 ">anzala"
 "Eabod"
 "jaEala"
 "Eiwaj"
 missing
 ">an*ara"
 "l~adun"
 "bu\$~ira"
 ⋮
 "wa`Hid"
 "kaAna"
 "yarojuwA@"
 "rab~"
 missing
 "Eamila"
 ">a\$oraka"
 "EibaAdat"
 "rab~"

julia> forms1 = select(crpsnew, :form)
827-element WeakRefStrings.StringArray{String,1}:
 "Hamodu"
 "l~ahi"
 ">anzala"
 "Eabodi"
 "yajoEal"
 "EiwajaA"
 "l~i"
 "yun*ira"
 "l~aduno"
 "yuba\$~ira"
 ⋮
 "wa`HidN"
 "kaAna"
 "yarojuwA@"
 "rab~i"
 "lo"
 "yaEomalo"
 "yu\$oriko"
 "EibaAdapi"
 "rab~i"

julia> forms1[.!ismissing.(lemmas)] = lemmas[.!ismissing.(lemmas)]
795-element Array{Union{Missing, String},1}:
 "Hamod"
 "{ll~ah"
 ">anzala"
 "Eabod"
 "jaEala"
 "Eiwaj"
 ">an*ara"
 "l~adun"
 "bu\$~ira"
 "Eamila"
 ⋮
 "<ila`h"
 "wa`Hid"
 "kaAna"
 "yarojuwA@"
 "rab~"
 "Eamila"
 ">a\$oraka"
 "EibaAdat"
 "rab~"
Tips

We can also use the Root features instead, which is done by simply replacing lemma.(feats) with root.(feats).

We now put back the new form to the corpus:

julia> crpsnew = transform(crpsnew, :form => forms1)
Table with 827 rows, 7 columns:
Columns:
#  colname   type
───────────────────
1  chapter   Int64
2  verse     Int64
3  word      Int64
4  part      Int64
5  form      String
6  tag       String
7  features  String

julia> crpsnew = CorpusData(crpsnew)
Quranic Arabic Corpus (morphology)
(C) 2011 Kais Dukes

Table with 827 rows, 7 columns:
Columns:
#  colname   type
───────────────────
1  chapter   Int64
2  verse     Int64
3  word      Int64
4  part      Int64
5  form      String
6  tag       String
7  features  String

Tokenization

We want to summarize the Qur'an at the verse level. Thus, the token would be the verses of the corpus. From these verses, we further clean it by dediacritization and normalization of the characters:

julia> lem_vrs = verses(crpsnew)
109-element Array{String,1}:
 "Hamod {ll~ah >anzala Eabod jaEala Eiwaj"
 "l~i>an*ara l~adun bu\$~ira Eamila"
 ">an*ara qaAla {t~axa*a {ll~ah"
 "Eilom A^baA' kabura xaraja >afowa`h qaAla"
 "ba`xiE >avar 'aAmana Hadiyv"
 "jaEala >aroD libalawo >aHosan"
 "lajaAEil"
 "Hasiba kahof r~aqiym kaAna 'aAyap"
 ">awaY fitoyap kahof qaAla A^taY l~adun yuhay~i}o >amor"
 "Daraba >u*unN kahof"
 ⋮
 "Hasiba kafara {t~axa*a Eabod duwn >aEotadato jahan~am ka`firuwn"
 "qaAla nab~a>a >axosariyn"
 "Dal~a saEoy Hayaw`p d~unoyaA Hasiba >aHosana"
 "kafara 'aAyap rab~ liqaA^' HabiTa Eamal >aqaAma qiya`map"
 "jazaA^' jahan~am kafara {t~axa*a 'aAyap rasuwl"
 "'aAmana Eamila S~a`liHa`t kaAna jan~ap firodawos"
 "bagaY`"
 "qaAla kaAna baHor kalima`t rab~ lanafida baHor nafida kalima`t rab~ jaA^'a mivol"
 "qaAla ba\$ar mivol >awoHaY`^ <ila`h <ila`h wa`Hid kaAna yarojuwA@ rab~ loEamila >a\$oraka EibaAdat rab~"

julia> vrs = QuranTree.normalize.(dediac.(lem_vrs))
109-element Array{String,1}:
 "Hmd Allh Anzl Ebd jEl Ewj"
 "lAn*r ldn b\$r Eml"
 "An*r qAl Atx* Allh"
 "Elm AbA' kbr xrj AfwAh qAl"
 "bAxE Avr 'Amn Hdyv"
 "jEl ArD lblw AHsn"
 "ljAEl"
 "Hsb khf rqym kAn 'Ayh"
 "Awy ftyh khf qAl Aty ldn yhyy Amr"
 "Drb A*n khf"
 ⋮
 "Hsb kfr Atx* Ebd dwn AEtdt jhnm kAfrwn"
 "qAl nbA Axsryn"
 "Dl sEy HywAh dnyA Hsb AHsn"
 "kfr 'Ayh rb lqA' HbT Eml AqAm qyAmh"
 "jzA' jhnm kfr Atx* 'Ayh rswl"
 "'Amn Eml SAlHAt kAn jnh frdws"
 "bgyA"
 "qAl kAn bHr klmAt rb lnfd bHr nfd klmAt rb jA' mvl"
 "qAl b\$r mvl AwHyA AlAh AlAh wAHd kAn yrjwA@ rb lEml A\$rk EbAdt rb"

Creating a TextAnalysis Corpus

To make use of the TextAnalysis.jl's APIs, we need to encode the processed Quranic Corpus to TextAnalysis.jl's Corpus. In this case, we will create a StringDocument of the verses.

julia> crps1 = Corpus(StringDocument.(vrs))
A Corpus with 109 documents:
 * 109 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens

We then update the lexicon and inverse index for efficient indexing of the corpus.

julia> update_lexicon!(crps1)

julia> update_inverse_index!(crps1)

Next, we create a Document Term Matrix, which will have rows of verses and columns of words describing the verses.

julia> m1 = DocumentTermMatrix(crps1)
A 109 X 361 DocumentTermMatrix

Latent Dirichlet Allocation

Finally, run LDA as follows:

julia> k = 3          # number of topics
3

julia> iter = 1000    # number of gibbs sampling iterations
1000

julia> alpha = 0.1    # hyperparameter
0.1

julia> beta = 0.1     # hyperparameter
0.1

julia> ϕ, θ = lda(m1, k, iter, alpha, beta)
(
  [3,   1]  =  0.0175439
  [3,   2]  =  0.00584795
  [3,   3]  =  0.00584795
  [2,   4]  =  0.00333333
  [1,   5]  =  0.00373134
  [2,   6]  =  0.02
  [3,   7]  =  0.00292398
  [2,   8]  =  0.11
  [3,   8]  =  0.0555556
  ⋮
  [2, 352]  =  0.00333333
  [1, 353]  =  0.00746269
  [1, 354]  =  0.00373134
  [1, 355]  =  0.00373134
  [1, 356]  =  0.00373134
  [3, 357]  =  0.00584795
  [1, 358]  =  0.00373134
  [2, 359]  =  0.00333333
  [2, 360]  =  0.00666667
  [2, 361]  =  0.00333333, [0.0 0.0 … 0.6923076923076923 1.0; 1.0 0.0 … 0.3076923076923077 0.0; 0.0 1.0 … 0.0 0.0])

Extract the topic for first cluster:

julia> ntopics = 10
10

julia> cluster_topics = Matrix(undef, ntopics, k);

julia> for i = 1:k
           topics_idcs = sortperm(ϕ[i, :], rev=true)
           cluster_topics[:, i] = arabic.(m1.terms[topics_idcs][1:ntopics])
       end

julia> @pt cluster_topics
 -------- -------- --------
  Col. 1   Col. 2   Col. 3
 -------- -------- --------
     قال        ء        ذ
      رب      قال        ء
     كان      اتي      وجد
  استطاع      جعل      قال
     امر       رب     الله
    اراد      كان      اتخ
    اتبع        ل      دون
     لبث      امن        ر
     علم      ارض       رب
     بحر       شي        ا
 -------- -------- --------

Tabulating this propery would give us the following

Pkg.add("DataFrames")
Pkg.add("Latexify")
using DataFrames: DataFrame
using Latexify

mdtable(convert(DataFrame, cluster_topics), latex=false)
x1x2x3
قالءذ
ربقالء
كاناتيوجد
استطاعجعلقال
امرربالله
ارادكاناتخ
اتبعلدون
لبثامنر
علمارضرب
بحرشيا

As you may have noticed, the result is not good and this is mainly due to data processing. Readers are encourage to improve this for their own use. This section, however, demonstrated how TextAnalysis.jl's LDA can be used for Topic Modeling the QuranTree.jl's corpus.

Finally, the following will extract the topic for each verse:

julia> vrs_topics = []
Any[]

julia> for i = 1:dtm(m1).m
           push!(vrs_topics, sortperm(θ[:, i], rev=true)[1])
       end

julia> @pt vrs_topics
 --------
  Col. 1
 --------
       2
       3
       3
       1
       2
       2
       2
       3
    ⋮
       1
       2
       3
       3
       2
       1
       1
       1
 --------
93 rows omitted