Text Summarization

This section will demonstrate how to use TextAnalysis.jl (Julia's leading NLP library) for QuranTree.jl. In particular, in summarizing the Qur'an, specifically Chapter 18 (The Cave) which most Muslims are aware of the story, since it is the chapter recommended to be read every Friday. The algorithm used for summarization is called TextRank, an application of PageRank algorithm to text datasets.

julia> using JuliaDB

julia> using PrettyTables

julia> using QuranTree

julia> using TextAnalysis

julia> @ptconf tf=tf_compact vcrop_mode=:backend

julia> crps, tnzl = QuranData() |> load;

julia> crpsdata = table(crps)
Quranic Arabic Corpus (morphology)
(C) 2011 Kais Dukes

Table with 128219 rows, 7 columns:
Columns:
#  colname   type
───────────────────
1  chapter   Int64
2  verse     Int64
3  word      Int64
4  part      Int64
5  form      String
6  tag       String
7  features  String
Note

You need to install JuliaDB.jl and PrettyTables.jl to successfully run the code.

using Pkg
Pkg.add("JuliaDB")
Pkg.add("PrettyTables")

Data Preprocessing

The first data processing will be the removal of all Disconnected Letters like الٓمٓ ,الٓمٓصٓ, among others. This is done as follows:

julia> function preprocess(s::String)
           feat = parse(Features, s)
           disletters = isfeature(feat, AbstractDisLetters)
           prepositions = isfeature(feat, AbstractPreposition)
           particles = isfeature(feat, AbstractParticle)
           conjunctions = isfeature(feat, AbstractConjunction)
           pronouns = isfeature(feat, AbstractPronoun)
           adverbs = isfeature(feat, AbstractAdverb)
       
           return !disletters && !prepositions && !particles && !conjunctions && !pronouns && !adverbs
       end
preprocess (generic function with 1 method)

julia> crpstbl = filter(t -> preprocess(t.features), crpsdata[18].data)
Table with 827 rows, 7 columns:
Columns:
#  colname   type
───────────────────
1  chapter   Int64
2  verse     Int64
3  word      Int64
4  part      Int64
5  form      String
6  tag       String
7  features  String

Next, we create a copy of the above data so we have the original state, and use the copy to do further data processing.

julia> crpsnew = deepcopy(crpstbl)
Table with 827 rows, 7 columns:
Columns:
#  colname   type
───────────────────
1  chapter   Int64
2  verse     Int64
3  word      Int64
4  part      Int64
5  form      String
6  tag       String
7  features  String

julia> feats = select(crpsnew, :features)
827-element WeakRefStrings.StringArray{String,1}:
 "STEM|POS:N|LEM:Hamod|ROOT:Hmd|M|NOM"
 "STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN"
 "STEM|POS:V|PERF|(IV)|LEM:>anzala|ROOT:nzl|3MS"
 "STEM|POS:N|LEM:Eabod|ROOT:Ebd|M|GEN"
 "STEM|POS:V|IMPF|LEM:jaEala|ROOT:jEl|3MS|MOOD:JUS"
 "STEM|POS:N|LEM:Eiwaj|ROOT:Ewj|M|NOM"
 "PREFIX|l:PRP+"
 "STEM|POS:V|IMPF|(IV)|LEM:>an*ara|ROOT:n*r|3MS|MOOD:SUBJ"
 "STEM|POS:N|LEM:l~adun|ROOT:ldn|GEN"
 "STEM|POS:V|IMPF|(II)|LEM:bu\$~ira|ROOT:b\$r|3MS|MOOD:SUBJ"
 ⋮
 "STEM|POS:ADJ|LEM:wa`Hid|ROOT:wHd|MS|INDEF|NOM"
 "STEM|POS:V|PERF|LEM:kaAna|ROOT:kwn|SP:kaAn|3MS"
 "STEM|POS:V|IMPF|LEM:yarojuwA@|ROOT:rjw|3MS"
 "STEM|POS:N|LEM:rab~|ROOT:rbb|M|GEN"
 "PREFIX|l:IMPV+"
 "STEM|POS:V|IMPF|LEM:Eamila|ROOT:Eml|3MS|MOOD:JUS"
 "STEM|POS:V|IMPF|(IV)|LEM:>a\$oraka|ROOT:\$rk|3MS|MOOD:JUS"
 "STEM|POS:N|LEM:EibaAdat|ROOT:Ebd|F|GEN"
 "STEM|POS:N|LEM:rab~|ROOT:rbb|M|GEN"

julia> feats = parse.(Features, feats)
827-element Array{AbstractFeature,1}:
 Stem(:N, N, AbstractFeature[Lemma("Hamod"), Root("Hmd"), M, NOM])
 Stem(:PN, PN, AbstractFeature[Lemma("{ll~ah"), Root("Alh"), GEN])
 Stem(:V, V, AbstractFeature[Lemma(">anzala"), Root("nzl"), PERF, IV, 3, M, S, IND, ACT])
 Stem(:N, N, AbstractFeature[Lemma("Eabod"), Root("Ebd"), M, GEN])
 Stem(:V, V, AbstractFeature[Lemma("jaEala"), Root("jEl"), JUS, IMPF, 3, M, S, ACT, I])
 Stem(:N, N, AbstractFeature[Lemma("Eiwaj"), Root("Ewj"), M, NOM])
 Prefix(Symbol("l:PRP+"), PRP)
 Stem(:V, V, AbstractFeature[Lemma(">an*ara"), Root("n*r"), SUBJ, IMPF, IV, 3, M, S, ACT])
 Stem(:N, N, AbstractFeature[Lemma("l~adun"), Root("ldn"), GEN])
 Stem(:V, V, AbstractFeature[Lemma("bu\$~ira"), Root("b\$r"), SUBJ, IMPF, II, 3, M, S, ACT])
 ⋮
 Stem(:ADJ, ADJ, AbstractFeature[Lemma("wa`Hid"), Root("wHd"), M, S, INDEF, NOM])
 Stem(:V, V, AbstractFeature[Lemma("kaAna"), Root("kwn"), Special("kaAn"), PERF, 3, M, S, IND, ACT, I])
 Stem(:V, V, AbstractFeature[Lemma("yarojuwA@"), Root("rjw"), IMPF, 3, M, S, IND, ACT, I])
 Stem(:N, N, AbstractFeature[Lemma("rab~"), Root("rbb"), M, GEN])
 Prefix(Symbol("l:IMPV+"), IMPV)
 Stem(:V, V, AbstractFeature[Lemma("Eamila"), Root("Eml"), JUS, IMPF, 3, M, S, ACT, I])
 Stem(:V, V, AbstractFeature[Lemma(">a\$oraka"), Root("\$rk"), JUS, IMPF, IV, 3, M, S, ACT])
 Stem(:N, N, AbstractFeature[Lemma("EibaAdat"), Root("Ebd"), F, GEN])
 Stem(:N, N, AbstractFeature[Lemma("rab~"), Root("rbb"), M, GEN])

Lemmatization

Using the above parsed features, we then convert the form of the tokens into its lemma. This is useful for addressing minimal variations due to inflection.

julia> lemmas = lemma.(feats)
827-element Array{Union{Missing, String},1}:
 "Hamod"
 "{ll~ah"
 ">anzala"
 "Eabod"
 "jaEala"
 "Eiwaj"
 missing
 ">an*ara"
 "l~adun"
 "bu\$~ira"
 ⋮
 "wa`Hid"
 "kaAna"
 "yarojuwA@"
 "rab~"
 missing
 "Eamila"
 ">a\$oraka"
 "EibaAdat"
 "rab~"

julia> forms1 = select(crpsnew, :form)
827-element WeakRefStrings.StringArray{String,1}:
 "Hamodu"
 "l~ahi"
 ">anzala"
 "Eabodi"
 "yajoEal"
 "EiwajaA"
 "l~i"
 "yun*ira"
 "l~aduno"
 "yuba\$~ira"
 ⋮
 "wa`HidN"
 "kaAna"
 "yarojuwA@"
 "rab~i"
 "lo"
 "yaEomalo"
 "yu\$oriko"
 "EibaAdapi"
 "rab~i"

julia> forms1[.!ismissing.(lemmas)] = lemmas[.!ismissing.(lemmas)]
795-element Array{Union{Missing, String},1}:
 "Hamod"
 "{ll~ah"
 ">anzala"
 "Eabod"
 "jaEala"
 "Eiwaj"
 ">an*ara"
 "l~adun"
 "bu\$~ira"
 "Eamila"
 ⋮
 "<ila`h"
 "wa`Hid"
 "kaAna"
 "yarojuwA@"
 "rab~"
 "Eamila"
 ">a\$oraka"
 "EibaAdat"
 "rab~"
Tips

We can also use the Root features instead, which is done by simply replacing lemma.(feats) with root.(feats).

We now put back the new form to the corpus:

julia> crpsnew = transform(crpsnew, :form => forms1)
Table with 827 rows, 7 columns:
Columns:
#  colname   type
───────────────────
1  chapter   Int64
2  verse     Int64
3  word      Int64
4  part      Int64
5  form      String
6  tag       String
7  features  String

julia> crpsnew = CorpusData(crpsnew)
Quranic Arabic Corpus (morphology)
(C) 2011 Kais Dukes

Table with 827 rows, 7 columns:
Columns:
#  colname   type
───────────────────
1  chapter   Int64
2  verse     Int64
3  word      Int64
4  part      Int64
5  form      String
6  tag       String
7  features  String

Tokenization

We want to summarize the Qur'an at the verse level. Thus, the token would be the verses of the corpus. From these verses, we further clean it by dediacritization and normalization of the characters:

julia> lem_vrs = verses(crpsnew)
109-element Array{String,1}:
 "Hamod {ll~ah >anzala Eabod jaEala Eiwaj"
 "l~i>an*ara l~adun bu\$~ira Eamila"
 ">an*ara qaAla {t~axa*a {ll~ah"
 "Eilom A^baA' kabura xaraja >afowa`h qaAla"
 "ba`xiE >avar 'aAmana Hadiyv"
 "jaEala >aroD libalawo >aHosan"
 "lajaAEil"
 "Hasiba kahof r~aqiym kaAna 'aAyap"
 ">awaY fitoyap kahof qaAla A^taY l~adun yuhay~i}o >amor"
 "Daraba >u*unN kahof"
 ⋮
 "Hasiba kafara {t~axa*a Eabod duwn >aEotadato jahan~am ka`firuwn"
 "qaAla nab~a>a >axosariyn"
 "Dal~a saEoy Hayaw`p d~unoyaA Hasiba >aHosana"
 "kafara 'aAyap rab~ liqaA^' HabiTa Eamal >aqaAma qiya`map"
 "jazaA^' jahan~am kafara {t~axa*a 'aAyap rasuwl"
 "'aAmana Eamila S~a`liHa`t kaAna jan~ap firodawos"
 "bagaY`"
 "qaAla kaAna baHor kalima`t rab~ lanafida baHor nafida kalima`t rab~ jaA^'a mivol"
 "qaAla ba\$ar mivol >awoHaY`^ <ila`h <ila`h wa`Hid kaAna yarojuwA@ rab~ loEamila >a\$oraka EibaAdat rab~"

julia> vrs = QuranTree.normalize.(dediac.(lem_vrs))
109-element Array{String,1}:
 "Hmd Allh Anzl Ebd jEl Ewj"
 "lAn*r ldn b\$r Eml"
 "An*r qAl Atx* Allh"
 "Elm AbA' kbr xrj AfwAh qAl"
 "bAxE Avr 'Amn Hdyv"
 "jEl ArD lblw AHsn"
 "ljAEl"
 "Hsb khf rqym kAn 'Ayh"
 "Awy ftyh khf qAl Aty ldn yhyy Amr"
 "Drb A*n khf"
 ⋮
 "Hsb kfr Atx* Ebd dwn AEtdt jhnm kAfrwn"
 "qAl nbA Axsryn"
 "Dl sEy HywAh dnyA Hsb AHsn"
 "kfr 'Ayh rb lqA' HbT Eml AqAm qyAmh"
 "jzA' jhnm kfr Atx* 'Ayh rswl"
 "'Amn Eml SAlHAt kAn jnh frdws"
 "bgyA"
 "qAl kAn bHr klmAt rb lnfd bHr nfd klmAt rb jA' mvl"
 "qAl b\$r mvl AwHyA AlAh AlAh wAHd kAn yrjwA@ rb lEml A\$rk EbAdt rb"

Creating a TextAnalysis Corpus

To make use of the TextAnalysis.jl's APIs, we need to encode the processed Quranic Corpus to TextAnalysis.jl's Corpus. In this case, we will create a StringDocument of the verses.

julia> crps1 = Corpus(StringDocument.(vrs))
A Corpus with 109 documents:
 * 109 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens

We then update the lexicon and inverse index for efficient indexing of the corpus.

julia> update_lexicon!(crps1)

julia> update_inverse_index!(crps1)

Next, we create a Document Term Matrix, which will have rows of verses and columns of words describing the verses.

julia> m1 = DocumentTermMatrix(crps1)
A 109 X 361 DocumentTermMatrix

TF-IDF

Finally, we compute the corresponding TF-IDF, which will serve as the feature matrix.

julia> tfidf = tf_idf(m1)
109×361 SparseArrays.SparseMatrixCSC{Float64,Int64} with 836 stored entries:
  [23 ,   1]  =  0.280174
  [28 ,   1]  =  0.246553
  [38 ,   1]  =  0.342434
  [68 ,   1]  =  0.513652
  [76 ,   1]  =  0.205461
  [16 ,   2]  =  0.210432
  [17 ,   2]  =  0.235188
  [85 ,   3]  =  0.285586
  [89 ,   3]  =  0.571172
  ⋮
  [109, 354]  =  0.312757
  [21 , 355]  =  0.223398
  [81 , 356]  =  0.26063
  [57 , 357]  =  0.333183
  [72 , 357]  =  0.666367
  [18 , 358]  =  0.195473
  [12 , 359]  =  0.670193
  [47 , 360]  =  0.444245
  [51 , 360]  =  0.571172
  [45 , 361]  =  0.469135

Summarizing the Qur'an

Using the TF-IDF, we compute the product of it with its transpose to come up with a square matrix, where the elements describes the linkage between the verses, or the similarity between the verses.

julia> sim_mat = tfidf * tfidf'
109×109 SparseArrays.SparseMatrixCSC{Float64,Int64} with 5199 stored entries:
  [1  ,   1]  =  2.23101
  [3  ,   1]  =  0.107656
  [6  ,   1]  =  0.202849
  [14 ,   1]  =  0.0753595
  [15 ,   1]  =  0.202285
  [16 ,   1]  =  0.0793258
  [20 ,   1]  =  0.0502397
  [23 ,   1]  =  0.0685086
  [25 ,   1]  =  0.0538282
  ⋮
  [93 , 109]  =  0.00869631
  [94 , 109]  =  0.0445483
  [95 , 109]  =  0.0104356
  [97 , 109]  =  0.0964851
  [100, 109]  =  0.0575444
  [102, 109]  =  0.0173926
  [104, 109]  =  0.025966
  [106, 109]  =  0.0328825
  [108, 109]  =  0.127861
  [109, 109]  =  1.18181

At this point, we can now write the code for the PageRank algorithm:

julia> using LinearAlgebra

julia> function pagerank(A; Niter=20, damping=.15)
           Nmax = size(A, 1)
           r = rand(1, Nmax);              # Generate a random starting rank.
           r = r ./ norm(r, 1);            # Normalize
           a = (1 - damping) ./ Nmax;      # Create damping vector
       
           for i=1:Niter
               s = r * A
               rmul!(s, damping)
               r = s .+ (a * sum(r, dims=2));   # Compute PageRank.
           end
       
           r = r ./ norm(r, 1);
       
           return r
       end
pagerank (generic function with 1 method)

Using this function, we apply it to the above similarity matrix (sim_mat) and extract the PageRank scores for all verses. This score will serve as the weights, and so higher scores suggest that the verse has a lot of connections to other verses in the corpus, which means it represents per se the corpus.

julia> p = pagerank(sim_mat)
1×109 Array{Float64,2}:
 0.00259272  0.00258589  0.00270019  …  0.00635843  0.00256724  0.00240216

Now we sort these scores in descending order and use it to rearrange the original verses:

julia> idx = sortperm(vec(p), rev=true)[1:10]
10-element Array{Int64,1}:
  84
  88
  91
  65
  69
   7
  27
 107
  90
  67

Finally, the following 10 verses best summarizes the corpus (Chapter 18) using TextRank:

julia> verse_nos = verses(CorpusData(crpstbl), number=true, start_end=false)
1-element Array{Tuple{Array{Int64,1},Array{Int64,1}},1}:
 ([18], [1, 2, 4, 5, 6, 7, 8, 9, 10, 11  …  101, 102, 103, 104, 105, 106, 107, 108, 109, 110])

julia> verse_out = String[];

julia> chapter = Int64[];

julia> verse = Int64[];

julia> for v in verse_nos
           verse_out = vcat(verse_out, verses(crpsdata[v[1]][v[2]]))
           chapter = vcat(chapter, repeat(v[1], inner=length(v[2])))
           verse = vcat(verse, v[2])
       end

julia> tbl = table((
           chapter=chapter[idx],
           verse=verse[idx],
           verse_text=arabic.(verse_out[idx])
       ));

julia> @pt tbl
 --------- ------- -------------------------------------------------------------
  chapter   verse                                                              ⋯
    Int64   Int64                                                              ⋯
 --------- ------- -------------------------------------------------------------
       18      85                                                              ⋯
       18      89                                                              ⋯
       18      92                                                              ⋯
       18      66                                                              ⋯
       18      70                                                              ⋯
       18       8                                                              ⋯
       18      28   وَٱصْبِرْ نَفْسَكَ مَعَ ٱلَّذِينَ يَدْعُونَ رَبَّهُم بِٱلْغَدَوٰةِ وَٱلْعَشِىِّ يُرِيدُونَ وَجْهَهُۥ ⋯
       18     108                                                              ⋯
       18      91                                                              ⋯
       18      68                                                              ⋯
 --------- ------- -------------------------------------------------------------
                                                                1 column omitted

The following is the table of the above output properly formatted in HTML.

Pkg.add("DataFrames")
Pkg.add("IterableTables")
Pkg.add("Latexify")
using DataFrames: DataFrame
using IterableTables
using Latexify

mdtable(DataFrame(tbl), latex=false)
chapterverseverse_text
1885فَأَتْبَعَ سَبَبًا
1889ثُمَّ أَتْبَعَ سَبَبًا
1892ثُمَّ أَتْبَعَ سَبَبًا
1866قَالَ لَهُۥ مُوسَىٰ هَلْ أَتَّبِعُكَ عَلَىٰٓ أَن تُعَلِّمَنِ مِمَّا عُلِّمْتَ رُشْدًا
1870قَالَ فَإِنِ ٱتَّبَعْتَنِى فَلَا تَسْـَٔلْنِى عَن شَىْءٍ حَتَّىٰٓ أُحْدِثَ لَكَ مِنْهُ ذِكْرًا
188وَإِنَّا لَجَٰعِلُونَ مَا عَلَيْهَا صَعِيدًا جُرُزًا
1828وَٱصْبِرْ نَفْسَكَ مَعَ ٱلَّذِينَ يَدْعُونَ رَبَّهُم بِٱلْغَدَوٰةِ وَٱلْعَشِىِّ يُرِيدُونَ وَجْهَهُۥ وَلَا تَعْدُ عَيْنَاكَ عَنْهُمْ تُرِيدُ زِينَةَ ٱلْحَيَوٰةِ ٱلدُّنْيَا وَلَا تُطِعْ مَنْ أَغْفَلْنَا قَلْبَهُۥ عَن ذِكْرِنَا وَٱتَّبَعَ هَوَىٰهُ وَكَانَ أَمْرُهُۥ فُرُطًا
18108خَٰلِدِينَ فِيهَا لَا يَبْغُونَ عَنْهَا حِوَلًا
1891كَذَٰلِكَ وَقَدْ أَحَطْنَا بِمَا لَدَيْهِ خُبْرًا
1868وَكَيْفَ تَصْبِرُ عَلَىٰ مَا لَمْ تُحِطْ بِهِۦ خُبْرًا

The following are the translations of the above verses:

ChapterVerseEnglish Translation
1885So he travelled a course,
1889Then he travelled a ˹different˺ course
1892Then he travelled a ˹third˺ course
1866Moses said to him, “May I follow you, provided that you teach me some of the right guidance you have been taught?”
1870He responded, “Then if you follow me, do not question me about anything until I ˹myself˺ clarify it for you.”
188And We will certainly reduce whatever is on it to barren ground.
1828And patiently stick with those who call upon their Lord morning and evening, seeking His pleasure.1 Do not let your eyes look beyond them, desiring the luxuries of this worldly life. And do not obey those whose hearts We have made heedless of Our remembrance, who follow ˹only˺ their desires and whose state is ˹total˺ loss.
18108where they will be forever, never desiring anywhere else.
1891So it was. And We truly had full knowledge of him.
1868And how can you be patient with what is beyond your ˹realm of˺ knowledge?”