Getting Started

There are two datasets included in the library, namely the Quranic Arabic Corpus and the Tanzil data. To load, simply run the following:

julia> using QuranTree

julia> data = QuranData()
QuranData(QuranTree.FilePaths("/juliateam/.julia/packages/QuranTree/JFGph/src/../data/quranic-corpus-morphology-0.4.txt", "/juliateam/.julia/packages/QuranTree/JFGph/src/../data/quran-uthmani-final.txt"))

julia> crps, tnzl = load(data);

The QuranData() is a struct containing the default file path of the data. The load function returns a tuple for both the Quranic Corpus and the Tanzil Data. The loaded data is encoded in a immutable (read-only) array, so users cannot change it. This is specified in the type of the object as shown below:

julia> crps
(CorpusRaw) 128276-element ReadOnlyArrays.ReadOnlyArray{String,1,Array{String,1}}:
 "# PLEASE DO NOT REMOVE OR CHANGE THIS COPYRIGHT BLOCK"
 "#===================================================================="
 "#"
 "#  Quranic Arabic Corpus (morphology, version 0.4)"
 "#  Copyright (C) 2011 Kais Dukes"
 "#  License: GNU General Public License"
 "#"
 "#  The Quranic Arabic Corpus includes syntactic and morphological"
 "#  annotation of the Quran, and builds on the verified Arabic text"
 "#  distributed by the Tanzil project."
 ⋮
 "(114:5:4:1)\tSuduwri\tN\tSTEM|POS:N|LEM:Sador|ROOT:Sdr|MP|GEN"
 "(114:5:5:1)\t{l\tDET\tPREFIX|Al+"
 "(114:5:5:2)\tn~aAsi\tN\tSTEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"
 "(114:6:1:1)\tmina\tP\tSTEM|POS:P|LEM:min"
 "(114:6:2:1)\t{lo\tDET\tPREFIX|Al+"
 "(114:6:2:2)\tjin~api\tN\tSTEM|POS:N|LEM:jin~ap|ROOT:jnn|F|GEN"
 "(114:6:3:1)\twa\tCONJ\tPREFIX|w:CONJ+"
 "(114:6:3:2)\t{l\tDET\tPREFIX|Al+"
 "(114:6:3:3)\tn~aAsi\tN\tSTEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"

julia> tnzl
(TanzilRaw) 6266-element ReadOnlyArrays.ReadOnlyArray{String,1,Array{String,1}}:
 "1|1|بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
 "1|2|ٱلْحَمْدُ لِلَّهِ رَبِّ ٱلْعَٰلَمِينَ"
 "1|3|ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
 "1|4|مَٰلِكِ يَوْمِ ٱلدِّينِ"
 "1|5|إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ"
 "1|6|ٱهْدِنَا ٱلصِّرَٰطَ ٱلْمُسْتَقِيمَ"
 "1|7|صِرَٰطَ ٱلَّذِينَ أَنْعَمْتَ عَلَيْهِمْ غَيْرِ ٱلْمَغْضُوبِ عَلَيْهِمْ وَلَا ٱلضَّآلِّينَ"
 "2|1|بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ الٓمٓ"
 "2|2|ذَٰلِكَ ٱلْكِتَٰبُ لَا رَيْبَ فِيهِ هُدًى لِّلْمُتَّقِينَ"
 "2|3|ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰةَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ"
 ⋮
 "#    track of changes."
 "#"
 "#  - This copyright notice shall be included in all verbatim copies "
 "#    of the text, and shall be reproduced appropriately in all files "
 "#    derived from or containing substantial portion of this text."
 "#"
 "#  Please check updates at: http://tanzil.net/updates/"
 "# "
 "#===================================================================="

In order to parse these raw data, the table function is used:

julia> crpsdata = table(crps);

julia> tnzldata = table(tnzl);

julia> crpsdata
Quranic Arabic Corpus (morphology)
(C) 2011 Kais Dukes

Table with 128219 rows, 7 columns:
Columns:
#  colname   type
───────────────────
1  chapter   Int64
2  verse     Int64
3  word      Int64
4  part      Int64
5  form      String
6  tag       String
7  features  String

julia> tnzldata
Tanzil Quran Text (Uthmani)
(C) 2008-2010 Tanzil.net

Table with 6236 rows, 3 columns:
chapter  verse  form
─────────────────────────────────────────────────────────────────────
1        1      "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
1        2      "ٱلْحَمْدُ لِلَّهِ رَبِّ ٱلْعَٰلَمِينَ"
1        3      "ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
1        4      "مَٰلِكِ يَوْمِ ٱلدِّينِ"
1        5      "إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ"
1        6      "ٱهْدِنَا ٱلصِّرَٰطَ ٱلْمُسْتَقِيمَ"
1        7      "صِرَٰطَ ٱلَّذِينَ أَنْعَمْتَ عَلَيْهِمْ غَيْرِ ٱلْمَغْضُوبِ عَلَيْهِمْ وَلَا ٱلضَّآلِّينَ"
2        1      "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ الٓمٓ"
2        2      "ذَٰلِكَ ٱلْكِتَٰبُ لَا رَيْبَ فِيهِ هُدًى لِّلْمُتَّقِينَ"
⋮
113      4      "وَمِن شَرِّ ٱلنَّفَّٰثَٰتِ فِى ٱلْعُقَدِ"
113      5      "وَمِن شَرِّ حَاسِدٍ إِذَا حَسَدَ"
114      1      "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ قُلْ أَعُوذُ بِرَبِّ ٱلنَّاسِ"
114      2      "مَلِكِ ٱلنَّاسِ"
114      3      "إِلَٰهِ ٱلنَّاسِ"
114      4      "مِن شَرِّ ٱلْوَسْوَاسِ ٱلْخَنَّاسِ"
114      5      "ٱلَّذِى يُوَسْوِسُ فِى صُدُورِ ٱلنَّاسِ"
114      6      "مِنَ ٱلْجِنَّةِ وَٱلنَّاسِ"

The resulting tables are of type CorpusData and TanzilData, respectively, and are encoded on top of JuliaDB.jl's IndexedTable, which can be accessed by simply calling the macro @data (for example, @data crpsdata or crpsdata.data). One thing to note, however, is that JuliaDB.jl will only display the meta data of the columns if the width of the table is wider than the width of the output pane, for example in case of the crpsdata above, the table contains more columns (and thus wider) compared to tnzldata. To display the data of any wide table, we recommend PrettyTables.jl:

julia> using PrettyTables

julia> @ptconf vcrop_mode=:middle tf=tf_compact

julia> @pt crpsdata
 --------- ------- ------- ------- ------------ -------- -----------------------
  chapter   verse    word    part         form      tag                        ⋯
    Int64   Int64   Int64   Int64       String   String                        ⋯
 --------- ------- ------- ------- ------------ -------- -----------------------
        1       1       1       1           bi        P                        ⋯
        1       1       1       2         somi        N           STEM|POS:N|L ⋯
        1       1       2       1      {ll~ahi       PN          STEM|POS:PN|L ⋯
        1       1       3       1           {l      DET                        ⋯
        1       1       3       2   r~aHoma`ni      ADJ   STEM|POS:ADJ|LEM:r~a ⋯
        1       1       4       1           {l      DET                        ⋯
        1       1       4       2     r~aHiymi      ADJ     STEM|POS:ADJ|LEM:r ⋯
        1       2       1       1          {lo      DET                        ⋯
     ⋮        ⋮       ⋮       ⋮         ⋮          ⋮                           ⋱
      114       5       5       2       n~aAsi        N         STEM|POS:N|LEM ⋯
      114       6       1       1         mina        P                        ⋯
      114       6       2       1          {lo      DET                        ⋯
      114       6       2       2      jin~api        N         STEM|POS:N|LEM ⋯
      114       6       3       1           wa     CONJ                        ⋯
      114       6       3       2           {l      DET                        ⋯
      114       6       3       3       n~aAsi        N         STEM|POS:N|LEM ⋯
 --------- ------- ------- ------- ------------ -------- -----------------------
                                                1 column and 128204 rows omitted

Note

You need to install PrettyTables.jl to successfully run the code.

using Pkg
Pkg.add("PrettyTables")

Manipulating the Table

As mentioned above, the table is based on JuliaDB.jl's IndexedTable. Therefore, any data manipulation is done through the JuliaDB.jl's APIs. To access the data, simply call the property with .data or using the macro @data:

julia> crpstbl = @data crpsdata; # or crpsdata.data

julia> tnzltbl = @data tnzldata; # or tnzldata.data

julia> crpstbl
Table with 128219 rows, 7 columns:
Columns:
#  colname   type
───────────────────
1  chapter   Int64
2  verse     Int64
3  word      Int64
4  part      Int64
5  form      String
6  tag       String
7  features  String

julia> tnzltbl
Table with 6236 rows, 3 columns:
chapter  verse  form
─────────────────────────────────────────────────────────────────────
1        1      "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
1        2      "ٱلْحَمْدُ لِلَّهِ رَبِّ ٱلْعَٰلَمِينَ"
1        3      "ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
1        4      "مَٰلِكِ يَوْمِ ٱلدِّينِ"
1        5      "إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ"
1        6      "ٱهْدِنَا ٱلصِّرَٰطَ ٱلْمُسْتَقِيمَ"
1        7      "صِرَٰطَ ٱلَّذِينَ أَنْعَمْتَ عَلَيْهِمْ غَيْرِ ٱلْمَغْضُوبِ عَلَيْهِمْ وَلَا ٱلضَّآلِّينَ"
2        1      "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ الٓمٓ"
2        2      "ذَٰلِكَ ٱلْكِتَٰبُ لَا رَيْبَ فِيهِ هُدًى لِّلْمُتَّقِينَ"
⋮
113      4      "وَمِن شَرِّ ٱلنَّفَّٰثَٰتِ فِى ٱلْعُقَدِ"
113      5      "وَمِن شَرِّ حَاسِدٍ إِذَا حَسَدَ"
114      1      "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ قُلْ أَعُوذُ بِرَبِّ ٱلنَّاسِ"
114      2      "مَلِكِ ٱلنَّاسِ"
114      3      "إِلَٰهِ ٱلنَّاسِ"
114      4      "مِن شَرِّ ٱلْوَسْوَاسِ ٱلْخَنَّاسِ"
114      5      "ٱلَّذِى يُوَسْوِسُ فِى صُدُورِ ٱلنَّاسِ"
114      6      "مِنَ ٱلْجِنَّةِ وَٱلنَّاسِ"

Note that, crpsdata and crpstbl have different type (as in the case of tnzldata and tnzltbl) as shown below:

julia> typeof(crpsdata)
CorpusData

julia> typeof(crpstbl)
IndexedTables.IndexedTable{StructArrays.StructArray{NamedTuple{(:chapter, :verse, :word, :part, :form, :tag, :features),Tuple{Int64,Int64,Int64,Int64,String,String,String}},1,NamedTuple{(:chapter, :verse, :word, :part, :form, :tag, :features),Tuple{Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1}}},Int64}}

From here, any data manipulation is done using JuliaDB.jl's APIs. For example, the following will select the feature column of the crpstbl:

julia> using JuliaDB

julia> select(crpstbl, :features)
128219-element WeakRefStrings.StringArray{String,1}:
 "PREFIX|bi+"
 "STEM|POS:N|LEM:{som|ROOT:smw|M|GEN"
 "STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN"
 "PREFIX|Al+"
 "STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN"
 "PREFIX|Al+"
 "STEM|POS:ADJ|LEM:r~aHiym|ROOT:rHm|MS|GEN"
 "PREFIX|Al+"
 "STEM|POS:N|LEM:Hamod|ROOT:Hmd|M|NOM"
 "PREFIX|l:P+"
 ⋮
 "STEM|POS:N|LEM:Sador|ROOT:Sdr|MP|GEN"
 "PREFIX|Al+"
 "STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"
 "STEM|POS:P|LEM:min"
 "PREFIX|Al+"
 "STEM|POS:N|LEM:jin~ap|ROOT:jnn|F|GEN"
 "PREFIX|w:CONJ+"
 "PREFIX|Al+"
 "STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"

julia> # or equivalent to
       select(crpsdata.data, :features)
128219-element WeakRefStrings.StringArray{String,1}:
 "PREFIX|bi+"
 "STEM|POS:N|LEM:{som|ROOT:smw|M|GEN"
 "STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN"
 "PREFIX|Al+"
 "STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN"
 "PREFIX|Al+"
 "STEM|POS:ADJ|LEM:r~aHiym|ROOT:rHm|MS|GEN"
 "PREFIX|Al+"
 "STEM|POS:N|LEM:Hamod|ROOT:Hmd|M|NOM"
 "PREFIX|l:P+"
 ⋮
 "STEM|POS:N|LEM:Sador|ROOT:Sdr|MP|GEN"
 "PREFIX|Al+"
 "STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"
 "STEM|POS:P|LEM:min"
 "PREFIX|Al+"
 "STEM|POS:N|LEM:jin~ap|ROOT:jnn|F|GEN"
 "PREFIX|w:CONJ+"
 "PREFIX|Al+"
 "STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"

Note

You need to install JuliaDB.jl to successfully run the code.

using Pkg
Pkg.add("JuliaDB")

To filter tokens that are Prefixed features, the Base.jl's occursin can be used:

julia> filter(t -> occursin(r"^PREFIX", t.features), crpstbl)
Table with 28670 rows, 7 columns:
chapter  verse  word  part  form   tag     features
───────────────────────────────────────────────────────────
1        1      1     1     "bi"   "P"     "PREFIX|bi+"
1        1      3     1     "{l"   "DET"   "PREFIX|Al+"
1        1      4     1     "{l"   "DET"   "PREFIX|Al+"
1        2      1     1     "{lo"  "DET"   "PREFIX|Al+"
1        2      2     1     "li"   "P"     "PREFIX|l:P+"
1        2      4     1     "{lo"  "DET"   "PREFIX|Al+"
1        3      1     1     "{l"   "DET"   "PREFIX|Al+"
1        3      2     1     "{l"   "DET"   "PREFIX|Al+"
1        4      3     1     "{l"   "DET"   "PREFIX|Al+"
⋮
114      2      2     1     "{l"   "DET"   "PREFIX|Al+"
114      3      2     1     "{l"   "DET"   "PREFIX|Al+"
114      4      3     1     "{lo"  "DET"   "PREFIX|Al+"
114      4      4     1     "{lo"  "DET"   "PREFIX|Al+"
114      5      5     1     "{l"   "DET"   "PREFIX|Al+"
114      6      2     1     "{lo"  "DET"   "PREFIX|Al+"
114      6      3     1     "wa"   "CONJ"  "PREFIX|w:CONJ+"
114      6      3     2     "{l"   "DET"   "PREFIX|Al+"

julia> # or equivalent to
       filter(t -> occursin(r"^PREFIX", t.features), crpsdata.data)
Table with 28670 rows, 7 columns:
chapter  verse  word  part  form   tag     features
───────────────────────────────────────────────────────────
1        1      1     1     "bi"   "P"     "PREFIX|bi+"
1        1      3     1     "{l"   "DET"   "PREFIX|Al+"
1        1      4     1     "{l"   "DET"   "PREFIX|Al+"
1        2      1     1     "{lo"  "DET"   "PREFIX|Al+"
1        2      2     1     "li"   "P"     "PREFIX|l:P+"
1        2      4     1     "{lo"  "DET"   "PREFIX|Al+"
1        3      1     1     "{l"   "DET"   "PREFIX|Al+"
1        3      2     1     "{l"   "DET"   "PREFIX|Al+"
1        4      3     1     "{l"   "DET"   "PREFIX|Al+"
⋮
114      2      2     1     "{l"   "DET"   "PREFIX|Al+"
114      3      2     1     "{l"   "DET"   "PREFIX|Al+"
114      4      3     1     "{lo"  "DET"   "PREFIX|Al+"
114      4      4     1     "{lo"  "DET"   "PREFIX|Al+"
114      5      5     1     "{l"   "DET"   "PREFIX|Al+"
114      6      2     1     "{lo"  "DET"   "PREFIX|Al+"
114      6      3     1     "wa"   "CONJ"  "PREFIX|w:CONJ+"
114      6      3     2     "{l"   "DET"   "PREFIX|Al+"

The main point here is that, any data manipulation on the CorpusTable and TanzilData is done through JuliaDB.jl's APIs.