Getting Started
There are two datasets included in the library, namely the Quranic Arabic Corpus and the Tanzil data. To load, simply run the following:
julia> using QuranTree
julia> data = QuranData()
QuranData(QuranTree.FilePaths("/juliateam/.julia/packages/QuranTree/JFGph/src/../data/quranic-corpus-morphology-0.4.txt", "/juliateam/.julia/packages/QuranTree/JFGph/src/../data/quran-uthmani-final.txt"))
julia> crps, tnzl = load(data);
The QuranData()
is a struct
containing the default file path of the data. The load
function returns a tuple
for both the Quranic Corpus and the Tanzil Data. The loaded data is encoded in a immutable (read-only) array, so users cannot change it. This is specified in the type of the object as shown below:
julia> crps
(CorpusRaw) 128276-element ReadOnlyArrays.ReadOnlyArray{String,1,Array{String,1}}:
"# PLEASE DO NOT REMOVE OR CHANGE THIS COPYRIGHT BLOCK"
"#===================================================================="
"#"
"# Quranic Arabic Corpus (morphology, version 0.4)"
"# Copyright (C) 2011 Kais Dukes"
"# License: GNU General Public License"
"#"
"# The Quranic Arabic Corpus includes syntactic and morphological"
"# annotation of the Quran, and builds on the verified Arabic text"
"# distributed by the Tanzil project."
⋮
"(114:5:4:1)\tSuduwri\tN\tSTEM|POS:N|LEM:Sador|ROOT:Sdr|MP|GEN"
"(114:5:5:1)\t{l\tDET\tPREFIX|Al+"
"(114:5:5:2)\tn~aAsi\tN\tSTEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"
"(114:6:1:1)\tmina\tP\tSTEM|POS:P|LEM:min"
"(114:6:2:1)\t{lo\tDET\tPREFIX|Al+"
"(114:6:2:2)\tjin~api\tN\tSTEM|POS:N|LEM:jin~ap|ROOT:jnn|F|GEN"
"(114:6:3:1)\twa\tCONJ\tPREFIX|w:CONJ+"
"(114:6:3:2)\t{l\tDET\tPREFIX|Al+"
"(114:6:3:3)\tn~aAsi\tN\tSTEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"
julia> tnzl
(TanzilRaw) 6266-element ReadOnlyArrays.ReadOnlyArray{String,1,Array{String,1}}:
"1|1|بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
"1|2|ٱلْحَمْدُ لِلَّهِ رَبِّ ٱلْعَٰلَمِينَ"
"1|3|ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
"1|4|مَٰلِكِ يَوْمِ ٱلدِّينِ"
"1|5|إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ"
"1|6|ٱهْدِنَا ٱلصِّرَٰطَ ٱلْمُسْتَقِيمَ"
"1|7|صِرَٰطَ ٱلَّذِينَ أَنْعَمْتَ عَلَيْهِمْ غَيْرِ ٱلْمَغْضُوبِ عَلَيْهِمْ وَلَا ٱلضَّآلِّينَ"
"2|1|بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ الٓمٓ"
"2|2|ذَٰلِكَ ٱلْكِتَٰبُ لَا رَيْبَ فِيهِ هُدًى لِّلْمُتَّقِينَ"
"2|3|ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰةَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ"
⋮
"# track of changes."
"#"
"# - This copyright notice shall be included in all verbatim copies "
"# of the text, and shall be reproduced appropriately in all files "
"# derived from or containing substantial portion of this text."
"#"
"# Please check updates at: http://tanzil.net/updates/"
"# "
"#===================================================================="
In order to parse these raw data, the table
function is used:
julia> crpsdata = table(crps);
julia> tnzldata = table(tnzl);
julia> crpsdata
Quranic Arabic Corpus (morphology)
(C) 2011 Kais Dukes
Table with 128219 rows, 7 columns:
Columns:
# colname type
───────────────────
1 chapter Int64
2 verse Int64
3 word Int64
4 part Int64
5 form String
6 tag String
7 features String
julia> tnzldata
Tanzil Quran Text (Uthmani)
(C) 2008-2010 Tanzil.net
Table with 6236 rows, 3 columns:
chapter verse form
─────────────────────────────────────────────────────────────────────
1 1 "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
1 2 "ٱلْحَمْدُ لِلَّهِ رَبِّ ٱلْعَٰلَمِينَ"
1 3 "ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
1 4 "مَٰلِكِ يَوْمِ ٱلدِّينِ"
1 5 "إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ"
1 6 "ٱهْدِنَا ٱلصِّرَٰطَ ٱلْمُسْتَقِيمَ"
1 7 "صِرَٰطَ ٱلَّذِينَ أَنْعَمْتَ عَلَيْهِمْ غَيْرِ ٱلْمَغْضُوبِ عَلَيْهِمْ وَلَا ٱلضَّآلِّينَ"
2 1 "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ الٓمٓ"
2 2 "ذَٰلِكَ ٱلْكِتَٰبُ لَا رَيْبَ فِيهِ هُدًى لِّلْمُتَّقِينَ"
⋮
113 4 "وَمِن شَرِّ ٱلنَّفَّٰثَٰتِ فِى ٱلْعُقَدِ"
113 5 "وَمِن شَرِّ حَاسِدٍ إِذَا حَسَدَ"
114 1 "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ قُلْ أَعُوذُ بِرَبِّ ٱلنَّاسِ"
114 2 "مَلِكِ ٱلنَّاسِ"
114 3 "إِلَٰهِ ٱلنَّاسِ"
114 4 "مِن شَرِّ ٱلْوَسْوَاسِ ٱلْخَنَّاسِ"
114 5 "ٱلَّذِى يُوَسْوِسُ فِى صُدُورِ ٱلنَّاسِ"
114 6 "مِنَ ٱلْجِنَّةِ وَٱلنَّاسِ"
The resulting tables are of type CorpusData
and TanzilData
, respectively, and are encoded on top of JuliaDB.jl's IndexedTable
, which can be accessed by simply calling the macro @data
(for example, @data crpsdata
or crpsdata.data
). One thing to note, however, is that JuliaDB.jl will only display the meta data of the columns if the width of the table is wider than the width of the output pane, for example in case of the crpsdata
above, the table contains more columns (and thus wider) compared to tnzldata
. To display the data of any wide table, we recommend PrettyTables.jl:
julia> using PrettyTables
julia> @ptconf vcrop_mode=:middle tf=tf_compact
julia> @pt crpsdata
--------- ------- ------- ------- ------------ -------- -----------------------
chapter verse word part form tag ⋯
Int64 Int64 Int64 Int64 String String ⋯
--------- ------- ------- ------- ------------ -------- -----------------------
1 1 1 1 bi P ⋯
1 1 1 2 somi N STEM|POS:N|L ⋯
1 1 2 1 {ll~ahi PN STEM|POS:PN|L ⋯
1 1 3 1 {l DET ⋯
1 1 3 2 r~aHoma`ni ADJ STEM|POS:ADJ|LEM:r~a ⋯
1 1 4 1 {l DET ⋯
1 1 4 2 r~aHiymi ADJ STEM|POS:ADJ|LEM:r ⋯
1 2 1 1 {lo DET ⋯
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
114 5 5 2 n~aAsi N STEM|POS:N|LEM ⋯
114 6 1 1 mina P ⋯
114 6 2 1 {lo DET ⋯
114 6 2 2 jin~api N STEM|POS:N|LEM ⋯
114 6 3 1 wa CONJ ⋯
114 6 3 2 {l DET ⋯
114 6 3 3 n~aAsi N STEM|POS:N|LEM ⋯
--------- ------- ------- ------- ------------ -------- -----------------------
1 column and 128204 rows omitted
You need to install PrettyTables.jl to successfully run the code.
using Pkg
Pkg.add("PrettyTables")
Manipulating the Table
As mentioned above, the table is based on JuliaDB.jl's IndexedTable
. Therefore, any data manipulation is done through the JuliaDB.jl's APIs. To access the data, simply call the property with .data
or using the macro @data
:
julia> crpstbl = @data crpsdata; # or crpsdata.data
julia> tnzltbl = @data tnzldata; # or tnzldata.data
julia> crpstbl
Table with 128219 rows, 7 columns:
Columns:
# colname type
───────────────────
1 chapter Int64
2 verse Int64
3 word Int64
4 part Int64
5 form String
6 tag String
7 features String
julia> tnzltbl
Table with 6236 rows, 3 columns:
chapter verse form
─────────────────────────────────────────────────────────────────────
1 1 "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
1 2 "ٱلْحَمْدُ لِلَّهِ رَبِّ ٱلْعَٰلَمِينَ"
1 3 "ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
1 4 "مَٰلِكِ يَوْمِ ٱلدِّينِ"
1 5 "إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ"
1 6 "ٱهْدِنَا ٱلصِّرَٰطَ ٱلْمُسْتَقِيمَ"
1 7 "صِرَٰطَ ٱلَّذِينَ أَنْعَمْتَ عَلَيْهِمْ غَيْرِ ٱلْمَغْضُوبِ عَلَيْهِمْ وَلَا ٱلضَّآلِّينَ"
2 1 "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ الٓمٓ"
2 2 "ذَٰلِكَ ٱلْكِتَٰبُ لَا رَيْبَ فِيهِ هُدًى لِّلْمُتَّقِينَ"
⋮
113 4 "وَمِن شَرِّ ٱلنَّفَّٰثَٰتِ فِى ٱلْعُقَدِ"
113 5 "وَمِن شَرِّ حَاسِدٍ إِذَا حَسَدَ"
114 1 "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ قُلْ أَعُوذُ بِرَبِّ ٱلنَّاسِ"
114 2 "مَلِكِ ٱلنَّاسِ"
114 3 "إِلَٰهِ ٱلنَّاسِ"
114 4 "مِن شَرِّ ٱلْوَسْوَاسِ ٱلْخَنَّاسِ"
114 5 "ٱلَّذِى يُوَسْوِسُ فِى صُدُورِ ٱلنَّاسِ"
114 6 "مِنَ ٱلْجِنَّةِ وَٱلنَّاسِ"
Note that, crpsdata
and crpstbl
have different type (as in the case of tnzldata
and tnzltbl
) as shown below:
julia> typeof(crpsdata)
CorpusData
julia> typeof(crpstbl)
IndexedTables.IndexedTable{StructArrays.StructArray{NamedTuple{(:chapter, :verse, :word, :part, :form, :tag, :features),Tuple{Int64,Int64,Int64,Int64,String,String,String}},1,NamedTuple{(:chapter, :verse, :word, :part, :form, :tag, :features),Tuple{Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1}}},Int64}}
From here, any data manipulation is done using JuliaDB.jl's APIs. For example, the following will select the feature column of the crpstbl
:
julia> using JuliaDB
julia> select(crpstbl, :features)
128219-element WeakRefStrings.StringArray{String,1}:
"PREFIX|bi+"
"STEM|POS:N|LEM:{som|ROOT:smw|M|GEN"
"STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN"
"PREFIX|Al+"
"STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN"
"PREFIX|Al+"
"STEM|POS:ADJ|LEM:r~aHiym|ROOT:rHm|MS|GEN"
"PREFIX|Al+"
"STEM|POS:N|LEM:Hamod|ROOT:Hmd|M|NOM"
"PREFIX|l:P+"
⋮
"STEM|POS:N|LEM:Sador|ROOT:Sdr|MP|GEN"
"PREFIX|Al+"
"STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"
"STEM|POS:P|LEM:min"
"PREFIX|Al+"
"STEM|POS:N|LEM:jin~ap|ROOT:jnn|F|GEN"
"PREFIX|w:CONJ+"
"PREFIX|Al+"
"STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"
julia> # or equivalent to
select(crpsdata.data, :features)
128219-element WeakRefStrings.StringArray{String,1}:
"PREFIX|bi+"
"STEM|POS:N|LEM:{som|ROOT:smw|M|GEN"
"STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN"
"PREFIX|Al+"
"STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN"
"PREFIX|Al+"
"STEM|POS:ADJ|LEM:r~aHiym|ROOT:rHm|MS|GEN"
"PREFIX|Al+"
"STEM|POS:N|LEM:Hamod|ROOT:Hmd|M|NOM"
"PREFIX|l:P+"
⋮
"STEM|POS:N|LEM:Sador|ROOT:Sdr|MP|GEN"
"PREFIX|Al+"
"STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"
"STEM|POS:P|LEM:min"
"PREFIX|Al+"
"STEM|POS:N|LEM:jin~ap|ROOT:jnn|F|GEN"
"PREFIX|w:CONJ+"
"PREFIX|Al+"
"STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"
You need to install JuliaDB.jl to successfully run the code.
using Pkg
Pkg.add("JuliaDB")
To filter tokens that are Prefix
ed features, the Base.jl's occursin
can be used:
julia> filter(t -> occursin(r"^PREFIX", t.features), crpstbl)
Table with 28670 rows, 7 columns:
chapter verse word part form tag features
───────────────────────────────────────────────────────────
1 1 1 1 "bi" "P" "PREFIX|bi+"
1 1 3 1 "{l" "DET" "PREFIX|Al+"
1 1 4 1 "{l" "DET" "PREFIX|Al+"
1 2 1 1 "{lo" "DET" "PREFIX|Al+"
1 2 2 1 "li" "P" "PREFIX|l:P+"
1 2 4 1 "{lo" "DET" "PREFIX|Al+"
1 3 1 1 "{l" "DET" "PREFIX|Al+"
1 3 2 1 "{l" "DET" "PREFIX|Al+"
1 4 3 1 "{l" "DET" "PREFIX|Al+"
⋮
114 2 2 1 "{l" "DET" "PREFIX|Al+"
114 3 2 1 "{l" "DET" "PREFIX|Al+"
114 4 3 1 "{lo" "DET" "PREFIX|Al+"
114 4 4 1 "{lo" "DET" "PREFIX|Al+"
114 5 5 1 "{l" "DET" "PREFIX|Al+"
114 6 2 1 "{lo" "DET" "PREFIX|Al+"
114 6 3 1 "wa" "CONJ" "PREFIX|w:CONJ+"
114 6 3 2 "{l" "DET" "PREFIX|Al+"
julia> # or equivalent to
filter(t -> occursin(r"^PREFIX", t.features), crpsdata.data)
Table with 28670 rows, 7 columns:
chapter verse word part form tag features
───────────────────────────────────────────────────────────
1 1 1 1 "bi" "P" "PREFIX|bi+"
1 1 3 1 "{l" "DET" "PREFIX|Al+"
1 1 4 1 "{l" "DET" "PREFIX|Al+"
1 2 1 1 "{lo" "DET" "PREFIX|Al+"
1 2 2 1 "li" "P" "PREFIX|l:P+"
1 2 4 1 "{lo" "DET" "PREFIX|Al+"
1 3 1 1 "{l" "DET" "PREFIX|Al+"
1 3 2 1 "{l" "DET" "PREFIX|Al+"
1 4 3 1 "{l" "DET" "PREFIX|Al+"
⋮
114 2 2 1 "{l" "DET" "PREFIX|Al+"
114 3 2 1 "{l" "DET" "PREFIX|Al+"
114 4 3 1 "{lo" "DET" "PREFIX|Al+"
114 4 4 1 "{lo" "DET" "PREFIX|Al+"
114 5 5 1 "{l" "DET" "PREFIX|Al+"
114 6 2 1 "{lo" "DET" "PREFIX|Al+"
114 6 3 1 "wa" "CONJ" "PREFIX|w:CONJ+"
114 6 3 2 "{l" "DET" "PREFIX|Al+"
The main point here is that, any data manipulation on the CorpusTable
and TanzilData
is done through JuliaDB.jl's APIs.