Data Processing
The goal of having a Quranic corpus is to study it computationally. As such, special utilities for further data preprocessing are necessary. QuranTree.jl offers functions for processing Arabic texts. These include, character dediacritization and character normalization.
Character Dediacritization
dediac
works for both Arabic, Buckwalter and custom transliterations.
julia> using QuranTree
julia> crps, tnzl = load(QuranData());
julia> crpsdata = table(crps);
julia> tnzldata = table(tnzl);
julia> avrs = verses(tnzldata[1][1])[1]
"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
julia> dediac(avrs)
"بسم ٱلله ٱلرحمٰن ٱلرحيم"
julia> bvrs = verses(crpsdata[1][1])[1]
"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
julia> dediac(bvrs)
"bsm {llh {lrHm`n {lrHym"
julia> dediac(avrs) === arabic(dediac(bvrs))
true
Custom transliteration is also dediacritizable as shown below,
julia> old_keys = collect(keys(BW_ENCODING));
julia> new_vals = reverse(collect(values(BW_ENCODING)));
julia> my_encoder = Dict(old_keys .=> new_vals);
julia> @transliterator my_encoder "MyEncoder"
julia> encode(avrs)
"\"S%gAS zppj[KS zp`j[&gA[r]S zp`j[&SkAS"
julia> arabic(encode(avrs))
"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
julia> dediac(encode(avrs))
"\"%A zppK zp`&Ar] zp`&kA"
julia> arabic(dediac(encode(avrs)))
"بسم ٱلله ٱلرحمٰن ٱلرحيم"
To reset the transliteration,
julia> @transliterator :default
julia> encode(avrs)
"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
julia> dediac(encode(avrs))
"bsm {llh {lrHm`n {lrHym"
Character Normalization
Normalization is done using the normalize
function. It works for Arabic, Buckwalter and other custom transliterations. For example, the following normalizes the avrs
above:
julia> normalize(avrs)
"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"
julia> normalize(dediac(avrs))
"بسم الله الرحمان الرحيم"
julia> dediac(normalize(avrs))
"بسم الله الرحمان الرحيم"
julia> # using pipe notation
avrs |> dediac |> normalize |> encode
"bsm Allh AlrHmAn AlrHym"
Specific character can be normalized:
julia> avrs1 = verses(tnzldata[2][4])[1]
"وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ أُنزِلَ إِلَيْكَ وَمَآ أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> normalize(avrs1, :alif_maddah)
"وَٱلَّذِينَ يُؤْمِنُونَ بِمَا أُنزِلَ إِلَيْكَ وَمَا أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> normalize(avrs1, :alif_hamza_above)
"وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ اُنزِلَ إِلَيْكَ وَمَآ اُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> normalize(avrs, [:alif_khanjareeya, :hamzat_wasl])
"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"
Or using the CorpusData
instead of the TanzilData
,
julia> avrs2 = arabic(verses(crpsdata[2][15])[1])
"ٱللَّهُ يَسْتَهْزِئُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ"
julia> normalize(avrs2, :ya_hamza_above)
"ٱللَّهُ يَسْتَهْزِيُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ"