Data Processing

The goal of having a Quranic corpus is to study it computationally. As such, special utilities for further data preprocessing are necessary. QuranTree.jl offers functions for processing Arabic texts. These include, character dediacritization and character normalization.

Character Dediacritization

dediac works for both Arabic, Buckwalter and custom transliterations.

julia> using QuranTree

julia> crps, tnzl = load(QuranData());

julia> crpsdata = table(crps);

julia> tnzldata = table(tnzl);

julia> avrs = verses(tnzldata[1][1])[1]
"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"

julia> dediac(avrs)
"بسم ٱلله ٱلرحمٰن ٱلرحيم"

julia> bvrs = verses(crpsdata[1][1])[1]
"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"

julia> dediac(bvrs)
"bsm {llh {lrHm`n {lrHym"

julia> dediac(avrs) === arabic(dediac(bvrs))
true

Custom transliteration is also dediacritizable as shown below,

julia> old_keys = collect(keys(BW_ENCODING));

julia> new_vals = reverse(collect(values(BW_ENCODING)));

julia> my_encoder = Dict(old_keys .=> new_vals);

julia> @transliterator my_encoder "MyEncoder"

julia> encode(avrs)
"\"S%gAS zppj[KS zp`j[&gA[r]S zp`j[&SkAS"

julia> arabic(encode(avrs))
"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"

julia> dediac(encode(avrs))
"\"%A zppK zp`&Ar] zp`&kA"

julia> arabic(dediac(encode(avrs)))
"بسم ٱلله ٱلرحمٰن ٱلرحيم"

To reset the transliteration,

julia> @transliterator :default

julia> encode(avrs)
"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"

julia> dediac(encode(avrs))
"bsm {llh {lrHm`n {lrHym"

Character Normalization

Normalization is done using the normalize function. It works for Arabic, Buckwalter and other custom transliterations. For example, the following normalizes the avrs above:

julia> normalize(avrs)
"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"

julia> normalize(dediac(avrs))
"بسم الله الرحمان الرحيم"

julia> dediac(normalize(avrs))
"بسم الله الرحمان الرحيم"

julia> # using pipe notation
       avrs |> dediac |> normalize |> encode
"bsm Allh AlrHmAn AlrHym"

Specific character can be normalized:

julia> avrs1 = verses(tnzldata[2][4])[1]
"وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ أُنزِلَ إِلَيْكَ وَمَآ أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"

julia> normalize(avrs1, :alif_maddah)
"وَٱلَّذِينَ يُؤْمِنُونَ بِمَا أُنزِلَ إِلَيْكَ وَمَا أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"

julia> normalize(avrs1, :alif_hamza_above)
"وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ اُنزِلَ إِلَيْكَ وَمَآ اُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"

julia> normalize(avrs, [:alif_khanjareeya, :hamzat_wasl])
"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"

Or using the CorpusData instead of the TanzilData,

julia> avrs2 = arabic(verses(crpsdata[2][15])[1])
"ٱللَّهُ يَسْتَهْزِئُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ"

julia> normalize(avrs2, :ya_hamza_above)
"ٱللَّهُ يَسْتَهْزِيُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ"