CAMeL Tools

In this section, we will explore how to use CAMeL Tools of New York University Abu Dhabi. CAMeL is a suite of tools for Arabic Natural Language Processing, and by far the most feature-rich library to date for universal Arabic NLP. To install the library, follow the instructions here.

Setting up

For macOS users, however, simply run the following in the terminal:

pip3 install camel-tools

Then, download the necessary data as follows:

camel_data light

For this tutorial, we are going to use only the light version of the CAMeL data which is around 19mb.

Julia PyCall.jl

Julia can interoperate with Python through the library PyCall.jl. To install, run the following:

julia> using Pkg

julia> Pkg.add("PyCall")
  Resolving package versions...
No Changes to `~/.julia/packages/QuranTree/JFGph/docs/Project.toml`
No Changes to `~/.julia/packages/QuranTree/JFGph/docs/Manifest.toml`

Character Dediacritization

At this point, Julia can now connect to Python, and CAMeL Tools can now be loaded via the macro @pyimport. For example, the following will load the dediac module of the said library:

julia> using PyCall

julia> @pyimport camel_tools.utils.dediac as camel_dediac
ERROR: PyError (PyImport_ImportModule

The Python package camel_tools.utils.dediac could not be imported by pyimport. Usually this means
that you did not install camel_tools.utils.dediac in the Python version being used by PyCall.

PyCall is currently configured to use the Julia-specific Python distribution
installed by the Conda.jl package.  To install the camel_tools.utils.dediac module, you can
use `pyimport_conda("camel_tools.utils.dediac", PKG)`, where PKG is the Anaconda
package the contains the module camel_tools.utils.dediac, or alternatively you can use the
Conda package directly (via `using Conda` followed by `Conda.add` etcetera).

Alternatively, if you want to use a different Python distribution on your
system, such as a system-wide Python (as opposed to the Julia-specific Python),
you can re-configure PyCall with that Python.   As explained in the PyCall
documentation, set ENV["PYTHON"] to the path/name of the python executable
you want to use, run Pkg.build("PyCall"), and re-launch Julia.

) <class 'ModuleNotFoundError'>
ModuleNotFoundError("No module named 'camel_tools'")

julia> @pyimport camel_tools.utils.normalize as camel_normalize
ERROR: PyError (PyImport_ImportModule

The Python package camel_tools.utils.normalize could not be imported by pyimport. Usually this means
that you did not install camel_tools.utils.normalize in the Python version being used by PyCall.

PyCall is currently configured to use the Julia-specific Python distribution
installed by the Conda.jl package.  To install the camel_tools.utils.normalize module, you can
use `pyimport_conda("camel_tools.utils.normalize", PKG)`, where PKG is the Anaconda
package the contains the module camel_tools.utils.normalize, or alternatively you can use the
Conda package directly (via `using Conda` followed by `Conda.add` etcetera).

Alternatively, if you want to use a different Python distribution on your
system, such as a system-wide Python (as opposed to the Julia-specific Python),
you can re-configure PyCall with that Python.   As explained in the PyCall
documentation, set ENV["PYTHON"] to the path/name of the python executable
you want to use, run Pkg.build("PyCall"), and re-launch Julia.

) <class 'ModuleNotFoundError'>
ModuleNotFoundError("No module named 'camel_tools'")

Important

In case Python is not found, then it is required to specify the path in the environment variables, and as to which version to use. Hence, after installation of PyCall.jl, specify the path, for example:

ENV["PYTHON"] = "/usr/bin/python3"
Pkg.build("PyCall")

The last line will build the library and PyCall.jl will remember the path.

Important

Make sure the Python version you setup is where the CAMeL Tools was installed.

Let's use this and compare the results with QuranTree.jl's built in dediac function.

julia> using QuranTree

julia> crps, tnzl = load(QuranData());

julia> crpsdata = table(crps);

julia> tnzldata = table(tnzl);

julia> avrs1 = verses(tnzldata[1][1])[1]
"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"

julia> dediac(avrs1)
"بسم ٱلله ٱلرحمٰن ٱلرحيم"

Now using CAMeL tools, we get the following:

julia> camel_dediac.dediac_ar(avrs1)
ERROR: UndefVarError: camel_dediac not defined

The difference is on the Alif Khanjareeya, where at the moment QuranTree.jl tree does not consider it as part of the diacritics, but part of the characters to be normalized.

Let's try this on CorpusData as well, to see how it handles Buckwalter dediacritization:

julia> vrs1 = verses(crpsdata[1][1])[1]
"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"

julia> dediac(vrs1)
"bsm {llh {lrHm`n {lrHym"

julia> camel_dediac.dediac_bw(vrs1)
ERROR: UndefVarError: camel_dediac not defined

Character Normalization

To normalize, QuranTree.jl uses argument for specifying the character to normalize. However for CAMeL tools, this is part of the name of the function:

julia> avrs2 = verses(tnzldata[2][3])[1]
"ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰةَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ"

julia> normalize(avrs2, :ta_marbuta)
"ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰهَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ"

julia> camel_normalize.normalize_teh_marbuta_ar(avrs2)
ERROR: UndefVarError: camel_normalize not defined

Another example, normalizing over the Buckwalter encoding:

julia> vrs2 = verses(crpsdata[2][3])[1]
"{l~a*iyna yu&ominuwna bi{logayobi wayuqiymuwna {lS~alaw`pa wamim~aA razaqona`humo yunfiquwna"

julia> normalize(vrs2, :ta_marbuta)
"{l~a*iyna yu&ominuwna bi{logayobi wayuqiymuwna {lS~alaw`ha wamim~aA razaqona`humo yunfiquwna"

julia> camel_normalize.normalize_teh_marbuta_bw(vrs2)
ERROR: UndefVarError: camel_normalize not defined