CAMeL Tools Logo

CAMeL Tools Release v1.0.0

Overview

This update marks the first official release of CAMeL Tools.

CAMeL Tools is primarily a collection of Python APIs for Arabic Natural Language Processing (NLP). These APIs are meant to be building blocks for NLP applications.

We also provide a collection of command-line utilities to perform some common Arabic NLP tasks directly in the command line.

In this update, we provide a quick tour of all the features that come in CAMeL Tools v1.0.0.

The results reported below are extracted from the paper by Obeid et al. 2020.

APIs

CAMeL Tools is primarily a set of APIs for Arabic NLP. Below is a description of the top-level submodules that are provided:

camel_tools.utils

A collection of utilities for performing common preprocessing tasks in Arabic NLP. This includes, cleaning, conversions between various transliteration schemes, normalization, and de-diacritization.

camel_tools.morphology

Contains the CAMeL Tools morphological analyzer, generator, and reinflector.

We provide two morphological databases as part of the data download for CAMeL Tools.

We also provide access to two additional databases that require access to the LDC’s Standard Arabic Morphological Analyzer.

The Analyzer class uses a given morphological database to generate possible analyses for a given word.

The Generator produces analyses that fit a certain set of pre-defined morphological features.

The Reinflector combines analysis and generation to allow producing word forms by reinflecting input words given input features.

camel_tools.disambig

Disambiguation is the process of finding the most likely analysis of a word.

In the paper, we discuss two disambiguator implementations: one using maximum likelihood estimation (MLE) and one using multitask learning.

We provide the Diambiguator base class that all diambiguators implement as well as the MLEDisambiguator which implements the MLE model discussed in the paper. Our multitask implementation will be provided through a separate repository.

Comparison of the performance of the CAMeLTools Multitask learning and MLE systems on MSA to MADAMIRA.

We report on the LDC Penn Arabic Treebank and Egyptian Treebank following the Diab et al., 2013 suggested data splits. See Obeid et al. 2020 for more details.

MSA MADAMIRA CAMeL Tools
Multitask
CAMeL Tools
MLE
DIAC 87.7% 90.9% 78.4%
LEX 96.4% 95.4% 95.7%
POS 97.1% 97.2% 95.5%
FULL 85.6% 89.0% 70.0%
ATB TOK 99.0% 99.4% 99.0%

Comparison of the performance of the CAMeLTools MLE system on EGY to MADAMIRA.

EGY MADAMIRA CAMeL Tools
MLE
DIAC 82.8% 78.9%
LEX 86.6% 87.8%
POS 91.7% 91.8%
FULL 76.4% 73.0%
ATB TOK 93.5% 92.8%

camel_tools.taggers

This module provides the Tagger base class and the DefaultTagger class, the default Tagger implementation in CAMeL Tools. A tagger is just a one-to-one mapping from a token in a sentence to some tag type (generally a string). This is a generalization for tagging tasks (POS tagging, diacritization, lemmatization, etc).

The DefaultTagger utilizes a Disambiguator to perform first disambiguate words in a sentence and then outputs a specified tag for each word, providing reasonable defaults when no analyses are returned.

camel_tools.tokenizers

CAMeL Tools provides utilities for two types of tokenization tasks:

For word tokenization, we provide a very simple punctuation and whitespace splitter. For morphological tokenization, we provide the MorphologicalTokenizer class that wraps a disambiguator to output one of the tokenization features from the disambiguated analysis. In this sense, the MorphologicalTokenizer acts like a Tagger except it is a one-to-many map.

camel_tools.dialectid

We provide a dialect identification system based on the system described by Salamah, Bouamor and Habash (2018). It classifies a sentence into one of 25 Arabic city dialects and Modern Standard Arabic. Our system is the only publicly available implementation of their work.

We also provide utility functions to sum the output probability scores in order to output a country dialect probability score and a region dialect probability score.

Below is a comparison of our dialect identification system with the top two systems from the MADAR Shared Task. Our results differ only slightly from Salameh et al (2018)’s results due to minor implementation variance.

Metric CAMeL Tools (Salameh et al 2018) ArbDialectID
F1 67.85% (67.89%) 67.32%
Precision 68.36% (68.41%) 67.60%
Recall 67.71% (67.75%) 67.29%
Acc. City 67.71% (67.75%) 67.29%
Acc. Country 76.42% (76.44%) 75.23%
Acc. Region 85.92% (85.96%) 84.42%

An online demo of this tool can be found here.

camel_tools.sentiment

CAMeL Tools sentiment analyzer performance using AraBERT and mBERT compared to Mazajak over three benchmark datasets. The results are reported in terms of macro F1 score over the positive and negative classes. Please contact the authors for details about the data sets and splits.

  CAMeL Tools
AraBERT
CAMeL Tools
mBERT
Mazajak
ArSAS 92% 89% 90%
ASTD 73% 66% 72%
SemEval 69% 60% 63%

camel-tools.ner

The results of the proposed system when trained and tested on ANERcorp dataset vs. CRF-based system (Benajiba and Rosso, 2008). The exact data and splits are provided here.

CAMeL Tools Precision Recall F1
LOC 88% 92% 90%
MISC 68% 58% 63%
ORG 77% 70% 73%
PERS 89% 85% 87%
Overall 84% 81% 83%
Benajiba & Rosso 2008 Precision Recall F1
LOC 93% 87% 90%
MISC 71% 54% 61%
ORG 84% 54% 66%
PERS 80% 67% 73%
Overall 87% 73% 79%

Command-line Tools

We also provide a collection of command-line tools for performing various tasks directly from the terminal without having to write a script. These are just wrappers over the APIs provided.

Below is a list of the available command-line tools: