This update marks the first official release of CAMeL Tools.
CAMeL Tools is primarily a collection of Python APIs for Arabic Natural Language Processing (NLP). These APIs are meant to be building blocks for NLP applications.
We also provide a collection of command-line utilities to perform some common Arabic NLP tasks directly in the command line.
In this update, we provide a quick tour of all the features that come in CAMeL Tools v1.0.0.
The results reported below are extracted from the paper by Obeid et al. 2020.
CAMeL Tools is primarily a set of APIs for Arabic NLP. Below is a description of the top-level submodules that are provided:
A collection of utilities for performing common preprocessing tasks in Arabic NLP. This includes, cleaning, conversions between various transliteration schemes, normalization, and de-diacritization.
Contains the CAMeL Tools morphological analyzer, generator, and reinflector.
We provide two morphological databases as part of the data download for CAMeL Tools.
We also provide access to two additional databases that require access to the LDC’s Standard Arabic Morphological Analyzer.
The Analyzer class uses a given morphological database to generate possible analyses for a given word.
The Generator produces analyses that fit a certain set of pre-defined morphological features.
The Reinflector combines analysis and generation to allow producing word forms by reinflecting input words given input features.
Disambiguation is the process of finding the most likely analysis of a word.
In the paper, we discuss two disambiguator implementations: one using maximum likelihood estimation (MLE) and one using multitask learning.
We provide the Diambiguator base class that all diambiguators implement as well as the MLEDisambiguator which implements the MLE model discussed in the paper. Our multitask implementation will be provided through a separate repository.
Comparison of the performance of the CAMeLTools Multitask learning and MLE systems on MSA to MADAMIRA.
We report on the LDC Penn Arabic Treebank and Egyptian Treebank following the Diab et al., 2013 suggested data splits. See Obeid et al. 2020 for more details.
MSA | MADAMIRA | CAMeL Tools Multitask |
CAMeL Tools MLE |
---|---|---|---|
DIAC | 87.7% | 90.9% | 78.4% |
LEX | 96.4% | 95.4% | 95.7% |
POS | 97.1% | 97.2% | 95.5% |
FULL | 85.6% | 89.0% | 70.0% |
ATB TOK | 99.0% | 99.4% | 99.0% |
Comparison of the performance of the CAMeLTools MLE system on EGY to MADAMIRA.
EGY | MADAMIRA | CAMeL Tools MLE |
---|---|---|
DIAC | 82.8% | 78.9% |
LEX | 86.6% | 87.8% |
POS | 91.7% | 91.8% |
FULL | 76.4% | 73.0% |
ATB TOK | 93.5% | 92.8% |
This module provides the Tagger base class and the DefaultTagger class, the default Tagger implementation in CAMeL Tools. A tagger is just a one-to-one mapping from a token in a sentence to some tag type (generally a string). This is a generalization for tagging tasks (POS tagging, diacritization, lemmatization, etc).
The DefaultTagger utilizes a Disambiguator to perform first disambiguate words in a sentence and then outputs a specified tag for each word, providing reasonable defaults when no analyses are returned.
CAMeL Tools provides utilities for two types of tokenization tasks:
For word tokenization, we provide a very simple punctuation and whitespace splitter. For morphological tokenization, we provide the MorphologicalTokenizer class that wraps a disambiguator to output one of the tokenization features from the disambiguated analysis. In this sense, the MorphologicalTokenizer acts like a Tagger except it is a one-to-many map.
We provide a dialect identification system based on the system described by Salamah, Bouamor and Habash (2018). It classifies a sentence into one of 25 Arabic city dialects and Modern Standard Arabic. Our system is the only publicly available implementation of their work.
We also provide utility functions to sum the output probability scores in order to output a country dialect probability score and a region dialect probability score.
Below is a comparison of our dialect identification system with the top two systems from the MADAR Shared Task. Our results differ only slightly from Salameh et al (2018)’s results due to minor implementation variance.
Metric | CAMeL Tools | (Salameh et al 2018) | ArbDialectID |
---|---|---|---|
F1 | 67.85% | (67.89%) | 67.32% |
Precision | 68.36% | (68.41%) | 67.60% |
Recall | 67.71% | (67.75%) | 67.29% |
Acc. City | 67.71% | (67.75%) | 67.29% |
Acc. Country | 76.42% | (76.44%) | 75.23% |
Acc. Region | 85.92% | (85.96%) | 84.42% |
An online demo of this tool can be found here.
CAMeL Tools sentiment analyzer performance using AraBERT and mBERT compared to Mazajak over three benchmark datasets. The results are reported in terms of macro F1 score over the positive and negative classes. Please contact the authors for details about the data sets and splits.
CAMeL Tools AraBERT |
CAMeL Tools mBERT |
Mazajak | |
---|---|---|---|
ArSAS | 92% | 89% | 90% |
ASTD | 73% | 66% | 72% |
SemEval | 69% | 60% | 63% |
The results of the proposed system when trained and tested on ANERcorp dataset vs. CRF-based system (Benajiba and Rosso, 2008). The exact data and splits are provided here.
CAMeL Tools | Precision | Recall | F1 |
---|---|---|---|
LOC | 88% | 92% | 90% |
MISC | 68% | 58% | 63% |
ORG | 77% | 70% | 73% |
PERS | 89% | 85% | 87% |
Overall | 84% | 81% | 83% |
Benajiba & Rosso 2008 | Precision | Recall | F1 |
---|---|---|---|
LOC | 93% | 87% | 90% |
MISC | 71% | 54% | 61% |
ORG | 84% | 54% | 66% |
PERS | 80% | 67% | 73% |
Overall | 87% | 73% | 79% |
We also provide a collection of command-line tools for performing various tasks directly from the terminal without having to write a script. These are just wrappers over the APIs provided.
Below is a list of the available command-line tools: