Overview¶
What is Maha?¶
Maha is a Python library that provides simple, yet powerful, functions for dealing with Arabic text. It can be used to clean and parse text, files, and folders with or without streaming capability.
Maha also provides a datasets module that offers a simple interface for using the
available datasets.
Important
Maha is currently in beta.
Datasets¶
Note
All datasets are licensed under the Creative Commons Attribution 4.0 International License.
This module provides a number of datasets that can be used with any NLP task. The available datasets are listed below with their descriptions.
Dataset |
Records |
Description |
|---|---|---|
names |
44161 |
List of Arabic names with meaning and origin of most names |
Datasets can be loaded using the load_dataset() function as follows:
>>> from maha.datasets import load_dataset
>>> names = load_dataset("names")
>>> len(names)
44161
>>> names[0]
Name(name='آبوت', description=['اسم علم مذكر إنكليزي: ABOT، معناه الخوري'], origin='انجليزي')
>>> names[0].cleaned_name
'آبوت'
It is also possible to stream a dataset:
>>> from maha.datasets import load_dataset
>>> names = load_dataset("names", streaming=True)
>>> next(iter(names))
Name(name='آبوت', description=['اسم علم مذكر إنكليزي: ABOT، معناه الخوري'], origin='انجليزي')
Cleaners¶
This module provides a number of functions for cleaning Arabic text. These functions are summarized into the following categories:
keep_fn: Functions that keep only specific characters in a text.remove_fn: Functions that remove specific characters from a text.normalize_fn: Functions that converts similar characters into one common character.replace_fn: Functions that replace specific characters with other characters.contains_fn: Functions that check if a text contains specific characters.
These modules expose a simple interface that is used in conjunction with the defined
constants as well as the available expressions.
Custom constants/expressions are also supported and can be easily used.
Examples¶
To keep
ARABIC_LETTERSandHARAKATonly, you can do the following:
>>> from maha.cleaners.functions import keep
>>> sample_text = "البسملة : ( بِسْمِ اللَّـهِ الرَّحْمَـٰنِ الرَّحِيمِ )"
>>> keep(sample_text, arabic_letters=True, harakat=True)
'البسملة بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ'
To remove
PUNCTUATIONS, you can do the following:
>>> from maha.cleaners.functions import remove
>>> sample_text = "البسملة : ( بِسْمِ اللَّـهِ الرَّحْمَـٰنِ الرَّحِيمِ )"
>>> remove(sample_text, punctuations=True)
'البسملة بِسْمِ اللَّـهِ الرَّحْمَـٰنِ الرَّحِيمِ'
To normalize
ALEF_VARIATIONS, you can do the following:
>>> from maha.cleaners.functions import normalize
>>> sample_text = "أنا وأخي علي في المكتبةِ نَطلعُ على موضوع البرمجه"
>>> normalize(sample_text, alef=True)
'انا واخي علي في المكتبةِ نَطلعُ على موضوع البرمجه'
To normalize all but
ALEF_VARIATIONS, you can do the following:
>>> from maha.cleaners.functions import normalize
>>> sample_text = "أنا وأخي علي في المكتبةِ نَطلعُ على موضوع البرمجه"
>>> normalize(sample_text, alef=False, all=True)
'أنا وأخي علي في المكتبهِ نَطلعُ علي موضوع البرمجه'
To check if text contains
ARABIC_LETTERS, you can do the following:
>>> from maha.cleaners.functions import contains
>>> sample_text = "أنا وأخي علي في المكتبةِ نَدرسُ البرمجه"
>>> contains(sample_text, arabic_letters=True)
True
To check if text contains
ARABIC_LETTERSandENGLISH_LETTERS, you can do the following:
>>> from maha.cleaners.functions import contains
>>> sample_text = "أنا وأخي علي في المكتبةِ نَدرسُ البرمجه"
>>> contains(sample_text, arabic_letters=True, english_letters=True)
{'arabic_letters': True, 'english_letters': False}
Cleaners are robust against most real case scenarios and can output clean text. Check the following example:
>>> from maha.cleaners.functions import remove
>>> sample_text = "#بسم_الله_الرحمن_الرحيم"
>>> remove(sample_text, punctuations=True)
'بسم الله الرحمن الرحيم'
In general all cleaners provide the same interface.
Check the full list of functions here
How cleaners work¶
Cleaners rely on regular expressions to clean text. All regular expressions are constructed from your inputs on the fly. They are efficient and fast. We use regex library instead of the built-in re module.
Note
If you need to clean a large amount of text or files, it is recommended to use
processors instead.
Parsers¶
This module provides a number of rules for extracting useful values from Arabic text.
This includes extracting constants and expressions
in addition to extracting specific dimensions such as extracting duration, numeral, or time.
Tip
Before parsing a text, it is recommended to either remove all harakat
using remove() or keep arabic letters only using keep().
The module provides the following simple interfaces to parse text.
The
parse()function is used to parse any of the availableconstantsandexpressions. It is also possible to use custom constants/expressions.The
parse_dimension()function is used to parse any of the available dimensions. The available dimensions areduration,time,numeral,ordinalandnames. To create your own dimension, check out custom dimension. If you would like to contribute your custom dimension to the project, check contribution guidelines.
Examples¶
To extract all
ARABICcharacters, you can do the following:
>>> from maha.parsers.functions import parse
>>> sample_text = "أنا وأخي في المكتبةِ نَدرسُ برمجه Python "
>>> parse(sample_text, arabic=True, include_space=True)
[Dimension(body=أنا وأخي في المكتبةِ نَدرسُ برمجه , value=أنا وأخي في المكتبةِ نَدرسُ برمجه , start=0, end=34, dimension_type=DimensionType.ARABIC)]
To extract emails, you can do the following:
>>> from maha.parsers.functions import parse
>>> sample_text = "الإيميل: test@example.com أو test2@example.com"
>>> parse(sample_text, emails=True)
[Dimension(body=test@example.com, value=test@example.com, start=9, end=25, dimension_type=DimensionType.EMAILS), Dimension(body=test2@example.com, value=test2@example.com, start=29, end=46, dimension_type=DimensionType.EMAILS)]
To extract arabic hashtags and all mentions, you can do the following:
>>> from maha.parsers.functions import parse
>>> sample_text = "#مها #maha @Maha @مها"
>>> parse(sample_text, arabic_hashtags=True, mentions=True)
[Dimension(body=#مها, value=#مها, start=0, end=4, dimension_type=DimensionType.ARABIC_HASHTAGS), Dimension(body=@Maha, value=@Maha, start=11, end=16, dimension_type=DimensionType.MENTIONS), Dimension(body=@مها, value=@مها, start=17, end=21, dimension_type=DimensionType.MENTIONS)]
To extract arabic hashtags and all mentions, you can do the following:
>>> from maha.parsers.functions import parse
>>> sample_text = "#مها #maha @Maha @مها"
>>> parse(sample_text, arabic_hashtags=True, mentions=True)
[Dimension(body=#مها, value=#مها, start=0, end=4, dimension_type=DimensionType.ARABIC_HASHTAGS), Dimension(body=@Maha, value=@Maha, start=11, end=16, dimension_type=DimensionType.MENTIONS), Dimension(body=@مها, value=@مها, start=17, end=21, dimension_type=DimensionType.MENTIONS)]
To extract numerals, you can do the following:
>>> from maha.parsers.functions import parse_dimension
>>> parse_dimension("العام ثلاثمئة وستة وخمسون يوما او 12 شهرا", numeral=True)
[Dimension(body=ثلاثمئة وستة وخمسون, value=356, start=6, end=25, dimension_type=DimensionType.NUMERAL), Dimension(body=12, value=12, start=34, end=36, dimension_type=DimensionType.NUMERAL)]
>>> parse_dimension("مليون واربعمية وستة عشر", numeral=True)[0].value
1000416
>>> parse_dimension("خمسين فاصلة ثلاثة واربعين", numeral=True)[0].value
50.43
>>> parse_dimension("تسعة عشر الف ومئة", numeral=True)[0].value
19100
>>> parse_dimension(
... "تسع وتسعين مليار وتسعة وتسعين الف وتسع مائة وتسع وتسعين ", numeral=True
... )[0].value
99000099999
>>> parse_dimension("100 وخمسين", numeral=True)[0].value
150
>>> parse_dimension("100 و1", numeral=True)[0].value
101
To extract duration, you can do the following:
>>> from maha.parsers.functions import parse_dimension
>>> parse_dimension("العام ثلاثمئة وستة وخمسون يوما او 12 شهرا", duration=True)
[Dimension(body=ثلاثمئة وستة وخمسون يوما, value=DurationValue(values=[ValueUnit(value=356, unit=<DurationUnit.DAYS: 4>)], normalized_unit=<DurationUnit.SECONDS: 1>), start=6, end=30, dimension_type=DimensionType.DURATION), Dimension(body=12 شهرا, value=DurationValue(values=[ValueUnit(value=12, unit=<DurationUnit.MONTHS: 6>)], normalized_unit=<DurationUnit.SECONDS: 1>), start=34, end=41, dimension_type=DimensionType.DURATION)]
>>> parse_dimension("مليون ثانية ودقيقة", duration=True)[0].value
DurationValue(values=[ValueUnit(value=1, unit=<DurationUnit.MINUTES: 2>), ValueUnit(value=1000000, unit=<DurationUnit.SECONDS: 1>)], normalized_unit=<DurationUnit.SECONDS: 1>)
>>> parse_dimension("مليون ثانية ودقيقة", duration=True)[0].value.normalized_value
ValueUnit(value=1000060, unit=<DurationUnit.SECONDS: 1>)
>>> parse_dimension("يوم ونصف دقيقة", duration=True)[0].value
DurationValue(values=[ValueUnit(value=1, unit=<DurationUnit.DAYS: 4>), ValueUnit(value=0.5, unit=<DurationUnit.MINUTES: 2>)], normalized_unit=<DurationUnit.SECONDS: 1>)
>>> parse_dimension("ساعة الا ثلث", duration=True)[0].value.normalized_value
ValueUnit(value=2400, unit=<DurationUnit.SECONDS: 1>)
>>> parse_dimension("يومين و10 ساعات", duration=True)[0].value.values
[ValueUnit(value=2, unit=<DurationUnit.DAYS: 4>), ValueUnit(value=10, unit=<DurationUnit.HOURS: 3>)]
To extract ordinal, you can do the following:
>>> from maha.parsers.functions import parse_dimension
>>> parse_dimension("الأول", ordinal=True)
[Dimension(body=الأول, value=1, start=0, end=5, dimension_type=DimensionType.ORDINAL)]
>>> parse_dimension("الثالث والرابع", ordinal=True)
[Dimension(body=الثالث, value=3, start=0, end=6, dimension_type=DimensionType.ORDINAL), Dimension(body=الرابع, value=4, start=8, end=14, dimension_type=DimensionType.ORDINAL)]
>>> parse_dimension("الخمسين", ordinal=True)[0].value
50
>>> parse_dimension("الثاني والمئة", ordinal=True)
[Dimension(body=الثاني, value=2, start=0, end=6, dimension_type=DimensionType.ORDINAL), Dimension(body=المئة, value=100, start=8, end=13, dimension_type=DimensionType.ORDINAL)]
>>> parse_dimension("المليون والعشرين", ordinal=True)
[Dimension(body=المليون, value=1000000, start=0, end=7, dimension_type=DimensionType.ORDINAL), Dimension(body=العشرين, value=20, start=9, end=16, dimension_type=DimensionType.ORDINAL)]
>>> parse_dimension("الألف والمئتين", ordinal=True)
[Dimension(body=الألف, value=1000, start=0, end=5, dimension_type=DimensionType.ORDINAL), Dimension(body=المئتين, value=200, start=7, end=14, dimension_type=DimensionType.ORDINAL)]
To extract time, you can do the following:
>>> from maha.parsers.functions import parse_dimension
>>> from datetime import datetime
>>> now = datetime(2021, 9, 15)
>>> parse_dimension("غدا", time=True)
[Dimension(body=غدا, value=TimeValue(days=1), start=0, end=3, dimension_type=DimensionType.TIME)]
>>> parse_dimension("غدا الساعة الرابعة مساء", time=True)[0].value + now
datetime.datetime(2021, 9, 16, 16, 0)
>>> parse_dimension("السنة القادمة الموافق 15/9", time=True)[0].value + now
datetime.datetime(2022, 9, 15, 0, 0)
>>> parse_dimension("عام الفين وواحد وعشرين", time=True)[0].value
TimeValue(year=2021)
>>> (
... parse_dimension("الأسبوع الماضي يوم السبت الساعة 11 و30 دقيقة", time=True)[0].value
... + now
... )
datetime.datetime(2021, 9, 11, 11, 30)
>>> output = parse_dimension("الشهر القادم", time=True)[0].value
>>> output.is_months_set()
True
To extract names, you can do the following:
>>> from maha.parsers.functions import parse_dimension
>>> parse_dimension("محمد", names=True)
[Dimension(body=محمد, value=محمد, start=0, end=4, dimension_type=DimensionType.NAME)]
>>> parse_dimension("مريم", names=True)
[Dimension(body=مريم, value=مريم, start=0, end=4, dimension_type=DimensionType.NAME)]
>>> parse_dimension("اسمي فاطمة", names=True)
[Dimension(body=فاطمة, value=فاطمة, start=5, end=10, dimension_type=DimensionType.NAME)]
>>> parse_dimension("انا آنا", names=True)
[Dimension(body=آنا, value=آنا, start=4, end=7, dimension_type=DimensionType.NAME)]
>>> parse_dimension("ادهم في المكتبة", names=True)
[Dimension(body=ادهم, value=ادهم, start=0, end=4, dimension_type=DimensionType.NAME)]
>>> parse_dimension("جاك", names=True)
[Dimension(body=جاك, value=جاك, start=0, end=3, dimension_type=DimensionType.NAME)]
How parsers work¶
Parsers also rely on regular expressions
to extract values from text. Dimension parsers use very sophisticated regular expressions,
which we call Rules. Rules are defined in maha.parsers.rules. All rules are
optimized, compiled and cached to ensure best performance possible.
Rules can be easily extended, check custom dimension.
Processors¶
This module extends the functionality of cleaners to work on files and folders. It provides a way to process files in a streaming fashion, which is useful for large files. Processors are developed to extend any future classes that can work on files and folders.
The available processors are:
TextProcessor: Processes simple texts.FileProcessor: Processes small files.StreamTextProcessor: Processes stream of texts.StreamFileProcessor: Processes files in a streaming fashion.
Examples¶
To process a simple text, you can do the following:
>>> from maha.processors import TextProcessor
>>> sample_text = """
... بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ
... الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
... الرَّحْمَنِ الرَّحِيمِ
... مَالِكِ يَوْمِ الدِّينِ
... إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ
... اهدِنَا الصِّرَاطَ الْمُسْتَقِيمَ
... صِرَاطَ الَّذِينَ أَنْعَمْتَ عَلَيْهِمْ غَيْرِ الْمَغْضُوبِ عَلَيْهِمْ وَلاَ الضَّالِّينَ
... """
>>> processor = TextProcessor.from_text(sample_text, sep="\n")
>>> cleaned_text = (
... processor.normalize(alef=True).keep(arabic_letters=True).drop_empty_lines().text
... )
>>> print(cleaned_text)
بسم الله الرحمن الرحيم
الحمد لله رب العالمين
الرحمن الرحيم
مالك يوم الدين
اياك نعبد واياك نستعين
اهدنا الصراط المستقيم
صراط الذين انعمت عليهم غير المغضوب عليهم ولا الضالين
>>> # To print the unique characters
>>> unique_char = processor.get(unique_characters=True)
>>> unique_char.sort()
>>> unique_char
[' ', 'ا', 'ب', 'ت', 'ح', 'د', 'ذ', 'ر', 'س', 'ص', 'ض', 'ط', 'ع', 'غ', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'و', 'ي']
To streamly process a file, you can do the following:
>>> from pathlib import Path
>>> import maha
>>> # This is a sample file that comes with Maha.
>>> resource_path = Path(maha.__file__).parents[1] / "sample_data/surah_al-ala.txt"
>>> from maha.processors import StreamFileProcessor
>>> proc = StreamFileProcessor(resource_path)
>>> proc = proc.normalize(all=True).keep(arabic_letters=True).drop_empty_lines()
>>> # To start processing the file, run the following commented code.
>>> # proc.process_and_save(Path("output.txt"))