`maha.cleaners.functions`#

Submodules#

Package Contents#

Functions#

`contains`(text[, arabic, english, ...])	Check for certain characters, strings or patterns in the given text.
`contains_expressions`(text, expressions)	Check for matched strings in the given `text` using the input `expressions`
`contain_strings`(text, strings)	Check for the input `strings` in the given `text`
`contains_repeated_substring`(text[, min_repeated])	Check for consecutive substrings that are repeated at least `min_repeated` times.
`contains_single_letter_word`(text[, ...])	Check for a single-letter word.
`keep`(text[, arabic, english, ...])	Keeps only certain characters in the given text and removes everything else.
`keep_strings`(text, strings[, use_space])	Keeps only the input strings `strings` in the given text `text`
`keep_arabic_letters`(text)	Keeps only Arabic letters `ARABIC_LETTERS` in the given text.
`keep_arabic_characters`(text)	Keeps only common Arabic characters `ARABIC` in the given text.
`keep_arabic_with_english_numbers`(text)	Keeps only common Arabic characters `ARABIC` and English numbers `ENGLISH_NUMBERS` in the given text.
`keep_arabic_letters_with_harakat`(text)	Keeps only Arabic letters `ARABIC_LETTERS` and HARAKAT `HARAKAT` in the given text.
`normalize`(text[, lam_alef, alef, waw, yeh, ...])	Normalizes characters in the given text
`normalize_lam_alef`(text[, keep_hamza])	Normalize `LAM_ALEF_VARIATIONS` to `LAM_ALEF_VARIATIONS_NORMALIZED` If `keep_hamza` is True.
`normalize_small_alef`(text[, keep_madda, ...])	Normalize `ALEF_SUPERSCRIPT` to `ALEF`.
`numbers_to_text`(text[, accusative])	Converts numbers in text to their equivalent text in Arabic.
`remove`(text[, arabic, english, ...])	Removes certain characters from the given text.
`remove_strings`(text, strings[, use_space])	Removes the input strings `strings` in the given text `text`
`remove_extra_spaces`(text[, max_spaces])	Keeps a maximum of `max_spaces` number of spaces when extra spaces are present (more than one space)
`remove_punctuations`(text)	Removes all punctuations `PUNCTUATIONS` from the given text.
`remove_english`(text)	Removes all english characters `ENGLISH` from the given text.
`remove_all_harakat`(text)	Removes all harakat `ALL_HARAKAT` from the given text.
`remove_harakat`(text)	Removes common harakat `HARAKAT` from the given text.
`remove_numbers`(text)	Removes all numbers `NUMBERS` from the given text.
`remove_tatweel`(text)	Removes tatweel symbol `TATWEEL` from the given text.
`remove_expressions`(text, patterns[, ...])	Removes matched characters from the given text `text` using input patterns `patterns`
`remove_emails`(text)	Removes emails using pattern `EXPRESSION_EMAILS` from the given text.
`remove_hashtags`(text)	Removes hashtags (strings that start with # symbol) using pattern `EXPRESSION_HASHTAGS` from the given text.
`remove_links`(text)	Removes links using pattern `EXPRESSION_LINKS` from the given text.
`remove_mentions`(text)	Removes mentions (strings that start with @ symbol) using pattern `EXPRESSION_MENTIONS` from the given text.
`reduce_repeated_substring`(text[, ...])	Reduces consecutive substrings that are repeated at least `min_repeated` times to `reduce_to` times.
`remove_hash_keep_tag`(text)	Removes the hash symbol `HASHTAG` from all hashtags in the given text.
`remove_arabic_letter_dots`(text)	Remove dots from `ARABIC_LETTERS` in the given `text` using the `ARABIC_DOTLESS_MAP`
`replace`(text, strings, with_value)	Replaces the input `strings` in the given text with the given value
`replace_except`(text, strings, with_value)	Replaces everything except the input `strings` in the given text with the given value
`replace_pairs`(text, keys, values)	Replaces each key with its corresponding value in the given text
`replace_expression`(text, expression, with_value)	Matches characters from the input text using the given `expression` and replaces all matched characters with the given value.
`arabic_numbers_to_english`(text)	Converts Arabic numbers `ARABIC_NUMBERS` to the corresponding English numbers `ENGLISH_NUMBERS`
`connect_single_letter_word`(text[, waw, feh, ...])	Connects single-letter word with the letter following it.

contains(text, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, lam_alef_variations=False, lam_alef=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, persian=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, custom_strings=None, custom_expressions=None, operator=None)[source]#

Check for certain characters, strings or patterns in the given text.

To add a new parameter, make sure that its name is the same as the corresponding constant. For the patterns, only remove the prefix EXPRESSION_ from the parameter name

Parameters

text (str) – Text to check
arabic (bool, optional) – Check for ARABIC characters, by default False
english (bool, optional) – Check for ENGLISH characters, by default False
arabic_letters (bool, optional) – Check for ARABIC_LETTERS characters, by default False
english_letters (bool, optional) – Check for ENGLISH_LETTERS characters, by default False
english_small_letters (bool, optional) – Check for ENGLISH_SMALL_LETTERS characters, by default False
english_capital_letters (bool, optional) – Check for ENGLISH_CAPITAL_LETTERS characters, by default False
numbers (bool, optional) – Check for NUMBERS characters, by default False
harakat (bool, optional) – Check for HARAKAT characters, by default False
all_harakat (bool, optional) – Check for ALL_HARAKAT characters, by default False
tatweel (bool, optional) – Check for TATWEEL character, by default False
lam_alef_variations (bool, optional) – Check for LAM_ALEF_VARIATIONS characters, by default False
lam_alef (bool, optional) – Check for LAM_ALEF character, by default False
punctuations (bool, optional) – Check for PUNCTUATIONS characters, by default False
arabic_numbers (bool, optional) – Check for ARABIC_NUMBERS characters, by default False
english_numbers (bool, optional) – Check for ENGLISH_NUMBERS characters, by default False
arabic_punctuations (bool, optional) – Check for ARABIC_PUNCTUATIONS characters, by default False
english_punctuations (bool, optional) – Check for ENGLISH_PUNCTUATIONS characters, by default False
arabic_ligatures (bool, optional) – Check for ARABIC_LIGATURES words, by default False
persian (bool, optional) – Check for PERSIAN characters, by default False
arabic_hashtags (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_ARABIC_HASHTAGS, by default False
arabic_mentions (bool, optional) – Check for Arabic mentions using the expression EXPRESSION_ARABIC_MENTIONS, by default False
emails (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_EMAILS, by default False
english_hashtags (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_ENGLISH_HASHTAGS, by default False
english_mentions (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_ENGLISH_MENTIONS, by default False
hashtags (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_HASHTAGS, by default False
links (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_LINKS, by default False
mentions (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_MENTIONS, by default False
emojis (bool, optional) – Check for emojis using the expression EXPRESSION_EMOJIS, by default False
custom_strings (Union[List[str], str], optional) – Include any other string(s), by default None
custom_expressions (ExpressionGroup | Expression | None) – Include any other expressions, by default None
operator (bool, optional) – When multiple arguments are set to True, this operator is used to combine the output into a boolean. Takes ‘and’ or ‘or’, by default None

Returns

If one argument is set to True, a boolean value is returned. True if the text contains it, False otherwise.
If operator is set and more than one argument is set to True, a boolean value that combines the result with the “and/or” operator is returned.
If more than one argument is set to True, a dictionary is returned where keys are the True passed arguments and the corresponding values are booleans. True if the text contains the argument, False otherwise.

Return type

Union[Dict[str, bool], bool]

Raises

ValueError – If no argument is set to True

Examples

>>> from maha.cleaners.functions import contains
>>> text = "مقاييس أداء النماذج في التعلم الآلي Machine Learning ... 🌺"
>>> contains(text, english=True, emails=True, emojis=True)
{'english': True, 'emails': False, 'emojis': True}

>>> from maha.cleaners.functions import contains
>>> text = "قال رسول اللهﷺ إن خير أيامكم يوم الجمعة فأكثروا عليَّ من الصلاة فيه"
>>> contains(text, english=True)
False

contains_expressions(text, expressions)[source]#

Check for matched strings in the given text using the input expressions

Note

Use lookahead/lookbehind when substrings should not be captured or removed.

Parameters

text (str) – Text to check
expressions (Union[ExpressionGroup, Expression, str]) – Expression(s) to use

Returns

True if the pattern is found in the given text, False otherwise.

Return type

bool

Raises

ValueError – If expressions are not of type Expression, ExpressionGroup or str

Example

>>> from maha.cleaners.functions import contains_expressions
>>> text = "علم الهندسة (Engineering)"
>>> contains_expressions(text, r"\([A-Za-z]+\)")
True

contain_strings(text, strings)[source]#

Check for the input strings in the given text

Parameters

text (str) – Text to check
strings (Union[List[str], str]) – String or list of strings to check for

Returns

True if the input string(s) are found in the text, False otherwise

Return type

bool

Raises

ValueError – If no strings are provided

Example

>>> from maha.cleaners.functions import contain_strings
>>> text = "الله أكبر، الحمد لله رب العالمين"
>>> contain_strings(text, "الله")
True

contains_repeated_substring(text, min_repeated=3)[source]#

Check for consecutive substrings that are repeated at least min_repeated times. For example with the default arguments, the text ‘hhhhhh’ should return True

Parameters

text (str) – Text to check
min_repeated (int, optional) – Minimum number of consecutive repeated substring to consider, by default 3

Returns

True if the input text contains consecutive substrings, otherwise False

Return type

bool

Raises

ValueError – If non positive integer is passed

Example

>>> from maha.cleaners.functions import contains_repeated_substring
>>> text = "كانت اللعبة حللللللللوة جداً"
>>> contains_repeated_substring(text)
True

contains_single_letter_word(text, arabic_letters=False, english_letters=False)[source]#

Check for a single-letter word. For example, “how r u” should return True if english_letters is set to True because it contains two single-letter word, “r” and “u”.

Parameters

text (str) – Text to check
arabic_letters (bool, optional) – Check for all ARABIC_LETTERS, by default False
english_letters (bool, optional) – Check for all ENGLISH_LETTERS, by default False

Returns

True if the input text contains single-letter word, False otherwise

Return type

bool

Raises

ValueError – If no argument is set to True

Example

>>> from maha.cleaners.functions import contains_single_letter_word
>>> text = "cu later my friend, ك"
>>> contains_single_letter_word(text, arabic_letters=True, english_letters=True)
True

keep(text, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, use_space=True, custom_strings=None)[source]#

Keeps only certain characters in the given text and removes everything else.

To add a new parameter, make sure that its name is the same as the corresponding constant.

Parameters

text (str) – Text to be processed
arabic (bool, optional) – Keep ARABIC characters, by default False
english (bool, optional) – Keep ENGLISH characters, by default False
arabic_letters (bool, optional) – Keep ARABIC_LETTERS characters, by default False
english_letters (bool, optional) – Keep ENGLISH_LETTERS characters, by default False
english_small_letters (bool, optional) – Keep ENGLISH_SMALL_LETTERS characters, by default False
english_capital_letters (bool, optional) – Keep ENGLISH_CAPITAL_LETTERS characters, by default False
numbers (bool, optional) – Keep NUMBERS characters, by default False
harakat (bool, optional) – Keep HARAKAT characters, by default False
all_harakat (bool, optional) – Keep ALL_HARAKAT characters, by default False
punctuations (bool, optional) – Keep PUNCTUATIONS characters, by default False
arabic_numbers (bool, optional) – Keep ARABIC_NUMBERS characters, by default False
english_numbers (bool, optional) – Keep ENGLISH_NUMBERS characters, by default False
arabic_punctuations (bool, optional) – Keep ARABIC_PUNCTUATIONS characters, by default False
english_punctuations (bool, optional) – Keep ENGLISH_PUNCTUATIONS characters, by default False
use_space (bool, optional) – False to not replace with space, check keep_strings() for more information, by default True
custom_strings (List[str], optional) – Include any other string(s), by default None

Returns

Processed text

Return type

str

Raises

ValueError – If no argument is set to True

Example

>>> from maha.cleaners.functions import keep
>>> text = "بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ"
>>> keep(text, arabic_letters=True)
'بسم الله الرحمن الرحيم'

keep_strings(text, strings, use_space=True)[source]#

Keeps only the input strings strings in the given text text

By default, this works by replacing all strings except the input strings with a space, which means space is kept. This is to help separate texts when unwanted strings are present without spaces. For example, ‘end.start’ will be converted to ‘end start’ if English letters ENGLISH_LETTERS are passed to strings. To disable this behavior, set use_space to False.

Note

Extra spaces (more than one space) are removed by default if use_space is set to True.

Parameters

text (str) – Text to be processed
strings (Union[List[str], str]) – list of strings to keep
use_space (bool) – False to not replace with space, defaults to True

Returns

Text that contains only the input strings.

Return type

str

Raises

ValueError – If no strings are provided

Example

>>> from maha.cleaners.functions import keep_strings
>>> text = "لا حول ولا قوة إلا بالله"
>>> keep_strings(text, "الله")
'الله'

keep_arabic_letters(text)[source]#

Keeps only Arabic letters ARABIC_LETTERS in the given text.

Parameters: text (str) – Text to be processed
Returns: Text contains Arabic letters only.
Return type: str

Example

>>> from maha.cleaners.functions import keep_arabic_letters
>>> text = " 1 يا أحلى mathematicians في العالم"
>>> keep_arabic_letters(text)
'يا أحلى في العالم'

keep_arabic_characters(text)[source]#

Keeps only common Arabic characters ARABIC in the given text.

Parameters: text (str) – Text to be processed
Returns: Text contains the common Arabic characters only.
Return type: str

Example

>>> from maha.cleaners.functions import keep_arabic_characters
>>> text = "أَلمَانِيَا (بالألمانية: Deutschland) رسمِيّاً جُمهُورِيَّة أَلمَانِيَا الاِتِّحَاديَّة"
>>> keep_arabic_characters(text)
'أَلمَانِيَا بالألمانية رسمِيّاً جُمهُورِيَّة أَلمَانِيَا الاِتِّحَاديَّة'

keep_arabic_with_english_numbers(text)[source]#

Keeps only common Arabic characters ARABIC and English numbers ENGLISH_NUMBERS in the given text.

Parameters: text (str) – Text to be processed
Returns: Text contains the common Arabic characters and English numbers only.
Return type: str

Example

>>> from maha.cleaners.functions import keep_arabic_with_english_numbers
>>> text = "تتكون من 16 ولاية تُغطي مساحة 357,021 كيلومتر Deutschland"
>>> keep_arabic_with_english_numbers(text)
'تتكون من 16 ولاية تُغطي مساحة 357 021 كيلومتر'

keep_arabic_letters_with_harakat(text)[source]#

Keeps only Arabic letters ARABIC_LETTERS and HARAKAT HARAKAT in the given text.

Parameters: text (str) – Text to be processed
Returns: Text contains Arabic letters with harakat only.
Return type: str

Example

>>> from maha.cleaners.functions import keep_arabic_letters_with_harakat
>>> text = "إنّ في التّركِ قوة…"
>>> keep_arabic_letters_with_harakat(text)
'إنّ في التّركِ قوة'

normalize(text, lam_alef=None, alef=None, waw=None, yeh=None, teh_marbuta=None, ligatures=None, spaces=None, all=False)[source]#

Normalizes characters in the given text

Parameters

text (str) – Text to process
lam_alef (bool, optional) – Normalize LAM_ALEF_VARIATIONS characters to LAM and ALEF, by default None
alef (bool, optional) – Normalize ALEF_VARIATIONS characters to ALEF, by default None
waw (bool, optional) – Normalize WAW_VARIATIONS characters to WAW, by default None
yeh (bool, optional) – Normalize YEH_VARIATIONS characters to YEH and ALEF, by default None
teh_marbuta (bool, optional) – Normalize TEH_MARBUTA characters to HEH, by default None
ligatures (bool, optional) – Normalize ARABIC_LIGATURES characters to the corresponding indices in ARABIC_LIGATURES_NORMALIZED, by default None
spaces (bool, optional) – Normalize space variations using the expression EXPRESSION_ALL_SPACES, by default None
all (bool, optional) – Do all normalization except the ones that are set to False, by default False

Returns

Processed text

Return type

str

Raises

ValueError – If no argument is set to True

Examples

>>> from maha.cleaners.functions import normalize
>>> text = "عن أبي هريرة"
>>> normalize(text, alef=True, teh_marbuta=True)
'عن ابي هريره'

>>> from maha.cleaners.functions import normalize
>>> text = "قال رسول الله ﷺ"
>>> normalize(text, ligatures=True)
'قال رسول الله صلى الله عليه وسلم'

>>> from maha.cleaners.functions import normalize
>>> text = "قال مؤمن: ﷽ قل هو ﷲ أحد"
... # For space
>>> normalize(text, all=True, waw=False)
'قال مؤمن: بسم الله الرحمن الرحيم قل هو الله احد'

normalize_lam_alef(text, keep_hamza=True)[source]#

Normalize LAM_ALEF_VARIATIONS to LAM_ALEF_VARIATIONS_NORMALIZED If keep_hamza is True. Otherwise, normalize to LAM and ALEF

Parameters

text (str) – Text to process
keep_hamza (bool, optional) – True to preserve hamza and madda characters, by default True

Returns

Normalized text

Return type

str

Examples

>>> from maha.cleaners.functions import normalize_lam_alef
>>> text = "السﻻم عليكم أحبتي، قالوا في صِفَةِ رَسُولِ الله يتَﻷلأ وَجْهُه"
>>> normalize_lam_alef(text)
'السلام عليكم أحبتي، قالوا في صِفَةِ رَسُولِ الله يتَلألأ وَجْهُه'

>>> from maha.cleaners.functions import normalize_lam_alef
>>> text = "اﻵن يا أصحابي"
>>> normalize_lam_alef(text, keep_hamza=False)
'الان يا أصحابي'

normalize_small_alef(text, keep_madda=True, normalize_end=False)[source]#

Normalize ALEF_SUPERSCRIPT to ALEF. If keep_madda is True and ALEF_SUPERSCRIPT is followed by HAMZA_ABOVE, then normalize to ALEF_MADDA_ABOVE

Parameters

text (str) – Text to process
keep_madda (bool, optional) – True to preserve madda character, by default True
normalize_end (bool, optional) – True to normalize ALEF_SUPERSCRIPT that appear at the end of a word, by default False

Returns

Normalized text

Return type

str

Example

>>> from maha.cleaners.functions import normalize_small_alef
>>> text = "وَٱلصَّٰٓفَّٰتِ صَفّٗا"
>>> normalize_small_alef(text)
'وَٱلصَّآفَّاتِ صَفّٗا'

numbers_to_text(text, accusative=False)[source]#

Converts numbers in text to their equivalent text in Arabic.

Parameters

text (str) – Text with numbers to be converted.
accusative (bool, optional) – If True, the number will be converted to its accusative form.

Returns

Text with numbers converted to their equivalent text in Arabic.

Return type

str

remove(text, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, use_space=True, custom_strings=None, custom_expressions=None)[source]#

Removes certain characters from the given text.

To add a new parameter, make sure that its name is the same as the corresponding constant. For the patterns, only remove the prefix EXPRESSION_ from the parameter name

Parameters

text (str) – Text to be processed
arabic (bool, optional) – Remove ARABIC characters, by default False
english (bool, optional) – Remove ENGLISH characters, by default False
arabic_letters (bool, optional) – Remove ARABIC_LETTERS characters, by default False
english_letters (bool, optional) – Remove ENGLISH_LETTERS characters, by default False
english_small_letters (bool, optional) – Remove ENGLISH_SMALL_LETTERS characters, by default False
english_capital_letters (bool, optional) – Remove ENGLISH_CAPITAL_LETTERS characters, by default False
numbers (bool, optional) – Remove NUMBERS characters, by default False
harakat (bool, optional) – Remove HARAKAT characters, by default False
all_harakat (bool, optional) – Remove ALL_HARAKAT characters, by default False
tatweel (bool, optional) – Remove TATWEEL character, by default False
punctuations (bool, optional) – Remove PUNCTUATIONS characters, by default False
arabic_numbers (bool, optional) – Remove ARABIC_NUMBERS characters, by default False
english_numbers (bool, optional) – Remove ENGLISH_NUMBERS characters, by default False
arabic_punctuations (bool, optional) – Remove ARABIC_PUNCTUATIONS characters, by default False
english_punctuations (bool, optional) – Remove ENGLISH_PUNCTUATIONS characters, by default False
arabic_ligatures (bool, optional) – Remove ARABIC_LIGATURES words, by default False
arabic_hashtags (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_ARABIC_HASHTAGS, by default False
arabic_mentions (bool, optional) – Remove Arabic mentions using the expression EXPRESSION_ARABIC_MENTIONS, by default False
emails (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_EMAILS, by default False
english_hashtags (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_ENGLISH_HASHTAGS, by default False
english_mentions (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_ENGLISH_MENTIONS, by default False
hashtags (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_HASHTAGS, by default False
links (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_LINKS, by default False
mentions (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_MENTIONS, by default False
emojis (bool, optional) – Remove emojis using the expression EXPRESSION_EMOJIS, by default False
use_space (bool, optional) – False to not replace with space, check remove_strings() for more information, by default True
custom_strings (list[str] | str | None) – Include any other string(s), by default None
custom_expressions (Union[ExpressionGroup, Expression, str]) – Include any other regular expression expressions, by default None

Returns

Processed text

Return type

str

Raises

ValueError – If no argument is set to True

Examples

>>> from maha.cleaners.functions import remove
>>> text = "ويندوز 11 سيدعم تطبيقات نظام أندرويد. #Windows11"
>>> remove(text, hashtags=True)
'ويندوز 11 سيدعم تطبيقات نظام أندرويد.'

>>> from maha.cleaners.functions import remove
>>> text = "قَالَ رَبِّ اشْرَحْ لِي صَدْرِي.."
>>> remove(text, all_harakat=True, punctuations=True)
'قال رب اشرح لي صدري'

remove_strings(text, strings, use_space=True)[source]#

Removes the input strings strings in the given text text

This works by replacing all input strings strings with a space, which means space cannot be removed. This is to help separate texts when unwanted strings are present without spaces. For example, ‘end.start’ will be converted to ‘end start’ if dot DOT is passed to strings. To disable this behavior, set use_space to False.

Note

Extra spaces (more than one space) are removed by default if use_space is set to True.

Parameters

text (str) – Text to be processed
strings (Union[List[str], str]) – list of strings to remove
use_space (bool) – False to not replace with space, defaults to True

Returns

Text with input strings removed.

Return type

str

Raises

ValueError – If no strings are provided

Example

>>> from maha.cleaners.functions import remove_strings
>>> text = "ومن الكلمات المحظورة السلاح"
>>> remove_strings(text, "السلاح")
'ومن الكلمات المحظورة'

remove_extra_spaces(text, max_spaces=1)[source]#

Keeps a maximum of max_spaces number of spaces when extra spaces are present (more than one space)

Parameters

text (str) – Text to be processed
max_spaces (int, optional) – Maximum number of spaces to keep, by default 1

Returns

Text with extra spaces removed

Return type

str

Raises

ValueError – When a negative or float value is assigned to max_spaces

Example

>>> from maha.cleaners.functions import remove_extra_spaces
>>> text = "وكان صديقنا    العزيز   محمد من أفضل   الأشخاص الذين قابلتهم"
>>> remove_extra_spaces(text)
'وكان صديقنا العزيز محمد من أفضل الأشخاص الذين قابلتهم'

remove_punctuations(text)[source]#

Removes all punctuations PUNCTUATIONS from the given text.

Parameters: text (str) – Text to be processed
Returns: Text with punctuations removed.
Return type: str

Example

>>> from maha.cleaners.functions import remove_punctuations
>>> text = "مثال على الرموز الخاصة كالتالي $ ^ & * ( ) ! @"
>>> remove_punctuations(text)
'مثال على الرموز الخاصة كالتالي'

remove_english(text)[source]#

Removes all english characters ENGLISH from the given text.

Parameters: text (str) – Text to be processed
Returns: Text with english removed.
Return type: str

Example

>>> from maha.cleaners.functions import remove_english
>>> text = "ومن أفضل الجامعات هي جامعة إكسفورد (Oxford University)"
>>> remove_english(text)
'ومن أفضل الجامعات هي جامعة إكسفورد'

remove_all_harakat(text)[source]#

Removes all harakat ALL_HARAKAT from the given text.

Parameters: text (str) – Text to be processed
Returns: Text with all harakat removed.
Return type: str

Example

>>> from maha.cleaners.functions import remove_all_harakat
>>> text = "وَٱلصَّٰٓفَّٰتِ صَفّٗا (1) فَٱلزَّٰجِرَٰتِ زَجۡرٗا"
>>> remove_all_harakat(text)
'وٱلصفت صفا (1) فٱلزجرت زجرا'

remove_harakat(text)[source]#

Removes common harakat HARAKAT from the given text.

Parameters: text (str) – Text to be processed
Returns: Text with common harakat removed.
Return type: str

Example

>>> from maha.cleaners.functions import remove_harakat
>>> text = "ألا تَرَى: كلَّ مَنْ تَرجو وتَأمَلُهُ مِنَ البَرِيَّةِ (مسكينُ بْنُ مسكينِ)"
>>> remove_harakat(text)
'ألا ترى: كل من ترجو وتأمله من البرية (مسكين بن مسكين)'

remove_numbers(text)[source]#

Removes all numbers NUMBERS from the given text.

Parameters: text (str) – Text to be processed
Returns: Text with numbers removed.
Return type: str

Example

>>> from maha.cleaners.functions import remove_numbers
>>> text = "ورقم أبو تريكة في نادي الأهلي هو إثنين وعشرين (22)"
>>> remove_numbers(text)
'ورقم أبو تريكة في نادي الأهلي هو إثنين وعشرين ( )'

remove_tatweel(text)[source]#

Removes tatweel symbol TATWEEL from the given text.

Parameters: text (str) – Text to process
Returns: Text with tatweel symbol removed.
Return type: str

Example

>>> from maha.cleaners.functions import remove_tatweel
>>> text = "الحمــــــــد لله رب العــــــــــــالمـــــــيـــــن"
>>> remove_tatweel(text)
'الحمد لله رب العالمين'

remove_expressions(text, patterns, remove_spaces=True)[source]#

Removes matched characters from the given text text using input patterns patterns

Note

Use lookahead/lookbehind when substrings should not be captured or removed.

Parameters

text (str) – Text to process
patterns (Expression | ExpressionGroup | str) – Expression(s) to use
remove_spaces (bool, optional) – False to keep extra spaces, defaults to True

Returns

Text with matched characters removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_expressions
>>> text = "الأميرُ الغازي أرطُغرُل، أو اختصارًا أرطغرل (بالتركية: Ertuğrul)"
>>> remove_expressions(text, r"\(.*\)")
'الأميرُ الغازي أرطُغرُل، أو اختصارًا أرطغرل'

remove_emails(text)[source]#

Removes emails using pattern EXPRESSION_EMAILS from the given text.

Parameters: text (str) – Text to be processed
Returns: Text with emails removed.
Return type: str

Example

>>> from maha.cleaners.functions import remove_emails
>>> text = "يمكن استخدام الإيميل الشخصي، كمثال user1998@gmail.com"
>>> remove_emails(text)
'يمكن استخدام الإيميل الشخصي، كمثال'

remove_hashtags(text)[source]#

Removes hashtags (strings that start with # symbol) using pattern EXPRESSION_HASHTAGS from the given text.

Parameters: text (str) – Text to be processed
Returns: Text with hashtags removed.
Return type: str

Example

>>> from maha.cleaners.functions import remove_hashtags
>>> text = "ويمكن القول أن مكة المكرمة من أجمل المناطق على وجه الأرض #السعودية"
>>> remove_hashtags(text)
'ويمكن القول أن مكة المكرمة من أجمل المناطق على وجه الأرض'

remove_links(text)[source]#

Removes links using pattern EXPRESSION_LINKS from the given text.

Parameters: text (str) – Text to be processed
Returns: Text with links removed.
Return type: str

Example

>>> from maha.cleaners.functions import remove_links
>>> text = "لمشاهدة آخر التطورات يرجى زيارة الموقع التالي: https://github.com/TRoboto/Maha"
>>> remove_links(text)
'لمشاهدة آخر التطورات يرجى زيارة الموقع التالي:'

remove_mentions(text)[source]#

Removes mentions (strings that start with @ symbol) using pattern EXPRESSION_MENTIONS from the given text.

Parameters: text (str) – Text to be processed
Returns: Text with mentions removed.
Return type: str

Example

>>> from maha.cleaners.functions import remove_mentions
>>> text = "@test لو سمحت صديقنا تزورنا على المعرض لاستلام الجائزة"
>>> remove_mentions(text)
'لو سمحت صديقنا تزورنا على المعرض لاستلام الجائزة'

reduce_repeated_substring(text, min_repeated=3, reduce_to=2)[source]#

Reduces consecutive substrings that are repeated at least min_repeated times to reduce_to times. For example with the default arguments, ‘hhhhhh’ is reduced to ‘hh’

TODO: Maybe change the implemention for 50x speed https://stackoverflow.com/questions/29481088/how-can-i-tell-if-a-string-repeats-itself-in-python/29489919#29489919

Parameters

text (str) – Text to process
min_repeated (int, optional) – Minimum number of consecutive repeated substring to consider, by default 3
reduce_to (int, optional) – Number of substring to keep, by default 2

Returns

Processed text

Return type

str

Raises

ValueError – If non positive integer is passed or reduce_to is greater than min_repeated

Examples

..code:: pycon

>>> from maha.cleaners.functions import reduce_repeated_substring
>>> text = "ههههههههههههههه"
>>> reduce_repeated_substring(text)
'هه'

..code:: pycon

>>> from maha.cleaners.functions import reduce_repeated_substring
>>> text = "ويييييييييين راححححححححححححوا"
>>> reduce_repeated_substring(text, reduce_to=1)
'وين راحوا'

remove_hash_keep_tag(text)[source]#

Removes the hash symbol HASHTAG from all hashtags in the given text.

Parameters: text (str) – Text to process
Returns: Text without hashtags.
Return type: str

Example

>>> from maha.cleaners.functions import remove_hash_keep_tag
>>> text = "We love #Jordan very much"
>>> remove_hash_keep_tag(text)
'We love Jordan very much'

remove_arabic_letter_dots(text)[source]#

Remove dots from ARABIC_LETTERS in the given text using the ARABIC_DOTLESS_MAP

Parameters: text (str) – Text to be processed
Returns: Text with dotless Arabic letters
Return type: str

Example

>>> from maha.cleaners.functions import remove_arabic_letter_dots
>>> text = "الحَمدُ للهِ الَّذي بنِعمتِه تَتمُّ الصَّالحاتُ"
>>> remove_arabic_letter_dots(text)
'الحَمدُ للهِ الَّدى ٮٮِعمٮِه ٮَٮمُّ الصَّالحاٮُ'

replace(text, strings, with_value)[source]#

Replaces the input strings in the given text with the given value

Parameters

text (str) – Text to process
strings (list[str] | str) – Strings to replace
with_value (str) – Value to replace the input strings with

Returns

Processed text

Return type

str

Examples

>>> from maha.cleaners.functions import replace
>>> text = "حصل الولد على معدل 50%"
>>> replace(text, "%", " بالمئة")
'حصل الولد على معدل 50 بالمئة'

>>> from maha.cleaners.functions import replace
>>> text = "ولقد كلف هذا المنتج 100 $"
>>> replace(text, "$", "دولار")
'ولقد كلف هذا المنتج 100 دولار'

replace_except(text, strings, with_value)[source]#

Replaces everything except the input strings in the given text with the given value

Parameters

text (str) – Text to process
strings (list[str] | str) – Strings to preserve (not replace)
with_value (str) – Value to replace all other strings with.

Returns

Processed text

Return type

str

Example

>>> from maha.cleaners.functions import replace_except
>>> from maha.constants import ARABIC_LETTERS, SPACE, EMPTY
>>> text = "لَيتَ الذينَ تُحبُّ العيّنَ رؤيَتهم"
>>> replace_except(text, ARABIC_LETTERS + [SPACE], EMPTY)
'ليت الذين تحب العين رؤيتهم'

replace_pairs(text, keys, values)[source]#

Replaces each key with its corresponding value in the given text

Parameters

text (str) – Text to process
keys (list[str]) – Strings to be replaced
values (list[str]) – Strings to be replaced with

Returns

Processed text

Return type

str

Raises

ValueError – If keys and values are of different lengths

Example

>>> from maha.cleaners.functions import replace_pairs
>>> text = 'شلونك يا محمد؟'
>>> replace_pairs(text, ['شلونك'] , ['كيف حالك'])
'كيف حالك يا محمد؟'

replace_expression(text, expression, with_value)[source]#

Matches characters from the input text using the given expression and replaces all matched characters with the given value.

Parameters

text (str) – Text to process
expression (Expression | ExpressionGroup | str) – Pattern/Expression used to match characters from the text
with_value (Callable[..., str] | str) – Value to replace the matched characters with

Returns

Processed text

Return type

str

Examples

>>> from maha.cleaners.functions import replace_expression
>>> text = "ولقد حصلت على ١٠ من ١٠ "
>>> replace_expression(text, "١٠", "عشرة")
'ولقد حصلت على عشرة من عشرة '

>>> from maha.cleaners.functions import replace_expression
>>> text = "ذهبت الفتاه إلى المدرسه"
>>> replace_expression(text, "ه( |$)", "ة ").strip()
'ذهبت الفتاة إلى المدرسة'

arabic_numbers_to_english(text)[source]#

Converts Arabic numbers ARABIC_NUMBERS to the corresponding English numbers ENGLISH_NUMBERS

Parameters: text (str) – Text to process
Returns: Processed text with all occurrences of Arabic numbers converted to English numbers
Return type: str

Examples

>>> from maha.cleaners.functions import arabic_numbers_to_english
>>> text = "٣"
>>> arabic_numbers_to_english(text)
'3'

>>> from maha.cleaners.functions import arabic_numbers_to_english
>>> text = "١٠"
>>> arabic_numbers_to_english(text)
'10'

connect_single_letter_word(text, waw=None, feh=None, beh=None, lam=None, kaf=None, teh=None, all=None, custom_strings=None)[source]#

Connects single-letter word with the letter following it.

Parameters

text (str) – Text to process
waw (bool, optional) – Connect WAW letter, by default None
feh (bool, optional) – Connect FEH letter, by default None
beh (bool, optional) – Connect BEH letter, by default None
lam (bool, optional) – Connect LAM letter, by default None
kaf (bool, optional) – Connect KAF letter, by default None
teh (bool, optional) – Connect TEH letter, by default None
all (bool, optional) – Connect all letter except the ones set to False, by default None
custom_strings (Union[List[str], str], optional) – Include any other string(s) to connect, by default None

maha.cleaners.functions#

Submodules#

Package Contents#

Functions#

`maha.cleaners.functions`#