maha.cleaners.functions#

Submodules#

Package Contents#

Functions#

contains(text[, arabic, english, ...])

Check for certain characters, strings or patterns in the given text.

contains_expressions(text, expressions)

Check for matched strings in the given text using the input expressions

contain_strings(text, strings)

Check for the input strings in the given text

contains_repeated_substring(text[, min_repeated])

Check for consecutive substrings that are repeated at least min_repeated times.

contains_single_letter_word(text[, ...])

Check for a single-letter word.

keep(text[, arabic, english, ...])

Keeps only certain characters in the given text and removes everything else.

keep_strings(text, strings[, use_space])

Keeps only the input strings strings in the given text text

keep_arabic_letters(text)

Keeps only Arabic letters ARABIC_LETTERS in the given text.

keep_arabic_characters(text)

Keeps only common Arabic characters ARABIC in the given text.

keep_arabic_with_english_numbers(text)

Keeps only common Arabic characters ARABIC and English numbers ENGLISH_NUMBERS in the given text.

keep_arabic_letters_with_harakat(text)

Keeps only Arabic letters ARABIC_LETTERS and HARAKAT HARAKAT in the given text.

normalize(text[, lam_alef, alef, waw, yeh, ...])

Normalizes characters in the given text

normalize_lam_alef(text[, keep_hamza])

Normalize LAM_ALEF_VARIATIONS to LAM_ALEF_VARIATIONS_NORMALIZED If keep_hamza is True.

normalize_small_alef(text[, keep_madda, ...])

Normalize ALEF_SUPERSCRIPT to ALEF.

numbers_to_text(text[, accusative])

Converts numbers in text to their equivalent text in Arabic.

remove(text[, arabic, english, ...])

Removes certain characters from the given text.

remove_strings(text, strings[, use_space])

Removes the input strings strings in the given text text

remove_extra_spaces(text[, max_spaces])

Keeps a maximum of max_spaces number of spaces when extra spaces are present (more than one space)

remove_punctuations(text)

Removes all punctuations PUNCTUATIONS from the given text.

remove_english(text)

Removes all english characters ENGLISH from the given text.

remove_all_harakat(text)

Removes all harakat ALL_HARAKAT from the given text.

remove_harakat(text)

Removes common harakat HARAKAT from the given text.

remove_numbers(text)

Removes all numbers NUMBERS from the given text.

remove_tatweel(text)

Removes tatweel symbol TATWEEL from the given text.

remove_expressions(text, patterns[, ...])

Removes matched characters from the given text text using input patterns patterns

remove_emails(text)

Removes emails using pattern EXPRESSION_EMAILS from the given text.

remove_hashtags(text)

Removes hashtags (strings that start with # symbol) using pattern EXPRESSION_HASHTAGS from the given text.

remove_links(text)

Removes links using pattern EXPRESSION_LINKS from the given text.

remove_mentions(text)

Removes mentions (strings that start with @ symbol) using pattern EXPRESSION_MENTIONS from the given text.

reduce_repeated_substring(text[, ...])

Reduces consecutive substrings that are repeated at least min_repeated times to reduce_to times.

remove_hash_keep_tag(text)

Removes the hash symbol HASHTAG from all hashtags in the given text.

remove_arabic_letter_dots(text)

Remove dots from ARABIC_LETTERS in the given text using the ARABIC_DOTLESS_MAP

replace(text, strings, with_value)

Replaces the input strings in the given text with the given value

replace_except(text, strings, with_value)

Replaces everything except the input strings in the given text with the given value

replace_pairs(text, keys, values)

Replaces each key with its corresponding value in the given text

replace_expression(text, expression, with_value)

Matches characters from the input text using the given expression and replaces all matched characters with the given value.

arabic_numbers_to_english(text)

Converts Arabic numbers ARABIC_NUMBERS to the corresponding English numbers ENGLISH_NUMBERS

connect_single_letter_word(text[, waw, feh, ...])

Connects single-letter word with the letter following it.

contains(text, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, lam_alef_variations=False, lam_alef=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, persian=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, custom_strings=None, custom_expressions=None, operator=None)[source]#

Check for certain characters, strings or patterns in the given text.

To add a new parameter, make sure that its name is the same as the corresponding constant. For the patterns, only remove the prefix EXPRESSION_ from the parameter name

Parameters
  • text (str) – Text to check

  • arabic (bool, optional) – Check for ARABIC characters, by default False

  • english (bool, optional) – Check for ENGLISH characters, by default False

  • arabic_letters (bool, optional) – Check for ARABIC_LETTERS characters, by default False

  • english_letters (bool, optional) – Check for ENGLISH_LETTERS characters, by default False

  • english_small_letters (bool, optional) – Check for ENGLISH_SMALL_LETTERS characters, by default False

  • english_capital_letters (bool, optional) – Check for ENGLISH_CAPITAL_LETTERS characters, by default False

  • numbers (bool, optional) – Check for NUMBERS characters, by default False

  • harakat (bool, optional) – Check for HARAKAT characters, by default False

  • all_harakat (bool, optional) – Check for ALL_HARAKAT characters, by default False

  • tatweel (bool, optional) – Check for TATWEEL character, by default False

  • lam_alef_variations (bool, optional) – Check for LAM_ALEF_VARIATIONS characters, by default False

  • lam_alef (bool, optional) – Check for LAM_ALEF character, by default False

  • punctuations (bool, optional) – Check for PUNCTUATIONS characters, by default False

  • arabic_numbers (bool, optional) – Check for ARABIC_NUMBERS characters, by default False

  • english_numbers (bool, optional) – Check for ENGLISH_NUMBERS characters, by default False

  • arabic_punctuations (bool, optional) – Check for ARABIC_PUNCTUATIONS characters, by default False

  • english_punctuations (bool, optional) – Check for ENGLISH_PUNCTUATIONS characters, by default False

  • arabic_ligatures (bool, optional) – Check for ARABIC_LIGATURES words, by default False

  • persian (bool, optional) – Check for PERSIAN characters, by default False

  • arabic_hashtags (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_ARABIC_HASHTAGS, by default False

  • arabic_mentions (bool, optional) – Check for Arabic mentions using the expression EXPRESSION_ARABIC_MENTIONS, by default False

  • emails (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_EMAILS, by default False

  • english_hashtags (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_ENGLISH_HASHTAGS, by default False

  • english_mentions (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_ENGLISH_MENTIONS, by default False

  • hashtags (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_HASHTAGS, by default False

  • links (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_LINKS, by default False

  • mentions (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_MENTIONS, by default False

  • emojis (bool, optional) – Check for emojis using the expression EXPRESSION_EMOJIS, by default False

  • custom_strings (Union[List[str], str], optional) – Include any other string(s), by default None

  • custom_expressions (ExpressionGroup | Expression | None) – Include any other expressions, by default None

  • operator (bool, optional) – When multiple arguments are set to True, this operator is used to combine the output into a boolean. Takes ‘and’ or ‘or’, by default None

Returns

  • If one argument is set to True, a boolean value is returned. True if the text contains it, False otherwise.

  • If operator is set and more than one argument is set to True, a boolean value that combines the result with the “and/or” operator is returned.

  • If more than one argument is set to True, a dictionary is returned where keys are the True passed arguments and the corresponding values are booleans. True if the text contains the argument, False otherwise.

Return type

Union[Dict[str, bool], bool]

Raises

ValueError – If no argument is set to True

Examples

>>> from maha.cleaners.functions import contains
>>> text = "مقاييس أداء النماذج في التعلم الآلي Machine Learning ... 🌺"
>>> contains(text, english=True, emails=True, emojis=True)
{'english': True, 'emails': False, 'emojis': True}
>>> from maha.cleaners.functions import contains
>>> text = "قال رسول اللهﷺ إن خير أيامكم يوم الجمعة فأكثروا عليَّ من الصلاة فيه"
>>> contains(text, english=True)
False
contains_expressions(text, expressions)[source]#

Check for matched strings in the given text using the input expressions

Note

Use lookahead/lookbehind when substrings should not be captured or removed.

Parameters
Returns

True if the pattern is found in the given text, False otherwise.

Return type

bool

Raises

ValueError – If expressions are not of type Expression, ExpressionGroup or str

Example

>>> from maha.cleaners.functions import contains_expressions
>>> text = "علم الهندسة (Engineering)"
>>> contains_expressions(text, r"\([A-Za-z]+\)")
True
contain_strings(text, strings)[source]#

Check for the input strings in the given text

Parameters
  • text (str) – Text to check

  • strings (Union[List[str], str]) – String or list of strings to check for

Returns

True if the input string(s) are found in the text, False otherwise

Return type

bool

Raises

ValueError – If no strings are provided

Example

>>> from maha.cleaners.functions import contain_strings
>>> text = "الله أكبر، الحمد لله رب العالمين"
>>> contain_strings(text, "الله")
True
contains_repeated_substring(text, min_repeated=3)[source]#

Check for consecutive substrings that are repeated at least min_repeated times. For example with the default arguments, the text ‘hhhhhh’ should return True

Parameters
  • text (str) – Text to check

  • min_repeated (int, optional) – Minimum number of consecutive repeated substring to consider, by default 3

Returns

True if the input text contains consecutive substrings, otherwise False

Return type

bool

Raises

ValueError – If non positive integer is passed

Example

>>> from maha.cleaners.functions import contains_repeated_substring
>>> text = "كانت اللعبة حللللللللوة جداً"
>>> contains_repeated_substring(text)
True
contains_single_letter_word(text, arabic_letters=False, english_letters=False)[source]#

Check for a single-letter word. For example, “how r u” should return True if english_letters is set to True because it contains two single-letter word, “r” and “u”.

Parameters
  • text (str) – Text to check

  • arabic_letters (bool, optional) – Check for all ARABIC_LETTERS, by default False

  • english_letters (bool, optional) – Check for all ENGLISH_LETTERS, by default False

Returns

True if the input text contains single-letter word, False otherwise

Return type

bool

Raises

ValueError – If no argument is set to True

Example

>>> from maha.cleaners.functions import contains_single_letter_word
>>> text = "cu later my friend, ك"
>>> contains_single_letter_word(text, arabic_letters=True, english_letters=True)
True
keep(text, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, use_space=True, custom_strings=None)[source]#

Keeps only certain characters in the given text and removes everything else.

To add a new parameter, make sure that its name is the same as the corresponding constant.

Parameters
  • text (str) – Text to be processed

  • arabic (bool, optional) – Keep ARABIC characters, by default False

  • english (bool, optional) – Keep ENGLISH characters, by default False

  • arabic_letters (bool, optional) – Keep ARABIC_LETTERS characters, by default False

  • english_letters (bool, optional) – Keep ENGLISH_LETTERS characters, by default False

  • english_small_letters (bool, optional) – Keep ENGLISH_SMALL_LETTERS characters, by default False

  • english_capital_letters (bool, optional) – Keep ENGLISH_CAPITAL_LETTERS characters, by default False

  • numbers (bool, optional) – Keep NUMBERS characters, by default False

  • harakat (bool, optional) – Keep HARAKAT characters, by default False

  • all_harakat (bool, optional) – Keep ALL_HARAKAT characters, by default False

  • punctuations (bool, optional) – Keep PUNCTUATIONS characters, by default False

  • arabic_numbers (bool, optional) – Keep ARABIC_NUMBERS characters, by default False

  • english_numbers (bool, optional) – Keep ENGLISH_NUMBERS characters, by default False

  • arabic_punctuations (bool, optional) – Keep ARABIC_PUNCTUATIONS characters, by default False

  • english_punctuations (bool, optional) – Keep ENGLISH_PUNCTUATIONS characters, by default False

  • use_space (bool, optional) – False to not replace with space, check keep_strings() for more information, by default True

  • custom_strings (List[str], optional) – Include any other string(s), by default None

Returns

Processed text

Return type

str

Raises

ValueError – If no argument is set to True

Example

>>> from maha.cleaners.functions import keep
>>> text = "بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ"
>>> keep(text, arabic_letters=True)
'بسم الله الرحمن الرحيم'
keep_strings(text, strings, use_space=True)[source]#

Keeps only the input strings strings in the given text text

By default, this works by replacing all strings except the input strings with a space, which means space is kept. This is to help separate texts when unwanted strings are present without spaces. For example, ‘end.start’ will be converted to ‘end start’ if English letters ENGLISH_LETTERS are passed to strings. To disable this behavior, set use_space to False.

Note

Extra spaces (more than one space) are removed by default if use_space is set to True.

Parameters
  • text (str) – Text to be processed

  • strings (Union[List[str], str]) – list of strings to keep

  • use_space (bool) – False to not replace with space, defaults to True

Returns

Text that contains only the input strings.

Return type

str

Raises

ValueError – If no strings are provided

Example

>>> from maha.cleaners.functions import keep_strings
>>> text = "لا حول ولا قوة إلا بالله"
>>> keep_strings(text, "الله")
'الله'
keep_arabic_letters(text)[source]#

Keeps only Arabic letters ARABIC_LETTERS in the given text.

Parameters

text (str) – Text to be processed

Returns

Text contains Arabic letters only.

Return type

str

Example

>>> from maha.cleaners.functions import keep_arabic_letters
>>> text = " 1 يا أحلى mathematicians في العالم"
>>> keep_arabic_letters(text)
'يا أحلى في العالم'
keep_arabic_characters(text)[source]#

Keeps only common Arabic characters ARABIC in the given text.

Parameters

text (str) – Text to be processed

Returns

Text contains the common Arabic characters only.

Return type

str

Example

>>> from maha.cleaners.functions import keep_arabic_characters
>>> text = "أَلمَانِيَا (بالألمانية: Deutschland) رسمِيّاً جُمهُورِيَّة أَلمَانِيَا الاِتِّحَاديَّة"
>>> keep_arabic_characters(text)
'أَلمَانِيَا بالألمانية رسمِيّاً جُمهُورِيَّة أَلمَانِيَا الاِتِّحَاديَّة'
keep_arabic_with_english_numbers(text)[source]#

Keeps only common Arabic characters ARABIC and English numbers ENGLISH_NUMBERS in the given text.

Parameters

text (str) – Text to be processed

Returns

Text contains the common Arabic characters and English numbers only.

Return type

str

Example

>>> from maha.cleaners.functions import keep_arabic_with_english_numbers
>>> text = "تتكون من 16 ولاية تُغطي مساحة 357,021 كيلومتر Deutschland"
>>> keep_arabic_with_english_numbers(text)
'تتكون من 16 ولاية تُغطي مساحة 357 021 كيلومتر'
keep_arabic_letters_with_harakat(text)[source]#

Keeps only Arabic letters ARABIC_LETTERS and HARAKAT HARAKAT in the given text.

Parameters

text (str) – Text to be processed

Returns

Text contains Arabic letters with harakat only.

Return type

str

Example

>>> from maha.cleaners.functions import keep_arabic_letters_with_harakat
>>> text = "إنّ في التّركِ قوة…"
>>> keep_arabic_letters_with_harakat(text)
'إنّ في التّركِ قوة'
normalize(text, lam_alef=None, alef=None, waw=None, yeh=None, teh_marbuta=None, ligatures=None, spaces=None, all=False)[source]#

Normalizes characters in the given text

Parameters
  • text (str) – Text to process

  • lam_alef (bool, optional) – Normalize LAM_ALEF_VARIATIONS characters to LAM and ALEF, by default None

  • alef (bool, optional) – Normalize ALEF_VARIATIONS characters to ALEF, by default None

  • waw (bool, optional) – Normalize WAW_VARIATIONS characters to WAW, by default None

  • yeh (bool, optional) – Normalize YEH_VARIATIONS characters to YEH and ALEF, by default None

  • teh_marbuta (bool, optional) – Normalize TEH_MARBUTA characters to HEH, by default None

  • ligatures (bool, optional) – Normalize ARABIC_LIGATURES characters to the corresponding indices in ARABIC_LIGATURES_NORMALIZED, by default None

  • spaces (bool, optional) – Normalize space variations using the expression EXPRESSION_ALL_SPACES, by default None

  • all (bool, optional) – Do all normalization except the ones that are set to False, by default False

Returns

Processed text

Return type

str

Raises

ValueError – If no argument is set to True

Examples

>>> from maha.cleaners.functions import normalize
>>> text = "عن أبي هريرة"
>>> normalize(text, alef=True, teh_marbuta=True)
'عن ابي هريره'
>>> from maha.cleaners.functions import normalize
>>> text = "قال رسول الله ﷺ"
>>> normalize(text, ligatures=True)
'قال رسول الله صلى الله عليه وسلم'
>>> from maha.cleaners.functions import normalize
>>> text = "قال مؤمن: ﷽ قل هو ﷲ أحد"
... # For space
>>> normalize(text, all=True, waw=False)
'قال مؤمن: بسم الله الرحمن الرحيم قل هو الله احد'
normalize_lam_alef(text, keep_hamza=True)[source]#

Normalize LAM_ALEF_VARIATIONS to LAM_ALEF_VARIATIONS_NORMALIZED If keep_hamza is True. Otherwise, normalize to LAM and ALEF

Parameters
  • text (str) – Text to process

  • keep_hamza (bool, optional) – True to preserve hamza and madda characters, by default True

Returns

Normalized text

Return type

str

Examples

>>> from maha.cleaners.functions import normalize_lam_alef
>>> text = "السﻻم عليكم أحبتي، قالوا في صِفَةِ رَسُولِ الله يتَﻷلأ وَجْهُه"
>>> normalize_lam_alef(text)
'السلام عليكم أحبتي، قالوا في صِفَةِ رَسُولِ الله يتَلألأ وَجْهُه'
>>> from maha.cleaners.functions import normalize_lam_alef
>>> text = "اﻵن يا أصحابي"
>>> normalize_lam_alef(text, keep_hamza=False)
'الان يا أصحابي'
normalize_small_alef(text, keep_madda=True, normalize_end=False)[source]#

Normalize ALEF_SUPERSCRIPT to ALEF. If keep_madda is True and ALEF_SUPERSCRIPT is followed by HAMZA_ABOVE, then normalize to ALEF_MADDA_ABOVE

Parameters
  • text (str) – Text to process

  • keep_madda (bool, optional) – True to preserve madda character, by default True

  • normalize_end (bool, optional) – True to normalize ALEF_SUPERSCRIPT that appear at the end of a word, by default False

Returns

Normalized text

Return type

str

Example

>>> from maha.cleaners.functions import normalize_small_alef
>>> text = "وَٱلصَّٰٓفَّٰتِ صَفّٗا"
>>> normalize_small_alef(text)
'وَٱلصَّآفَّاتِ صَفّٗا'
numbers_to_text(text, accusative=False)[source]#

Converts numbers in text to their equivalent text in Arabic.

Parameters
  • text (str) – Text with numbers to be converted.

  • accusative (bool, optional) – If True, the number will be converted to its accusative form.

Returns

Text with numbers converted to their equivalent text in Arabic.

Return type

str

remove(text, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, use_space=True, custom_strings=None, custom_expressions=None)[source]#

Removes certain characters from the given text.

To add a new parameter, make sure that its name is the same as the corresponding constant. For the patterns, only remove the prefix EXPRESSION_ from the parameter name

Parameters
  • text (str) – Text to be processed

  • arabic (bool, optional) – Remove ARABIC characters, by default False

  • english (bool, optional) – Remove ENGLISH characters, by default False

  • arabic_letters (bool, optional) – Remove ARABIC_LETTERS characters, by default False

  • english_letters (bool, optional) – Remove ENGLISH_LETTERS characters, by default False

  • english_small_letters (bool, optional) – Remove ENGLISH_SMALL_LETTERS characters, by default False

  • english_capital_letters (bool, optional) – Remove ENGLISH_CAPITAL_LETTERS characters, by default False

  • numbers (bool, optional) – Remove NUMBERS characters, by default False

  • harakat (bool, optional) – Remove HARAKAT characters, by default False

  • all_harakat (bool, optional) – Remove ALL_HARAKAT characters, by default False

  • tatweel (bool, optional) – Remove TATWEEL character, by default False

  • punctuations (bool, optional) – Remove PUNCTUATIONS characters, by default False

  • arabic_numbers (bool, optional) – Remove ARABIC_NUMBERS characters, by default False

  • english_numbers (bool, optional) – Remove ENGLISH_NUMBERS characters, by default False

  • arabic_punctuations (bool, optional) – Remove ARABIC_PUNCTUATIONS characters, by default False

  • english_punctuations (bool, optional) – Remove ENGLISH_PUNCTUATIONS characters, by default False

  • arabic_ligatures (bool, optional) – Remove ARABIC_LIGATURES words, by default False

  • arabic_hashtags (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_ARABIC_HASHTAGS, by default False

  • arabic_mentions (bool, optional) – Remove Arabic mentions using the expression EXPRESSION_ARABIC_MENTIONS, by default False

  • emails (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_EMAILS, by default False

  • english_hashtags (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_ENGLISH_HASHTAGS, by default False

  • english_mentions (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_ENGLISH_MENTIONS, by default False

  • hashtags (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_HASHTAGS, by default False

  • links (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_LINKS, by default False

  • mentions (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_MENTIONS, by default False

  • emojis (bool, optional) – Remove emojis using the expression EXPRESSION_EMOJIS, by default False

  • use_space (bool, optional) – False to not replace with space, check remove_strings() for more information, by default True

  • custom_strings (list[str] | str | None) – Include any other string(s), by default None

  • custom_expressions (Union[ExpressionGroup, Expression, str]) – Include any other regular expression expressions, by default None

Returns

Processed text

Return type

str

Raises

ValueError – If no argument is set to True

Examples

>>> from maha.cleaners.functions import remove
>>> text = "ويندوز 11 سيدعم تطبيقات نظام أندرويد. #Windows11"
>>> remove(text, hashtags=True)
'ويندوز 11 سيدعم تطبيقات نظام أندرويد.'
>>> from maha.cleaners.functions import remove
>>> text = "قَالَ رَبِّ اشْرَحْ لِي صَدْرِي.."
>>> remove(text, all_harakat=True, punctuations=True)
'قال رب اشرح لي صدري'
remove_strings(text, strings, use_space=True)[source]#

Removes the input strings strings in the given text text

This works by replacing all input strings strings with a space, which means space cannot be removed. This is to help separate texts when unwanted strings are present without spaces. For example, ‘end.start’ will be converted to ‘end start’ if dot DOT is passed to strings. To disable this behavior, set use_space to False.

Note

Extra spaces (more than one space) are removed by default if use_space is set to True.

Parameters
  • text (str) – Text to be processed

  • strings (Union[List[str], str]) – list of strings to remove

  • use_space (bool) – False to not replace with space, defaults to True

Returns

Text with input strings removed.

Return type

str

Raises

ValueError – If no strings are provided

Example

>>> from maha.cleaners.functions import remove_strings
>>> text = "ومن الكلمات المحظورة السلاح"
>>> remove_strings(text, "السلاح")
'ومن الكلمات المحظورة'
remove_extra_spaces(text, max_spaces=1)[source]#

Keeps a maximum of max_spaces number of spaces when extra spaces are present (more than one space)

Parameters
  • text (str) – Text to be processed

  • max_spaces (int, optional) – Maximum number of spaces to keep, by default 1

Returns

Text with extra spaces removed

Return type

str

Raises

ValueError – When a negative or float value is assigned to max_spaces

Example

>>> from maha.cleaners.functions import remove_extra_spaces
>>> text = "وكان صديقنا    العزيز   محمد من أفضل   الأشخاص الذين قابلتهم"
>>> remove_extra_spaces(text)
'وكان صديقنا العزيز محمد من أفضل الأشخاص الذين قابلتهم'
remove_punctuations(text)[source]#

Removes all punctuations PUNCTUATIONS from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with punctuations removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_punctuations
>>> text = "مثال على الرموز الخاصة كالتالي $ ^ & * ( ) ! @"
>>> remove_punctuations(text)
'مثال على الرموز الخاصة كالتالي'
remove_english(text)[source]#

Removes all english characters ENGLISH from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with english removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_english
>>> text = "ومن أفضل الجامعات هي جامعة إكسفورد (Oxford University)"
>>> remove_english(text)
'ومن أفضل الجامعات هي جامعة إكسفورد'
remove_all_harakat(text)[source]#

Removes all harakat ALL_HARAKAT from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with all harakat removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_all_harakat
>>> text = "وَٱلصَّٰٓفَّٰتِ صَفّٗا (1) فَٱلزَّٰجِرَٰتِ زَجۡرٗا"
>>> remove_all_harakat(text)
'وٱلصفت صفا (1) فٱلزجرت زجرا'
remove_harakat(text)[source]#

Removes common harakat HARAKAT from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with common harakat removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_harakat
>>> text = "ألا تَرَى: كلَّ مَنْ تَرجو وتَأمَلُهُ مِنَ البَرِيَّةِ (مسكينُ بْنُ مسكينِ)"
>>> remove_harakat(text)
'ألا ترى: كل من ترجو وتأمله من البرية (مسكين بن مسكين)'
remove_numbers(text)[source]#

Removes all numbers NUMBERS from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with numbers removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_numbers
>>> text = "ورقم أبو تريكة في نادي الأهلي هو إثنين وعشرين (22)"
>>> remove_numbers(text)
'ورقم أبو تريكة في نادي الأهلي هو إثنين وعشرين ( )'
remove_tatweel(text)[source]#

Removes tatweel symbol TATWEEL from the given text.

Parameters

text (str) – Text to process

Returns

Text with tatweel symbol removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_tatweel
>>> text = "الحمــــــــد لله رب العــــــــــــالمـــــــيـــــن"
>>> remove_tatweel(text)
'الحمد لله رب العالمين'
remove_expressions(text, patterns, remove_spaces=True)[source]#

Removes matched characters from the given text text using input patterns patterns

Note

Use lookahead/lookbehind when substrings should not be captured or removed.

Parameters
  • text (str) – Text to process

  • patterns (Expression | ExpressionGroup | str) – Expression(s) to use

  • remove_spaces (bool, optional) – False to keep extra spaces, defaults to True

Returns

Text with matched characters removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_expressions
>>> text = "الأميرُ الغازي أرطُغرُل، أو اختصارًا أرطغرل (بالتركية: Ertuğrul)"
>>> remove_expressions(text, r"\(.*\)")
'الأميرُ الغازي أرطُغرُل، أو اختصارًا أرطغرل'
remove_emails(text)[source]#

Removes emails using pattern EXPRESSION_EMAILS from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with emails removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_emails
>>> text = "يمكن استخدام الإيميل الشخصي، كمثال user1998@gmail.com"
>>> remove_emails(text)
'يمكن استخدام الإيميل الشخصي، كمثال'
remove_hashtags(text)[source]#

Removes hashtags (strings that start with # symbol) using pattern EXPRESSION_HASHTAGS from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with hashtags removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_hashtags
>>> text = "ويمكن القول أن مكة المكرمة من أجمل المناطق على وجه الأرض #السعودية"
>>> remove_hashtags(text)
'ويمكن القول أن مكة المكرمة من أجمل المناطق على وجه الأرض'

Removes links using pattern EXPRESSION_LINKS from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with links removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_links
>>> text = "لمشاهدة آخر التطورات يرجى زيارة الموقع التالي: https://github.com/TRoboto/Maha"
>>> remove_links(text)
'لمشاهدة آخر التطورات يرجى زيارة الموقع التالي:'
remove_mentions(text)[source]#

Removes mentions (strings that start with @ symbol) using pattern EXPRESSION_MENTIONS from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with mentions removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_mentions
>>> text = "@test لو سمحت صديقنا تزورنا على المعرض لاستلام الجائزة"
>>> remove_mentions(text)
'لو سمحت صديقنا تزورنا على المعرض لاستلام الجائزة'
reduce_repeated_substring(text, min_repeated=3, reduce_to=2)[source]#

Reduces consecutive substrings that are repeated at least min_repeated times to reduce_to times. For example with the default arguments, ‘hhhhhh’ is reduced to ‘hh’

TODO: Maybe change the implemention for 50x speed https://stackoverflow.com/questions/29481088/how-can-i-tell-if-a-string-repeats-itself-in-python/29489919#29489919

Parameters
  • text (str) – Text to process

  • min_repeated (int, optional) – Minimum number of consecutive repeated substring to consider, by default 3

  • reduce_to (int, optional) – Number of substring to keep, by default 2

Returns

Processed text

Return type

str

Raises

ValueError – If non positive integer is passed or reduce_to is greater than min_repeated

Examples

..code:: pycon

>>> from maha.cleaners.functions import reduce_repeated_substring
>>> text = "ههههههههههههههه"
>>> reduce_repeated_substring(text)
'هه'

..code:: pycon

>>> from maha.cleaners.functions import reduce_repeated_substring
>>> text = "ويييييييييين راححححححححححححوا"
>>> reduce_repeated_substring(text, reduce_to=1)
'وين راحوا'
remove_hash_keep_tag(text)[source]#

Removes the hash symbol HASHTAG from all hashtags in the given text.

Parameters

text (str) – Text to process

Returns

Text without hashtags.

Return type

str

Example

>>> from maha.cleaners.functions import remove_hash_keep_tag
>>> text = "We love #Jordan very much"
>>> remove_hash_keep_tag(text)
'We love Jordan very much'
remove_arabic_letter_dots(text)[source]#

Remove dots from ARABIC_LETTERS in the given text using the ARABIC_DOTLESS_MAP

Parameters

text (str) – Text to be processed

Returns

Text with dotless Arabic letters

Return type

str

Example

>>> from maha.cleaners.functions import remove_arabic_letter_dots
>>> text = "الحَمدُ للهِ الَّذي بنِعمتِه تَتمُّ الصَّالحاتُ"
>>> remove_arabic_letter_dots(text)
'الحَمدُ للهِ الَّدى ٮٮِعمٮِه ٮَٮمُّ الصَّالحاٮُ'
replace(text, strings, with_value)[source]#

Replaces the input strings in the given text with the given value

Parameters
  • text (str) – Text to process

  • strings (list[str] | str) – Strings to replace

  • with_value (str) – Value to replace the input strings with

Returns

Processed text

Return type

str

Examples

>>> from maha.cleaners.functions import replace
>>> text = "حصل الولد على معدل 50%"
>>> replace(text, "%", " بالمئة")
'حصل الولد على معدل 50 بالمئة'
>>> from maha.cleaners.functions import replace
>>> text = "ولقد كلف هذا المنتج 100 $"
>>> replace(text, "$", "دولار")
'ولقد كلف هذا المنتج 100 دولار'
replace_except(text, strings, with_value)[source]#

Replaces everything except the input strings in the given text with the given value

Parameters
  • text (str) – Text to process

  • strings (list[str] | str) – Strings to preserve (not replace)

  • with_value (str) – Value to replace all other strings with.

Returns

Processed text

Return type

str

Example

>>> from maha.cleaners.functions import replace_except
>>> from maha.constants import ARABIC_LETTERS, SPACE, EMPTY
>>> text = "لَيتَ الذينَ تُحبُّ العيّنَ رؤيَتهم"
>>> replace_except(text, ARABIC_LETTERS + [SPACE], EMPTY)
'ليت الذين تحب العين رؤيتهم'
replace_pairs(text, keys, values)[source]#

Replaces each key with its corresponding value in the given text

Parameters
  • text (str) – Text to process

  • keys (list[str]) – Strings to be replaced

  • values (list[str]) – Strings to be replaced with

Returns

Processed text

Return type

str

Raises

ValueError – If keys and values are of different lengths

Example

>>> from maha.cleaners.functions import replace_pairs
>>> text = 'شلونك يا محمد؟'
>>> replace_pairs(text, ['شلونك'] , ['كيف حالك'])
'كيف حالك يا محمد؟'
replace_expression(text, expression, with_value)[source]#

Matches characters from the input text using the given expression and replaces all matched characters with the given value.

Parameters
  • text (str) – Text to process

  • expression (Expression | ExpressionGroup | str) – Pattern/Expression used to match characters from the text

  • with_value (Callable[..., str] | str) – Value to replace the matched characters with

Returns

Processed text

Return type

str

Examples

>>> from maha.cleaners.functions import replace_expression
>>> text = "ولقد حصلت على ١٠ من ١٠ "
>>> replace_expression(text, "١٠", "عشرة")
'ولقد حصلت على عشرة من عشرة '
>>> from maha.cleaners.functions import replace_expression
>>> text = "ذهبت الفتاه إلى المدرسه"
>>> replace_expression(text, "ه( |$)", "ة ").strip()
'ذهبت الفتاة إلى المدرسة'
arabic_numbers_to_english(text)[source]#

Converts Arabic numbers ARABIC_NUMBERS to the corresponding English numbers ENGLISH_NUMBERS

Parameters

text (str) – Text to process

Returns

Processed text with all occurrences of Arabic numbers converted to English numbers

Return type

str

Examples

>>> from maha.cleaners.functions import arabic_numbers_to_english
>>> text = "٣"
>>> arabic_numbers_to_english(text)
'3'
>>> from maha.cleaners.functions import arabic_numbers_to_english
>>> text = "١٠"
>>> arabic_numbers_to_english(text)
'10'
connect_single_letter_word(text, waw=None, feh=None, beh=None, lam=None, kaf=None, teh=None, all=None, custom_strings=None)[source]#

Connects single-letter word with the letter following it.

Parameters
  • text (str) – Text to process

  • waw (bool, optional) – Connect WAW letter, by default None

  • feh (bool, optional) – Connect FEH letter, by default None

  • beh (bool, optional) – Connect BEH letter, by default None

  • lam (bool, optional) – Connect LAM letter, by default None

  • kaf (bool, optional) – Connect KAF letter, by default None

  • teh (bool, optional) – Connect TEH letter, by default None

  • all (bool, optional) – Connect all letter except the ones set to False, by default None

  • custom_strings (Union[List[str], str], optional) – Include any other string(s) to connect, by default None