maha.cleaners.functions.remove_fn#

Functions that operate on a string and remove certain characters.

Module Contents#

Functions#

remove(text[, arabic, english, ...])

Removes certain characters from the given text.

reduce_repeated_substring(text[, ...])

Reduces consecutive substrings that are repeated at least min_repeated times to reduce_to times.

remove_hash_keep_tag(text)

Removes the hash symbol HASHTAG from all hashtags in the given text.

remove_tatweel(text)

Removes tatweel symbol TATWEEL from the given text.

remove_emails(text)

Removes emails using pattern EXPRESSION_EMAILS from the given text.

remove_hashtags(text)

Removes hashtags (strings that start with # symbol) using pattern EXPRESSION_HASHTAGS from the given text.

remove_links(text)

Removes links using pattern EXPRESSION_LINKS from the given text.

remove_mentions(text)

Removes mentions (strings that start with @ symbol) using pattern EXPRESSION_MENTIONS from the given text.

remove_punctuations(text)

Removes all punctuations PUNCTUATIONS from the given text.

remove_english(text)

Removes all english characters ENGLISH from the given text.

remove_all_harakat(text)

Removes all harakat ALL_HARAKAT from the given text.

remove_harakat(text)

Removes common harakat HARAKAT from the given text.

remove_numbers(text)

Removes all numbers NUMBERS from the given text.

remove_expressions(text, patterns[, ...])

Removes matched characters from the given text text using input patterns patterns

remove_strings(text, strings[, use_space])

Removes the input strings strings in the given text text

remove_extra_spaces(text[, max_spaces])

Keeps a maximum of max_spaces number of spaces when extra spaces are present (more than one space)

remove_arabic_letter_dots(text)

Remove dots from ARABIC_LETTERS in the given text using the ARABIC_DOTLESS_MAP

remove(text, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, use_space=True, custom_strings=None, custom_expressions=None)[source]#

Removes certain characters from the given text.

To add a new parameter, make sure that its name is the same as the corresponding constant. For the patterns, only remove the prefix EXPRESSION_ from the parameter name

Parameters
  • text (str) – Text to be processed

  • arabic (bool, optional) – Remove ARABIC characters, by default False

  • english (bool, optional) – Remove ENGLISH characters, by default False

  • arabic_letters (bool, optional) – Remove ARABIC_LETTERS characters, by default False

  • english_letters (bool, optional) – Remove ENGLISH_LETTERS characters, by default False

  • english_small_letters (bool, optional) – Remove ENGLISH_SMALL_LETTERS characters, by default False

  • english_capital_letters (bool, optional) – Remove ENGLISH_CAPITAL_LETTERS characters, by default False

  • numbers (bool, optional) – Remove NUMBERS characters, by default False

  • harakat (bool, optional) – Remove HARAKAT characters, by default False

  • all_harakat (bool, optional) – Remove ALL_HARAKAT characters, by default False

  • tatweel (bool, optional) – Remove TATWEEL character, by default False

  • punctuations (bool, optional) – Remove PUNCTUATIONS characters, by default False

  • arabic_numbers (bool, optional) – Remove ARABIC_NUMBERS characters, by default False

  • english_numbers (bool, optional) – Remove ENGLISH_NUMBERS characters, by default False

  • arabic_punctuations (bool, optional) – Remove ARABIC_PUNCTUATIONS characters, by default False

  • english_punctuations (bool, optional) – Remove ENGLISH_PUNCTUATIONS characters, by default False

  • arabic_ligatures (bool, optional) – Remove ARABIC_LIGATURES words, by default False

  • arabic_hashtags (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_ARABIC_HASHTAGS, by default False

  • arabic_mentions (bool, optional) – Remove Arabic mentions using the expression EXPRESSION_ARABIC_MENTIONS, by default False

  • emails (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_EMAILS, by default False

  • english_hashtags (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_ENGLISH_HASHTAGS, by default False

  • english_mentions (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_ENGLISH_MENTIONS, by default False

  • hashtags (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_HASHTAGS, by default False

  • links (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_LINKS, by default False

  • mentions (bool, optional) – Remove Arabic hashtags using the expression EXPRESSION_MENTIONS, by default False

  • emojis (bool, optional) – Remove emojis using the expression EXPRESSION_EMOJIS, by default False

  • use_space (bool, optional) – False to not replace with space, check remove_strings() for more information, by default True

  • custom_strings (list[str] | str | None) – Include any other string(s), by default None

  • custom_expressions (Union[ExpressionGroup, Expression, str]) – Include any other regular expression expressions, by default None

Returns

Processed text

Return type

str

Raises

ValueError – If no argument is set to True

Examples

>>> from maha.cleaners.functions import remove
>>> text = "ويندوز 11 سيدعم تطبيقات نظام أندرويد. #Windows11"
>>> remove(text, hashtags=True)
'ويندوز 11 سيدعم تطبيقات نظام أندرويد.'
>>> from maha.cleaners.functions import remove
>>> text = "قَالَ رَبِّ اشْرَحْ لِي صَدْرِي.."
>>> remove(text, all_harakat=True, punctuations=True)
'قال رب اشرح لي صدري'
reduce_repeated_substring(text, min_repeated=3, reduce_to=2)[source]#

Reduces consecutive substrings that are repeated at least min_repeated times to reduce_to times. For example with the default arguments, ‘hhhhhh’ is reduced to ‘hh’

TODO: Maybe change the implemention for 50x speed https://stackoverflow.com/questions/29481088/how-can-i-tell-if-a-string-repeats-itself-in-python/29489919#29489919

Parameters
  • text (str) – Text to process

  • min_repeated (int, optional) – Minimum number of consecutive repeated substring to consider, by default 3

  • reduce_to (int, optional) – Number of substring to keep, by default 2

Returns

Processed text

Return type

str

Raises

ValueError – If non positive integer is passed or reduce_to is greater than min_repeated

Examples

..code:: pycon

>>> from maha.cleaners.functions import reduce_repeated_substring
>>> text = "ههههههههههههههه"
>>> reduce_repeated_substring(text)
'هه'

..code:: pycon

>>> from maha.cleaners.functions import reduce_repeated_substring
>>> text = "ويييييييييين راححححححححححححوا"
>>> reduce_repeated_substring(text, reduce_to=1)
'وين راحوا'
remove_hash_keep_tag(text)[source]#

Removes the hash symbol HASHTAG from all hashtags in the given text.

Parameters

text (str) – Text to process

Returns

Text without hashtags.

Return type

str

Example

>>> from maha.cleaners.functions import remove_hash_keep_tag
>>> text = "We love #Jordan very much"
>>> remove_hash_keep_tag(text)
'We love Jordan very much'
remove_tatweel(text)[source]#

Removes tatweel symbol TATWEEL from the given text.

Parameters

text (str) – Text to process

Returns

Text with tatweel symbol removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_tatweel
>>> text = "الحمــــــــد لله رب العــــــــــــالمـــــــيـــــن"
>>> remove_tatweel(text)
'الحمد لله رب العالمين'
remove_emails(text)[source]#

Removes emails using pattern EXPRESSION_EMAILS from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with emails removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_emails
>>> text = "يمكن استخدام الإيميل الشخصي، كمثال user1998@gmail.com"
>>> remove_emails(text)
'يمكن استخدام الإيميل الشخصي، كمثال'
remove_hashtags(text)[source]#

Removes hashtags (strings that start with # symbol) using pattern EXPRESSION_HASHTAGS from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with hashtags removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_hashtags
>>> text = "ويمكن القول أن مكة المكرمة من أجمل المناطق على وجه الأرض #السعودية"
>>> remove_hashtags(text)
'ويمكن القول أن مكة المكرمة من أجمل المناطق على وجه الأرض'

Removes links using pattern EXPRESSION_LINKS from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with links removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_links
>>> text = "لمشاهدة آخر التطورات يرجى زيارة الموقع التالي: https://github.com/TRoboto/Maha"
>>> remove_links(text)
'لمشاهدة آخر التطورات يرجى زيارة الموقع التالي:'
remove_mentions(text)[source]#

Removes mentions (strings that start with @ symbol) using pattern EXPRESSION_MENTIONS from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with mentions removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_mentions
>>> text = "@test لو سمحت صديقنا تزورنا على المعرض لاستلام الجائزة"
>>> remove_mentions(text)
'لو سمحت صديقنا تزورنا على المعرض لاستلام الجائزة'
remove_punctuations(text)[source]#

Removes all punctuations PUNCTUATIONS from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with punctuations removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_punctuations
>>> text = "مثال على الرموز الخاصة كالتالي $ ^ & * ( ) ! @"
>>> remove_punctuations(text)
'مثال على الرموز الخاصة كالتالي'
remove_english(text)[source]#

Removes all english characters ENGLISH from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with english removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_english
>>> text = "ومن أفضل الجامعات هي جامعة إكسفورد (Oxford University)"
>>> remove_english(text)
'ومن أفضل الجامعات هي جامعة إكسفورد'
remove_all_harakat(text)[source]#

Removes all harakat ALL_HARAKAT from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with all harakat removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_all_harakat
>>> text = "وَٱلصَّٰٓفَّٰتِ صَفّٗا (1) فَٱلزَّٰجِرَٰتِ زَجۡرٗا"
>>> remove_all_harakat(text)
'وٱلصفت صفا (1) فٱلزجرت زجرا'
remove_harakat(text)[source]#

Removes common harakat HARAKAT from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with common harakat removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_harakat
>>> text = "ألا تَرَى: كلَّ مَنْ تَرجو وتَأمَلُهُ مِنَ البَرِيَّةِ (مسكينُ بْنُ مسكينِ)"
>>> remove_harakat(text)
'ألا ترى: كل من ترجو وتأمله من البرية (مسكين بن مسكين)'
remove_numbers(text)[source]#

Removes all numbers NUMBERS from the given text.

Parameters

text (str) – Text to be processed

Returns

Text with numbers removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_numbers
>>> text = "ورقم أبو تريكة في نادي الأهلي هو إثنين وعشرين (22)"
>>> remove_numbers(text)
'ورقم أبو تريكة في نادي الأهلي هو إثنين وعشرين ( )'
remove_expressions(text, patterns, remove_spaces=True)[source]#

Removes matched characters from the given text text using input patterns patterns

Note

Use lookahead/lookbehind when substrings should not be captured or removed.

Parameters
  • text (str) – Text to process

  • patterns (Expression | ExpressionGroup | str) – Expression(s) to use

  • remove_spaces (bool, optional) – False to keep extra spaces, defaults to True

Returns

Text with matched characters removed.

Return type

str

Example

>>> from maha.cleaners.functions import remove_expressions
>>> text = "الأميرُ الغازي أرطُغرُل، أو اختصارًا أرطغرل (بالتركية: Ertuğrul)"
>>> remove_expressions(text, r"\(.*\)")
'الأميرُ الغازي أرطُغرُل، أو اختصارًا أرطغرل'
remove_strings(text, strings, use_space=True)[source]#

Removes the input strings strings in the given text text

This works by replacing all input strings strings with a space, which means space cannot be removed. This is to help separate texts when unwanted strings are present without spaces. For example, ‘end.start’ will be converted to ‘end start’ if dot DOT is passed to strings. To disable this behavior, set use_space to False.

Note

Extra spaces (more than one space) are removed by default if use_space is set to True.

Parameters
  • text (str) – Text to be processed

  • strings (Union[List[str], str]) – list of strings to remove

  • use_space (bool) – False to not replace with space, defaults to True

Returns

Text with input strings removed.

Return type

str

Raises

ValueError – If no strings are provided

Example

>>> from maha.cleaners.functions import remove_strings
>>> text = "ومن الكلمات المحظورة السلاح"
>>> remove_strings(text, "السلاح")
'ومن الكلمات المحظورة'
remove_extra_spaces(text, max_spaces=1)[source]#

Keeps a maximum of max_spaces number of spaces when extra spaces are present (more than one space)

Parameters
  • text (str) – Text to be processed

  • max_spaces (int, optional) – Maximum number of spaces to keep, by default 1

Returns

Text with extra spaces removed

Return type

str

Raises

ValueError – When a negative or float value is assigned to max_spaces

Example

>>> from maha.cleaners.functions import remove_extra_spaces
>>> text = "وكان صديقنا    العزيز   محمد من أفضل   الأشخاص الذين قابلتهم"
>>> remove_extra_spaces(text)
'وكان صديقنا العزيز محمد من أفضل الأشخاص الذين قابلتهم'
remove_arabic_letter_dots(text)[source]#

Remove dots from ARABIC_LETTERS in the given text using the ARABIC_DOTLESS_MAP

Parameters

text (str) – Text to be processed

Returns

Text with dotless Arabic letters

Return type

str

Example

>>> from maha.cleaners.functions import remove_arabic_letter_dots
>>> text = "الحَمدُ للهِ الَّذي بنِعمتِه تَتمُّ الصَّالحاتُ"
>>> remove_arabic_letter_dots(text)
'الحَمدُ للهِ الَّدى ٮٮِعمٮِه ٮَٮمُّ الصَّالحاٮُ'