maha.cleaners.functions.contains_fn#

Functions that operate on a string and check for values contained in it

Module Contents#

Functions#

contains(text[, arabic, english, ...])

Check for certain characters, strings or patterns in the given text.

contains_repeated_substring(text[, min_repeated])

Check for consecutive substrings that are repeated at least min_repeated times.

contains_single_letter_word(text[, ...])

Check for a single-letter word.

contains_expressions(text, expressions)

Check for matched strings in the given text using the input expressions

contain_strings(text, strings)

Check for the input strings in the given text

contains(text, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, lam_alef_variations=False, lam_alef=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, persian=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, custom_strings=None, custom_expressions=None, operator=None)[source]#

Check for certain characters, strings or patterns in the given text.

To add a new parameter, make sure that its name is the same as the corresponding constant. For the patterns, only remove the prefix EXPRESSION_ from the parameter name

Parameters
  • text (str) – Text to check

  • arabic (bool, optional) – Check for ARABIC characters, by default False

  • english (bool, optional) – Check for ENGLISH characters, by default False

  • arabic_letters (bool, optional) – Check for ARABIC_LETTERS characters, by default False

  • english_letters (bool, optional) – Check for ENGLISH_LETTERS characters, by default False

  • english_small_letters (bool, optional) – Check for ENGLISH_SMALL_LETTERS characters, by default False

  • english_capital_letters (bool, optional) – Check for ENGLISH_CAPITAL_LETTERS characters, by default False

  • numbers (bool, optional) – Check for NUMBERS characters, by default False

  • harakat (bool, optional) – Check for HARAKAT characters, by default False

  • all_harakat (bool, optional) – Check for ALL_HARAKAT characters, by default False

  • tatweel (bool, optional) – Check for TATWEEL character, by default False

  • lam_alef_variations (bool, optional) – Check for LAM_ALEF_VARIATIONS characters, by default False

  • lam_alef (bool, optional) – Check for LAM_ALEF character, by default False

  • punctuations (bool, optional) – Check for PUNCTUATIONS characters, by default False

  • arabic_numbers (bool, optional) – Check for ARABIC_NUMBERS characters, by default False

  • english_numbers (bool, optional) – Check for ENGLISH_NUMBERS characters, by default False

  • arabic_punctuations (bool, optional) – Check for ARABIC_PUNCTUATIONS characters, by default False

  • english_punctuations (bool, optional) – Check for ENGLISH_PUNCTUATIONS characters, by default False

  • arabic_ligatures (bool, optional) – Check for ARABIC_LIGATURES words, by default False

  • persian (bool, optional) – Check for PERSIAN characters, by default False

  • arabic_hashtags (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_ARABIC_HASHTAGS, by default False

  • arabic_mentions (bool, optional) – Check for Arabic mentions using the expression EXPRESSION_ARABIC_MENTIONS, by default False

  • emails (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_EMAILS, by default False

  • english_hashtags (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_ENGLISH_HASHTAGS, by default False

  • english_mentions (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_ENGLISH_MENTIONS, by default False

  • hashtags (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_HASHTAGS, by default False

  • links (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_LINKS, by default False

  • mentions (bool, optional) – Check for Arabic hashtags using the expression EXPRESSION_MENTIONS, by default False

  • emojis (bool, optional) – Check for emojis using the expression EXPRESSION_EMOJIS, by default False

  • custom_strings (Union[List[str], str], optional) – Include any other string(s), by default None

  • custom_expressions (ExpressionGroup | Expression | None) – Include any other expressions, by default None

  • operator (bool, optional) – When multiple arguments are set to True, this operator is used to combine the output into a boolean. Takes ‘and’ or ‘or’, by default None

Returns

  • If one argument is set to True, a boolean value is returned. True if the text contains it, False otherwise.

  • If operator is set and more than one argument is set to True, a boolean value that combines the result with the “and/or” operator is returned.

  • If more than one argument is set to True, a dictionary is returned where keys are the True passed arguments and the corresponding values are booleans. True if the text contains the argument, False otherwise.

Return type

Union[Dict[str, bool], bool]

Raises

ValueError – If no argument is set to True

Examples

>>> from maha.cleaners.functions import contains
>>> text = "مقاييس أداء النماذج في التعلم الآلي Machine Learning ... 🌺"
>>> contains(text, english=True, emails=True, emojis=True)
{'english': True, 'emails': False, 'emojis': True}
>>> from maha.cleaners.functions import contains
>>> text = "قال رسول اللهﷺ إن خير أيامكم يوم الجمعة فأكثروا عليَّ من الصلاة فيه"
>>> contains(text, english=True)
False
contains_repeated_substring(text, min_repeated=3)[source]#

Check for consecutive substrings that are repeated at least min_repeated times. For example with the default arguments, the text ‘hhhhhh’ should return True

Parameters
  • text (str) – Text to check

  • min_repeated (int, optional) – Minimum number of consecutive repeated substring to consider, by default 3

Returns

True if the input text contains consecutive substrings, otherwise False

Return type

bool

Raises

ValueError – If non positive integer is passed

Example

>>> from maha.cleaners.functions import contains_repeated_substring
>>> text = "كانت اللعبة حللللللللوة جداً"
>>> contains_repeated_substring(text)
True
contains_single_letter_word(text, arabic_letters=False, english_letters=False)[source]#

Check for a single-letter word. For example, “how r u” should return True if english_letters is set to True because it contains two single-letter word, “r” and “u”.

Parameters
  • text (str) – Text to check

  • arabic_letters (bool, optional) – Check for all ARABIC_LETTERS, by default False

  • english_letters (bool, optional) – Check for all ENGLISH_LETTERS, by default False

Returns

True if the input text contains single-letter word, False otherwise

Return type

bool

Raises

ValueError – If no argument is set to True

Example

>>> from maha.cleaners.functions import contains_single_letter_word
>>> text = "cu later my friend, ك"
>>> contains_single_letter_word(text, arabic_letters=True, english_letters=True)
True
contains_expressions(text, expressions)[source]#

Check for matched strings in the given text using the input expressions

Note

Use lookahead/lookbehind when substrings should not be captured or removed.

Parameters
Returns

True if the pattern is found in the given text, False otherwise.

Return type

bool

Raises

ValueError – If expressions are not of type Expression, ExpressionGroup or str

Example

>>> from maha.cleaners.functions import contains_expressions
>>> text = "علم الهندسة (Engineering)"
>>> contains_expressions(text, r"\([A-Za-z]+\)")
True
contain_strings(text, strings)[source]#

Check for the input strings in the given text

Parameters
  • text (str) – Text to check

  • strings (Union[List[str], str]) – String or list of strings to check for

Returns

True if the input string(s) are found in the text, False otherwise

Return type

bool

Raises

ValueError – If no strings are provided

Example

>>> from maha.cleaners.functions import contain_strings
>>> text = "الله أكبر، الحمد لله رب العالمين"
>>> contain_strings(text, "الله")
True