maha.cleaners.functions.contains_fn#
Functions that operate on a string and check for values contained in it
Module Contents#
Functions#
|
Check for certain characters, strings or patterns in the given text. |
|
Check for consecutive substrings that are repeated at least |
|
Check for a single-letter word. |
|
Check for matched strings in the given |
|
Check for the input |
- contains(text, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, lam_alef_variations=False, lam_alef=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, persian=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, custom_strings=None, custom_expressions=None, operator=None)[source]#
Check for certain characters, strings or patterns in the given text.
To add a new parameter, make sure that its name is the same as the corresponding constant. For the patterns, only remove the prefix
EXPRESSION_from the parameter name- Parameters
text (str) – Text to check
arabic (bool, optional) – Check for
ARABICcharacters, by default Falseenglish (bool, optional) – Check for
ENGLISHcharacters, by default Falsearabic_letters (bool, optional) – Check for
ARABIC_LETTERScharacters, by default Falseenglish_letters (bool, optional) – Check for
ENGLISH_LETTERScharacters, by default Falseenglish_small_letters (bool, optional) – Check for
ENGLISH_SMALL_LETTERScharacters, by default Falseenglish_capital_letters (bool, optional) – Check for
ENGLISH_CAPITAL_LETTERScharacters, by default Falsenumbers (bool, optional) – Check for
NUMBERScharacters, by default Falseharakat (bool, optional) – Check for
HARAKATcharacters, by default Falseall_harakat (bool, optional) – Check for
ALL_HARAKATcharacters, by default Falsetatweel (bool, optional) – Check for
TATWEELcharacter, by default Falselam_alef_variations (bool, optional) – Check for
LAM_ALEF_VARIATIONScharacters, by default Falselam_alef (bool, optional) – Check for
LAM_ALEFcharacter, by default Falsepunctuations (bool, optional) – Check for
PUNCTUATIONScharacters, by default Falsearabic_numbers (bool, optional) – Check for
ARABIC_NUMBERScharacters, by default Falseenglish_numbers (bool, optional) – Check for
ENGLISH_NUMBERScharacters, by default Falsearabic_punctuations (bool, optional) – Check for
ARABIC_PUNCTUATIONScharacters, by default Falseenglish_punctuations (bool, optional) – Check for
ENGLISH_PUNCTUATIONScharacters, by default Falsearabic_ligatures (bool, optional) – Check for
ARABIC_LIGATURESwords, by default Falsepersian (bool, optional) – Check for
PERSIANcharacters, by default Falsearabic_hashtags (bool, optional) – Check for Arabic hashtags using the expression
EXPRESSION_ARABIC_HASHTAGS, by default Falsearabic_mentions (bool, optional) – Check for Arabic mentions using the expression
EXPRESSION_ARABIC_MENTIONS, by default Falseemails (bool, optional) – Check for Arabic hashtags using the expression
EXPRESSION_EMAILS, by default Falseenglish_hashtags (bool, optional) – Check for Arabic hashtags using the expression
EXPRESSION_ENGLISH_HASHTAGS, by default Falseenglish_mentions (bool, optional) – Check for Arabic hashtags using the expression
EXPRESSION_ENGLISH_MENTIONS, by default Falsehashtags (bool, optional) – Check for Arabic hashtags using the expression
EXPRESSION_HASHTAGS, by default Falselinks (bool, optional) – Check for Arabic hashtags using the expression
EXPRESSION_LINKS, by default Falsementions (bool, optional) – Check for Arabic hashtags using the expression
EXPRESSION_MENTIONS, by default Falseemojis (bool, optional) – Check for emojis using the expression
EXPRESSION_EMOJIS, by default Falsecustom_strings (Union[List[str], str], optional) – Include any other string(s), by default None
custom_expressions (ExpressionGroup | Expression | None) – Include any other expressions, by default None
operator (bool, optional) – When multiple arguments are set to True, this operator is used to combine the output into a boolean. Takes ‘and’ or ‘or’, by default None
- Returns
If one argument is set to True, a boolean value is returned. True if the text contains it, False otherwise.
If
operatoris set and more than one argument is set to True, a boolean value that combines the result with the “and/or” operator is returned.If more than one argument is set to True, a dictionary is returned where keys are the True passed arguments and the corresponding values are booleans. True if the text contains the argument, False otherwise.
- Return type
Union[Dict[str, bool], bool]
- Raises
ValueError – If no argument is set to True
Examples
>>> from maha.cleaners.functions import contains >>> text = "مقاييس أداء النماذج في التعلم الآلي Machine Learning ... 🌺" >>> contains(text, english=True, emails=True, emojis=True) {'english': True, 'emails': False, 'emojis': True}
>>> from maha.cleaners.functions import contains >>> text = "قال رسول اللهﷺ إن خير أيامكم يوم الجمعة فأكثروا عليَّ من الصلاة فيه" >>> contains(text, english=True) False
- contains_repeated_substring(text, min_repeated=3)[source]#
Check for consecutive substrings that are repeated at least
min_repeatedtimes. For example with the default arguments, the text ‘hhhhhh’ should return True- Parameters
text (str) – Text to check
min_repeated (int, optional) – Minimum number of consecutive repeated substring to consider, by default 3
- Returns
True if the input text contains consecutive substrings, otherwise False
- Return type
bool
- Raises
ValueError – If non positive integer is passed
Example
>>> from maha.cleaners.functions import contains_repeated_substring >>> text = "كانت اللعبة حللللللللوة جداً" >>> contains_repeated_substring(text) True
- contains_single_letter_word(text, arabic_letters=False, english_letters=False)[source]#
Check for a single-letter word. For example, “how r u” should return True if
english_lettersis set to True because it contains two single-letter word, “r” and “u”.- Parameters
text (str) – Text to check
arabic_letters (bool, optional) – Check for all
ARABIC_LETTERS, by default Falseenglish_letters (bool, optional) – Check for all
ENGLISH_LETTERS, by default False
- Returns
True if the input text contains single-letter word, False otherwise
- Return type
bool
- Raises
ValueError – If no argument is set to True
Example
>>> from maha.cleaners.functions import contains_single_letter_word >>> text = "cu later my friend, ك" >>> contains_single_letter_word(text, arabic_letters=True, english_letters=True) True
- contains_expressions(text, expressions)[source]#
Check for matched strings in the given
textusing the inputexpressionsNote
Use lookahead/lookbehind when substrings should not be captured or removed.
- Parameters
text (str) – Text to check
expressions (Union[
ExpressionGroup,Expression, str]) – Expression(s) to use
- Returns
True if the pattern is found in the given text, False otherwise.
- Return type
bool
- Raises
ValueError – If
expressionsare not of typeExpression,ExpressionGroupor str
Example
>>> from maha.cleaners.functions import contains_expressions >>> text = "علم الهندسة (Engineering)" >>> contains_expressions(text, r"\([A-Za-z]+\)") True
- contain_strings(text, strings)[source]#
Check for the input
stringsin the giventext- Parameters
text (str) – Text to check
strings (Union[List[str], str]) – String or list of strings to check for
- Returns
True if the input string(s) are found in the text, False otherwise
- Return type
bool
- Raises
ValueError – If no
stringsare provided
Example
>>> from maha.cleaners.functions import contain_strings >>> text = "الله أكبر، الحمد لله رب العالمين" >>> contain_strings(text, "الله") True