maha.cleaners.functions#
Submodules#
Package Contents#
Functions#
|
Check for certain characters, strings or patterns in the given text. |
|
Check for matched strings in the given |
|
Check for the input |
|
Check for consecutive substrings that are repeated at least |
|
Check for a single-letter word. |
|
Keeps only certain characters in the given text and removes everything else. |
|
Keeps only the input strings |
|
Keeps only Arabic letters |
|
Keeps only common Arabic characters |
Keeps only common Arabic characters |
|
Keeps only Arabic letters |
|
|
Normalizes characters in the given text |
|
Normalize |
|
Normalize |
|
Converts numbers in text to their equivalent text in Arabic. |
|
Removes certain characters from the given text. |
|
Removes the input strings |
|
Keeps a maximum of |
|
Removes all punctuations |
|
Removes all english characters |
|
Removes all harakat |
|
Removes common harakat |
|
Removes all numbers |
|
Removes tatweel symbol |
|
Removes matched characters from the given text |
|
Removes emails using pattern |
|
Removes hashtags (strings that start with # symbol) using pattern |
|
Removes links using pattern |
|
Removes mentions (strings that start with @ symbol) using pattern |
|
Reduces consecutive substrings that are repeated at least |
|
Removes the hash symbol |
Remove dots from |
|
|
Replaces the input |
|
Replaces everything except the input |
|
Replaces each key with its corresponding value in the given text |
|
Matches characters from the input text using the given |
Converts Arabic numbers |
|
|
Connects single-letter word with the letter following it. |
- contains(text, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, lam_alef_variations=False, lam_alef=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, persian=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, custom_strings=None, custom_expressions=None, operator=None)[source]#
Check for certain characters, strings or patterns in the given text.
To add a new parameter, make sure that its name is the same as the corresponding constant. For the patterns, only remove the prefix
EXPRESSION_from the parameter name- Parameters
text (str) – Text to check
arabic (bool, optional) – Check for
ARABICcharacters, by default Falseenglish (bool, optional) – Check for
ENGLISHcharacters, by default Falsearabic_letters (bool, optional) – Check for
ARABIC_LETTERScharacters, by default Falseenglish_letters (bool, optional) – Check for
ENGLISH_LETTERScharacters, by default Falseenglish_small_letters (bool, optional) – Check for
ENGLISH_SMALL_LETTERScharacters, by default Falseenglish_capital_letters (bool, optional) – Check for
ENGLISH_CAPITAL_LETTERScharacters, by default Falsenumbers (bool, optional) – Check for
NUMBERScharacters, by default Falseharakat (bool, optional) – Check for
HARAKATcharacters, by default Falseall_harakat (bool, optional) – Check for
ALL_HARAKATcharacters, by default Falsetatweel (bool, optional) – Check for
TATWEELcharacter, by default Falselam_alef_variations (bool, optional) – Check for
LAM_ALEF_VARIATIONScharacters, by default Falselam_alef (bool, optional) – Check for
LAM_ALEFcharacter, by default Falsepunctuations (bool, optional) – Check for
PUNCTUATIONScharacters, by default Falsearabic_numbers (bool, optional) – Check for
ARABIC_NUMBERScharacters, by default Falseenglish_numbers (bool, optional) – Check for
ENGLISH_NUMBERScharacters, by default Falsearabic_punctuations (bool, optional) – Check for
ARABIC_PUNCTUATIONScharacters, by default Falseenglish_punctuations (bool, optional) – Check for
ENGLISH_PUNCTUATIONScharacters, by default Falsearabic_ligatures (bool, optional) – Check for
ARABIC_LIGATURESwords, by default Falsepersian (bool, optional) – Check for
PERSIANcharacters, by default Falsearabic_hashtags (bool, optional) – Check for Arabic hashtags using the expression
EXPRESSION_ARABIC_HASHTAGS, by default Falsearabic_mentions (bool, optional) – Check for Arabic mentions using the expression
EXPRESSION_ARABIC_MENTIONS, by default Falseemails (bool, optional) – Check for Arabic hashtags using the expression
EXPRESSION_EMAILS, by default Falseenglish_hashtags (bool, optional) – Check for Arabic hashtags using the expression
EXPRESSION_ENGLISH_HASHTAGS, by default Falseenglish_mentions (bool, optional) – Check for Arabic hashtags using the expression
EXPRESSION_ENGLISH_MENTIONS, by default Falsehashtags (bool, optional) – Check for Arabic hashtags using the expression
EXPRESSION_HASHTAGS, by default Falselinks (bool, optional) – Check for Arabic hashtags using the expression
EXPRESSION_LINKS, by default Falsementions (bool, optional) – Check for Arabic hashtags using the expression
EXPRESSION_MENTIONS, by default Falseemojis (bool, optional) – Check for emojis using the expression
EXPRESSION_EMOJIS, by default Falsecustom_strings (Union[List[str], str], optional) – Include any other string(s), by default None
custom_expressions (ExpressionGroup | Expression | None) – Include any other expressions, by default None
operator (bool, optional) – When multiple arguments are set to True, this operator is used to combine the output into a boolean. Takes ‘and’ or ‘or’, by default None
- Returns
If one argument is set to True, a boolean value is returned. True if the text contains it, False otherwise.
If
operatoris set and more than one argument is set to True, a boolean value that combines the result with the “and/or” operator is returned.If more than one argument is set to True, a dictionary is returned where keys are the True passed arguments and the corresponding values are booleans. True if the text contains the argument, False otherwise.
- Return type
Union[Dict[str, bool], bool]
- Raises
ValueError – If no argument is set to True
Examples
>>> from maha.cleaners.functions import contains >>> text = "مقاييس أداء النماذج في التعلم الآلي Machine Learning ... 🌺" >>> contains(text, english=True, emails=True, emojis=True) {'english': True, 'emails': False, 'emojis': True}
>>> from maha.cleaners.functions import contains >>> text = "قال رسول اللهﷺ إن خير أيامكم يوم الجمعة فأكثروا عليَّ من الصلاة فيه" >>> contains(text, english=True) False
- contains_expressions(text, expressions)[source]#
Check for matched strings in the given
textusing the inputexpressionsNote
Use lookahead/lookbehind when substrings should not be captured or removed.
- Parameters
text (str) – Text to check
expressions (Union[
ExpressionGroup,Expression, str]) – Expression(s) to use
- Returns
True if the pattern is found in the given text, False otherwise.
- Return type
bool
- Raises
ValueError – If
expressionsare not of typeExpression,ExpressionGroupor str
Example
>>> from maha.cleaners.functions import contains_expressions >>> text = "علم الهندسة (Engineering)" >>> contains_expressions(text, r"\([A-Za-z]+\)") True
- contain_strings(text, strings)[source]#
Check for the input
stringsin the giventext- Parameters
text (str) – Text to check
strings (Union[List[str], str]) – String or list of strings to check for
- Returns
True if the input string(s) are found in the text, False otherwise
- Return type
bool
- Raises
ValueError – If no
stringsare provided
Example
>>> from maha.cleaners.functions import contain_strings >>> text = "الله أكبر، الحمد لله رب العالمين" >>> contain_strings(text, "الله") True
- contains_repeated_substring(text, min_repeated=3)[source]#
Check for consecutive substrings that are repeated at least
min_repeatedtimes. For example with the default arguments, the text ‘hhhhhh’ should return True- Parameters
text (str) – Text to check
min_repeated (int, optional) – Minimum number of consecutive repeated substring to consider, by default 3
- Returns
True if the input text contains consecutive substrings, otherwise False
- Return type
bool
- Raises
ValueError – If non positive integer is passed
Example
>>> from maha.cleaners.functions import contains_repeated_substring >>> text = "كانت اللعبة حللللللللوة جداً" >>> contains_repeated_substring(text) True
- contains_single_letter_word(text, arabic_letters=False, english_letters=False)[source]#
Check for a single-letter word. For example, “how r u” should return True if
english_lettersis set to True because it contains two single-letter word, “r” and “u”.- Parameters
text (str) – Text to check
arabic_letters (bool, optional) – Check for all
ARABIC_LETTERS, by default Falseenglish_letters (bool, optional) – Check for all
ENGLISH_LETTERS, by default False
- Returns
True if the input text contains single-letter word, False otherwise
- Return type
bool
- Raises
ValueError – If no argument is set to True
Example
>>> from maha.cleaners.functions import contains_single_letter_word >>> text = "cu later my friend, ك" >>> contains_single_letter_word(text, arabic_letters=True, english_letters=True) True
- keep(text, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, use_space=True, custom_strings=None)[source]#
Keeps only certain characters in the given text and removes everything else.
To add a new parameter, make sure that its name is the same as the corresponding constant.
- Parameters
text (str) – Text to be processed
arabic (bool, optional) – Keep
ARABICcharacters, by default Falseenglish (bool, optional) – Keep
ENGLISHcharacters, by default Falsearabic_letters (bool, optional) – Keep
ARABIC_LETTERScharacters, by default Falseenglish_letters (bool, optional) – Keep
ENGLISH_LETTERScharacters, by default Falseenglish_small_letters (bool, optional) – Keep
ENGLISH_SMALL_LETTERScharacters, by default Falseenglish_capital_letters (bool, optional) – Keep
ENGLISH_CAPITAL_LETTERScharacters, by default Falsenumbers (bool, optional) – Keep
NUMBERScharacters, by default Falseharakat (bool, optional) – Keep
HARAKATcharacters, by default Falseall_harakat (bool, optional) – Keep
ALL_HARAKATcharacters, by default Falsepunctuations (bool, optional) – Keep
PUNCTUATIONScharacters, by default Falsearabic_numbers (bool, optional) – Keep
ARABIC_NUMBERScharacters, by default Falseenglish_numbers (bool, optional) – Keep
ENGLISH_NUMBERScharacters, by default Falsearabic_punctuations (bool, optional) – Keep
ARABIC_PUNCTUATIONScharacters, by default Falseenglish_punctuations (bool, optional) – Keep
ENGLISH_PUNCTUATIONScharacters, by default Falseuse_space (bool, optional) – False to not replace with space, check
keep_strings()for more information, by default Truecustom_strings (List[str], optional) – Include any other string(s), by default None
- Returns
Processed text
- Return type
str
- Raises
ValueError – If no argument is set to True
Example
>>> from maha.cleaners.functions import keep >>> text = "بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ" >>> keep(text, arabic_letters=True) 'بسم الله الرحمن الرحيم'
- keep_strings(text, strings, use_space=True)[source]#
Keeps only the input strings
stringsin the given texttextBy default, this works by replacing all strings except the input
stringswith a space, which means space is kept. This is to help separate texts when unwanted strings are present without spaces. For example, ‘end.start’ will be converted to ‘end start’ if English lettersENGLISH_LETTERSare passed tostrings. To disable this behavior, setuse_spaceto False.Note
Extra spaces (more than one space) are removed by default if
use_spaceis set to True.- Parameters
text (str) – Text to be processed
strings (Union[List[str], str]) – list of strings to keep
use_space (bool) – False to not replace with space, defaults to True
- Returns
Text that contains only the input strings.
- Return type
str
- Raises
ValueError – If no
stringsare provided
Example
>>> from maha.cleaners.functions import keep_strings >>> text = "لا حول ولا قوة إلا بالله" >>> keep_strings(text, "الله") 'الله'
- keep_arabic_letters(text)[source]#
Keeps only Arabic letters
ARABIC_LETTERSin the given text.- Parameters
text (str) – Text to be processed
- Returns
Text contains Arabic letters only.
- Return type
str
Example
>>> from maha.cleaners.functions import keep_arabic_letters >>> text = " 1 يا أحلى mathematicians في العالم" >>> keep_arabic_letters(text) 'يا أحلى في العالم'
- keep_arabic_characters(text)[source]#
Keeps only common Arabic characters
ARABICin the given text.- Parameters
text (str) – Text to be processed
- Returns
Text contains the common Arabic characters only.
- Return type
str
Example
>>> from maha.cleaners.functions import keep_arabic_characters >>> text = "أَلمَانِيَا (بالألمانية: Deutschland) رسمِيّاً جُمهُورِيَّة أَلمَانِيَا الاِتِّحَاديَّة" >>> keep_arabic_characters(text) 'أَلمَانِيَا بالألمانية رسمِيّاً جُمهُورِيَّة أَلمَانِيَا الاِتِّحَاديَّة'
- keep_arabic_with_english_numbers(text)[source]#
Keeps only common Arabic characters
ARABICand English numbersENGLISH_NUMBERSin the given text.- Parameters
text (str) – Text to be processed
- Returns
Text contains the common Arabic characters and English numbers only.
- Return type
str
Example
>>> from maha.cleaners.functions import keep_arabic_with_english_numbers >>> text = "تتكون من 16 ولاية تُغطي مساحة 357,021 كيلومتر Deutschland" >>> keep_arabic_with_english_numbers(text) 'تتكون من 16 ولاية تُغطي مساحة 357 021 كيلومتر'
- keep_arabic_letters_with_harakat(text)[source]#
Keeps only Arabic letters
ARABIC_LETTERSand HARAKATHARAKATin the given text.- Parameters
text (str) – Text to be processed
- Returns
Text contains Arabic letters with harakat only.
- Return type
str
Example
>>> from maha.cleaners.functions import keep_arabic_letters_with_harakat >>> text = "إنّ في التّركِ قوة…" >>> keep_arabic_letters_with_harakat(text) 'إنّ في التّركِ قوة'
- normalize(text, lam_alef=None, alef=None, waw=None, yeh=None, teh_marbuta=None, ligatures=None, spaces=None, all=False)[source]#
Normalizes characters in the given text
- Parameters
text (str) – Text to process
lam_alef (bool, optional) – Normalize
LAM_ALEF_VARIATIONScharacters toLAMandALEF, by default Nonealef (bool, optional) – Normalize
ALEF_VARIATIONScharacters toALEF, by default Nonewaw (bool, optional) – Normalize
WAW_VARIATIONScharacters toWAW, by default Noneyeh (bool, optional) – Normalize
YEH_VARIATIONScharacters toYEHandALEF, by default Noneteh_marbuta (bool, optional) – Normalize
TEH_MARBUTAcharacters toHEH, by default Noneligatures (bool, optional) – Normalize
ARABIC_LIGATUREScharacters to the corresponding indices inARABIC_LIGATURES_NORMALIZED, by default Nonespaces (bool, optional) – Normalize space variations using the expression
EXPRESSION_ALL_SPACES, by default Noneall (bool, optional) – Do all normalization except the ones that are set to False, by default False
- Returns
Processed text
- Return type
str
- Raises
ValueError – If no argument is set to True
Examples
>>> from maha.cleaners.functions import normalize >>> text = "عن أبي هريرة" >>> normalize(text, alef=True, teh_marbuta=True) 'عن ابي هريره'
>>> from maha.cleaners.functions import normalize >>> text = "قال رسول الله ﷺ" >>> normalize(text, ligatures=True) 'قال رسول الله صلى الله عليه وسلم'
>>> from maha.cleaners.functions import normalize >>> text = "قال مؤمن: ﷽ قل هو ﷲ أحد" ... # For space >>> normalize(text, all=True, waw=False) 'قال مؤمن: بسم الله الرحمن الرحيم قل هو الله احد'
- normalize_lam_alef(text, keep_hamza=True)[source]#
Normalize
LAM_ALEF_VARIATIONStoLAM_ALEF_VARIATIONS_NORMALIZEDIfkeep_hamzais True. Otherwise, normalize toLAMandALEF- Parameters
text (str) – Text to process
keep_hamza (bool, optional) – True to preserve hamza and madda characters, by default True
- Returns
Normalized text
- Return type
str
Examples
>>> from maha.cleaners.functions import normalize_lam_alef >>> text = "السﻻم عليكم أحبتي، قالوا في صِفَةِ رَسُولِ الله يتَﻷلأ وَجْهُه" >>> normalize_lam_alef(text) 'السلام عليكم أحبتي، قالوا في صِفَةِ رَسُولِ الله يتَلألأ وَجْهُه'
>>> from maha.cleaners.functions import normalize_lam_alef >>> text = "اﻵن يا أصحابي" >>> normalize_lam_alef(text, keep_hamza=False) 'الان يا أصحابي'
- normalize_small_alef(text, keep_madda=True, normalize_end=False)[source]#
Normalize
ALEF_SUPERSCRIPTtoALEF. Ifkeep_maddais True andALEF_SUPERSCRIPTis followed byHAMZA_ABOVE, then normalize toALEF_MADDA_ABOVE- Parameters
text (str) – Text to process
keep_madda (bool, optional) – True to preserve madda character, by default True
normalize_end (bool, optional) – True to normalize
ALEF_SUPERSCRIPTthat appear at the end of a word, by default False
- Returns
Normalized text
- Return type
str
Example
>>> from maha.cleaners.functions import normalize_small_alef >>> text = "وَٱلصَّٰٓفَّٰتِ صَفّٗا" >>> normalize_small_alef(text) 'وَٱلصَّآفَّاتِ صَفّٗا'
- numbers_to_text(text, accusative=False)[source]#
Converts numbers in text to their equivalent text in Arabic.
- Parameters
text (str) – Text with numbers to be converted.
accusative (bool, optional) – If True, the number will be converted to its accusative form.
- Returns
Text with numbers converted to their equivalent text in Arabic.
- Return type
str
- remove(text, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, use_space=True, custom_strings=None, custom_expressions=None)[source]#
Removes certain characters from the given text.
To add a new parameter, make sure that its name is the same as the corresponding constant. For the patterns, only remove the prefix
EXPRESSION_from the parameter name- Parameters
text (str) – Text to be processed
arabic (bool, optional) – Remove
ARABICcharacters, by default Falseenglish (bool, optional) – Remove
ENGLISHcharacters, by default Falsearabic_letters (bool, optional) – Remove
ARABIC_LETTERScharacters, by default Falseenglish_letters (bool, optional) – Remove
ENGLISH_LETTERScharacters, by default Falseenglish_small_letters (bool, optional) – Remove
ENGLISH_SMALL_LETTERScharacters, by default Falseenglish_capital_letters (bool, optional) – Remove
ENGLISH_CAPITAL_LETTERScharacters, by default Falsenumbers (bool, optional) – Remove
NUMBERScharacters, by default Falseharakat (bool, optional) – Remove
HARAKATcharacters, by default Falseall_harakat (bool, optional) – Remove
ALL_HARAKATcharacters, by default Falsetatweel (bool, optional) – Remove
TATWEELcharacter, by default Falsepunctuations (bool, optional) – Remove
PUNCTUATIONScharacters, by default Falsearabic_numbers (bool, optional) – Remove
ARABIC_NUMBERScharacters, by default Falseenglish_numbers (bool, optional) – Remove
ENGLISH_NUMBERScharacters, by default Falsearabic_punctuations (bool, optional) – Remove
ARABIC_PUNCTUATIONScharacters, by default Falseenglish_punctuations (bool, optional) – Remove
ENGLISH_PUNCTUATIONScharacters, by default Falsearabic_ligatures (bool, optional) – Remove
ARABIC_LIGATURESwords, by default Falsearabic_hashtags (bool, optional) – Remove Arabic hashtags using the expression
EXPRESSION_ARABIC_HASHTAGS, by default Falsearabic_mentions (bool, optional) – Remove Arabic mentions using the expression
EXPRESSION_ARABIC_MENTIONS, by default Falseemails (bool, optional) – Remove Arabic hashtags using the expression
EXPRESSION_EMAILS, by default Falseenglish_hashtags (bool, optional) – Remove Arabic hashtags using the expression
EXPRESSION_ENGLISH_HASHTAGS, by default Falseenglish_mentions (bool, optional) – Remove Arabic hashtags using the expression
EXPRESSION_ENGLISH_MENTIONS, by default Falsehashtags (bool, optional) – Remove Arabic hashtags using the expression
EXPRESSION_HASHTAGS, by default Falselinks (bool, optional) – Remove Arabic hashtags using the expression
EXPRESSION_LINKS, by default Falsementions (bool, optional) – Remove Arabic hashtags using the expression
EXPRESSION_MENTIONS, by default Falseemojis (bool, optional) – Remove emojis using the expression
EXPRESSION_EMOJIS, by default Falseuse_space (bool, optional) – False to not replace with space, check
remove_strings()for more information, by default Truecustom_strings (list[str] | str | None) – Include any other string(s), by default None
custom_expressions (Union[
ExpressionGroup,Expression, str]) – Include any other regular expression expressions, by default None
- Returns
Processed text
- Return type
str
- Raises
ValueError – If no argument is set to True
Examples
>>> from maha.cleaners.functions import remove >>> text = "ويندوز 11 سيدعم تطبيقات نظام أندرويد. #Windows11" >>> remove(text, hashtags=True) 'ويندوز 11 سيدعم تطبيقات نظام أندرويد.'
>>> from maha.cleaners.functions import remove >>> text = "قَالَ رَبِّ اشْرَحْ لِي صَدْرِي.." >>> remove(text, all_harakat=True, punctuations=True) 'قال رب اشرح لي صدري'
- remove_strings(text, strings, use_space=True)[source]#
Removes the input strings
stringsin the given texttextThis works by replacing all input strings
stringswith a space, which means space cannot be removed. This is to help separate texts when unwanted strings are present without spaces. For example, ‘end.start’ will be converted to ‘end start’ if dotDOTis passed tostrings. To disable this behavior, setuse_spaceto False.Note
Extra spaces (more than one space) are removed by default if
use_spaceis set to True.- Parameters
text (str) – Text to be processed
strings (Union[List[str], str]) – list of strings to remove
use_space (bool) – False to not replace with space, defaults to True
- Returns
Text with input strings removed.
- Return type
str
- Raises
ValueError – If no
stringsare provided
Example
>>> from maha.cleaners.functions import remove_strings >>> text = "ومن الكلمات المحظورة السلاح" >>> remove_strings(text, "السلاح") 'ومن الكلمات المحظورة'
- remove_extra_spaces(text, max_spaces=1)[source]#
Keeps a maximum of
max_spacesnumber of spaces when extra spaces are present (more than one space)- Parameters
text (str) – Text to be processed
max_spaces (int, optional) – Maximum number of spaces to keep, by default 1
- Returns
Text with extra spaces removed
- Return type
str
- Raises
ValueError – When a negative or float value is assigned to
max_spaces
Example
>>> from maha.cleaners.functions import remove_extra_spaces >>> text = "وكان صديقنا العزيز محمد من أفضل الأشخاص الذين قابلتهم" >>> remove_extra_spaces(text) 'وكان صديقنا العزيز محمد من أفضل الأشخاص الذين قابلتهم'
- remove_punctuations(text)[source]#
Removes all punctuations
PUNCTUATIONSfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with punctuations removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_punctuations >>> text = "مثال على الرموز الخاصة كالتالي $ ^ & * ( ) ! @" >>> remove_punctuations(text) 'مثال على الرموز الخاصة كالتالي'
- remove_english(text)[source]#
Removes all english characters
ENGLISHfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with english removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_english >>> text = "ومن أفضل الجامعات هي جامعة إكسفورد (Oxford University)" >>> remove_english(text) 'ومن أفضل الجامعات هي جامعة إكسفورد'
- remove_all_harakat(text)[source]#
Removes all harakat
ALL_HARAKATfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with all harakat removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_all_harakat >>> text = "وَٱلصَّٰٓفَّٰتِ صَفّٗا (1) فَٱلزَّٰجِرَٰتِ زَجۡرٗا" >>> remove_all_harakat(text) 'وٱلصفت صفا (1) فٱلزجرت زجرا'
- remove_harakat(text)[source]#
Removes common harakat
HARAKATfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with common harakat removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_harakat >>> text = "ألا تَرَى: كلَّ مَنْ تَرجو وتَأمَلُهُ مِنَ البَرِيَّةِ (مسكينُ بْنُ مسكينِ)" >>> remove_harakat(text) 'ألا ترى: كل من ترجو وتأمله من البرية (مسكين بن مسكين)'
- remove_numbers(text)[source]#
Removes all numbers
NUMBERSfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with numbers removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_numbers >>> text = "ورقم أبو تريكة في نادي الأهلي هو إثنين وعشرين (22)" >>> remove_numbers(text) 'ورقم أبو تريكة في نادي الأهلي هو إثنين وعشرين ( )'
- remove_tatweel(text)[source]#
Removes tatweel symbol
TATWEELfrom the given text.- Parameters
text (str) – Text to process
- Returns
Text with tatweel symbol removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_tatweel >>> text = "الحمــــــــد لله رب العــــــــــــالمـــــــيـــــن" >>> remove_tatweel(text) 'الحمد لله رب العالمين'
- remove_expressions(text, patterns, remove_spaces=True)[source]#
Removes matched characters from the given text
textusing input patternspatternsNote
Use lookahead/lookbehind when substrings should not be captured or removed.
- Parameters
text (str) – Text to process
patterns (Expression | ExpressionGroup | str) – Expression(s) to use
remove_spaces (bool, optional) – False to keep extra spaces, defaults to True
- Returns
Text with matched characters removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_expressions >>> text = "الأميرُ الغازي أرطُغرُل، أو اختصارًا أرطغرل (بالتركية: Ertuğrul)" >>> remove_expressions(text, r"\(.*\)") 'الأميرُ الغازي أرطُغرُل، أو اختصارًا أرطغرل'
- remove_emails(text)[source]#
Removes emails using pattern
EXPRESSION_EMAILSfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with emails removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_emails >>> text = "يمكن استخدام الإيميل الشخصي، كمثال user1998@gmail.com" >>> remove_emails(text) 'يمكن استخدام الإيميل الشخصي، كمثال'
- remove_hashtags(text)[source]#
Removes hashtags (strings that start with # symbol) using pattern
EXPRESSION_HASHTAGSfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with hashtags removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_hashtags >>> text = "ويمكن القول أن مكة المكرمة من أجمل المناطق على وجه الأرض #السعودية" >>> remove_hashtags(text) 'ويمكن القول أن مكة المكرمة من أجمل المناطق على وجه الأرض'
- remove_links(text)[source]#
Removes links using pattern
EXPRESSION_LINKSfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with links removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_links >>> text = "لمشاهدة آخر التطورات يرجى زيارة الموقع التالي: https://github.com/TRoboto/Maha" >>> remove_links(text) 'لمشاهدة آخر التطورات يرجى زيارة الموقع التالي:'
- remove_mentions(text)[source]#
Removes mentions (strings that start with @ symbol) using pattern
EXPRESSION_MENTIONSfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with mentions removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_mentions >>> text = "@test لو سمحت صديقنا تزورنا على المعرض لاستلام الجائزة" >>> remove_mentions(text) 'لو سمحت صديقنا تزورنا على المعرض لاستلام الجائزة'
- reduce_repeated_substring(text, min_repeated=3, reduce_to=2)[source]#
Reduces consecutive substrings that are repeated at least
min_repeatedtimes toreduce_totimes. For example with the default arguments, ‘hhhhhh’ is reduced to ‘hh’TODO: Maybe change the implemention for 50x speed https://stackoverflow.com/questions/29481088/how-can-i-tell-if-a-string-repeats-itself-in-python/29489919#29489919
- Parameters
text (str) – Text to process
min_repeated (int, optional) – Minimum number of consecutive repeated substring to consider, by default 3
reduce_to (int, optional) – Number of substring to keep, by default 2
- Returns
Processed text
- Return type
str
- Raises
ValueError – If non positive integer is passed or
reduce_tois greater thanmin_repeated
Examples
..code:: pycon
>>> from maha.cleaners.functions import reduce_repeated_substring >>> text = "ههههههههههههههه" >>> reduce_repeated_substring(text) 'هه'
..code:: pycon
>>> from maha.cleaners.functions import reduce_repeated_substring >>> text = "ويييييييييين راححححححححححححوا" >>> reduce_repeated_substring(text, reduce_to=1) 'وين راحوا'
- remove_hash_keep_tag(text)[source]#
Removes the hash symbol
HASHTAGfrom all hashtags in the given text.- Parameters
text (str) – Text to process
- Returns
Text without hashtags.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_hash_keep_tag >>> text = "We love #Jordan very much" >>> remove_hash_keep_tag(text) 'We love Jordan very much'
- remove_arabic_letter_dots(text)[source]#
Remove dots from
ARABIC_LETTERSin the giventextusing theARABIC_DOTLESS_MAP- Parameters
text (str) – Text to be processed
- Returns
Text with dotless Arabic letters
- Return type
str
Example
>>> from maha.cleaners.functions import remove_arabic_letter_dots >>> text = "الحَمدُ للهِ الَّذي بنِعمتِه تَتمُّ الصَّالحاتُ" >>> remove_arabic_letter_dots(text) 'الحَمدُ للهِ الَّدى ٮٮِعمٮِه ٮَٮمُّ الصَّالحاٮُ'
- replace(text, strings, with_value)[source]#
Replaces the input
stringsin the given text with the given value- Parameters
text (str) – Text to process
strings (list[str] | str) – Strings to replace
with_value (str) – Value to replace the input strings with
- Returns
Processed text
- Return type
str
Examples
>>> from maha.cleaners.functions import replace >>> text = "حصل الولد على معدل 50%" >>> replace(text, "%", " بالمئة") 'حصل الولد على معدل 50 بالمئة'
>>> from maha.cleaners.functions import replace >>> text = "ولقد كلف هذا المنتج 100 $" >>> replace(text, "$", "دولار") 'ولقد كلف هذا المنتج 100 دولار'
- replace_except(text, strings, with_value)[source]#
Replaces everything except the input
stringsin the given text with the given value- Parameters
text (str) – Text to process
strings (list[str] | str) – Strings to preserve (not replace)
with_value (str) – Value to replace all other strings with.
- Returns
Processed text
- Return type
str
Example
>>> from maha.cleaners.functions import replace_except >>> from maha.constants import ARABIC_LETTERS, SPACE, EMPTY >>> text = "لَيتَ الذينَ تُحبُّ العيّنَ رؤيَتهم" >>> replace_except(text, ARABIC_LETTERS + [SPACE], EMPTY) 'ليت الذين تحب العين رؤيتهم'
- replace_pairs(text, keys, values)[source]#
Replaces each key with its corresponding value in the given text
- Parameters
text (str) – Text to process
keys (list[str]) – Strings to be replaced
values (list[str]) – Strings to be replaced with
- Returns
Processed text
- Return type
str
- Raises
ValueError – If keys and values are of different lengths
Example
>>> from maha.cleaners.functions import replace_pairs >>> text = 'شلونك يا محمد؟' >>> replace_pairs(text, ['شلونك'] , ['كيف حالك']) 'كيف حالك يا محمد؟'
- replace_expression(text, expression, with_value)[source]#
Matches characters from the input text using the given
expressionand replaces all matched characters with the given value.- Parameters
text (str) – Text to process
expression (Expression | ExpressionGroup | str) – Pattern/Expression used to match characters from the text
with_value (Callable[..., str] | str) – Value to replace the matched characters with
- Returns
Processed text
- Return type
str
Examples
>>> from maha.cleaners.functions import replace_expression >>> text = "ولقد حصلت على ١٠ من ١٠ " >>> replace_expression(text, "١٠", "عشرة") 'ولقد حصلت على عشرة من عشرة '
>>> from maha.cleaners.functions import replace_expression >>> text = "ذهبت الفتاه إلى المدرسه" >>> replace_expression(text, "ه( |$)", "ة ").strip() 'ذهبت الفتاة إلى المدرسة'
- arabic_numbers_to_english(text)[source]#
Converts Arabic numbers
ARABIC_NUMBERSto the corresponding English numbersENGLISH_NUMBERS- Parameters
text (str) – Text to process
- Returns
Processed text with all occurrences of Arabic numbers converted to English numbers
- Return type
str
Examples
>>> from maha.cleaners.functions import arabic_numbers_to_english >>> text = "٣" >>> arabic_numbers_to_english(text) '3'
>>> from maha.cleaners.functions import arabic_numbers_to_english >>> text = "١٠" >>> arabic_numbers_to_english(text) '10'
- connect_single_letter_word(text, waw=None, feh=None, beh=None, lam=None, kaf=None, teh=None, all=None, custom_strings=None)[source]#
Connects single-letter word with the letter following it.
- Parameters
text (str) – Text to process
waw (bool, optional) – Connect
WAWletter, by default Nonefeh (bool, optional) – Connect
FEHletter, by default Nonebeh (bool, optional) – Connect
BEHletter, by default Nonelam (bool, optional) – Connect
LAMletter, by default Nonekaf (bool, optional) – Connect
KAFletter, by default Noneteh (bool, optional) – Connect
TEHletter, by default Noneall (bool, optional) – Connect all letter except the ones set to False, by default None
custom_strings (Union[List[str], str], optional) – Include any other string(s) to connect, by default None