maha.cleaners.functions.remove_fn#
Functions that operate on a string and remove certain characters.
Module Contents#
Functions#
|
Removes certain characters from the given text. |
|
Reduces consecutive substrings that are repeated at least |
|
Removes the hash symbol |
|
Removes tatweel symbol |
|
Removes emails using pattern |
|
Removes hashtags (strings that start with # symbol) using pattern |
|
Removes links using pattern |
|
Removes mentions (strings that start with @ symbol) using pattern |
|
Removes all punctuations |
|
Removes all english characters |
|
Removes all harakat |
|
Removes common harakat |
|
Removes all numbers |
|
Removes matched characters from the given text |
|
Removes the input strings |
|
Keeps a maximum of |
Remove dots from |
- remove(text, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, use_space=True, custom_strings=None, custom_expressions=None)[source]#
Removes certain characters from the given text.
To add a new parameter, make sure that its name is the same as the corresponding constant. For the patterns, only remove the prefix
EXPRESSION_from the parameter name- Parameters
text (str) – Text to be processed
arabic (bool, optional) – Remove
ARABICcharacters, by default Falseenglish (bool, optional) – Remove
ENGLISHcharacters, by default Falsearabic_letters (bool, optional) – Remove
ARABIC_LETTERScharacters, by default Falseenglish_letters (bool, optional) – Remove
ENGLISH_LETTERScharacters, by default Falseenglish_small_letters (bool, optional) – Remove
ENGLISH_SMALL_LETTERScharacters, by default Falseenglish_capital_letters (bool, optional) – Remove
ENGLISH_CAPITAL_LETTERScharacters, by default Falsenumbers (bool, optional) – Remove
NUMBERScharacters, by default Falseharakat (bool, optional) – Remove
HARAKATcharacters, by default Falseall_harakat (bool, optional) – Remove
ALL_HARAKATcharacters, by default Falsetatweel (bool, optional) – Remove
TATWEELcharacter, by default Falsepunctuations (bool, optional) – Remove
PUNCTUATIONScharacters, by default Falsearabic_numbers (bool, optional) – Remove
ARABIC_NUMBERScharacters, by default Falseenglish_numbers (bool, optional) – Remove
ENGLISH_NUMBERScharacters, by default Falsearabic_punctuations (bool, optional) – Remove
ARABIC_PUNCTUATIONScharacters, by default Falseenglish_punctuations (bool, optional) – Remove
ENGLISH_PUNCTUATIONScharacters, by default Falsearabic_ligatures (bool, optional) – Remove
ARABIC_LIGATURESwords, by default Falsearabic_hashtags (bool, optional) – Remove Arabic hashtags using the expression
EXPRESSION_ARABIC_HASHTAGS, by default Falsearabic_mentions (bool, optional) – Remove Arabic mentions using the expression
EXPRESSION_ARABIC_MENTIONS, by default Falseemails (bool, optional) – Remove Arabic hashtags using the expression
EXPRESSION_EMAILS, by default Falseenglish_hashtags (bool, optional) – Remove Arabic hashtags using the expression
EXPRESSION_ENGLISH_HASHTAGS, by default Falseenglish_mentions (bool, optional) – Remove Arabic hashtags using the expression
EXPRESSION_ENGLISH_MENTIONS, by default Falsehashtags (bool, optional) – Remove Arabic hashtags using the expression
EXPRESSION_HASHTAGS, by default Falselinks (bool, optional) – Remove Arabic hashtags using the expression
EXPRESSION_LINKS, by default Falsementions (bool, optional) – Remove Arabic hashtags using the expression
EXPRESSION_MENTIONS, by default Falseemojis (bool, optional) – Remove emojis using the expression
EXPRESSION_EMOJIS, by default Falseuse_space (bool, optional) – False to not replace with space, check
remove_strings()for more information, by default Truecustom_strings (list[str] | str | None) – Include any other string(s), by default None
custom_expressions (Union[
ExpressionGroup,Expression, str]) – Include any other regular expression expressions, by default None
- Returns
Processed text
- Return type
str
- Raises
ValueError – If no argument is set to True
Examples
>>> from maha.cleaners.functions import remove >>> text = "ويندوز 11 سيدعم تطبيقات نظام أندرويد. #Windows11" >>> remove(text, hashtags=True) 'ويندوز 11 سيدعم تطبيقات نظام أندرويد.'
>>> from maha.cleaners.functions import remove >>> text = "قَالَ رَبِّ اشْرَحْ لِي صَدْرِي.." >>> remove(text, all_harakat=True, punctuations=True) 'قال رب اشرح لي صدري'
- reduce_repeated_substring(text, min_repeated=3, reduce_to=2)[source]#
Reduces consecutive substrings that are repeated at least
min_repeatedtimes toreduce_totimes. For example with the default arguments, ‘hhhhhh’ is reduced to ‘hh’TODO: Maybe change the implemention for 50x speed https://stackoverflow.com/questions/29481088/how-can-i-tell-if-a-string-repeats-itself-in-python/29489919#29489919
- Parameters
text (str) – Text to process
min_repeated (int, optional) – Minimum number of consecutive repeated substring to consider, by default 3
reduce_to (int, optional) – Number of substring to keep, by default 2
- Returns
Processed text
- Return type
str
- Raises
ValueError – If non positive integer is passed or
reduce_tois greater thanmin_repeated
Examples
..code:: pycon
>>> from maha.cleaners.functions import reduce_repeated_substring >>> text = "ههههههههههههههه" >>> reduce_repeated_substring(text) 'هه'
..code:: pycon
>>> from maha.cleaners.functions import reduce_repeated_substring >>> text = "ويييييييييين راححححححححححححوا" >>> reduce_repeated_substring(text, reduce_to=1) 'وين راحوا'
- remove_hash_keep_tag(text)[source]#
Removes the hash symbol
HASHTAGfrom all hashtags in the given text.- Parameters
text (str) – Text to process
- Returns
Text without hashtags.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_hash_keep_tag >>> text = "We love #Jordan very much" >>> remove_hash_keep_tag(text) 'We love Jordan very much'
- remove_tatweel(text)[source]#
Removes tatweel symbol
TATWEELfrom the given text.- Parameters
text (str) – Text to process
- Returns
Text with tatweel symbol removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_tatweel >>> text = "الحمــــــــد لله رب العــــــــــــالمـــــــيـــــن" >>> remove_tatweel(text) 'الحمد لله رب العالمين'
- remove_emails(text)[source]#
Removes emails using pattern
EXPRESSION_EMAILSfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with emails removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_emails >>> text = "يمكن استخدام الإيميل الشخصي، كمثال user1998@gmail.com" >>> remove_emails(text) 'يمكن استخدام الإيميل الشخصي، كمثال'
- remove_hashtags(text)[source]#
Removes hashtags (strings that start with # symbol) using pattern
EXPRESSION_HASHTAGSfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with hashtags removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_hashtags >>> text = "ويمكن القول أن مكة المكرمة من أجمل المناطق على وجه الأرض #السعودية" >>> remove_hashtags(text) 'ويمكن القول أن مكة المكرمة من أجمل المناطق على وجه الأرض'
- remove_links(text)[source]#
Removes links using pattern
EXPRESSION_LINKSfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with links removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_links >>> text = "لمشاهدة آخر التطورات يرجى زيارة الموقع التالي: https://github.com/TRoboto/Maha" >>> remove_links(text) 'لمشاهدة آخر التطورات يرجى زيارة الموقع التالي:'
- remove_mentions(text)[source]#
Removes mentions (strings that start with @ symbol) using pattern
EXPRESSION_MENTIONSfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with mentions removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_mentions >>> text = "@test لو سمحت صديقنا تزورنا على المعرض لاستلام الجائزة" >>> remove_mentions(text) 'لو سمحت صديقنا تزورنا على المعرض لاستلام الجائزة'
- remove_punctuations(text)[source]#
Removes all punctuations
PUNCTUATIONSfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with punctuations removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_punctuations >>> text = "مثال على الرموز الخاصة كالتالي $ ^ & * ( ) ! @" >>> remove_punctuations(text) 'مثال على الرموز الخاصة كالتالي'
- remove_english(text)[source]#
Removes all english characters
ENGLISHfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with english removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_english >>> text = "ومن أفضل الجامعات هي جامعة إكسفورد (Oxford University)" >>> remove_english(text) 'ومن أفضل الجامعات هي جامعة إكسفورد'
- remove_all_harakat(text)[source]#
Removes all harakat
ALL_HARAKATfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with all harakat removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_all_harakat >>> text = "وَٱلصَّٰٓفَّٰتِ صَفّٗا (1) فَٱلزَّٰجِرَٰتِ زَجۡرٗا" >>> remove_all_harakat(text) 'وٱلصفت صفا (1) فٱلزجرت زجرا'
- remove_harakat(text)[source]#
Removes common harakat
HARAKATfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with common harakat removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_harakat >>> text = "ألا تَرَى: كلَّ مَنْ تَرجو وتَأمَلُهُ مِنَ البَرِيَّةِ (مسكينُ بْنُ مسكينِ)" >>> remove_harakat(text) 'ألا ترى: كل من ترجو وتأمله من البرية (مسكين بن مسكين)'
- remove_numbers(text)[source]#
Removes all numbers
NUMBERSfrom the given text.- Parameters
text (str) – Text to be processed
- Returns
Text with numbers removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_numbers >>> text = "ورقم أبو تريكة في نادي الأهلي هو إثنين وعشرين (22)" >>> remove_numbers(text) 'ورقم أبو تريكة في نادي الأهلي هو إثنين وعشرين ( )'
- remove_expressions(text, patterns, remove_spaces=True)[source]#
Removes matched characters from the given text
textusing input patternspatternsNote
Use lookahead/lookbehind when substrings should not be captured or removed.
- Parameters
text (str) – Text to process
patterns (Expression | ExpressionGroup | str) – Expression(s) to use
remove_spaces (bool, optional) – False to keep extra spaces, defaults to True
- Returns
Text with matched characters removed.
- Return type
str
Example
>>> from maha.cleaners.functions import remove_expressions >>> text = "الأميرُ الغازي أرطُغرُل، أو اختصارًا أرطغرل (بالتركية: Ertuğrul)" >>> remove_expressions(text, r"\(.*\)") 'الأميرُ الغازي أرطُغرُل، أو اختصارًا أرطغرل'
- remove_strings(text, strings, use_space=True)[source]#
Removes the input strings
stringsin the given texttextThis works by replacing all input strings
stringswith a space, which means space cannot be removed. This is to help separate texts when unwanted strings are present without spaces. For example, ‘end.start’ will be converted to ‘end start’ if dotDOTis passed tostrings. To disable this behavior, setuse_spaceto False.Note
Extra spaces (more than one space) are removed by default if
use_spaceis set to True.- Parameters
text (str) – Text to be processed
strings (Union[List[str], str]) – list of strings to remove
use_space (bool) – False to not replace with space, defaults to True
- Returns
Text with input strings removed.
- Return type
str
- Raises
ValueError – If no
stringsare provided
Example
>>> from maha.cleaners.functions import remove_strings >>> text = "ومن الكلمات المحظورة السلاح" >>> remove_strings(text, "السلاح") 'ومن الكلمات المحظورة'
- remove_extra_spaces(text, max_spaces=1)[source]#
Keeps a maximum of
max_spacesnumber of spaces when extra spaces are present (more than one space)- Parameters
text (str) – Text to be processed
max_spaces (int, optional) – Maximum number of spaces to keep, by default 1
- Returns
Text with extra spaces removed
- Return type
str
- Raises
ValueError – When a negative or float value is assigned to
max_spaces
Example
>>> from maha.cleaners.functions import remove_extra_spaces >>> text = "وكان صديقنا العزيز محمد من أفضل الأشخاص الذين قابلتهم" >>> remove_extra_spaces(text) 'وكان صديقنا العزيز محمد من أفضل الأشخاص الذين قابلتهم'
- remove_arabic_letter_dots(text)[source]#
Remove dots from
ARABIC_LETTERSin the giventextusing theARABIC_DOTLESS_MAP- Parameters
text (str) – Text to be processed
- Returns
Text with dotless Arabic letters
- Return type
str
Example
>>> from maha.cleaners.functions import remove_arabic_letter_dots >>> text = "الحَمدُ للهِ الَّذي بنِعمتِه تَتمُّ الصَّالحاتُ" >>> remove_arabic_letter_dots(text) 'الحَمدُ للهِ الَّدى ٮٮِعمٮِه ٮَٮمُّ الصَّالحاٮُ'