maha.cleaners.functions.keep_fn#

Functions that operate on a string and remove all but certain characters.

Module Contents#

Functions#

keep(text[, arabic, english, ...])

Keeps only certain characters in the given text and removes everything else.

keep_arabic_letters(text)

Keeps only Arabic letters ARABIC_LETTERS in the given text.

keep_arabic_characters(text)

Keeps only common Arabic characters ARABIC in the given text.

keep_arabic_with_english_numbers(text)

Keeps only common Arabic characters ARABIC and English numbers ENGLISH_NUMBERS in the given text.

keep_arabic_letters_with_harakat(text)

Keeps only Arabic letters ARABIC_LETTERS and HARAKAT HARAKAT in the given text.

keep_strings(text, strings[, use_space])

Keeps only the input strings strings in the given text text

keep(text, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, use_space=True, custom_strings=None)[source]#

Keeps only certain characters in the given text and removes everything else.

To add a new parameter, make sure that its name is the same as the corresponding constant.

Parameters
  • text (str) – Text to be processed

  • arabic (bool, optional) – Keep ARABIC characters, by default False

  • english (bool, optional) – Keep ENGLISH characters, by default False

  • arabic_letters (bool, optional) – Keep ARABIC_LETTERS characters, by default False

  • english_letters (bool, optional) – Keep ENGLISH_LETTERS characters, by default False

  • english_small_letters (bool, optional) – Keep ENGLISH_SMALL_LETTERS characters, by default False

  • english_capital_letters (bool, optional) – Keep ENGLISH_CAPITAL_LETTERS characters, by default False

  • numbers (bool, optional) – Keep NUMBERS characters, by default False

  • harakat (bool, optional) – Keep HARAKAT characters, by default False

  • all_harakat (bool, optional) – Keep ALL_HARAKAT characters, by default False

  • punctuations (bool, optional) – Keep PUNCTUATIONS characters, by default False

  • arabic_numbers (bool, optional) – Keep ARABIC_NUMBERS characters, by default False

  • english_numbers (bool, optional) – Keep ENGLISH_NUMBERS characters, by default False

  • arabic_punctuations (bool, optional) – Keep ARABIC_PUNCTUATIONS characters, by default False

  • english_punctuations (bool, optional) – Keep ENGLISH_PUNCTUATIONS characters, by default False

  • use_space (bool, optional) – False to not replace with space, check keep_strings() for more information, by default True

  • custom_strings (List[str], optional) – Include any other string(s), by default None

Returns

Processed text

Return type

str

Raises

ValueError – If no argument is set to True

Example

>>> from maha.cleaners.functions import keep
>>> text = "بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ"
>>> keep(text, arabic_letters=True)
'بسم الله الرحمن الرحيم'
keep_arabic_letters(text)[source]#

Keeps only Arabic letters ARABIC_LETTERS in the given text.

Parameters

text (str) – Text to be processed

Returns

Text contains Arabic letters only.

Return type

str

Example

>>> from maha.cleaners.functions import keep_arabic_letters
>>> text = " 1 يا أحلى mathematicians في العالم"
>>> keep_arabic_letters(text)
'يا أحلى في العالم'
keep_arabic_characters(text)[source]#

Keeps only common Arabic characters ARABIC in the given text.

Parameters

text (str) – Text to be processed

Returns

Text contains the common Arabic characters only.

Return type

str

Example

>>> from maha.cleaners.functions import keep_arabic_characters
>>> text = "أَلمَانِيَا (بالألمانية: Deutschland) رسمِيّاً جُمهُورِيَّة أَلمَانِيَا الاِتِّحَاديَّة"
>>> keep_arabic_characters(text)
'أَلمَانِيَا بالألمانية رسمِيّاً جُمهُورِيَّة أَلمَانِيَا الاِتِّحَاديَّة'
keep_arabic_with_english_numbers(text)[source]#

Keeps only common Arabic characters ARABIC and English numbers ENGLISH_NUMBERS in the given text.

Parameters

text (str) – Text to be processed

Returns

Text contains the common Arabic characters and English numbers only.

Return type

str

Example

>>> from maha.cleaners.functions import keep_arabic_with_english_numbers
>>> text = "تتكون من 16 ولاية تُغطي مساحة 357,021 كيلومتر Deutschland"
>>> keep_arabic_with_english_numbers(text)
'تتكون من 16 ولاية تُغطي مساحة 357 021 كيلومتر'
keep_arabic_letters_with_harakat(text)[source]#

Keeps only Arabic letters ARABIC_LETTERS and HARAKAT HARAKAT in the given text.

Parameters

text (str) – Text to be processed

Returns

Text contains Arabic letters with harakat only.

Return type

str

Example

>>> from maha.cleaners.functions import keep_arabic_letters_with_harakat
>>> text = "إنّ في التّركِ قوة…"
>>> keep_arabic_letters_with_harakat(text)
'إنّ في التّركِ قوة'
keep_strings(text, strings, use_space=True)[source]#

Keeps only the input strings strings in the given text text

By default, this works by replacing all strings except the input strings with a space, which means space is kept. This is to help separate texts when unwanted strings are present without spaces. For example, ‘end.start’ will be converted to ‘end start’ if English letters ENGLISH_LETTERS are passed to strings. To disable this behavior, set use_space to False.

Note

Extra spaces (more than one space) are removed by default if use_space is set to True.

Parameters
  • text (str) – Text to be processed

  • strings (Union[List[str], str]) – list of strings to keep

  • use_space (bool) – False to not replace with space, defaults to True

Returns

Text that contains only the input strings.

Return type

str

Raises

ValueError – If no strings are provided

Example

>>> from maha.cleaners.functions import keep_strings
>>> text = "لا حول ولا قوة إلا بالله"
>>> keep_strings(text, "الله")
'الله'