maha.cleaners.functions.normalize_fn#

Special functions that convert similar characters into one common character (Characters that roughly have the same shape)

Module Contents#

Functions#

normalize(text[, lam_alef, alef, waw, yeh, ...])

Normalizes characters in the given text

normalize_lam_alef(text[, keep_hamza])

Normalize LAM_ALEF_VARIATIONS to LAM_ALEF_VARIATIONS_NORMALIZED If keep_hamza is True.

normalize_small_alef(text[, keep_madda, ...])

Normalize ALEF_SUPERSCRIPT to ALEF.

normalize(text, lam_alef=None, alef=None, waw=None, yeh=None, teh_marbuta=None, ligatures=None, spaces=None, all=False)[source]#

Normalizes characters in the given text

Parameters
  • text (str) – Text to process

  • lam_alef (bool, optional) – Normalize LAM_ALEF_VARIATIONS characters to LAM and ALEF, by default None

  • alef (bool, optional) – Normalize ALEF_VARIATIONS characters to ALEF, by default None

  • waw (bool, optional) – Normalize WAW_VARIATIONS characters to WAW, by default None

  • yeh (bool, optional) – Normalize YEH_VARIATIONS characters to YEH and ALEF, by default None

  • teh_marbuta (bool, optional) – Normalize TEH_MARBUTA characters to HEH, by default None

  • ligatures (bool, optional) – Normalize ARABIC_LIGATURES characters to the corresponding indices in ARABIC_LIGATURES_NORMALIZED, by default None

  • spaces (bool, optional) – Normalize space variations using the expression EXPRESSION_ALL_SPACES, by default None

  • all (bool, optional) – Do all normalization except the ones that are set to False, by default False

Returns

Processed text

Return type

str

Raises

ValueError – If no argument is set to True

Examples

>>> from maha.cleaners.functions import normalize
>>> text = "عن أبي هريرة"
>>> normalize(text, alef=True, teh_marbuta=True)
'عن ابي هريره'
>>> from maha.cleaners.functions import normalize
>>> text = "قال رسول الله ﷺ"
>>> normalize(text, ligatures=True)
'قال رسول الله صلى الله عليه وسلم'
>>> from maha.cleaners.functions import normalize
>>> text = "قال مؤمن: ﷽ قل هو ﷲ أحد"
... # For space
>>> normalize(text, all=True, waw=False)
'قال مؤمن: بسم الله الرحمن الرحيم قل هو الله احد'
normalize_lam_alef(text, keep_hamza=True)[source]#

Normalize LAM_ALEF_VARIATIONS to LAM_ALEF_VARIATIONS_NORMALIZED If keep_hamza is True. Otherwise, normalize to LAM and ALEF

Parameters
  • text (str) – Text to process

  • keep_hamza (bool, optional) – True to preserve hamza and madda characters, by default True

Returns

Normalized text

Return type

str

Examples

>>> from maha.cleaners.functions import normalize_lam_alef
>>> text = "السﻻم عليكم أحبتي، قالوا في صِفَةِ رَسُولِ الله يتَﻷلأ وَجْهُه"
>>> normalize_lam_alef(text)
'السلام عليكم أحبتي، قالوا في صِفَةِ رَسُولِ الله يتَلألأ وَجْهُه'
>>> from maha.cleaners.functions import normalize_lam_alef
>>> text = "اﻵن يا أصحابي"
>>> normalize_lam_alef(text, keep_hamza=False)
'الان يا أصحابي'
normalize_small_alef(text, keep_madda=True, normalize_end=False)[source]#

Normalize ALEF_SUPERSCRIPT to ALEF. If keep_madda is True and ALEF_SUPERSCRIPT is followed by HAMZA_ABOVE, then normalize to ALEF_MADDA_ABOVE

Parameters
  • text (str) – Text to process

  • keep_madda (bool, optional) – True to preserve madda character, by default True

  • normalize_end (bool, optional) – True to normalize ALEF_SUPERSCRIPT that appear at the end of a word, by default False

Returns

Normalized text

Return type

str

Example

>>> from maha.cleaners.functions import normalize_small_alef
>>> text = "وَٱلصَّٰٓفَّٰتِ صَفّٗا"
>>> normalize_small_alef(text)
'وَٱلصَّآفَّاتِ صَفّٗا'