`maha.cleaners.functions.normalize_fn`#

Special functions that convert similar characters into one common character (Characters that roughly have the same shape)

Module Contents#

Functions#

`normalize`(text[, lam_alef, alef, waw, yeh, ...])	Normalizes characters in the given text
`normalize_lam_alef`(text[, keep_hamza])	Normalize `LAM_ALEF_VARIATIONS` to `LAM_ALEF_VARIATIONS_NORMALIZED` If `keep_hamza` is True.
`normalize_small_alef`(text[, keep_madda, ...])	Normalize `ALEF_SUPERSCRIPT` to `ALEF`.

normalize(text, lam_alef=None, alef=None, waw=None, yeh=None, teh_marbuta=None, ligatures=None, spaces=None, all=False)[source]#

Normalizes characters in the given text

Parameters

text (str) – Text to process
lam_alef (bool, optional) – Normalize LAM_ALEF_VARIATIONS characters to LAM and ALEF, by default None
alef (bool, optional) – Normalize ALEF_VARIATIONS characters to ALEF, by default None
waw (bool, optional) – Normalize WAW_VARIATIONS characters to WAW, by default None
yeh (bool, optional) – Normalize YEH_VARIATIONS characters to YEH and ALEF, by default None
teh_marbuta (bool, optional) – Normalize TEH_MARBUTA characters to HEH, by default None
ligatures (bool, optional) – Normalize ARABIC_LIGATURES characters to the corresponding indices in ARABIC_LIGATURES_NORMALIZED, by default None
spaces (bool, optional) – Normalize space variations using the expression EXPRESSION_ALL_SPACES, by default None
all (bool, optional) – Do all normalization except the ones that are set to False, by default False

Returns

Processed text

Return type

str

Raises

ValueError – If no argument is set to True

Examples

>>> from maha.cleaners.functions import normalize
>>> text = "عن أبي هريرة"
>>> normalize(text, alef=True, teh_marbuta=True)
'عن ابي هريره'

>>> from maha.cleaners.functions import normalize
>>> text = "قال رسول الله ﷺ"
>>> normalize(text, ligatures=True)
'قال رسول الله صلى الله عليه وسلم'

>>> from maha.cleaners.functions import normalize
>>> text = "قال مؤمن: ﷽ قل هو ﷲ أحد"
... # For space
>>> normalize(text, all=True, waw=False)
'قال مؤمن: بسم الله الرحمن الرحيم قل هو الله احد'

normalize_lam_alef(text, keep_hamza=True)[source]#

Normalize LAM_ALEF_VARIATIONS to LAM_ALEF_VARIATIONS_NORMALIZED If keep_hamza is True. Otherwise, normalize to LAM and ALEF

Parameters

text (str) – Text to process
keep_hamza (bool, optional) – True to preserve hamza and madda characters, by default True

Returns

Normalized text

Return type

str

Examples

>>> from maha.cleaners.functions import normalize_lam_alef
>>> text = "السﻻم عليكم أحبتي، قالوا في صِفَةِ رَسُولِ الله يتَﻷلأ وَجْهُه"
>>> normalize_lam_alef(text)
'السلام عليكم أحبتي، قالوا في صِفَةِ رَسُولِ الله يتَلألأ وَجْهُه'

>>> from maha.cleaners.functions import normalize_lam_alef
>>> text = "اﻵن يا أصحابي"
>>> normalize_lam_alef(text, keep_hamza=False)
'الان يا أصحابي'

normalize_small_alef(text, keep_madda=True, normalize_end=False)[source]#

Normalize ALEF_SUPERSCRIPT to ALEF. If keep_madda is True and ALEF_SUPERSCRIPT is followed by HAMZA_ABOVE, then normalize to ALEF_MADDA_ABOVE

Parameters

text (str) – Text to process
keep_madda (bool, optional) – True to preserve madda character, by default True
normalize_end (bool, optional) – True to normalize ALEF_SUPERSCRIPT that appear at the end of a word, by default False

Returns

Normalized text

Return type

str

Example

>>> from maha.cleaners.functions import normalize_small_alef
>>> text = "وَٱلصَّٰٓفَّٰتِ صَفّٗا"
>>> normalize_small_alef(text)
'وَٱلصَّآفَّاتِ صَفّٗا'

maha.cleaners.functions.normalize_fn#

Module Contents#

Functions#

`maha.cleaners.functions.normalize_fn`#