maha.cleaners.functions.normalize_fn#
Special functions that convert similar characters into one common character (Characters that roughly have the same shape)
Module Contents#
Functions#
|
Normalizes characters in the given text |
|
Normalize |
|
Normalize |
- normalize(text, lam_alef=None, alef=None, waw=None, yeh=None, teh_marbuta=None, ligatures=None, spaces=None, all=False)[source]#
Normalizes characters in the given text
- Parameters
text (str) – Text to process
lam_alef (bool, optional) – Normalize
LAM_ALEF_VARIATIONScharacters toLAMandALEF, by default Nonealef (bool, optional) – Normalize
ALEF_VARIATIONScharacters toALEF, by default Nonewaw (bool, optional) – Normalize
WAW_VARIATIONScharacters toWAW, by default Noneyeh (bool, optional) – Normalize
YEH_VARIATIONScharacters toYEHandALEF, by default Noneteh_marbuta (bool, optional) – Normalize
TEH_MARBUTAcharacters toHEH, by default Noneligatures (bool, optional) – Normalize
ARABIC_LIGATUREScharacters to the corresponding indices inARABIC_LIGATURES_NORMALIZED, by default Nonespaces (bool, optional) – Normalize space variations using the expression
EXPRESSION_ALL_SPACES, by default Noneall (bool, optional) – Do all normalization except the ones that are set to False, by default False
- Returns
Processed text
- Return type
str
- Raises
ValueError – If no argument is set to True
Examples
>>> from maha.cleaners.functions import normalize >>> text = "عن أبي هريرة" >>> normalize(text, alef=True, teh_marbuta=True) 'عن ابي هريره'
>>> from maha.cleaners.functions import normalize >>> text = "قال رسول الله ﷺ" >>> normalize(text, ligatures=True) 'قال رسول الله صلى الله عليه وسلم'
>>> from maha.cleaners.functions import normalize >>> text = "قال مؤمن: ﷽ قل هو ﷲ أحد" ... # For space >>> normalize(text, all=True, waw=False) 'قال مؤمن: بسم الله الرحمن الرحيم قل هو الله احد'
- normalize_lam_alef(text, keep_hamza=True)[source]#
Normalize
LAM_ALEF_VARIATIONStoLAM_ALEF_VARIATIONS_NORMALIZEDIfkeep_hamzais True. Otherwise, normalize toLAMandALEF- Parameters
text (str) – Text to process
keep_hamza (bool, optional) – True to preserve hamza and madda characters, by default True
- Returns
Normalized text
- Return type
str
Examples
>>> from maha.cleaners.functions import normalize_lam_alef >>> text = "السﻻم عليكم أحبتي، قالوا في صِفَةِ رَسُولِ الله يتَﻷلأ وَجْهُه" >>> normalize_lam_alef(text) 'السلام عليكم أحبتي، قالوا في صِفَةِ رَسُولِ الله يتَلألأ وَجْهُه'
>>> from maha.cleaners.functions import normalize_lam_alef >>> text = "اﻵن يا أصحابي" >>> normalize_lam_alef(text, keep_hamza=False) 'الان يا أصحابي'
- normalize_small_alef(text, keep_madda=True, normalize_end=False)[source]#
Normalize
ALEF_SUPERSCRIPTtoALEF. Ifkeep_maddais True andALEF_SUPERSCRIPTis followed byHAMZA_ABOVE, then normalize toALEF_MADDA_ABOVE- Parameters
text (str) – Text to process
keep_madda (bool, optional) – True to preserve madda character, by default True
normalize_end (bool, optional) – True to normalize
ALEF_SUPERSCRIPTthat appear at the end of a word, by default False
- Returns
Normalized text
- Return type
str
Example
>>> from maha.cleaners.functions import normalize_small_alef >>> text = "وَٱلصَّٰٓفَّٰتِ صَفّٗا" >>> normalize_small_alef(text) 'وَٱلصَّآفَّاتِ صَفّٗا'