maha.processors
#
Submodules#
Package Contents#
Classes#
Base class for all processors. It contains almost all functions needed for the |
|
For processing text input. |
|
For processing file input. |
|
For processing a stream of text input. |
|
For processing file stream input. |
- class BaseProcessor[source]#
Bases:
abc.ABC
Base class for all processors. It contains almost all functions needed for the processors.
- Parameters
text (Union[List[str], str]) – A text or list of strings to process
- abstract get_lines(self, n_lines=100)#
Returns a generator of list of strings with length of
n_lines
- Parameters
n_lines (int) – Number of lines to yield, Defaults to 100
- Yields
List[str] – List of strings with length of
n_lines
. The last list maybe of length less thann_lines
.
- abstract apply(self, fn)#
Applies a function to each line
- Parameters
fn (Callable[[str], str]) – Function to apply
- abstract filter(self, fn)#
Keeps lines for which the input function is True
- Parameters
fn (Callable[[str], bool]) – Function to check
- get(self, unique_characters=False, character_length=False, word_length=False)#
Returns statistics about the provided text
- Parameters
unique_characters (bool, optional) – Return all unique characters, by default False
character_length (bool, optional) – Return the character length of each string, by default False
word_length (bool, optional) – Return the word length of each string (split by space), by default False
- Returns
If one argument is set to True, its value is return
- If more than one argument is set to True, a dictionary is returned where
keys are the True passed arguments with the corresponding values
- Return type
Union[Dict[str, Any], Any]
- print_unique_characters(self)#
Prints all unique characters in the text
- keep(self, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, use_space=True, custom_strings=None)#
Applies
keep()
to each line- Parameters
arabic (bool) –
english (bool) –
arabic_letters (bool) –
english_letters (bool) –
english_small_letters (bool) –
english_capital_letters (bool) –
numbers (bool) –
harakat (bool) –
all_harakat (bool) –
punctuations (bool) –
arabic_numbers (bool) –
english_numbers (bool) –
arabic_punctuations (bool) –
english_punctuations (bool) –
use_space (bool) –
custom_strings (list[str] | str | None) –
- normalize(self, lam_alef=None, alef=None, waw=None, yeh=None, teh_marbuta=None, ligatures=None, spaces=None, all=None)#
Applies
normalize()
to each line- Parameters
lam_alef (bool | None) –
alef (bool | None) –
waw (bool | None) –
yeh (bool | None) –
teh_marbuta (bool | None) –
ligatures (bool | None) –
spaces (bool | None) –
all (bool | None) –
- connect_single_letter_word(self, waw=None, feh=None, beh=None, lam=None, kaf=None, teh=None, all=None, custom_strings=None)#
Applies
connect_single_letter_word()
to each line- Parameters
waw (bool | None) –
feh (bool | None) –
beh (bool | None) –
lam (bool | None) –
kaf (bool | None) –
teh (bool | None) –
all (bool | None) –
custom_strings (list[str] | str | None) –
- replace(self, strings, with_value)#
Applies
replace()
to each line- Parameters
strings (list[str] | str) –
with_value (str) –
- replace_expression(self, expression, with_value)#
Applies
replace_expression()
to each line- Parameters
expression (Expression | ExpressionGroup | str) –
with_value (Callable[..., str] | str) –
- replace_pairs(self, keys, values)#
Applies
replace_pairs()
to each line- Parameters
keys (list[str]) –
values (list[str]) –
- reduce_repeated_substring(self, min_repeated=3, reduce_to=2)#
Applies
reduce_repeated_substring()
to each line- Parameters
min_repeated (int) –
reduce_to (int) –
- remove(self, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, use_space=True, custom_strings=None, custom_expressions=None)#
Applies
remove()
to each line- Parameters
arabic (bool) –
english (bool) –
arabic_letters (bool) –
english_letters (bool) –
english_small_letters (bool) –
english_capital_letters (bool) –
numbers (bool) –
harakat (bool) –
all_harakat (bool) –
tatweel (bool) –
punctuations (bool) –
arabic_numbers (bool) –
english_numbers (bool) –
arabic_punctuations (bool) –
english_punctuations (bool) –
arabic_ligatures (bool) –
arabic_hashtags (bool) –
arabic_mentions (bool) –
emails (bool) –
english_hashtags (bool) –
english_mentions (bool) –
hashtags (bool) –
links (bool) –
mentions (bool) –
emojis (bool) –
use_space (bool) –
custom_strings (list[str] | str | None) –
custom_expressions (list[str] | str | None) –
- drop_lines_contain(self, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, lam_alef_variations=False, lam_alef=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, persian=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, custom_strings=None, custom_expressions=None, operator='or')#
Drop lines that contain any of the selected strings or patterns.
Note
Use
operator='and'
to drop lines that contain all selected strings or patterns.See
contains()
for arguments description- Parameters
arabic (bool) –
english (bool) –
arabic_letters (bool) –
english_letters (bool) –
english_small_letters (bool) –
english_capital_letters (bool) –
numbers (bool) –
harakat (bool) –
all_harakat (bool) –
tatweel (bool) –
lam_alef_variations (bool) –
lam_alef (bool) –
punctuations (bool) –
arabic_numbers (bool) –
english_numbers (bool) –
arabic_punctuations (bool) –
english_punctuations (bool) –
arabic_ligatures (bool) –
persian (bool) –
arabic_hashtags (bool) –
arabic_mentions (bool) –
emails (bool) –
english_hashtags (bool) –
english_mentions (bool) –
hashtags (bool) –
links (bool) –
mentions (bool) –
emojis (bool) –
custom_strings (list[str] | str | None) –
custom_expressions (list[str] | str | None) –
operator (str) –
- drop_empty_lines(self)#
Drop empty lines.
- drop_lines_below_len(self, length, word_level=False)#
Drop lines with a number of characters/words less than the input
length
- Parameters
length (int) – Number of characters/words
word_level (bool, optional) – True to switch to word level, which splits the text by space, by default False
- drop_lines_above_len(self, length, word_level=False)#
Drop lines with a number of characters/words more than the input
length
- Parameters
length (int) – Number of characters/words
word_level (bool, optional) – True to switch to word level, which splits the text by space, by default False
- drop_lines_contain_repeated_substring(self, repeated=3)#
Drop lines containing a number of consecutive repeated substrings
- Parameters
repeated (int, optional) – Minimum number of repetitions, by default 3
- drop_lines_contain_single_letter_word(self, arabic_letters=False, english_letters=False)#
Drop lines containing a single-letter word (e.g.”محمد و احمد” or “how r u”). In Arabic, single-letter words are rare.
Warning
In English, all lines containing the letter “I” will be dropped since it is considered a single-letter word
See
contains_single_letter_word()
. See alsoconnect_single_letter_word()
.- Parameters
arabic_letters (bool) –
english_letters (bool) –
- filter_lines_contain(self, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, lam_alef_variations=False, lam_alef=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, persian=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, custom_strings=None, custom_expressions=None, operator='or')#
Keep lines that contain any of the selected strings or patterns.
Note
Use
operator='and'
to drop lines that contain all selected strings or patterns.See
contains()
for arguments description- Parameters
arabic (bool) –
english (bool) –
arabic_letters (bool) –
english_letters (bool) –
english_small_letters (bool) –
english_capital_letters (bool) –
numbers (bool) –
harakat (bool) –
all_harakat (bool) –
tatweel (bool) –
lam_alef_variations (bool) –
lam_alef (bool) –
punctuations (bool) –
arabic_numbers (bool) –
english_numbers (bool) –
arabic_punctuations (bool) –
english_punctuations (bool) –
arabic_ligatures (bool) –
persian (bool) –
arabic_hashtags (bool) –
arabic_mentions (bool) –
emails (bool) –
english_hashtags (bool) –
english_mentions (bool) –
hashtags (bool) –
links (bool) –
mentions (bool) –
emojis (bool) –
custom_strings (list[str] | str | None) –
custom_expressions (list[str] | str | None) –
operator (str) –
- class TextProcessor(text)[source]#
Bases:
maha.processors.base_processor.BaseProcessor
For processing text input.
- Parameters
text (Union[List[str], str]) – A text or list of strings to process
- apply(self, fn)#
Applies a function to each line
- Parameters
fn (Callable[[str], str]) – Function to apply
- filter(self, fn)#
Keeps lines for which the input function is True
- Parameters
fn (Callable[[str], bool]) – Function to check
- get_lines(self, n_lines=100)#
Returns a generator of list of strings with length of
n_lines
- Parameters
n_lines (int) – Number of lines to yield, Defaults to 100
- Yields
List[str] – List of strings with length of
n_lines
. The last list maybe of length less thann_lines
.
- set_lines(self, text)#
Overrides text
- Parameters
text (Union[List[str], str]) – New text or list of strings
- property text(self)#
Returns the processed text joined by the newline separator
\n
- Returns
processed text
- Return type
str
- classmethod from_text(cls, text, sep=None)#
Creates a new processor from the given text. Separate the text by the input
sep
argument if provided.- Parameters
text (str) – Text to process
sep (str, optional) – Separator used to split the given text, by default None
- Returns
New text processor
- Return type
- classmethod from_list(cls, lines)#
Creates a new processor from the given list of strings.
- Parameters
lines (List[str]) – list of strings
- Returns
New text processor
- Return type
- drop_duplicates(self)#
Drops duplicate lines from text
- class FileProcessor(path)[source]#
Bases:
TextProcessor
For processing file input.
Note
For large files (>100 MB), use
StreamFileProcessor
.- Parameters
path (Union[str,
pathlib.Path
]) – Path of the file to process.- Raises
FileNotFoundError – If the file doesn’t exist.
ValueError – If the file is empty.
- class StreamTextProcessor(lines)[source]#
Bases:
maha.processors.base_processor.BaseProcessor
For processing a stream of text input.
- Parameters
lines (Iterable[str]) – A an iterable of strings to process
- apply(self, fn)#
Applies a function to each line
- Parameters
fn (Callable[[str], str]) – Function to apply
- filter(self, fn)#
Keeps lines for which the input function is True
- Parameters
fn (Callable[[str], bool]) – Function to check
- get_lines(self, n_lines=100)#
Returns a generator of list of strings with length of
n_lines
- Parameters
n_lines (int) – Number of lines to yield, Defaults to 100
- Yields
List[str] – List of strings with length of
n_lines
. The last list maybe of length less thann_lines
.
- process(self, n_lines=100)#
Applies all functions in sequence to the given iterable
- Parameters
n_lines (int, optional) – Number of lines to process at a time, by default 100
- Yields
List[str] – A list of processed text, it can be empty.
- Raises
ValueError – If no functions were selected.
- apply_functions(self, text)#
Applies all functions in sequence to a given list of strings
- Parameters
text (List[str]) – List of strings to process
- class StreamFileProcessor(path, encoding='utf8')[source]#
Bases:
StreamTextProcessor
For processing file stream input.
- Parameters
path (Union[str,
pathlib.Path
]) – Path of the file to process.encoding (str) – File encoding.
- Raises
FileNotFoundError – If the file doesn’t exist.
- get_lines(self, n_lines=100)#
Returns a generator of list of strings with length of
n_lines
- Parameters
n_lines (int) – Number of lines to yield, Defaults to 100
- Yields
List[str] – List of strings with length of
n_lines
. The last list maybe of length less thann_lines
.
- process_and_save(self, path, n_lines=100, override=False)#
Process the input file and save the result in the given path
- Parameters
path (Union[str,
pathlib.Path
]) – Path to save the filen_lines (int, optional) – Number of lines to process at a time, by default 100
override (bool, optional) – True to override the file if exists, by default False
- Raises
FileExistsError – If the file exists