maha.processors#

Submodules#

Package Contents#

Classes#

BaseProcessor

Base class for all processors. It contains almost all functions needed for the

TextProcessor

For processing text input.

FileProcessor

For processing file input.

StreamTextProcessor

For processing a stream of text input.

StreamFileProcessor

For processing file stream input.

class BaseProcessor[source]#

Bases: abc.ABC

Base class for all processors. It contains almost all functions needed for the processors.

Parameters

text (Union[List[str], str]) – A text or list of strings to process

abstract get_lines(self, n_lines=100)#

Returns a generator of list of strings with length of n_lines

Parameters

n_lines (int) – Number of lines to yield, Defaults to 100

Yields

List[str] – List of strings with length of n_lines. The last list maybe of length less than n_lines.

abstract apply(self, fn)#

Applies a function to each line

Parameters

fn (Callable[[str], str]) – Function to apply

abstract filter(self, fn)#

Keeps lines for which the input function is True

Parameters

fn (Callable[[str], bool]) – Function to check

get(self, unique_characters=False, character_length=False, word_length=False)#

Returns statistics about the provided text

Parameters
  • unique_characters (bool, optional) – Return all unique characters, by default False

  • character_length (bool, optional) – Return the character length of each string, by default False

  • word_length (bool, optional) – Return the word length of each string (split by space), by default False

Returns

  • If one argument is set to True, its value is return

  • If more than one argument is set to True, a dictionary is returned where

    keys are the True passed arguments with the corresponding values

Return type

Union[Dict[str, Any], Any]

print_unique_characters(self)#

Prints all unique characters in the text

keep(self, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, use_space=True, custom_strings=None)#

Applies keep() to each line

Parameters
  • arabic (bool) –

  • english (bool) –

  • arabic_letters (bool) –

  • english_letters (bool) –

  • english_small_letters (bool) –

  • english_capital_letters (bool) –

  • numbers (bool) –

  • harakat (bool) –

  • all_harakat (bool) –

  • punctuations (bool) –

  • arabic_numbers (bool) –

  • english_numbers (bool) –

  • arabic_punctuations (bool) –

  • english_punctuations (bool) –

  • use_space (bool) –

  • custom_strings (list[str] | str | None) –

normalize(self, lam_alef=None, alef=None, waw=None, yeh=None, teh_marbuta=None, ligatures=None, spaces=None, all=None)#

Applies normalize() to each line

Parameters
  • lam_alef (bool | None) –

  • alef (bool | None) –

  • waw (bool | None) –

  • yeh (bool | None) –

  • teh_marbuta (bool | None) –

  • ligatures (bool | None) –

  • spaces (bool | None) –

  • all (bool | None) –

connect_single_letter_word(self, waw=None, feh=None, beh=None, lam=None, kaf=None, teh=None, all=None, custom_strings=None)#

Applies connect_single_letter_word() to each line

Parameters
  • waw (bool | None) –

  • feh (bool | None) –

  • beh (bool | None) –

  • lam (bool | None) –

  • kaf (bool | None) –

  • teh (bool | None) –

  • all (bool | None) –

  • custom_strings (list[str] | str | None) –

replace(self, strings, with_value)#

Applies replace() to each line

Parameters
  • strings (list[str] | str) –

  • with_value (str) –

replace_expression(self, expression, with_value)#

Applies replace_expression() to each line

Parameters
replace_pairs(self, keys, values)#

Applies replace_pairs() to each line

Parameters
  • keys (list[str]) –

  • values (list[str]) –

reduce_repeated_substring(self, min_repeated=3, reduce_to=2)#

Applies reduce_repeated_substring() to each line

Parameters
  • min_repeated (int) –

  • reduce_to (int) –

remove(self, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, use_space=True, custom_strings=None, custom_expressions=None)#

Applies remove() to each line

Parameters
  • arabic (bool) –

  • english (bool) –

  • arabic_letters (bool) –

  • english_letters (bool) –

  • english_small_letters (bool) –

  • english_capital_letters (bool) –

  • numbers (bool) –

  • harakat (bool) –

  • all_harakat (bool) –

  • tatweel (bool) –

  • punctuations (bool) –

  • arabic_numbers (bool) –

  • english_numbers (bool) –

  • arabic_punctuations (bool) –

  • english_punctuations (bool) –

  • arabic_ligatures (bool) –

  • arabic_hashtags (bool) –

  • arabic_mentions (bool) –

  • emails (bool) –

  • english_hashtags (bool) –

  • english_mentions (bool) –

  • hashtags (bool) –

  • links (bool) –

  • mentions (bool) –

  • emojis (bool) –

  • use_space (bool) –

  • custom_strings (list[str] | str | None) –

  • custom_expressions (list[str] | str | None) –

drop_lines_contain(self, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, lam_alef_variations=False, lam_alef=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, persian=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, custom_strings=None, custom_expressions=None, operator='or')#

Drop lines that contain any of the selected strings or patterns.

Note

Use operator='and' to drop lines that contain all selected strings or patterns.

See contains() for arguments description

Parameters
  • arabic (bool) –

  • english (bool) –

  • arabic_letters (bool) –

  • english_letters (bool) –

  • english_small_letters (bool) –

  • english_capital_letters (bool) –

  • numbers (bool) –

  • harakat (bool) –

  • all_harakat (bool) –

  • tatweel (bool) –

  • lam_alef_variations (bool) –

  • lam_alef (bool) –

  • punctuations (bool) –

  • arabic_numbers (bool) –

  • english_numbers (bool) –

  • arabic_punctuations (bool) –

  • english_punctuations (bool) –

  • arabic_ligatures (bool) –

  • persian (bool) –

  • arabic_hashtags (bool) –

  • arabic_mentions (bool) –

  • emails (bool) –

  • english_hashtags (bool) –

  • english_mentions (bool) –

  • hashtags (bool) –

  • links (bool) –

  • mentions (bool) –

  • emojis (bool) –

  • custom_strings (list[str] | str | None) –

  • custom_expressions (list[str] | str | None) –

  • operator (str) –

drop_empty_lines(self)#

Drop empty lines.

drop_lines_below_len(self, length, word_level=False)#

Drop lines with a number of characters/words less than the input length

Parameters
  • length (int) – Number of characters/words

  • word_level (bool, optional) – True to switch to word level, which splits the text by space, by default False

drop_lines_above_len(self, length, word_level=False)#

Drop lines with a number of characters/words more than the input length

Parameters
  • length (int) – Number of characters/words

  • word_level (bool, optional) – True to switch to word level, which splits the text by space, by default False

drop_lines_contain_repeated_substring(self, repeated=3)#

Drop lines containing a number of consecutive repeated substrings

Parameters

repeated (int, optional) – Minimum number of repetitions, by default 3

drop_lines_contain_single_letter_word(self, arabic_letters=False, english_letters=False)#

Drop lines containing a single-letter word (e.g.”محمد و احمد” or “how r u”). In Arabic, single-letter words are rare.

Warning

In English, all lines containing the letter “I” will be dropped since it is considered a single-letter word

See contains_single_letter_word(). See also connect_single_letter_word().

Parameters
  • arabic_letters (bool) –

  • english_letters (bool) –

filter_lines_contain(self, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, lam_alef_variations=False, lam_alef=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, persian=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, custom_strings=None, custom_expressions=None, operator='or')#

Keep lines that contain any of the selected strings or patterns.

Note

Use operator='and' to drop lines that contain all selected strings or patterns.

See contains() for arguments description

Parameters
  • arabic (bool) –

  • english (bool) –

  • arabic_letters (bool) –

  • english_letters (bool) –

  • english_small_letters (bool) –

  • english_capital_letters (bool) –

  • numbers (bool) –

  • harakat (bool) –

  • all_harakat (bool) –

  • tatweel (bool) –

  • lam_alef_variations (bool) –

  • lam_alef (bool) –

  • punctuations (bool) –

  • arabic_numbers (bool) –

  • english_numbers (bool) –

  • arabic_punctuations (bool) –

  • english_punctuations (bool) –

  • arabic_ligatures (bool) –

  • persian (bool) –

  • arabic_hashtags (bool) –

  • arabic_mentions (bool) –

  • emails (bool) –

  • english_hashtags (bool) –

  • english_mentions (bool) –

  • hashtags (bool) –

  • links (bool) –

  • mentions (bool) –

  • emojis (bool) –

  • custom_strings (list[str] | str | None) –

  • custom_expressions (list[str] | str | None) –

  • operator (str) –

class TextProcessor(text)[source]#

Bases: maha.processors.base_processor.BaseProcessor

For processing text input.

Parameters

text (Union[List[str], str]) – A text or list of strings to process

apply(self, fn)#

Applies a function to each line

Parameters

fn (Callable[[str], str]) – Function to apply

filter(self, fn)#

Keeps lines for which the input function is True

Parameters

fn (Callable[[str], bool]) – Function to check

get_lines(self, n_lines=100)#

Returns a generator of list of strings with length of n_lines

Parameters

n_lines (int) – Number of lines to yield, Defaults to 100

Yields

List[str] – List of strings with length of n_lines. The last list maybe of length less than n_lines.

set_lines(self, text)#

Overrides text

Parameters

text (Union[List[str], str]) – New text or list of strings

property text(self)#

Returns the processed text joined by the newline separator \n

Returns

processed text

Return type

str

classmethod from_text(cls, text, sep=None)#

Creates a new processor from the given text. Separate the text by the input sep argument if provided.

Parameters
  • text (str) – Text to process

  • sep (str, optional) – Separator used to split the given text, by default None

Returns

New text processor

Return type

TextProcessor

classmethod from_list(cls, lines)#

Creates a new processor from the given list of strings.

Parameters

lines (List[str]) – list of strings

Returns

New text processor

Return type

TextProcessor

drop_duplicates(self)#

Drops duplicate lines from text

class FileProcessor(path)[source]#

Bases: TextProcessor

For processing file input.

Note

For large files (>100 MB), use StreamFileProcessor.

Parameters

path (Union[str, pathlib.Path]) – Path of the file to process.

Raises
  • FileNotFoundError – If the file doesn’t exist.

  • ValueError – If the file is empty.

class StreamTextProcessor(lines)[source]#

Bases: maha.processors.base_processor.BaseProcessor

For processing a stream of text input.

Parameters

lines (Iterable[str]) – A an iterable of strings to process

apply(self, fn)#

Applies a function to each line

Parameters

fn (Callable[[str], str]) – Function to apply

filter(self, fn)#

Keeps lines for which the input function is True

Parameters

fn (Callable[[str], bool]) – Function to check

get_lines(self, n_lines=100)#

Returns a generator of list of strings with length of n_lines

Parameters

n_lines (int) – Number of lines to yield, Defaults to 100

Yields

List[str] – List of strings with length of n_lines. The last list maybe of length less than n_lines.

process(self, n_lines=100)#

Applies all functions in sequence to the given iterable

Parameters

n_lines (int, optional) – Number of lines to process at a time, by default 100

Yields

List[str] – A list of processed text, it can be empty.

Raises

ValueError – If no functions were selected.

apply_functions(self, text)#

Applies all functions in sequence to a given list of strings

Parameters

text (List[str]) – List of strings to process

class StreamFileProcessor(path, encoding='utf8')[source]#

Bases: StreamTextProcessor

For processing file stream input.

Parameters
  • path (Union[str, pathlib.Path]) – Path of the file to process.

  • encoding (str) – File encoding.

Raises

FileNotFoundError – If the file doesn’t exist.

get_lines(self, n_lines=100)#

Returns a generator of list of strings with length of n_lines

Parameters

n_lines (int) – Number of lines to yield, Defaults to 100

Yields

List[str] – List of strings with length of n_lines. The last list maybe of length less than n_lines.

process_and_save(self, path, n_lines=100, override=False)#

Process the input file and save the result in the given path

Parameters
  • path (Union[str, pathlib.Path]) – Path to save the file

  • n_lines (int, optional) – Number of lines to process at a time, by default 100

  • override (bool, optional) – True to override the file if exists, by default False

Raises

FileExistsError – If the file exists