maha.processors

Submodules

Package Contents

Classes

BaseProcessor

Base class for all processors. It contains almost all functions needed for the

TextProcessor

For processing text input.

FileProcessor

For processing file input.

StreamTextProcessor

For processing a stream of text input.

StreamFileProcessor

For processing file stream input.

class BaseProcessor[source]

Bases: abc.ABC

Base class for all processors. It contains almost all functions needed for the processors.

Parameters

text (Union[List[str], str]) – A text or list of strings to process

abstract get_lines(self, n_lines=100)

Returns a generator of list of strings with length of n_lines

Parameters

n_lines (int) – Number of lines to yield, Defaults to 100

Yields

List[str] – List of strings with length of n_lines. The last list maybe of length less than n_lines.

abstract apply(self, fn)

Applies a function to each line

Parameters

fn (Callable[[str], str]) – Function to apply

abstract filter(self, fn)

Keeps lines for which the input function is True

Parameters

fn (Callable[[str], bool]) – Function to check

get(self, unique_characters=False, character_length=False, word_length=False)

Returns statistics about the provided text

Parameters
  • unique_characters (bool, optional) – Return all unique characters, by default False

  • character_length (bool, optional) – Return the character length of each string, by default False

  • word_length (bool, optional) – Return the word length of each string (split by space), by default False

Returns

  • If one argument is set to True, its value is return

  • If more than one argument is set to True, a dictionary is returned where

    keys are the True passed arguments with the corresponding values

Return type

Union[Dict[str, Any], Any]

print_unique_characters(self)

Prints all unique characters in the text

keep(self, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, use_space=True, custom_strings=None)

Applies keep() to each line

Parameters
  • arabic (bool) –

  • english (bool) –

  • arabic_letters (bool) –

  • english_letters (bool) –

  • english_small_letters (bool) –

  • english_capital_letters (bool) –

  • numbers (bool) –

  • harakat (bool) –

  • all_harakat (bool) –

  • punctuations (bool) –

  • arabic_numbers (bool) –

  • english_numbers (bool) –

  • arabic_punctuations (bool) –

  • english_punctuations (bool) –

  • use_space (bool) –

  • custom_strings (Union[List[str], str]) –

normalize(self, lam_alef=None, alef=None, waw=None, yeh=None, teh_marbuta=None, ligatures=None, spaces=None, all=None)

Applies normalize() to each line

Parameters
  • lam_alef (bool) –

  • alef (bool) –

  • waw (bool) –

  • yeh (bool) –

  • teh_marbuta (bool) –

  • ligatures (bool) –

  • spaces (bool) –

  • all (bool) –

connect_single_letter_word(self, waw=None, feh=None, beh=None, lam=None, kaf=None, teh=None, all=None, custom_strings=None)

Applies connect_single_letter_word() to each line

Parameters
  • waw (bool) –

  • feh (bool) –

  • beh (bool) –

  • lam (bool) –

  • kaf (bool) –

  • teh (bool) –

  • all (bool) –

  • custom_strings (Union[List[str], str]) –

replace(self, strings, with_value)

Applies replace() to each line

Parameters
  • strings (Union[List[str], str]) –

  • with_value (str) –

replace_expression(self, expression, with_value)

Applies replace_expression() to each line

Parameters
replace_pairs(self, keys, values)

Applies replace_pairs() to each line

Parameters
  • keys (List[str]) –

  • values (List[str]) –

reduce_repeated_substring(self, min_repeated=3, reduce_to=2)

Applies reduce_repeated_substring() to each line

Parameters
  • min_repeated (int) –

  • reduce_to (int) –

remove(self, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, use_space=True, custom_strings=None, custom_expressions=None)

Applies remove() to each line

Parameters
  • arabic (bool) –

  • english (bool) –

  • arabic_letters (bool) –

  • english_letters (bool) –

  • english_small_letters (bool) –

  • english_capital_letters (bool) –

  • numbers (bool) –

  • harakat (bool) –

  • all_harakat (bool) –

  • tatweel (bool) –

  • punctuations (bool) –

  • arabic_numbers (bool) –

  • english_numbers (bool) –

  • arabic_punctuations (bool) –

  • english_punctuations (bool) –

  • arabic_ligatures (bool) –

  • arabic_hashtags (bool) –

  • arabic_mentions (bool) –

  • emails (bool) –

  • english_hashtags (bool) –

  • english_mentions (bool) –

  • hashtags (bool) –

  • links (bool) –

  • mentions (bool) –

  • emojis (bool) –

  • use_space (bool) –

  • custom_strings (Union[List[str], str]) –

  • custom_expressions (Union[List[str], str]) –

drop_lines_contain(self, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, lam_alef_variations=False, lam_alef=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, persian=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, custom_strings=None, custom_expressions=None, operator='or')

Drop lines that contain any of the selected strings or patterns.

Note

Use operator='and' to drop lines that contain all selected strings or patterns.

See contains() for arguments description

Parameters
  • arabic (bool) –

  • english (bool) –

  • arabic_letters (bool) –

  • english_letters (bool) –

  • english_small_letters (bool) –

  • english_capital_letters (bool) –

  • numbers (bool) –

  • harakat (bool) –

  • all_harakat (bool) –

  • tatweel (bool) –

  • lam_alef_variations (bool) –

  • lam_alef (bool) –

  • punctuations (bool) –

  • arabic_numbers (bool) –

  • english_numbers (bool) –

  • arabic_punctuations (bool) –

  • english_punctuations (bool) –

  • arabic_ligatures (bool) –

  • persian (bool) –

  • arabic_hashtags (bool) –

  • arabic_mentions (bool) –

  • emails (bool) –

  • english_hashtags (bool) –

  • english_mentions (bool) –

  • hashtags (bool) –

  • links (bool) –

  • mentions (bool) –

  • emojis (bool) –

  • custom_strings (Union[List[str], str]) –

  • custom_expressions (Union[List[str], str]) –

  • operator (str) –

drop_empty_lines(self)

Drop empty lines.

drop_lines_below_len(self, length, word_level=False)

Drop lines with a number of characters/words less than the input length

Parameters
  • length (int) – Number of characters/words

  • word_level (bool, optional) – True to switch to word level, which splits the text by space, by default False

drop_lines_above_len(self, length, word_level=False)

Drop lines with a number of characters/words more than the input length

Parameters
  • length (int) – Number of characters/words

  • word_level (bool, optional) – True to switch to word level, which splits the text by space, by default False

drop_lines_contain_repeated_substring(self, repeated=3)

Drop lines containing a number of consecutive repeated substrings

Parameters

repeated (int, optional) – Minimum number of repetitions, by default 3

drop_lines_contain_single_letter_word(self, arabic_letters=False, english_letters=False)

Drop lines containing a single-letter word (e.g.”محمد و احمد” or “how r u”). In Arabic, single-letter words are rare.

Warning

In English, all lines containing the letter “I” will be dropped since it is considered a single-letter word

See contains_single_letter_word(). See also connect_single_letter_word().

Parameters
  • arabic_letters (bool) –

  • english_letters (bool) –

filter_lines_contain(self, arabic=False, english=False, arabic_letters=False, english_letters=False, english_small_letters=False, english_capital_letters=False, numbers=False, harakat=False, all_harakat=False, tatweel=False, lam_alef_variations=False, lam_alef=False, punctuations=False, arabic_numbers=False, english_numbers=False, arabic_punctuations=False, english_punctuations=False, arabic_ligatures=False, persian=False, arabic_hashtags=False, arabic_mentions=False, emails=False, english_hashtags=False, english_mentions=False, hashtags=False, links=False, mentions=False, emojis=False, custom_strings=None, custom_expressions=None, operator='or')

Keep lines that contain any of the selected strings or patterns.

Note

Use operator='and' to drop lines that contain all selected strings or patterns.

See contains() for arguments description

Parameters
  • arabic (bool) –

  • english (bool) –

  • arabic_letters (bool) –

  • english_letters (bool) –

  • english_small_letters (bool) –

  • english_capital_letters (bool) –

  • numbers (bool) –

  • harakat (bool) –

  • all_harakat (bool) –

  • tatweel (bool) –

  • lam_alef_variations (bool) –

  • lam_alef (bool) –

  • punctuations (bool) –

  • arabic_numbers (bool) –

  • english_numbers (bool) –

  • arabic_punctuations (bool) –

  • english_punctuations (bool) –

  • arabic_ligatures (bool) –

  • persian (bool) –

  • arabic_hashtags (bool) –

  • arabic_mentions (bool) –

  • emails (bool) –

  • english_hashtags (bool) –

  • english_mentions (bool) –

  • hashtags (bool) –

  • links (bool) –

  • mentions (bool) –

  • emojis (bool) –

  • custom_strings (Union[List[str], str]) –

  • custom_expressions (Union[List[str], str]) –

  • operator (str) –

class TextProcessor(text)[source]

Bases: maha.processors.base_processor.BaseProcessor

For processing text input.

Parameters

text (Union[List[str], str]) – A text or list of strings to process

apply(self, fn)

Applies a function to each line

Parameters

fn (Callable[[str], str]) – Function to apply

filter(self, fn)

Keeps lines for which the input function is True

Parameters

fn (Callable[[str], bool]) – Function to check

get_lines(self, n_lines=100)

Returns a generator of list of strings with length of n_lines

Parameters

n_lines (int) – Number of lines to yield, Defaults to 100

Yields

List[str] – List of strings with length of n_lines. The last list maybe of length less than n_lines.

set_lines(self, text)

Overrides text

Parameters

text (Union[List[str], str]) – New text or list of strings

property text(self)

Returns the processed text joined by the newline separator \n

Returns

processed text

Return type

str

classmethod from_text(cls, text, sep=None)

Creates a new processor from the given text. Separate the text by the input sep argument if provided.

Parameters
  • text (str) – Text to process

  • sep (str, optional) – Separator used to split the given text, by default None

Returns

New text processor

Return type

TextProcessor

classmethod from_list(cls, lines)

Creates a new processor from the given list of strings.

Parameters

lines (List[str]) – list of strings

Returns

New text processor

Return type

TextProcessor

drop_duplicates(self)

Drops duplicate lines from text

class FileProcessor(path)[source]

Bases: TextProcessor

For processing file input.

Note

For large files (>100 MB), use StreamFileProcessor.

Parameters

path (Union[str, pathlib.Path]) – Path of the file to process.

Raises
  • FileNotFoundError – If the file doesn’t exist.

  • ValueError – If the file is empty.

class StreamTextProcessor(lines)[source]

Bases: maha.processors.base_processor.BaseProcessor

For processing a stream of text input.

Parameters

lines (Iterable[str]) – A an iterable of strings to process

apply(self, fn)

Applies a function to each line

Parameters

fn (Callable[[str], str]) – Function to apply

filter(self, fn)

Keeps lines for which the input function is True

Parameters

fn (Callable[[str], bool]) – Function to check

get_lines(self, n_lines=100)

Returns a generator of list of strings with length of n_lines

Parameters

n_lines (int) – Number of lines to yield, Defaults to 100

Yields

List[str] – List of strings with length of n_lines. The last list maybe of length less than n_lines.

process(self, n_lines=100)

Applies all functions in sequence to the given iterable

Parameters

n_lines (int, optional) – Number of lines to process at a time, by default 100

Yields

List[str] – A list of processed text, it can be empty.

Raises

ValueError – If no functions were selected.

apply_functions(self, text)

Applies all functions in sequence to a given list of strings

Parameters

text (List[str]) – List of strings to process

class StreamFileProcessor(path, encoding='utf8')[source]

Bases: StreamTextProcessor

For processing file stream input.

Parameters
  • path (Union[str, pathlib.Path]) – Path of the file to process.

  • encoding (str) – File encoding.

Raises

FileNotFoundError – If the file doesn’t exist.

get_lines(self, n_lines=100)

Returns a generator of list of strings with length of n_lines

Parameters

n_lines (int) – Number of lines to yield, Defaults to 100

Yields

List[str] – List of strings with length of n_lines. The last list maybe of length less than n_lines.

process_and_save(self, path, n_lines=100, override=False)

Process the input file and save the result in the given path

Parameters
  • path (Union[str, pathlib.Path]) – Path to save the file

  • n_lines (int, optional) – Number of lines to process at a time, by default 100

  • override (bool, optional) – True to override the file if exists, by default False

Raises

FileExistsError – If the file exists