maha.processors.basic_processors#

All basic processors

Module Contents#

Classes#

TextProcessor

For processing text input.

FileProcessor

For processing file input.

class TextProcessor(text)[source]#

Bases: maha.processors.base_processor.BaseProcessor

For processing text input.

Parameters

text (Union[List[str], str]) – A text or list of strings to process

apply(self, fn)[source]#

Applies a function to each line

Parameters

fn (Callable[[str], str]) – Function to apply

filter(self, fn)[source]#

Keeps lines for which the input function is True

Parameters

fn (Callable[[str], bool]) – Function to check

get_lines(self, n_lines=100)[source]#

Returns a generator of list of strings with length of n_lines

Parameters

n_lines (int) – Number of lines to yield, Defaults to 100

Yields

List[str] – List of strings with length of n_lines. The last list maybe of length less than n_lines.

set_lines(self, text)[source]#

Overrides text

Parameters

text (Union[List[str], str]) – New text or list of strings

property text(self)[source]#

Returns the processed text joined by the newline separator \n

Returns

processed text

Return type

str

classmethod from_text(cls, text, sep=None)[source]#

Creates a new processor from the given text. Separate the text by the input sep argument if provided.

Parameters
  • text (str) – Text to process

  • sep (str, optional) – Separator used to split the given text, by default None

Returns

New text processor

Return type

TextProcessor

classmethod from_list(cls, lines)[source]#

Creates a new processor from the given list of strings.

Parameters

lines (List[str]) – list of strings

Returns

New text processor

Return type

TextProcessor

drop_duplicates(self)[source]#

Drops duplicate lines from text

class FileProcessor(path)[source]#

Bases: TextProcessor

For processing file input.

Note

For large files (>100 MB), use StreamFileProcessor.

Parameters

path (Union[str, pathlib.Path]) – Path of the file to process.

Raises
  • FileNotFoundError – If the file doesn’t exist.

  • ValueError – If the file is empty.