hotpdf.hotpdf.HotPdf

class hotpdf.hotpdf.HotPdf(pdf_file: PurePath | str | IOBase | None = None, password: str = '', page_numbers: list[int] | None = None, extraction_tolerance: int = 4, laparams: dict[str, float | bool] | None = None, include_annotation_spaces: bool = False, preserve_pdfminer_coordinates: bool = False)
__init__(pdf_file: PurePath | str | IOBase | None = None, password: str = '', page_numbers: list[int] | None = None, extraction_tolerance: int = 4, laparams: dict[str, float | bool] | None = None, include_annotation_spaces: bool = False, preserve_pdfminer_coordinates: bool = False) None

Initialize the HotPdf class.

Parameters:
  • pdf_file (PurePath | str | IOBytes) – The path to the PDF file to be loaded, or a bytes object.

  • password (str, optional) – Password to use to unlock the pdf

  • page_numbers (list[int], optional) – Pages to be loaded into memory. (0-indexed). If not provided, will load all pages (default).

  • extraction_tolerance (int, optional) – Tolerance value used during text extraction to adjust the bounding box for capturing text. Defaults to 4.

  • laparams (dict[str, Union[float, bool]], optional) – Layout parameters for pdfminer.

  • include_annotation_spaces (bool, optional) – Add annotation spaces to the memory map. Default: False

  • preserve_pdfminer_coordinates (bool, Optional) – Preserve pdfminer y-coordinate values. Default: False - use natural coords

Raises:

Methods

__init__([pdf_file, password, page_numbers, ...])

Initialize the HotPdf class.

extract_page_text(page)

Extract text from a specified page.

extract_spans(x0, y0, x1, y1[, page, sort])

Extract spans that intersect with the given bounding box.

extract_spans_text(x0, y0, x1, y1[, page])

Extract text from spans that intersect with the given bounding box.

extract_text(x0, y0, x1, y1[, page])

Extract text from a specified bounding box on a page.

find_text(query[, pages, take_span, sort, ...])

Find text within the loaded PDF pages.

load(pdf_file[, password, page_numbers, ...])

Load a PDF file into memory.

merge_multiple(hotpdfs)

Merge multiple HotPdf objects and return a single HotPdf object consisting of all pages