hotpdf.hotpdf.HotPdf¶
- class hotpdf.hotpdf.HotPdf(pdf_file: PurePath | str | IOBase | None = None, password: str = '', page_numbers: list[int] | None = None, extraction_tolerance: int = 4, laparams: dict[str, float | bool] | None = None, include_annotation_spaces: bool = False, preserve_pdfminer_coordinates: bool = False)¶
- __init__(pdf_file: PurePath | str | IOBase | None = None, password: str = '', page_numbers: list[int] | None = None, extraction_tolerance: int = 4, laparams: dict[str, float | bool] | None = None, include_annotation_spaces: bool = False, preserve_pdfminer_coordinates: bool = False) None¶
Initialize the HotPdf class.
- Parameters:
pdf_file (PurePath | str | IOBytes) – The path to the PDF file to be loaded, or a bytes object.
password (str, optional) – Password to use to unlock the pdf
page_numbers (list[int], optional) – Pages to be loaded into memory. (0-indexed). If not provided, will load all pages (default).
extraction_tolerance (int, optional) – Tolerance value used during text extraction to adjust the bounding box for capturing text. Defaults to 4.
laparams (dict[str, Union[float, bool]], optional) – Layout parameters for pdfminer.
include_annotation_spaces (bool, optional) – Add annotation spaces to the memory map. Default: False
preserve_pdfminer_coordinates (bool, Optional) – Preserve pdfminer y-coordinate values. Default: False - use natural coords
- Raises:
ValueError – If the page range is invalid.
FileNotFoundError – If the file is not found.
PermissionError – If the file is encrypted or the password is wrong.
RuntimeError – If an unknown error is generated by transfotmer.
Methods
__init__([pdf_file, password, page_numbers, ...])Initialize the HotPdf class.
extract_page_text(page)Extract text from a specified page.
extract_spans(x0, y0, x1, y1[, page, sort])Extract spans that intersect with the given bounding box.
extract_spans_text(x0, y0, x1, y1[, page])Extract text from spans that intersect with the given bounding box.
extract_text(x0, y0, x1, y1[, page])Extract text from a specified bounding box on a page.
find_text(query[, pages, take_span, sort, ...])Find text within the loaded PDF pages.
load(pdf_file[, password, page_numbers, ...])Load a PDF file into memory.
merge_multiple(hotpdfs)Merge multiple HotPdf objects and return a single HotPdf object consisting of all pages