Usage

HotPdf Class

class hotpdf.HotPdf(pdf_file: PurePath | str | IOBase | None = None, password: str = '', page_numbers: list[int] | None = None, extraction_tolerance: int = 4, laparams: dict[str, float | bool] | None = None, include_annotation_spaces: bool = False, preserve_pdfminer_coordinates: bool = False)
hotpdf.HotPdf.__init__(self, pdf_file: PurePath | str | IOBase | None = None, password: str = '', page_numbers: list[int] | None = None, extraction_tolerance: int = 4, laparams: dict[str, float | bool] | None = None, include_annotation_spaces: bool = False, preserve_pdfminer_coordinates: bool = False) None

Initialize the HotPdf class.

Parameters:
  • pdf_file (PurePath | str | IOBytes) – The path to the PDF file to be loaded, or a bytes object.

  • password (str, optional) – Password to use to unlock the pdf

  • page_numbers (list[int], optional) – Pages to be loaded into memory. (0-indexed). If not provided, will load all pages (default).

  • extraction_tolerance (int, optional) – Tolerance value used during text extraction to adjust the bounding box for capturing text. Defaults to 4.

  • laparams (dict[str, Union[float, bool]], optional) – Layout parameters for pdfminer.

  • include_annotation_spaces (bool, optional) – Add annotation spaces to the memory map. Default: False

  • preserve_pdfminer_coordinates (bool, Optional) – Preserve pdfminer y-coordinate values. Default: False - use natural coords

Raises:

The HotPdf class is the wrapper around your PDF that allows for searching text and extracting text on your PDFs.

from hotpdf import HotPdf
pdf_file_path = "path to your pdf file"

# Load directly from Path
hotpdf_document = HotPdf(pdf_file_path)

# Load from file stream
with open(pdf_file_path, "rb") as f:
   hotpdf_document_2 = HotPdf(f)

Alternatively you can defer loading, and use the .load() function instead. The outcome is the same, internally the constructor for HotPdf calls the .load() function

from hotpdf import HotPdf
pdf_file_path = "path to your pdf file"

# path
hotpdf_document = HotPdf()
hotpdf_document = hotpdf_document.load(pdf_file_path)

# file stream
hotpdf_document_2 = HotPdf()
with open(pdf_file_path, "rb") as f:
   hotpdf_document_2 = hotpdf_document_2.load(f)
hotpdf.HotPdf.load(self, pdf_file: PurePath | str | IOBase, password: str = '', page_numbers: list[int] | None = None, laparams: dict[str, float | bool] | None = None, include_annotation_spaces: bool = False, preserve_pdfminer_coordinates: bool = False) None

Load a PDF file into memory.

Parameters:
  • pdf_file (str | Bytes) – The path to the PDF file to be loaded, or a bytes object.

  • password (str, optional) – Password to use to unlock the pdf

  • page_numbers (list[int], optional) – Pages to be loaded into memory. (0-indexed). If not provided, will load all pages (default).

  • laparams (dict[str, Union[float, bool]], optional) – Layout parameters for pdfminer.

  • include_annotation_spaces (bool, optional) – Add annotation spaces to the memory map.

  • preserve_pdfminer_coordinates (bool, Optional) – Preserve pdfminer y-coordinate values. Default: False - use natural coords

Raises:

Exception – If an unknown error is generated by pdfminer.

You can also merge multiple HotPdf objects to get one single HotPdf object!

merged_hotpdf_object = HotPdf.merge_multiple(hotpdfs=[
    hotpdf_document,
    hotpdf_document2,
])
hotpdf.HotPdf.merge_multiple(hotpdfs: list[HotPdf]) HotPdf

Merge multiple HotPdf objects and return a single HotPdf object consisting of all pages

Parameters:

hotpdfs (list[HotPdf]) – List of HotPdf objects that will be combined to form one single HotPdf object. All other params will be ignored in this case.

Raises:

HotPdfIsNoneError – If any of the HotPdf objects in the hotpdfs list is None

Returns:

Merged HotPdf object

Return type:

HotPdf

File Operations

Length

The number of pages in the PDF file can be determined by checking the len of pages property of the hotpdf object.

num_pages = len(hotpdf_document.pages)

Extraction

extract_text

To extract string from specific positions in the PDF, you can use the extract_text function. This will extract the string that lies within the positions that have been specified on the page that it’s specified (default is Page 0).

text_in_bbox = hotpdf_document.extract_text(
   x0=0,
   y0=0,
   x1=100,
   y1=10,
   page=0,
)
hotpdf.HotPdf.extract_text(self, x0: int, y0: int, x1: int, y1: int, page: int = 0) str

Extract text from a specified bounding box on a page.

Parameters:
  • x0 (int) – The left x-coordinate of the bounding box.

  • y0 (int) – The bottom y-coordinate of the bounding box.

  • x1 (int) – The right x-coordinate of the bounding box.

  • y1 (int) – The top y-coordinate of the bounding box.

  • page (int) – The page number. Defaults to 0.

Raises:
Returns:

Extracted text within the bounding box.

Return type:

str

extract_spans

Instead of just the individual characters that lay within the bounds that you specify, if you want full words, or the complete spans that intersect within the specified bounds - you can use the extract_spans functions instead. This will extract all the spans that intersect with the positions that have been specified on the page that it’s specified (default is Page 0).

spans_in_bbox = hotpdf_document.extract_spans(
   x0=0,
   y0=0,
   x1=100,
   y1=10,
   page=0,
)
hotpdf.HotPdf.extract_spans(self, x0: int, y0: int, x1: int, y1: int, page: int = 0, sort: bool = True) list[Span]

Extract spans that intersect with the given bounding box.

Parameters:
  • x0 (int) – The left x-coordinate of the bounding box.

  • y0 (int) – The bottom y-coordinate of the bounding box.

  • x1 (int) – The right x-coordinate of the bounding box.

  • y1 (int) – The top y-coordinate of the bounding box.

  • page (int, optional) – The page number. Defaults to 0.

  • sort (bool, optional) – Sort the spans by their coordinates. Defaults to True.

Raises:
Returns:

List of spans of hotcharacters that intersect with the given bounding box

Return type:

list[Span]

extract_spans_text

Instead of handling the spans structures yourself, if you are only interested in the text of the spans, you can use the extract_spans_text function instead.

The function is the same as extract_spans except it returns you a list of str.

spans_text_in_bbox = hotpdf_document.extract_spans_text(
   x0=0,
   y0=0,
   x1=100,
   y1=10,
   page=0,
)
hotpdf.HotPdf.extract_spans_text(self, x0: int, y0: int, x1: int, y1: int, page: int = 0) str

Extract text from spans that intersect with the given bounding box.

Parameters:
  • x0 (int) – The left x-coordinate of the bounding box.

  • y0 (int) – The bottom y-coordinate of the bounding box.

  • x1 (int) – The right x-coordinate of the bounding box.

  • y1 (int) – The top y-coordinate of the bounding box.

  • page (int, optional) – The page number. Defaults to 0.

Raises:
Returns:

Extracted text that intersects with the bounding box.

Return type:

str

extract_page_text

If you want to view the text of an entire page in plaintext str format, you can use the extract_page_text function.

The function accepts page as a parameter.

page_text = hotpdf_document.extract_page_text(page=0,)
hotpdf.HotPdf.extract_page_text(self, page: int) str

Extract text from a specified page.

Parameters:

page (int) – The page number.

Raises:

ValueError – If the page number is invalid.

Returns:

Extracted text from the page.

Return type:

str