Usage¶

HotPdf Class¶

class hotpdf.HotPdf(pdf_file: PurePath | str | IOBase | None = None, password: str = '', page_numbers: list[int] | None = None, extraction_tolerance: int = 4, laparams: dict[str, float | bool] | None = None, include_annotation_spaces: bool = False, preserve_pdfminer_coordinates: bool = False)¶

hotpdf.HotPdf.__init__(self, pdf_file: PurePath | str | IOBase | None = None, password: str = '', page_numbers: list[int] | None = None, extraction_tolerance: int = 4, laparams: dict[str, float | bool] | None = None, include_annotation_spaces: bool = False, preserve_pdfminer_coordinates: bool = False) → None¶

Initialize the HotPdf class.

Parameters:

pdf_file (PurePath | str | IOBytes) – The path to the PDF file to be loaded, or a bytes object.
password (str, optional) – Password to use to unlock the pdf
page_numbers (list[int], optional) – Pages to be loaded into memory. (0-indexed). If not provided, will load all pages (default).
extraction_tolerance (int, optional) – Tolerance value used during text extraction to adjust the bounding box for capturing text. Defaults to 4.
laparams (dict[str, Union[float, bool]], optional) – Layout parameters for pdfminer.
include_annotation_spaces (bool, optional) – Add annotation spaces to the memory map. Default: False
preserve_pdfminer_coordinates (bool, Optional) – Preserve pdfminer y-coordinate values. Default: False - use natural coords

Raises:

ValueError – If the page range is invalid.
FileNotFoundError – If the file is not found.
PermissionError – If the file is encrypted or the password is wrong.
RuntimeError – If an unknown error is generated by transfotmer.

The HotPdf class is the wrapper around your PDF that allows for searching text and extracting text on your PDFs.

from hotpdf import HotPdf
pdf_file_path = "path to your pdf file"

# Load directly from Path
hotpdf_document = HotPdf(pdf_file_path)

# Load from file stream
with open(pdf_file_path, "rb") as f:
   hotpdf_document_2 = HotPdf(f)

Alternatively you can defer loading, and use the .load() function instead. The outcome is the same, internally the constructor for HotPdf calls the .load() function

from hotpdf import HotPdf
pdf_file_path = "path to your pdf file"

# path
hotpdf_document = HotPdf()
hotpdf_document = hotpdf_document.load(pdf_file_path)

# file stream
hotpdf_document_2 = HotPdf()
with open(pdf_file_path, "rb") as f:
   hotpdf_document_2 = hotpdf_document_2.load(f)

hotpdf.HotPdf.load(self, pdf_file: PurePath | str | IOBase, password: str = '', page_numbers: list[int] | None = None, laparams: dict[str, float | bool] | None = None, include_annotation_spaces: bool = False, preserve_pdfminer_coordinates: bool = False) → None¶

Load a PDF file into memory.

Parameters:

pdf_file (str | Bytes) – The path to the PDF file to be loaded, or a bytes object.
password (str, optional) – Password to use to unlock the pdf
page_numbers (list[int], optional) – Pages to be loaded into memory. (0-indexed). If not provided, will load all pages (default).
laparams (dict[str, Union[float, bool]], optional) – Layout parameters for pdfminer.
include_annotation_spaces (bool, optional) – Add annotation spaces to the memory map.
preserve_pdfminer_coordinates (bool, Optional) – Preserve pdfminer y-coordinate values. Default: False - use natural coords

Raises:

Exception – If an unknown error is generated by pdfminer.

You can also merge multiple HotPdf objects to get one single HotPdf object!

merged_hotpdf_object = HotPdf.merge_multiple(hotpdfs=[
    hotpdf_document,
    hotpdf_document2,
])

hotpdf.HotPdf.merge_multiple(hotpdfs: list[HotPdf]) → HotPdf¶

Merge multiple HotPdf objects and return a single HotPdf object consisting of all pages

Parameters:: hotpdfs (list[HotPdf]) – List of HotPdf objects that will be combined to form one single HotPdf object. All other params will be ignored in this case.
Raises:: HotPdfIsNoneError – If any of the HotPdf objects in the hotpdfs list is None
Returns:: Merged HotPdf object
Return type:: HotPdf

File Operations¶

Length¶

The number of pages in the PDF file can be determined by checking the len of pages property of the hotpdf object.

num_pages = len(hotpdf_document.pages)

Search¶

find_text¶

To look for a string in the entire PDF File, you can use the find_text function. You can also specify what pages you want to search in. By default it will look through the whole PDF. To get the whole span where the string lies in, you can set take_span to True.

text_occurences = hotpdf_document.find_text("foo")
text_occurences_with_span = hotpdf_document.find_text(
   "foo",
   take_span=True,
)

hotpdf.HotPdf.find_text(self, query: str, pages: list[int] | None = None, take_span: bool = False, sort: bool = True, case_sensitive: bool = True) → defaultdict[int, list[list[HotCharacter]]]¶

Find text within the loaded PDF pages.

Parameters:

query (str) – The text to search for.
pages (list[int], optional) – List of page numbers to search.
take_span (bool, optional) – Take the full span of the text that it is a part of.
sort (bool, Optional) – Return elements sorted by their positions.
case_sensitive (bool, optional) – Whether the search should be case-sensitive. Defaults to True.

Raises:

ValueError – If the page number is invalid.

Returns:

A dictionary mapping page numbers to found text coordinates.

Return type:

SearchResult

Extraction¶

extract_text¶

To extract string from specific positions in the PDF, you can use the extract_text function. This will extract the string that lies within the positions that have been specified on the page that it’s specified (default is Page 0).

text_in_bbox = hotpdf_document.extract_text(
   x0=0,
   y0=0,
   x1=100,
   y1=10,
   page=0,
)

hotpdf.HotPdf.extract_text(self, x0: int, y0: int, x1: int, y1: int, page: int = 0) → str¶

Extract text from a specified bounding box on a page.

Parameters:

x0 (int) – The left x-coordinate of the bounding box.
y0 (int) – The bottom y-coordinate of the bounding box.
x1 (int) – The right x-coordinate of the bounding box.
y1 (int) – The top y-coordinate of the bounding box.
page (int) – The page number. Defaults to 0.

Raises:

ValueError – If the coordinates are invalid.
ValueError – If the page number is invalid.

Returns:

Extracted text within the bounding box.

Return type:

str

extract_spans¶

Instead of just the individual characters that lay within the bounds that you specify, if you want full words, or the complete spans that intersect within the specified bounds - you can use the extract_spans functions instead. This will extract all the spans that intersect with the positions that have been specified on the page that it’s specified (default is Page 0).

spans_in_bbox = hotpdf_document.extract_spans(
   x0=0,
   y0=0,
   x1=100,
   y1=10,
   page=0,
)

hotpdf.HotPdf.extract_spans(self, x0: int, y0: int, x1: int, y1: int, page: int = 0, sort: bool = True) → list[Span]¶

Extract spans that intersect with the given bounding box.

Parameters:

x0 (int) – The left x-coordinate of the bounding box.
y0 (int) – The bottom y-coordinate of the bounding box.
x1 (int) – The right x-coordinate of the bounding box.
y1 (int) – The top y-coordinate of the bounding box.
page (int, optional) – The page number. Defaults to 0.
sort (bool, optional) – Sort the spans by their coordinates. Defaults to True.

Raises:

ValueError – If the coordinates are invalid.
ValueError – If the page number is invalid.

Returns:

List of spans of hotcharacters that intersect with the given bounding box

Return type:

list[Span]

extract_spans_text¶

Instead of handling the spans structures yourself, if you are only interested in the text of the spans, you can use the extract_spans_text function instead.

The function is the same as extract_spans except it returns you a list of str.

spans_text_in_bbox = hotpdf_document.extract_spans_text(
   x0=0,
   y0=0,
   x1=100,
   y1=10,
   page=0,
)

hotpdf.HotPdf.extract_spans_text(self, x0: int, y0: int, x1: int, y1: int, page: int = 0) → str¶

Extract text from spans that intersect with the given bounding box.

Parameters:

x0 (int) – The left x-coordinate of the bounding box.
y0 (int) – The bottom y-coordinate of the bounding box.
x1 (int) – The right x-coordinate of the bounding box.
y1 (int) – The top y-coordinate of the bounding box.
page (int, optional) – The page number. Defaults to 0.

Raises:

ValueError – If the coordinates are invalid.
ValueError – If the page number is invalid.

Returns:

Extracted text that intersects with the bounding box.

Return type:

str

extract_page_text¶

If you want to view the text of an entire page in plaintext str format, you can use the extract_page_text function.

The function accepts page as a parameter.

page_text = hotpdf_document.extract_page_text(page=0,)

hotpdf.HotPdf.extract_page_text(self, page: int, segment: bool = False) → str¶

Extract text from a specified page.

Parameters:

page (int) – The page number.
segment (bool) – Group text into layout blocks (recursive XY-cut) before reading, so side-by-side columns are not interleaved row by row. Best-effort; dense forms may over-segment. Defaults to False.

Raises:

ValueError – If the page number is invalid.

Returns:

Extracted text from the page.

Return type:

str

Usage¶

HotPdf Class¶

File Operations¶

Length¶

Search¶

find_text¶

Extraction¶

extract_text¶

extract_spans¶

extract_spans_text¶

extract_page_text¶

Table of Contents

Previous topic

Next topic

This Page