Usage¶
HotPdf Class¶
- class hotpdf.HotPdf(pdf_file: PurePath | str | IOBase | None = None, password: str = '', page_numbers: list[int] | None = None, extraction_tolerance: int = 4, laparams: dict[str, float | bool] | None = None, include_annotation_spaces: bool = False, preserve_pdfminer_coordinates: bool = False)¶
- hotpdf.HotPdf.__init__(self, pdf_file: PurePath | str | IOBase | None = None, password: str = '', page_numbers: list[int] | None = None, extraction_tolerance: int = 4, laparams: dict[str, float | bool] | None = None, include_annotation_spaces: bool = False, preserve_pdfminer_coordinates: bool = False) None¶
Initialize the HotPdf class.
- Parameters:
pdf_file (PurePath | str | IOBytes) – The path to the PDF file to be loaded, or a bytes object.
password (str, optional) – Password to use to unlock the pdf
page_numbers (list[int], optional) – Pages to be loaded into memory. (0-indexed). If not provided, will load all pages (default).
extraction_tolerance (int, optional) – Tolerance value used during text extraction to adjust the bounding box for capturing text. Defaults to 4.
laparams (dict[str, Union[float, bool]], optional) – Layout parameters for pdfminer.
include_annotation_spaces (bool, optional) – Add annotation spaces to the memory map. Default: False
preserve_pdfminer_coordinates (bool, Optional) – Preserve pdfminer y-coordinate values. Default: False - use natural coords
- Raises:
ValueError – If the page range is invalid.
FileNotFoundError – If the file is not found.
PermissionError – If the file is encrypted or the password is wrong.
RuntimeError – If an unknown error is generated by transfotmer.
The HotPdf class is the wrapper around your PDF that allows for searching text and extracting text on your PDFs.
from hotpdf import HotPdf
pdf_file_path = "path to your pdf file"
# Load directly from Path
hotpdf_document = HotPdf(pdf_file_path)
# Load from file stream
with open(pdf_file_path, "rb") as f:
hotpdf_document_2 = HotPdf(f)
Alternatively you can defer loading, and use the .load() function instead. The outcome is the same, internally the constructor for HotPdf calls the .load() function
from hotpdf import HotPdf
pdf_file_path = "path to your pdf file"
# path
hotpdf_document = HotPdf()
hotpdf_document = hotpdf_document.load(pdf_file_path)
# file stream
hotpdf_document_2 = HotPdf()
with open(pdf_file_path, "rb") as f:
hotpdf_document_2 = hotpdf_document_2.load(f)
- hotpdf.HotPdf.load(self, pdf_file: PurePath | str | IOBase, password: str = '', page_numbers: list[int] | None = None, laparams: dict[str, float | bool] | None = None, include_annotation_spaces: bool = False, preserve_pdfminer_coordinates: bool = False) None¶
Load a PDF file into memory.
- Parameters:
pdf_file (str | Bytes) – The path to the PDF file to be loaded, or a bytes object.
password (str, optional) – Password to use to unlock the pdf
page_numbers (list[int], optional) – Pages to be loaded into memory. (0-indexed). If not provided, will load all pages (default).
laparams (dict[str, Union[float, bool]], optional) – Layout parameters for pdfminer.
include_annotation_spaces (bool, optional) – Add annotation spaces to the memory map.
preserve_pdfminer_coordinates (bool, Optional) – Preserve pdfminer y-coordinate values. Default: False - use natural coords
- Raises:
Exception – If an unknown error is generated by pdfminer.
You can also merge multiple HotPdf objects to get one single HotPdf object!
merged_hotpdf_object = HotPdf.merge_multiple(hotpdfs=[
hotpdf_document,
hotpdf_document2,
])
File Operations¶
Length¶
The number of pages in the PDF file can be determined by checking the len of pages property of the hotpdf object.
num_pages = len(hotpdf_document.pages)
Search¶
find_text¶
To look for a string in the entire PDF File, you can use the find_text function. You can also specify what pages you want to search in. By default it will look through the whole PDF. To get the whole span where the string lies in, you can set take_span to True.
text_occurences = hotpdf_document.find_text("foo")
text_occurences_with_span = hotpdf_document.find_text(
"foo",
take_span=True,
)
- hotpdf.HotPdf.find_text(self, query: str, pages: list[int] | None = None, take_span: bool = False, sort: bool = True, case_sensitive: bool = True) defaultdict[int, list[list[HotCharacter]]]¶
Find text within the loaded PDF pages.
- Parameters:
query (str) – The text to search for.
pages (list[int], optional) – List of page numbers to search.
take_span (bool, optional) – Take the full span of the text that it is a part of.
sort (bool, Optional) – Return elements sorted by their positions.
case_sensitive (bool, optional) – Whether the search should be case-sensitive. Defaults to True.
- Raises:
ValueError – If the page number is invalid.
- Returns:
A dictionary mapping page numbers to found text coordinates.
- Return type:
SearchResult
Extraction¶
extract_text¶
To extract string from specific positions in the PDF, you can use the extract_text function. This will extract the string that lies within the positions that have been specified on the page that it’s specified (default is Page 0).
text_in_bbox = hotpdf_document.extract_text(
x0=0,
y0=0,
x1=100,
y1=10,
page=0,
)
- hotpdf.HotPdf.extract_text(self, x0: int, y0: int, x1: int, y1: int, page: int = 0) str¶
Extract text from a specified bounding box on a page.
- Parameters:
- Raises:
ValueError – If the coordinates are invalid.
ValueError – If the page number is invalid.
- Returns:
Extracted text within the bounding box.
- Return type:
extract_spans¶
Instead of just the individual characters that lay within the bounds that you specify, if you want full words, or the complete spans that intersect within the specified bounds - you can use the extract_spans functions instead. This will extract all the spans that intersect with the positions that have been specified on the page that it’s specified (default is Page 0).
spans_in_bbox = hotpdf_document.extract_spans(
x0=0,
y0=0,
x1=100,
y1=10,
page=0,
)
- hotpdf.HotPdf.extract_spans(self, x0: int, y0: int, x1: int, y1: int, page: int = 0, sort: bool = True) list[Span]¶
Extract spans that intersect with the given bounding box.
- Parameters:
x0 (int) – The left x-coordinate of the bounding box.
y0 (int) – The bottom y-coordinate of the bounding box.
x1 (int) – The right x-coordinate of the bounding box.
y1 (int) – The top y-coordinate of the bounding box.
page (int, optional) – The page number. Defaults to 0.
sort (bool, optional) – Sort the spans by their coordinates. Defaults to True.
- Raises:
ValueError – If the coordinates are invalid.
ValueError – If the page number is invalid.
- Returns:
List of spans of hotcharacters that intersect with the given bounding box
- Return type:
extract_spans_text¶
Instead of handling the spans structures yourself, if you are only interested in the text of the spans, you can use the extract_spans_text function instead.
The function is the same as extract_spans except it returns you a list of str.
spans_text_in_bbox = hotpdf_document.extract_spans_text(
x0=0,
y0=0,
x1=100,
y1=10,
page=0,
)
- hotpdf.HotPdf.extract_spans_text(self, x0: int, y0: int, x1: int, y1: int, page: int = 0) str¶
Extract text from spans that intersect with the given bounding box.
- Parameters:
- Raises:
ValueError – If the coordinates are invalid.
ValueError – If the page number is invalid.
- Returns:
Extracted text that intersects with the bounding box.
- Return type:
extract_page_text¶
If you want to view the text of an entire page in plaintext str format, you can use the extract_page_text function.
The function accepts page as a parameter.
page_text = hotpdf_document.extract_page_text(page=0,)