========= Usage ========= HotPdf Class ------------------------------------------ .. autoclass:: hotpdf.HotPdf .. autofunction:: hotpdf.HotPdf.__init__ The HotPdf class is the wrapper around your PDF that allows for searching text and extracting text on your PDFs. .. code-block:: python from hotpdf import HotPdf pdf_file_path = "path to your pdf file" # Load directly from Path hotpdf_document = HotPdf(pdf_file_path) # Load from file stream with open(pdf_file_path, "rb") as f: hotpdf_document_2 = HotPdf(f) Alternatively you can defer loading, and use the `.load()` function instead. The outcome is the same, internally the constructor for `HotPdf` calls the `.load()` function .. code-block:: python from hotpdf import HotPdf pdf_file_path = "path to your pdf file" # path hotpdf_document = HotPdf() hotpdf_document = hotpdf_document.load(pdf_file_path) # file stream hotpdf_document_2 = HotPdf() with open(pdf_file_path, "rb") as f: hotpdf_document_2 = hotpdf_document_2.load(f) .. autofunction:: hotpdf.HotPdf.load You can also merge multiple HotPdf objects to get one single HotPdf object! .. code-block:: python merged_hotpdf_object = HotPdf.merge_multiple(hotpdfs=[ hotpdf_document, hotpdf_document2, ]) .. autofunction:: hotpdf.HotPdf.merge_multiple File Operations ------------------------------------------ Length ~~~~~~~~~~~~~~~~~~~ The number of pages in the PDF file can be determined by checking the `len` of `pages` property of the hotpdf object. .. code-block:: python num_pages = len(hotpdf_document.pages) Search ------------------------------------------ find_text ~~~~~~~~~~~~~~~~~~~ To look for a string in the entire PDF File, you can use the `find_text` function. You can also specify what pages you want to search in. By default it will look through the whole PDF. To get the whole span where the string lies in, you can set `take_span` to True. .. code-block:: python text_occurences = hotpdf_document.find_text("foo") text_occurences_with_span = hotpdf_document.find_text( "foo", take_span=True, ) .. autofunction:: hotpdf.HotPdf.find_text Extraction ------------------------------------------ extract_text ~~~~~~~~~~~~~~~~~~~ To extract string from specific positions in the PDF, you can use the `extract_text` function. This will extract the string that lies within the positions that have been specified on the page that it's specified (default is Page 0). .. code-block:: python text_in_bbox = hotpdf_document.extract_text( x0=0, y0=0, x1=100, y1=10, page=0, ) .. autofunction:: hotpdf.HotPdf.extract_text extract_spans ~~~~~~~~~~~~~~~~~~~ Instead of just the individual characters that lay within the bounds that you specify, if you want full words, or the complete spans that intersect within the specified bounds - you can use the `extract_spans` functions instead. This will extract all the spans that intersect with the positions that have been specified on the page that it's specified (default is Page 0). .. code-block:: python spans_in_bbox = hotpdf_document.extract_spans( x0=0, y0=0, x1=100, y1=10, page=0, ) .. autofunction:: hotpdf.HotPdf.extract_spans extract_spans_text ~~~~~~~~~~~~~~~~~~~ Instead of handling the spans structures yourself, if you are only interested in the text of the spans, you can use the `extract_spans_text` function instead. The function is the same as `extract_spans`_ except it returns you a `list` of `str`. .. code-block:: python spans_text_in_bbox = hotpdf_document.extract_spans_text( x0=0, y0=0, x1=100, y1=10, page=0, ) .. autofunction:: hotpdf.HotPdf.extract_spans_text extract_page_text ~~~~~~~~~~~~~~~~~~~ If you want to view the text of an entire page in plaintext `str` format, you can use the `extract_page_text` function. The function accepts `page` as a parameter. .. code-block:: python page_text = hotpdf_document.extract_page_text(page=0,) .. autofunction:: hotpdf.HotPdf.extract_page_text