Pypdf2 Ignores Content, Gets Watermark Only

December 11, 2023 Post a Comment

I have thousands of PDF files like this one. I'm trying to use PyPDF2 to convert them to plain text (code is below). But PyPDF2 apparently only 'sees' the watermarks, not the conte

Solution 1:

Sometimes pdfminer3k gives better results. Please check out "How to read pdf file using pdfminer3k?"

I've tested the following code and it extracts text. However, the extraction is not 100% accurate...

# Open the example file
fp = open('Decisao_10166720039201098.pdf', 'rb')

from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine

parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
laparams.char_margin = 1.0
laparams.word_margin = 1.0
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
extracted_text = ''for page in doc.get_pages():
    interpreter.process_page(page)
    layout = device.get_result()
    for lt_obj in layout:
        ifisinstance(lt_obj, LTTextBox) orisinstance(lt_obj, LTTextLine):
            extracted_text += lt_obj.get_text()

print(extracted_text)

Solution 2:

PyPDF2 is good for normal PDF files. But I recommend you to use pdfminer. You can install it with pip install pdfminer.six And Follow these Command:

from pdfminer import high_level # High level text extraction

local_pdf_filename = "path/to/your/pdf/file.pdf"
pages = [0] # Page which you wanna extract

extracted_text = high_level.extract_text(local_pdf_filename, "", pages)
print(extracted_text)
input('\n\nExtraction Complete')

Python Guru

Pypdf2 Ignores Content, Gets Watermark Only

Solution 1:

Solution 2:

Post a Comment for "Pypdf2 Ignores Content, Gets Watermark Only"