Pythons Library Pdfreader For Pdf Extraction Wont Iterate Trough Pages
I want to extract text from PDF file with Python's lib called pdfreader. I followed the instructions here:
Solution 1:
Solving this issue required a lot of documentation reading for the Python module pdfreader. I was shocked at the level of difficulty in using this module for simple text extraction. It took hours to figure out a working solution.
The code below will enumerate the text on individual pages. You will still need to do some text cleaning to get your desired output.
I noted that one of your PDFs is having a problem with some font encoding during the parsing, which throws a warning message.
import requests
from io import BytesIO
from pdfreader import SimplePDFViewer
pdf_links = [
for pdf_link in pdf_links:
response = requests.get(pdf_link, stream=True)
# extract text page by pagewith BytesIO(response.content) as data:
viewer = SimplePDFViewer(data)
all_pages = [p for p in viewer.doc.pages()]
number_of_pages = len(all_pages)
for page_number inrange(1, number_of_pages + 1):
page_strings = " ".join(viewer.canvas.strings).replace(' ', '\n\n').strip()
print(f'Current Page Number: {page_number}')
print(f'Page Text: {page_strings}')
Post a Comment for "Pythons Library Pdfreader For Pdf Extraction Wont Iterate Trough Pages"