如何解决PDF 抓取不会使用 PyPDF2 加载文本
我正在尝试从 PDF 列表中提取所有文本,但在从对象中提取文本时遇到错误。知道是什么原因造成的吗?
ls = os.listdir(resumes)
pdf = [s for s in ls if '.pdf' in s]
print(pdf)
for p in pdf:
pdfFileObj = open(os.path.join(resumes,p),'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print(pdfReader.numPages)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
pdfFileObj.close()
错误:
File "C:\Program Files\Python39\lib\encodings\cp1252.py",line 19,in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0141' in position 305: character maps to <undefined>
解决方法
用 pdfplumber 试试这个:
import pdfplumber
import os
resumes = "C:\\path\\to\\resumes\\"
ls = os.listdir(resumes)
pdf_files = [s for s in ls if '.pdf' in s]
alltext = ""
for pdf_file in pdf_files:
pdf_path = resumes + pdf_file
pdf = pdfplumber.open(pdf_path)
nb_pages = len(pdf.pages)
print(nb_pages)
for n in range(0,nb_pages): # if you want to extract text from all the document
p = pdf.pages[n]
text = p.extract_text()
if text is None:
continue
alltext += text
print(alltext) # This is will print all the text
alltext = "" # reinitialize this variable
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。