使用 PyPDF2 提取文本时的编码问题

如何解决使用 PyPDF2 提取文本时的编码问题

我正在使用 PyPDF2 从 pdf 文件中提取文本。它可以工作，但无法识别带重音的字符。

这是我的代码：

filename ='document.pdf' 

#open allows you to read the file
pdfFileObj = open(filename,'rb')

#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

#discerning the number of pages will allow us to parse through all the pages
num_pages = pdfReader.numPages


count = 0
text = ""

#The while loop will read each page
while count < num_pages:                      
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()
    
if text != "":
    text = text

这是我的结果：

82 %G’nes dues au bruitEurop”ens expos”s ‹ des seuils 
au del‹ de 55 dB.125 MDes habitants dÕIle de France expos”s ‹ des 
valeurs sup”rieures recommand”s par lÕOMS.
90 %Des fran“ais se disent pr”occup”s par 
les questions relatives au bruit.
82 %Personnes d”clarent ’tre g’n”s par des 
nuisances sonores ‹ leur domicile.
45 %Les effets du bruit caus”s chaque ann”es
Les effets du bruit caus”s chaque ann”es
Personnes g’n”es par le bruit.

这就是pdf的样子：

解决方法

我不认为这是你的代码。我认为这是 PDF 的问题。

我使用示例 PDF 文件进行了检查：https://www.languagebird.com/wp-content/uploads/2019/10/sample_French_Basics_Grammar_Book-2017-3.pdf

#! /usr/bin/env python
# -*- coding: utf-8 -*-

# Note you don't need to manually open a file object
# You can pass a string reference to a file
pdfReader = PyPDF2.PdfFileReader('sample_French_Basics_Grammar_Book-2017-3.pdf')

text = ""

# Better to loop through the pages using the iterator
# Rather than manual count
for current_page in pdfReader.pages:
    text += current_page.extractText()

# Output the results
with open('output.txt','w') as f:
    f.write(text)

您会从 'output.txt' 的内容中注意到重音字符的表示是正确的。唯一的文本错误是必须处理的错误代码点中的智能引号。

extractText() 的输出是一个 unicode 字符串，因此如果源代码正确编码，那么重音字符应该不会有任何问题。

PDF 的结构是将图像层和文本层分开。图像层通常是顶层，以使其外观更整洁。不幸的是，这意味着您无法用肉眼看到底层文本的任何问题。没有看到您正在处理的 PDF，我怀疑在创建 PDF 时添加到 PDF 的文本不正确。

使用 PyPDF2 提取文本时的编码问题

如何解决使用 PyPDF2 提取文本时的编码问题

解决方法

相关推荐