如何解决在python中从pdf中提取时用空格代替俄语文本
我需要从 http://voeikovmgo.ru/images/stories/publications/2020/ejegodnik_zagr_atm_2019.pdf
我尝试用 python 提取代码:
pl = open('C:/ejegodnik_zagr_atm_2019.pdf','rb')
plread = PyPDF2.PdfFileReader(pl)
getpage = plread.getPage(4)
text = getpage.extractText()
print(text)
结果我得到字符串
'5 \n \n \n 89 \n 131 \n 90 \n 132 \n. 91 \n 133 \n 92 \n 134 \n 93 \n 135 \n 94 \n 136 \n 95 \n 137 \n 96 \n 138 \n 98 \n 139 \n. 99 \n 140 \n 100 \n 142 \n 101 \n\n 143 \n 102 \n 144 \n 103 \n 145 \n 104 \n 146 \n 105 \n 147 \n 106 \n \n 148 \n 108 \n 149 \n 109 \n 150 \n 110 \n-\n \n 151 \n-\n 111 \n 112 \n 152 \n 112 \n 153 \n 114 \n 154 \n 115 \n 155 \n 116 \n 156 \n 117 \n 157 \n 118 \n 158 \n 119 \n 159 \n 120 \n 160 \n 121 \n 161 \n-\n 122 \n 162 \n 124 \n-\n \n 163 \n 125 \n 164 \n 126 \n 165 \n 127 \n 166 \n 129 \n-\n 167 \n 130 \n 168 \n \n \n \n 184 \n 205 \n 187 \n 207 \n 189 \n-\n 209 \n 191 \n 212 \n 194 \n-\n 214 \n 197 217 \n 200 \n 219 \n 202 '
我还需要获取俄语文本。我应该如何解决这个问题?
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。