如何解决Python XML 删除标签内的换行符
问题是,在我从 SEC 抓取的一些 xml 文件中,标签内有换行符。因此,这些 xml 文件格式不正确。
<footnote id="F4">Shares sold on the open market are reported as an average sell price per share of $56.87; breakdown of shares sold and per share sale prices are as follows; 100 at $56.31; 200 at $56.32; 100 at $56.33; 198 at $56.39; 600 at $56.40; 100 at $56.41; 102 at $56.42; 600 at $56.44; 320 at $56.45; 100 at $56.46; 900 at $56.47; 480 at $56.48; 300 at $56.49; 1,200 at $56.50; 400 at $56.51; 1,130 at $56.52; 600 at $56.53; 100 at $56.54; 1,500 at $56.55; 600 at $56.56; 644 at $56.57; 1,656 at $56.58; 1,070 at $56.59; 2069 at $56.60; 1,831 at $56.61; 1,000 at $56.62; 1,000 at $56.63; 492 at $56.64; 1,400 at $56.65; 920 at $56.66; 1,000 at $56.67; 600 at $56.68; 500 at $56.69; 1,200 at $56.70; 500 at $56.71; 582 at $56.72; 400 at $56.73; 1,108 at $56.74; 37 at $56.75; 710 at $56.76; 630 at $56.77; 1,600 at $56.78; 400 at $56.79; 400 at $56.80; 1,500 at $56.81; 1,100 at $56.82; 100 at $56.83; 800 at $56.84; 200 at $56.85; 1,300 at $56.87; additional shares sold continued on Footnote (5).</footnot
e>
我的第一个想法是这是因为utf-8和ISO-8859-1的编码不同,但更改编码后问题仍然存在。 我的下一个解决方案是一个正则表达式,它检测标签内的那些换行符,但因为它们可能出现在任何地方,所以这个解决方案不是很可靠。
你们对如何解决这个问题有什么想法吗?
解决方法
对于 this txt file with xml part inside 可以这样完成:
import re
# open the txt file
with open("0001112679-10-000086.txt","r",encoding="utf8") as f:
txt = f.read();
# cut out the xml part from the txt file
start = txt.find("<XML>")
end = txt.find("</XML>") + 6
xml = txt[start:end]
# process the xml part
xml = re.sub(r"([^\n]{1023})\n",r"\1",xml)
# combine a new txt back from the parts
new_txt = txt[:start] + xml + txt[end:]
# save the new txt in file
with open("0001112679-10-000086_output.txt","w",encoding="utf8") as f:
f.write(new_txt)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。