如何在带有重音符号的.txt文件上使用.replace？

如何解决如何在带有重音符号的.txt文件上使用.replace？

所以我有一个代码，该代码需要一个.txt文件并将其作为字符串添加到变量中。

然后，我尝试在其上使用.replace（）来将字符“ó”更改为“ o”，但是它不起作用！控制台会打印相同的内容。

代码：

def normalize(filename):

    #Ignores errors because I get the .txt from my WhatsApp conversations and emojis raise an error.
    #File says: "Es una rubrica de evaluación." (among many emojis)

    txt_raw = open(filename,"r",errors="ignore")
    txt_read = txt_raw.read()


    #Here,only the "o" is replaced. In the real code,I use a for loop to iterate through all chrs.

    rem_accent_txt = txt_read.replace("ó","o")
    print(rem_accent_txt)

    return

预期输出：

"Es una rubrica de evaluacion."

当前输出：

"Es una rubrica de evaluación."

它不会打印错误或其他任何内容，而只是按原样打印。

我认为问题在于字符串来自文件，因为当我只创建一个字符串并使用代码时，它就可以工作，但是当我从文件中获取字符串时，它就不工作。

编辑：解决方案！

由于@ juanpa.arrivillaga和@ das-g，我想出了以下解决方案：

from unidecode import unidecode

def get_txt(filename):

    txt_raw = open(filename,encoding="utf8")
    txt_read = txt_raw.read()

    txt_decode = unidecode(txt_read)

    print(txt_decode)

    return txt_decode

解决方法

几乎可以肯定，正在发生的事情是您有一个规范化的unicode字符串。本质上，有两种方法可以用Unicode创建"ó"：

>>> combining = 'ó'
>>> composed = 'ó'
>>> len(combining),len(composed)
(2,1)
>>> list(combining)
['o','́']
>>> list(composed)
['ó']
>>> import unicodedata
>>> list(map(unicodedata.name,combining))
['LATIN SMALL LETTER O','COMBINING ACUTE ACCENT']
>>> list(map(unicodedata.name,composed))
['LATIN SMALL LETTER O WITH ACUTE']

只需将您的字符串标准化：

>>> composed == combining
False
>>> composed == unicodedata.normalize("NFC",combining)
True

尽管退后一步，您是否真的想删除重音符号？还是像上面那样只想归一化为合成？

顺便说一句，在读取文本文件时，您不应忽略错误。您应该使用正确的编码。我怀疑正在发生的事情是您使用错误的编码来编写文本文件，因为您应该能够很好地处理表情符号，它们在unicode中不是什么特别的东西。

>>> emoji = "?"
>>> print(emoji)
?
>>>
>>> unicodedata.name(emoji)
'GRINNING FACE'

如何在带有重音符号的.txt文件上使用.replace？

如何解决如何在带有重音符号的.txt文件上使用.replace？

解决方法

相关推荐