在 gensim wikicorpus 文本中保留标点符号和大小写

如何解决在 gensim wikicorpus 文本中保留标点符号和大小写

我有一个 Wiki Dump 作为 xml.bz2 文件，并希望将其转换为 txt 以便稍后使用 BERT 进行处理。目标是让每个单独的句子在一个新的行中，文章之间有一个空行（BERT 训练的要求）

我尝试关注此 (How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?) 帖子并自己进行了大量研究。这是我目前得到的：

from __future__ import print_function
import sys
from gensim.corpora import WikiCorpus
from wikicorpus import *
import six

def tokenize(content):
    #override original method in wikicorpus.py
    return [token.encode('utf8') for token in content.split() 
           if len(token) <= 15 and not token.startswith('_')]

def process_article(args):
   # override original method in wikicorpus.py
    text,lemmatize,title,pageid = args
    text = filter_wiki(text)
    if lemmatize:
        result = utils.lemmatize(text)
    else:
        result = tokenize(text)
    return result,pageid


class MyWikiCorpus(WikiCorpus):
    def __init__(self,fname,processes=None,lemmatize=utils.has_pattern(),dictionary=None,filter_namespaces=('0',)):
        WikiCorpus.__init__(self,processes,dictionary,filter_namespaces)

        def get_texts(self):
            articles,articles_all = 0,0
            positions,positions_all = 0,0
            texts = ((text,self.lemmatize,pageid) for title,text,pageid in extract_pages(bz2.BZ2File(self.fname),self.filter_namespaces))
            pool = multiprocessing.Pool(self.processes)
            for group in utils.chunkize(texts,chunksize=10 * self.processes,maxsize=1):
                for tokens,pageid in pool.imap(process_article,group):  # chunksize=10):
                    articles_all += 1
                    positions_all += len(tokens)
                if len(tokens) < ARTICLE_MIN_WORDS or any(title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES):
                    continue
                articles += 1
                positions += len(tokens)
                if self.metadata:
                    yield (tokens,(pageid,title))
                else:
                    yield tokens
            pool.terminate()

            print(
                "finished iterating over Wikipedia corpus of %i documents with %i positions"
                " (total %i articles,%i positions before pruning articles shorter than %i words)",articles,positions,articles_all,positions_all,ARTICLE_MIN_WORDS)
            self.length = articles  # cache corpus length

我使用上面的 Post 来覆盖函数并最终像这样调用类：

def make_corpus2(inp,outp):
    space = " "
    i = 0
    output = open(outp,'w')
    wiki = MyWikiCorpus(inp,lemmatize=False,dictionary={})
    for text in wiki.get_texts():
        if six.PY3:
            output.write(bytes(' '.join(text),'utf-8').decode('utf-8') + '\n')
        else:
            output.write(space.join(text) + "\n")
        i = i + 1
        if (i % 10000 == 0):
            print("Saved " + str(i) + " articles")

    output.close()
    print("Finished Saved " + str(i) + " articles")

并用 make_corpus2("./Wiki_dump_gross.xml.bz2","./pretrain/wiki_dump_sentences.txt")

调用它

没有错误并且它填充了输出文件，但仍然缺少标点符号。我觉得我结合了上一篇文章中给定的解决方案，所以我想知道我的错误在哪里。澄清一下：我为此使用了 Jupyter Notebook。

我得到的输出示例：

der begriff heilkunde bezeichnet die gesamtheit der menschlichen kenntnisse und fähigkeiten über die
entstehung heilung und verhinderung prävention von krankheiten er wird als synonym für medizin im
allgemeinen aber auch innerhalb der der volksheilkunde und jeder form der psychotherapie verwendet
ausübung einer heilkunde die ausübung einer heilkunde genannt auch heilkunst ist in deutschland
österreich und der schweiz rechtlich unterschiedlich geregelt

我还想知道是否可以保留文本的大小写，因为在德语中这是该语言的相关部分。

在 gensim wikicorpus 文本中保留标点符号和大小写

如何解决在 gensim wikicorpus 文本中保留标点符号和大小写

相关推荐