如何在Python中优化搜索大文件

如何解决如何在Python中优化搜索大文件

我有一个大文件，其中包含大约800万行文件名，并且我正在尝试搜索包含某个值的文件名。找到一个很好，但是问题是我试图搜索大约5万个唯一值，而且搜索时间非常长。

with open('UniqueValueList.txt') as g:
    uniqueValues = g.read().splitlines()

outF = open("Filenames_With_Unique_Values.txt","w")
with open('Filenames_File.txt') as f:
    fileLine = f.readlines()
    for line in fileLine:
        for value in uniqueValues:
            if value in line:
                outF.write(line)
outF.close()

我无法将文件名文件加载到内存中，因为文件太大。还有其他方法可以优化此搜索吗？

解决方法

我的两个理论是（1）内存映射文件，并为每个值搜索使用多行正则表达式，以及（2）将结果分配到多个子流程中。我将两者结合起来，提出了以下建议。可能可以在父级中进行mmap并共享，但我走了简单的路，只是在每个子进程中都做了，假设操作系统会为您找到有效的共享。

import multiprocessing as mp
import os
import mmap
import re

def _value_find_worker_init(filename):
    """Called when initializing mp.Pool to open an mmaped file in subprocesses.
    The file is `global mmap_file` so that the worker can find it.
    """
    global mmap_file
    filenames_fd = os.open(filename,os.O_RDONLY)
    mmap_file = mmap.mmap(filenames_fd,length=os.stat(filename).st_size,access=mmap.ACCESS_READ)

def _value_find_worker(value):
    """Return a list of matching lines in `global mmap_file`"""
    # multiline regex for findall
    regex = b"(?m)^.*?" + value + b".*?$"
    matched = re.compile(regex).findall(mmap_file)
    print(regex,matched)
    return matched

def find_unique():
    with open("UniqueValueList.txt","rb") as g:
        uniqueValues = [line.strip() for line in g]
    with open('UniqueValueList.txt',"rb") as g:
        uniqueValues = [line.strip() for line in g]
    with mp.Pool(initializer=_value_find_worker_init,initargs=("Filenames_File.txt",)) as pool:
        matched_values = set()
        for matches in pool.imap_unordered(_value_find_worker,uniqueValues):
            matched_values.update(matches)
    with open("Filenames_With_Unique_Values.txt","wb") as outfile:
        outfile.writelines(value + b"\n" for value in matched_values)

find_unique()

我们可以使用文件对象作为迭代器。迭代器将逐行返回每一行，可以对其进行处理。这样不会将整个文件读入内存，并且适合使用Python读取大文件。

通过这个清晰的教程来帮助自己 How to read huge file with Python

如何在Python中优化搜索大文件

如何解决如何在Python中优化搜索大文件

解决方法

相关推荐