如何解决使用PyTables对大型文本文件进行排序
我有两个大的输入文件(> 10 GB,Nx4)。任务是根据第二列尽快对这些文件进行排序。现在,我正在分块并将已排序的行保存在文本文件中(下面的代码)。虽然可以,但是我需要更快的速度!
有什么快速的方法吗?后来我不得不分块读取排序后的文件,如何使用Pytables
或H5Py
模块来完成此操作?或其他建议?
filename = ['Input-1.txt','Input-2.txt']
savename = ['Sort-1.txt','Sort-2.txt']
chunksize = 100_000_00 # chunk's size to read
for findex in range(2):
nrows = sum(1 for line in open(filename[findex])) # no. of lines in each file
# storing chunk files in /dump
this_dir = os.path.dirname(__file__)
path_1 = ["dump/chunk1_{}.tsv","dump/chunk2_{}.tsv"] # chunks in .tsv
path_2 = ["dump/chunk1_*.tsv","dump/chunk2_*.tsv"]
path_w = os.path.join(this_dir,path_1[findex])
path_r = os.path.join(this_dir,path_2[findex])
fid = 1
lines = []
with open(filename[findex],'r') as f_in:
# creates chunk file(s)
f_out = open(path_w.format(fid),'w')
for line_num,line in enumerate(f_in,1):
# keep appending until you reach chunksize (boundary)
lines.append(line)
# enter as line_num reaches chunksize
if line_num % chunksize == 0:
# updates list with sorted values
lines = sorted(lines,key=lambda k: float(k.split(',')[1]))
f_out.writelines(lines)
f_out.close()
lines = []
fid += 1
# open next chunk
f_out = open(path_w.format(fid),'w')
# last chunk
if lines:
lines = sorted(lines,')[1]))
f_out.writelines(lines)
f_out.close()
lines = []
print(f'==> Writing {savename[findex]}')
from heapq import merge
chunks = []
for filename[findex] in glob.glob(path_r):
chunks += [open(filename[findex],'r')]
#print(filename[findex],savename[findex])
with open(savename[findex],'w') as f_out:
f_out.writelines(merge(*chunks,')[1])))
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。