如何解决pyarrow版本1.0错误在使用ParquetDataset读取大量文件时抛出了内存不足异常与版本0.13兼容
我将一个数据帧拆分并存储在5000多个文件中。我使用ParquetDataset(fnames).read()加载所有文件。我将pyarrow从0.13.0更新到了最新版本1.0.1,它开始抛出“ OSError:内存不足:大小为131072的malloc失败”。同一台计算机上的相同代码仍适用于旧版本。我的机器具有256Gb内存,足以装载需要
# create a big dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': np.arange(50000000)})
df['F1'] = np.random.randn(50000000) * 100
df['F2'] = np.random.randn(50000000) * 100
df['F3'] = np.random.randn(50000000) * 100
df['F4'] = np.random.randn(50000000) * 100
df['F5'] = np.random.randn(50000000) * 100
df['F6'] = np.random.randn(50000000) * 100
df['F7'] = np.random.randn(50000000) * 100
df['F8'] = np.random.randn(50000000) * 100
df['F9'] = 'ABCDEFGH'
df['F10'] = 'ABCDEFGH'
df['F11'] = 'ABCDEFGH'
df['F12'] = 'ABCDEFGH01234'
df['F13'] = 'ABCDEFGH01234'
df['F14'] = 'ABCDEFGH01234'
df['F15'] = 'ABCDEFGH01234567'
df['F16'] = 'ABCDEFGH01234567'
df['F17'] = 'ABCDEFGH01234567'
# split and save data to 5000 files
for i in range(5000):
df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet',index=False)
# use a fresh session to read data
# below code works to read
import pandas as pd
df = []
for i in range(5000):
df.append(pd.read_parquet(f'{i}.parquet'))
df = pd.concat(df)
# below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine with version 0.13.0)
# tried use_legacy_dataset=False,same issue
import pyarrow.parquet as pq
fnames = []
for i in range(5000):
fnames.append(f'{i}.parquet')
len(fnames)
df = pq.ParquetDataset(fnames).read(use_threads=False)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。