如何解决应用功能和查询百万/十亿行表中值的最有效方法
我有包含百万/十亿行的镶木地板文件,我试图找到一个更快的过程来应用函数并查询这些大表上的值。 1M行表的ipynb代码示例:
from pyarrow import parquet as pq
from ttictoc import TicToc #Preferred time tracker
t = TicToc()
#Function to be applied
def val_calc(x,n_cols):
import random
import string
#Mapping process
abc_vals = string.ascii_uppercase[:n_cols] #N-alphabets in the df
x_map = {i:random.randint(j,n_cols+10) for i,j in zip(abc_vals,range(n_cols))} #map-dict
#Calculations... Formula: Σ(Xi*Yi)/ΣY
y_vals = {i:j**2 for i,j in x_map.items()} #Y's set of values to use
weights = [x_val*y_val for x_val,y_val in zip(x_map.values(),y_vals.values())] #(Xi*Yi)
result = sum(weights)/sum(y_vals.values()) #Σ(X*Y)/ΣY
return result
#Getting the parquet file
file_path = 'C:/XYZ/project/'
file_name = 'gx6c'
#pyarrow parquet -> pandas
large_pq = pq.read_table(file_path+file_name+'.pq').to_pandas()
#Number of columns - column per alphabet in a row:
n_columns = int(file_name.split('x')[-1].replace('c',''))
#-------------------------------------------------------Results and time taken
t.tic() #start time
#Function applied
large_pq['values'] = large_pq.apply(lambda x: val_calc(x,n_columns),axis=1)
t.toc() #end time
print(f'Time passed for applying function: {round(t.elapsed,5)} seconds')
display(large_pq)
#Querying part
t.tic()
queried = large_pq[large_pq['values'].between(12,13)]
t.toc()
print(f'Time passed for query: {round(t.elapsed,5)} seconds')
display(queried)
输出:
Time passed for applying function: 17.60126 seconds
abc values
0 AAAAAA 13.258228
1 AAAAAB 10.227642
2 AAAABA 11.264317
3 AAABAA 12.422303
4 AABAAA 13.537634
... ... ...
999995 JJIJJJ 12.620214
999996 JJJIJJ 11.323636
999997 JJJJIJ 10.756757
999998 JJJJJI 10.358811
999999 JJJJJJ 10.896328
1000000 rows × 2 columns
Time passed for query: 0.04801 seconds
abc values
3 AAABAA 12.422303
5 ABAAAA 12.062818
13 AAAAAD 12.762040
16 AADAAA 12.925373
25 AAAAAF 12.661267
... ... ...
999967 IJJJII 12.936667
999972 JIJIJI 12.331742
999986 JIJJJI 12.133333
999993 IJJJJJ 12.179487
999995 JJIJJJ 12.620214
284129 rows × 2 columns
对于每行17个字母(或17个“列”)的13M行表重复相同的操作大约需要10分钟,查询步骤为0.22524秒。对于较大的文件,我会遇到内存错误,因此无法达到十亿行的标记。是否有任何变通办法可以在更短的时间范围内执行这些过程,例如对于一个13M行的表来说,是10秒而不是几分钟?
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。