如何解决如何在python中提高计算速度?
我正在建立一个计算,以将新列添加到数据框。这是我的数据: 1
我需要创建一个新列“ mob”。 “暴民”的计算是
- 如果某行的“ LoanID”与前一行相同。例如,如果借贷['LoanId'] [0] =借贷['LoanId'] ReactJS Website;
- 如果“暴民”的前一行是> 0;如果是这样,那么此行的“ mob”值将与前一行的值加1;如果不是,请尝试如果该行的loan ['repay_lbl']为1或2,如果是,则该行的“暴民”值为1;
我的代码如下:
for i in range(1,len(loan['LoanId'])):
if loan['LoanId'][i-1] == loan['LoanId'][i]:
if loan['mob'][i-1] > 0:
loan['mob'][i] = loan['mob'][i-1] +1
elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
loan['mob'][i] = 1
代码将花费O(n)。有什么方法可以改善算法并加快速度吗? 我只是Python的初学者。非常感谢您的帮助。
解决方法
由于每行mob
列的值取决于上一行的值,因此它取决于所有先前的行。这意味着您不能并行运行它,而您基本上陷于O(n)
中。
因此,我认为numpy数组操作在这里不会有太大用处。
否则,通常会有一些技巧来加快Python代码的速度;
我不确定前两个是否适用于numpy / pandas。在这种情况下,您可能必须对数据使用常规的Python列表。
当然,在深入研究其中任何一项之前,您应该考虑自己的数据集是否足够大以保证需要付出努力。
,通过更改循环方式来缩短时间
改善循环时间的依据
- 遍历所有N行而不进行广播,因此复杂度为O(N)
- 虽然都是N阶,但是不同的循环方法具有不同的复杂度缩放因子
- 不同的比例因子使某些方法比其他方法快得多
受-Different ways to iterate over rows in a Pandas Dataframe — performance comparison
的启发方法
- For循环-原始帖子
- iterrows
- itertuples
- zip
摘要
对于10万行,zip方法比for循环(即OP方法)快93倍
测试代码
import pandas as pd
import numpy as np
from random import randint
def create_input(N):
' Creates a loan DataFrame with N rows '
LoanId = [randint(0,N //4) for _ in range(N)] # though random,N//4 ensures
# high likelihood some rows repeat
# LoanID
repay_lbl = [randint(0,2) for _ in range(N)]
data = {'LoanId':LoanId,'repay_lbl': repay_lbl,'mob':[0]*N}
return pd.DataFrame(data)
def m_itertuples(loan):
' Iterating using itertuples,set single values using at '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
prev_loanID,prev_mob = None,None
for index,row in enumerate(loan.itertuples()): # iterate over rows with iterrows()
if prev_loanID is not None:
if prev_loanID == row.LoanId:
if prev_mob > 0:
loan.at[row.Index,'mob'] = prev_mob + 1
elif row.repay_lbl == 1 or row.repay_lbl == 2:
loan.at[row.Index,'mob'] = 1
# Query for latest values
prev_loanID,prev_mob = loan.at[index,'LoanId'],loan.at[index,'mob']
return loan
def m_for_loop(loan):
' For loop over the data frame '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
for i in range(1,len(loan['LoanId'])):
if loan['LoanId'][i-1] == loan['LoanId'][i]:
if loan['mob'][i-1] > 0:
loan['mob'][i] = loan['mob'][i-1] +1
elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
loan['mob'][i] = 1
return loan
def m_iterrows(loan):
' Iterating using iterrows,row in loan.iterrows(): # iterate over rows with iterrows()
if prev_loanID is not None:
if prev_loanID == row['LoanId']:
if prev_mob > 0:
loan.at[index,'mob'] = prev_mob + 1
elif row['repay_lbl'] == 1 or row['repay_lbl'] == 2:
loan.at[index,'mob'] = 1
# Query for latest values
prev_loanID,'mob']
return loan
def m_zip(loan):
' Iterating using zip,prev_mob = None,(loanID,mob,repay_lbl) in enumerate(zip(loan['LoanId'],loan['mob'],loan['repay_lbl'])):
if prev_loanID is not None:
if prev_loanID == loanID:
if prev_mob > 0:
mob = loan.at[index,'mob'] = prev_mob + 1
elif repay_lbl == 1 or repay_lbl == 2:
mob = loan.at[index,'mob'] = 1
# Update to latest values
prev_loanID,prev_mob = loanID,mob
return loan
注意:迭代器代码查询数据帧以获取更新的数据,而不是从迭代器中获取warning:
您永远不要修改要迭代的内容。这不是 保证在所有情况下都能正常工作。根据数据类型, 迭代器返回一个副本而不是一个视图,对其进行写入将没有 效果。
还使用assert df1.equals(df2)
比较了DataFrame,以验证不同方法产生的结果相同
时间代码
使用benchit
inputs = [create_input(i) for i in 10**np.arange(6)] # 1 to 10^5 rows
funcs = [m_for_loop,m_iterrows,m_itertuples,m_zip]
t = benchit.timings(funcs,inputs)
结果
运行时间以秒为单位
Functions m_for_loop m_iterrows m_itertuples m_zip
Len
1 0.000217 0.000493 0.000781 0.000327
10 0.001070 0.002002 0.001008 0.000353
100 0.007100 0.016501 0.003062 0.000498
1000 0.056940 0.162423 0.021396 0.001057
10000 0.565809 1.625043 0.210858 0.006938
100000 5.890920 16.658842 2.179602 0.062953
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。