如何解决在Pandas中查询和计算的更快方法
我在熊猫中有两个数据框。我想要实现的是,从DF1中获取每个“名称”,并在DF2中获取相应的“城市”和“州”。
例如,DF1中的'Dwight'应该从DF2中返回相应的值'Miami'和'Florida'。
DF1
COUNT(*)
FROM
booking_status_journey bs
INNER JOIN booking_indonesia b ON b.id = bs.booking
WHERE
bs.hid IN (
4,5,6,7,11,14,16,17,18,19,23,24,25,26
)
AND (
(
UNIX_TIMESTAMP(STR_TO_DATE(bs.picked_up_fromwh,'%d/%m/%Y')) <= UNIX_TIMESTAMP('2020-10-15')
AND UNIX_TIMESTAMP(STR_TO_DATE(bs.picked_up_fromwh,'%d/%m/%Y')) >= UNIX_TIMESTAMP(DATE_SUB('2020-10-15',INTERVAL 5 DAY))
AND b.no_show_count = 0
AND bs.rabbit_id1 IS NOT NULL
AND bs.parcel_picked1 IS NULL
AND bs.start_delivery1 IS NULL
AND bs.arrived_at_drop_off1 IS NULL
AND bs.delivered IS NULL
)
OR (
UNIX_TIMESTAMP(STR_TO_DATE(bs.picked_up_fromwh,INTERVAL 5 DAY))
AND b.no_show_count = 1
AND bs.rabbit_id2 IS NOT NULL
AND bs.parcel_picked2 IS NULL
AND bs.no_show1 IS NOT NULL
AND bs.start_delivery2 IS NULL
AND bs.arrived_at_drop_off2 IS NULL
AND bs.delivered IS NULL
AND bs.Failed IS NULL
AND bs.returned_after_Failed IS NULL
AND bs.returned_after_no_show1 IS NULL
AND bs.returned_towh IS NULL
)
)```
[1]: https://i.stack.imgur.com/mPDTG.png
DF1具有约70,000行和3列
第二个数据帧,DF2大约有320,000行。
Name Age Student
0 Dwight 20 Yes
1 Michael 30 No
2 Pam 55 No
. . . .
70000 Jim 27 Yes
当前,我有两个函数,它们使用过滤器返回“城市”和“州”的值。
Name City State
0 Dwight Miami Florida
1 Michael Scranton Pennsylvania
2 Pam Austin Texas
. . . . .
325082 Jim Scranton Pennsylvania
我正在使用apply函数来处理所有值。
def read_city(id):
filt = (df2['Name'] == id)
if filt.any():
field = (df2[filt]['City'].values[0])
else:
field = ""
return field
def read_state(id):
filt = (df2['Name'] == id)
if filt.any():
field = (df2[filt]['State'].values[0])
else:
field = ""
return field
以上述方式计算结果需要很长时间。我大约需要18分钟才能恢复df ['city_list']和df ['State_list']。
有更快的计算速度吗?由于我是熊猫的新手,所以我想知道是否有一种有效的方法来计算这个?
解决方法
我相信您可以做一个map
:
s = df2.groupby('name')[['City','State']].agg(list)
df['city_list'] = df['Name'].map(s['City'])
df['State_list'] = df['Name'].map(s['State'])
或者在您获得s
之后左合并:
df = df.merge(s.add_suffix('_list'),left_on='Name',right_index=True,how='left')
,
我认为您可以执行以下操作:
# Dataframe DF1 (dummy data)
DF1 = pd.DataFrame(columns=['Name','Age','Student'],data=[['Dwight',20,'Yes'],['Michael',30,'No'],['Pam',55,['Jim',27,'Yes']])
print("DataFrame DF1")
print(DF1)
# Dataframe DF2 (dummy data)
DF2 = pd.DataFrame(columns=['Name','City','State'],'Miami','Florida'],'Scranton','Pennsylvania'],'Austin','Texas'],'Pennsylvania']])
print("DataFrame DF2")
print(DF2)
# You do a merge on 'Name' column and then,you change the name of columns 'City' and 'State'
df = pd.merge(DF1,DF2,on=['Name']).rename(columns={'City': 'city_list','State': 'State_list'})
print("DataFrame final")
print(df)
输出:
DataFrame DF1
Name Age Student
0 Dwight 20 Yes
1 Michael 30 No
2 Pam 55 No
3 Jim 27 Yes
DataFrame DF2
Name City State
0 Dwight Miami Florida
1 Michael Scranton Pennsylvania
2 Pam Austin Texas
3 Jim Scranton Pennsylvania
DataFrame final
Name Age Student city_list State_list
0 Dwight 20 Yes Miami Florida
1 Michael 30 No Scranton Pennsylvania
2 Pam 55 No Austin Texas
3 Jim 27 Yes Scranton Pennsylvania
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。