如何解决创建一个根据条件删除字符串中不需要的部分的列
我是python的新手,我被困在这里。我有一个像下面这样的数据框,我试图用流派列的宏流派来创建一个新列。
数据框:
import pandas as pd
d = {'Genres': ['Finance','Arcade','Business','Photography','Entertainment;Brain Games','Medical','Tools','Casual;Brain Games','Entertainment'],'Last Updated': ['March 10,2018','May 24,'April 11,'November 6,2014','March 9,'May 17,'June 3,2016','April 10,'July 16,2017']}
df = pd.DataFrame(data=d)
df
Genres Last Updated
0 Finance March 10,2018
1 Arcade May 24,2018
2 Business April 11,2018
3 Photography November 6,2014
4 Entertainment;Brain Games March 9,2018
5 Medical May 17,2018
6 Tools June 3,2016
7 Casual;Brain Games April 10,2016
8 Medical July 16,2018
9 Entertainment May 17,2017
所需的输出如下:
Genres macro_genres Last Updated
0 Finance Finance March 10,2018
1 Arcade Arcade May 24,2018
2 Business Business April 11,2018
3 Photography Photography November 6,2014
4 Entertainment;Brain Games Entertainment March 9,2018
5 Medical Medical May 17,2018
6 Tools Tools June 3,2016
7 Casual;Brain Games Casual April 10,2016
8 Medical Medical July 16,2018
9 Entertainment Entertainment May 17,2017
我尝试过的事情:
def macro_genre(i):
for i in df['Genres']:
if ';' in i:
j = i.split(';')[0]
return j
else:
return i
df['macro_genres'] = df['Genres'].apply(macro_genre)
但是它不起作用。它会创建该列,但会为整个列重复第一个值。
当我在函数外部尝试for
部分时,它就起作用了。
任何提示将不胜感激!谢谢!!!
解决方法
您可以只使用str.split(';')
。如果字符串中不存在;
,则什么都不会发生->返回带有原始字符串的列表(因此您可以随时使用[0]
):
df['macro_genres'] = df['Genres'].apply(lambda x: x.split(';')[0])
print(df)
打印:
Genres Last_Updated macro_genres
0 Finance March 10,2018 Finance
1 Arcade May 24,2018 Arcade
2 Business April 11,2018 Business
3 Photography November 6,2014 Photography
4 Entertainment;Brain_Games March 9,2018 Entertainment
5 Medical May 17,2018 Medical
6 Tools June 3,2016 Tools
7 Casual;Brain Games April 10,2016 Casual
8 Medical July 16,2018 Medical
9 Entertainment May 17,2017 Entertainment
,
一种可能是使用map
:
df['macro_games'] = df['Genres'].astype(str).map(lambda x : x.split(';')[0])
输出:
>>> df
Genres macro_genres Last Updated
0 Finance Finance March 10,2018
1 Arcade Arcade May 24,2018
2 Business Business April 11,2018
3 Photography Photography November 6,2014
4 Entertainment;Brain Games Entertainment March 9,2018
5 Medical Medical May 17,2018
6 Tools Tools June 3,2016
7 Casual;Brain Games Casual April 10,2016
8 Medical Medical July 16,2018
9 Entertainment Entertainment May 17,2017
1k数据帧上的运行时比较:
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
535 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
1.36 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
527 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)
关于10k数据帧的运行时比较:
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
3.62 ms ± 105 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
10 ms ± 259 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
3.47 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)
关于50k数据帧的运行时比较:
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
17 ms ± 133 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
16.7 ms ± 278 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)
关于100k数据帧的运行时比较:
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
34.1 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs,1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].astype(str).map(lambda x : x.split(';')[0])
35.5 ms ± 596 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。