创建一个根据条件删除字符串中不需要的部分的列

如何解决创建一个根据条件删除字符串中不需要的部分的列

我是python的新手，我被困在这里。我有一个像下面这样的数据框，我试图用流派列的宏流派来创建一个新列。

数据框：

import pandas as pd
d = {'Genres': ['Finance','Arcade','Business','Photography','Entertainment;Brain Games','Medical','Tools','Casual;Brain Games','Entertainment'],'Last Updated': ['March 10,2018','May 24,'April 11,'November 6,2014','March 9,'May 17,'June 3,2016','April 10,'July 16,2017']}
df = pd.DataFrame(data=d)
df

                       Genres        Last Updated
0                     Finance      March 10,2018
1                      Arcade        May 24,2018
2                    Business      April 11,2018
3                 Photography    November 6,2014
4   Entertainment;Brain Games       March 9,2018
5                     Medical        May 17,2018
6                       Tools        June 3,2016
7          Casual;Brain Games      April 10,2016
8                     Medical       July 16,2018
9               Entertainment        May 17,2017

所需的输出如下：

                       Genres          macro_genres        Last Updated
0                     Finance               Finance      March 10,2018
1                      Arcade                Arcade        May 24,2018
2                    Business              Business      April 11,2018
3                 Photography           Photography    November 6,2014
4   Entertainment;Brain Games         Entertainment       March 9,2018
5                     Medical               Medical        May 17,2018
6                       Tools                 Tools        June 3,2016
7          Casual;Brain Games                Casual      April 10,2016
8                     Medical               Medical       July 16,2018
9               Entertainment         Entertainment        May 17,2017

我尝试过的事情：

def macro_genre(i):
    for i in df['Genres']:
        if ';' in i:
            j = i.split(';')[0]
            return j
        else:
            return i
                    
df['macro_genres'] = df['Genres'].apply(macro_genre)

但是它不起作用。它会创建该列，但会为整个列重复第一个值。

当我在函数外部尝试for部分时，它就起作用了。

任何提示将不胜感激！谢谢!!!

解决方法

您可以只使用str.split(';')。如果字符串中不存在;，则什么都不会发生->返回带有原始字符串的列表（因此您可以随时使用[0]）：

df['macro_genres'] = df['Genres'].apply(lambda x: x.split(';')[0])
print(df)

打印：

                      Genres      Last_Updated   macro_genres
0                    Finance    March 10,2018        Finance
1                     Arcade      May 24,2018         Arcade
2                   Business    April 11,2018       Business
3                Photography  November 6,2014    Photography
4  Entertainment;Brain_Games     March 9,2018  Entertainment
5                    Medical      May 17,2018        Medical
6                      Tools      June 3,2016          Tools
7         Casual;Brain Games    April 10,2016         Casual
8                    Medical     July 16,2018        Medical
9              Entertainment      May 17,2017  Entertainment

一种可能是使用map：

df['macro_games'] = df['Genres'].astype(str).map(lambda x : x.split(';')[0])

输出：

>>> df
                       Genres          macro_genres        Last Updated
0                     Finance               Finance      March 10,2018
1                      Arcade                Arcade        May 24,2018
2                    Business              Business      April 11,2018
3                 Photography           Photography    November 6,2014
4   Entertainment;Brain Games         Entertainment       March 9,2018
5                     Medical               Medical        May 17,2018
6                       Tools                 Tools        June 3,2016
7          Casual;Brain Games                Casual      April 10,2016
8                     Medical               Medical       July 16,2018
9               Entertainment         Entertainment        May 17,2017

1k数据帧上的运行时比较：

#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
535 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)

#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
1.36 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)

#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
527 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)

关于10k数据帧的运行时比较：

#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
3.62 ms ± 105 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)

#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
10 ms ± 259 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)

#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
3.47 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)

关于50k数据帧的运行时比较：

#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
17 ms ± 133 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)

#map method 
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
16.7 ms ± 278 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)

关于100k数据帧的运行时比较：

#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
34.1 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs,1000 loops each)

#map method 
>>> %timeit -n 1000 df['Genres'].astype(str).map(lambda x : x.split(';')[0])
35.5 ms ± 596 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)

创建一个根据条件删除字符串中不需要的部分的列

如何解决创建一个根据条件删除字符串中不需要的部分的列

解决方法

相关推荐