将熊猫数据框转换为固定大小的分段数组

如何解决将熊猫数据框转换为固定大小的分段数组

我正在努力将数据框转换为固定大小的段数组,并将其馈送到卷积神经网络。具体来说,我想从dfm数组的列表,每个数组包含大小为(1,5,4)的段。所以最后,我将得到一个(m,1,4)数组。

为澄清我的问题,我将使用此MWE进行说明。假设这是我的df

df = {
    'id': [1,1],'speed': [17.63,17.63,0.17,1.41,0.61,0.32,0.18,0.43,0.30,0.46,0.75,0.37],'acc': [0.00,-0.09,1.24,-0.80,-0.29,-0.14,0.25,-0.13,0.16,0.29,-0.38,0.27],'jerk': [0.00,0.01,-2.04,0.51,0.15,0.39,0.13,-0.67,0.65,0.52],'bearing': [29.03,56.12,18.49,11.85,36.75,27.52,81.08,51.06,19.85,10.76,14.51,24.27],'label' : [3,3,3] }

df = pd.DataFrame.from_dict(df)

为此,我使用以下功能:

def df_transformer(dataframe,chunk_size=5):
    
    grouped = dataframe.groupby('id')

    # initialize accumulators
    X,y = np.zeros([0,chunk_size,4]),np.zeros([0,])

    # loop over segments (id)
    for _,group in grouped:

        inputs = group.loc[:,'speed':'bearing'].values
        label = group.loc[:,'label'].values[0]

        # calculate number of splits
        N = len(inputs) // chunk_size

        if N > 0:
            inputs = np.array_split(inputs,[chunk_size]*N)
        else:
            inputs = [inputs]
        
        # loop over splits
        for inpt in inputs:
            inpt = np.pad(
                inpt,[(0,chunk_size-len(inpt)),(0,0)],mode='constant')
            # add each inputs split to accumulators
            X = np.concatenate([X,inpt[np.newaxis,np.newaxis]],axis=0)
            y = np.concatenate([y,label[np.newaxis]],axis=0) 

    return X,y

上面的df有12行,因此如果正确转换为预期的形式,我应该得到形状为(3,4)的数组。在上述函数中,对少于5行的段进行零填充,以使段的形状为(1,4)

当前,此功能有两个问题:

  1. 该功能仅对我的df中少于10行有效。

这样(最后一行应在下面填充零):

X,y = df_transformer(df[:9])
X
array([[[[ 1.763e+01,0.000e+00,2.903e+01],[ 1.763e+01,-9.000e-02,1.000e-02,5.612e+01],[ 1.700e-01,1.240e+00,-2.040e+00,1.849e+01],[ 1.410e+00,-8.000e-01,5.100e-01,1.185e+01],[ 6.100e-01,-2.900e-01,1.500e-01,3.675e+01]]],[[[ 3.200e-01,-1.400e-01,3.900e-01,2.752e+01],[ 1.800e-01,2.500e-01,-3.800e-01,8.108e+01],[ 4.300e-01,-1.300e-01,2.900e-01,5.106e+01],[ 3.000e-01,1.600e-01,1.300e-01,1.985e+01],[ 0.000e+00,0.000e+00]]]])

但是在这种情况下引入了全零数组(段):

X,y = df_transformer(df[:10])
X
array([[[[ 1.763e+01,[[[ 0.000e+00,0.000e+00],0.000e+00]]],[ 4.600e-01,-6.700e-01,1.076e+01]]]])
  1. 如果我传递整个df,该函数将失败(我不理解该错误,但似乎与少于5行的段的填充有关。)

因此,在这种情况下,我收到index can't contain negative values错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-1fc559db37eb> in <module>()
----> 1 X,y = df_transformer(df)

2 frames
<ipython-input-4-9e1c49985863> in df_transformer(dataframe,chunk_size)
     24             inpt = np.pad(
     25                 inpt,---> 26                 mode='constant')
     27             # add each inputs split to accumulators
     28             X = np.concatenate([X,axis=0)

<__array_function__ internals> in pad(*args,**kwargs)

/usr/local/lib/python3.6/dist-packages/numpy/lib/arraypad.py in pad(array,pad_width,mode,**kwargs)
    746 
    747     # Broadcast to shape (array.ndim,2)
--> 748     pad_width = _as_pairs(pad_width,array.ndim,as_index=True)
    749 
    750     if callable(mode):

/usr/local/lib/python3.6/dist-packages/numpy/lib/arraypad.py in _as_pairs(x,ndim,as_index)
    517 
    518     if as_index and x.min() < 0:
--> 519         raise ValueError("index can't contain negative values")
    520 
    521     # Converting the array with `tolist` seems to improve performance

ValueError: index can't contain negative values

预期输出:

X,y = df_transformer(df)
X
array([[[[ 1.763e+01,1.076e+01]]],[[[ 7.500e-01,6.500e-01,1.451e+01],[ 3.700e-01,2.700e-01,5.200e-01,2.427e+01],0.000e+00]]]])

有人可以帮我解决这个问题吗?上面的WME可以很好地重现此错误。

编辑

RichieV的答案也有一个错误。尽管它在给定的MWE中有效,但在以下情况下却无法完成正确的任务(将df两次扩展

its size):
df = {
    'id': [1]*12+[2]*12,0.37]*2,0.27]*2,0.52]*2,24.27]*2,3]*2 }
df = pd.DataFrame.from_dict(df)

X,y = df_transformer(df,chunk_size=5)
print(X[:3])

[[[[ 1.763e+01  0.000e+00  0.000e+00  2.903e+01]
   [ 0.000e+00  0.000e+00  0.000e+00  0.000e+00]
   [ 0.000e+00  0.000e+00  0.000e+00  0.000e+00]
   [ 0.000e+00  0.000e+00  0.000e+00  0.000e+00]
   [ 3.700e-01  2.700e-01  5.200e-01  2.427e+01]]]


 [[[ 7.500e-01 -3.800e-01  6.500e-01  1.451e+01]
   [ 3.000e-01  1.600e-01  1.300e-01  1.985e+01]
   [ 4.600e-01  2.900e-01 -6.700e-01  1.076e+01]
   [ 1.800e-01  2.500e-01 -3.800e-01  8.108e+01]
   [ 3.200e-01 -1.400e-01  3.900e-01  2.752e+01]]]


 [[[ 6.100e-01 -2.900e-01  1.500e-01  3.675e+01]
   [ 1.410e+00 -8.000e-01  5.100e-01  1.185e+01]
   [ 1.700e-01  1.240e+00 -2.040e+00  1.849e+01]
   [ 1.763e+01 -9.000e-02  1.000e-02  5.612e+01]
   [ 4.300e-01 -1.300e-01  2.900e-01  5.106e+01]]]]

请注意,第一个元素与答案中的元素不同(在第二,第三和第四行中得到全零。

解决方法

您可以填充df一次,而不必在每次迭代中填充。

使用第二个ID获取该数据

df = {
    'id': [1,1,2,2],'speed': [17.63,17.63,0.17,1.41,0.61,0.32,0.18,0.43,0.30,0.46,0.75,0.37],'acc': [0.00,-0.09,1.24,-0.80,-0.29,-0.14,0.25,-0.13,0.16,0.29,-0.38,0.27],'jerk': [0.00,0.01,-2.04,0.51,0.15,0.39,0.13,-0.67,0.65,0.52],'bearing': [29.03,56.12,18.49,11.85,36.75,27.52,81.08,51.06,19.85,10.76,14.51,24.27],'label' : [3,3,3] }
df = pd.DataFrame.from_dict(df)
print(df)

    id  speed   acc  jerk  bearing  label
0    1  17.63  0.00  0.00    29.03      3
1    1  17.63 -0.09  0.01    56.12      3
2    1   0.17  1.24 -2.04    18.49      3
3    1   1.41 -0.80  0.51    11.85      3
4    1   0.61 -0.29  0.15    36.75      3
5    1   0.32 -0.14  0.39    27.52      3
6    1   0.18  0.25 -0.38    81.08      3
7    1   0.43 -0.13  0.29    51.06      3
8    1   0.30  0.16  0.13    19.85      3
9    2   0.46  0.29 -0.67    10.76      3
10   2   0.75 -0.38  0.65    14.51      3
11   2   0.37  0.27  0.52    24.27      3

和代码

def df_transformer(df,chunk_size=5):
    ### pad df with 0's so len(df) is exactly a multiple of chunk_size
    df = pd.concat([df,pd.DataFrame([[id] + [0] * 5 # add row with zeros
            for id,ct in df.groupby('id').size().iteritems() # for each id
            for row in range(chunk_size - ct % chunk_size)] # as many times as needed,columns=df.columns)
    ]).sort_values('id',kind='mergesort',ignore_index=True)
    # print(df)
    X,y = [],[]
    for _,group in df.groupby(df.index//5):
        X.append(group.iloc[:,1:-1].values[np.newaxis,...])
        y.append(group.iloc[0,-1]) # not sure how you want y to be structured
    return np.array(X),np.array(y)


X,y = df_transformer(df,chunk_size=5)
print(X)

输出

[[[[ 1.763e+01  0.000e+00  0.000e+00  2.903e+01]
   [ 1.763e+01 -9.000e-02  1.000e-02  5.612e+01]
   [ 1.700e-01  1.240e+00 -2.040e+00  1.849e+01]
   [ 1.410e+00 -8.000e-01  5.100e-01  1.185e+01]
   [ 6.100e-01 -2.900e-01  1.500e-01  3.675e+01]]]

 [[[ 3.200e-01 -1.400e-01  3.900e-01  2.752e+01]
   [ 1.800e-01  2.500e-01 -3.800e-01  8.108e+01]
   [ 4.300e-01 -1.300e-01  2.900e-01  5.106e+01]
   [ 3.000e-01  1.600e-01  1.300e-01  1.985e+01]
   [ 0.000e+00  0.000e+00  0.000e+00  0.000e+00]]]

 [[[ 4.600e-01  2.900e-01 -6.700e-01  1.076e+01]
   [ 7.500e-01 -3.800e-01  6.500e-01  1.451e+01]
   [ 3.700e-01  2.700e-01  5.200e-01  2.427e+01]
   [ 0.000e+00  0.000e+00  0.000e+00  0.000e+00]
   [ 0.000e+00  0.000e+00  0.000e+00  0.000e+00]]]]

请注意前两个部分来自id==1,最后一个来自id==2,每个部分都有自己的零填充

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 &lt;select id=&quot;xxx&quot;&gt; SELECT di.id, di.name, di.work_type, di.updated... &lt;where&gt; &lt;if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 &lt;property name=&quot;dynamic.classpath&quot; value=&quot;tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-