如何解决无法将 Pandas 数据框保存到带有浮点数列表作为单元格值的镶木地板上
我有一个结构如下的数据框:
Coumn1 Coumn2
0 (0.00030271668219938874,0.0002655923890415579... (0.0016430083196610212,0.0014970217598602176,...
1 (0.00015607803652528673,0.0001314736582571640... (0.0022136708721518517,0.0014974646037444472,...
2 (0.011317798867821693,0.011339936405420303,0... (0.004868391435593367,0.004406007472425699,0...
3 (3.94578673876822e-05,3.075833956245333e-05,... (0.0075020878575742245,0.0096737677231431,0....
4 (0.0004926157998852432,0.0003811710048466921,... (0.010351942852139473,0.008231297135353088,0...
.. ... ...
130 (0.011190211400389671,0.011337820440530777,0... (0.010182800702750683,0.011351295746862888,0...
131 (0.006286659277975559,0.007315031252801418,0... (0.02104150503873825,0.02531484328210354,0.0...
132 (0.0022791570518165827,0.0025983047671616077,... (0.008847278542816639,0.009222050197422504,0...
133 (0.0007059817435219884,0.0009831463685259223,... (0.0028264704160392284,0.0029402063228189945,...
134 (0.0018992726691067219,0.002058899961411953,... (0.0019639385864138603,0.002009353833273053,...
[135 rows x 2 columns]
其中每个单元格包含一些浮点值的列表/元组:
type(psd_res.data_frame['Column1'][0])
<class 'tuple'>
type(psd_res.data_frame['Column1'][0][0])
<class 'numpy.float64'>
(每个单元格条目在元组中包含相同数量的条目)
当我现在尝试将数据框保存为镶木地板时,出现错误 (fastparquet):
Can't infer object conversion type: 0 (0.00030271668219938874,0.0002655923890415579...
1 (0.00015607803652528673,0.0001314736582571640...
...
Name: Column1,dtype: object
完整的堆栈跟踪:https://pastebin.com/8Myu8hNV
我也用其他引擎 pyarrow 尝试过:
pyarrow.lib.ArrowInvalid: ('Could not convert (0.00030271668219938874,...,0.0002464042045176029)
with type tuple: did not recognize Python value type when inferring an Arrow data type','Conversion failed for column UO-Pumpe with type object')
所以我找到了这个话题 https://github.com/dask/fastparquet/issues/458。这似乎是 fastparquet 中的一个错误 - 但它应该可以在 pyarrow 中工作,但对我来说却失败了。
然后我尝试了一些我发现的东西,比如 infer_objects()
和 astype(float)
......到目前为止没有任何效果。
有没有人有解决方案如何将我的数据框保存到镶木地板上?
解决方法
数据帧的单元格包含浮点元组。这是一种不寻常的数据类型。
所以你需要给箭头一点帮助来确定你的数据类型。为此,您需要明确提供表的架构。
df = pd.DataFrame(
{
"column1": [(1.0,2.0),(3.0,4.0,5.0)]
}
)
schema = pa.schema([pa.field('column1',pa.list_(pa.float64()))])
df.to_parquet('/tmp/hello.pq',schema=schema)
请注意,如果您使用的是浮点数列表(而不是元组),它会起作用:
df = pd.DataFrame(
{
"column1": [[1.0,2.0],[3.0,5.0]]
}
)
df.to_parquet('/tmp/hello.pq')
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。