如何在pandas.to_parquet中控制时间戳架构

如何解决如何在pandas.to_parquet中控制时间戳架构

我注意到timestamp生成的镶木文件中pandas.to_parquet的列类型可能会有所不同，具体取决于熊猫的版本，例如

In [1]: pd.__version__                                                                                                 
Out[1]: '1.0.5'


In [2]: pd.DataFrame([pd.Timestamp('2020-01-01')],columns=['a']).to_parquet('/tmp/test.parquet')                      

In [3]: !parquet-tools schema /tmp/test.parquet                                                                        
message schema {
  optional int64 a (TIMESTAMP_MILLIS);
}

In [4]: !parquet-tools head /tmp/test.parquet                                                                          
a = 1577836800000

In [1]: pd.__version__
Out[1]: '1.1.2'

In [2]: pd.DataFrame([pd.Timestamp('2020-01-01')],columns=['a']).to_parquet('/tmp/test.parquet')

In [3]: !parquet-tools schema /tmp/test.parquet
message schema {
  optional int64 a (TIMESTAMP_MICROS);
}

In [4]: !parquet-tools head /tmp/test.parquet                                                                          
a = 1577836800000000

如上所述，pandas-1.0.5将时间戳的类型转换为TIMESTAMP_MILLIS，而pandas-1.1.2将时间戳的类型转换为TIMESTAMP_MICROS。

我正在使用pandas-1.1.2，但是我需要将类型设为TIMESTAMP_MILLIS，以供下游使用镶木文件（由Presto查询），请问该怎么做？ / p>

我正在使用pyarrow引擎。

解决方法

这可以用pyarrow进行配置，幸运的是pd.to_parquet会将所有未知的kwrg发送到镶木地板库中。
看着我们发现的pyarrow docs for ParquetWriter

coerce_timestamps（str，默认为None）–将时间戳转换为特定分辨率。默认值取决于版本。对于version ='1.0'（默认值），默认情况下，纳秒将转换为微秒（“ us”），秒将转换为毫秒（“ ms”）。对于version ='2.0'，将保留原始分辨率，并且默认情况下不进行任何转换。强制转换可能会导致数据丢失，在这种情况下，allow_truncated_timestamps = True可用于抑制引发的异常。有效值：{None，“ ms”，“ us”}

所以这意味着您可以将时间戳强制为毫秒

df.to_parquet(path,coerce_timestamps="ms")

或微秒

df.to_parquet(path,coerce_timestamps="us")

这使它成为您所需的代码

pd.DataFrame([pd.Timestamp('2020-01-01')],columns=['a']).to_parquet('/tmp/test.parquet',coerce_timestamps="ms")

也请注意这部分文档

allow_truncated_timestamps=True可用于抑制引发的异常。

如何在pandas.to_parquet中控制时间戳架构

如何解决如何在pandas.to_parquet中控制时间戳架构

解决方法

相关推荐