使用熊猫从Redshift中读取bigintint8列数据而无需科学计数法

如何解决使用熊猫从Redshift中读取bigintint8列数据而无需科学计数法

我正在使用Pandas从Redshift读取数据。我有一列bigint（int8）列以指数形式出现。我尝试了以下方法，但是在这种情况下却被数据截断了。

该列中数据的样本值是：635284328055690862。读为6.352843e+17。

我试图用Python将其转换为int64。

import numpy as np
df["column_name"] = df["column_name"].astype(np.int64)

在这种情况下，输出为：635284328055690880。在这里，我丢失了数据，最后将其缩放为0。

预期输出：635284328055690862

即使这样做，我也会得到相同的结果。

pd.set_option('display.float_format',lambda x: '%.0f' % x)

输出：635284328055690880

预期输出：635284328055690862

这似乎是正常的熊猫行为。我什至尝试使用列表创建数据框，但仍然得到相同的结果。

import pandas as pd
import numpy as np

pd.set_option('display.float_format',lambda x: '%.0f' % x)
sample_data = [[635284328055690862,758364950923147626],[np.NaN,np.NaN],[1,3]]
df = pd.DataFrame(sample_data)


Output:
0 635284328055690880 758364950923147648
1                nan                nan
2                  1                  3

我注意到的是，每当数据帧中有nan时，就会遇到这个问题。

我正在使用以下代码从Redshift获取数据。

from sqlalchemy import create_engine 
import pandas as pd  
connstr = 'redshift+psycopg2://<username>:<password>@<cluster_name>/<db_name>' 
engine = create_engine(connstr) 
with engine.connect() as conn,conn.begin():     
    df = pd.read_sql('''select * from schema.table_name''',conn)
print(df)

请帮助我解决此问题。预先感谢。

解决方法

之所以会发生这种情况，是因为标准整数数据类型无法提供表示缺失数据的方法。由于浮点数据类型确实提供了nan，因此处理此问题的旧方法是将缺少数据的数字列转换为float。

为解决此问题，熊猫引入了Nullable integer data type。如果您做的事情像读取csv一样简单，则可以在对read_csv的调用中明确指定这种类型，如下所示：

>>> pandas.read_csv('sample.csv',dtype="Int64")
             column_a  column_b
0  635284328055690880     45564
1                <NA>        45
2                   1      <NA>
3                   1         5

但是，问题仍然存在！看来即使635284328055690862可以表示为64位整数，但在某些时候，pandas仍将值传递给浮点转换步骤，从而更改了值。这很奇怪，甚至值得作为熊猫开发者的一个问题来提出来。

在这种情况下，我看到的最佳解决方法是使用“对象”数据类型，如下所示：

>>> pandas.read_csv('sample.csv',dtype="object")
             column_a column_b
0  635284328055690862    45564
1                 NaN       45
2                   1      NaN
3                   1        5

这将保留大整数的确切值，并且还允许NaN值。但是，由于这些现在是 python对象的数组，因此对计算密集型任务的性能会有重大影响。此外，在仔细检查中，它们似乎是Python str对象，因此我们仍然需要另一个转换步骤。令我惊讶的是，没有直接的方法。这是我能做的最好的事情：

def col_to_intNA(col):
    return {ix: pandas.NA if pandas.isnull(v) else int(v)
            for ix,v in col.to_dict().items()}

sample = {col: col_to_intNA(sample[col])
          for col in sample.columns}
sample = pandas.DataFrame(sample,dtype="Int64")

这给出了预期的结果：

>>> sample
             column_a  column_b
0  635284328055690862     45564
1                <NA>        45
2                   1      <NA>
3                   1         5
>>> sample.dtypes
column_a    Int64
column_b    Int64
dtype: object

因此可以解决一个问题。但是又出现了第二个问题，因为从Redshift数据库中读取数据，通常会使用read_sql，它没有提供任何指定数据类型的方法。

所以我们要自己滚！这基于您发布的代码以及pandas_redshift library的一些代码。它直接使用psycopg2，而不是使用sqlalchemy，因为我不确定sqlalchemy是否提供接受RealDictCursor的cursor_factory参数。注意：我根本没有测试过，因为我懒得只为测试StackOverflow答案而设置postgres数据库！我认为应该可以，但是我不确定。请让我知道它是否有效和/或需要更正的内容。

import psycopg2
from psycopg2.extras import RealDictCursor  # Turn rows into proper dicts.

import pandas

def row_null_to_NA(row):
    return {col: pandas.NA if pandas.isnull(val) else val
            for col,val in row.items()}

connstr = 'redshift+psycopg2://<username>:<password>@<cluster_name>/<db_name>'

try:  # `with conn:` only closes the transaction,not the connection 
    conn = psycopg2.connect(connstr,cursor_factory=RealDictCursor)
    cursor = conn.cursor()
    cursor.execute('''select * from schema.table_name''')

    # The DataFrame constructor accepts generators of dictionary rows.
    df = pandas.DataFrame(
        (row_null_to_NA(row) for row in cursor.fetchall()),dtype="Int64"
    )
finally:
    conn.close()

print(df)

请注意，这假设您的所有列都是整数列。如果没有，您可能需要逐列加载数据。

修复之一可以是代替select * from schema.table_name。您可以分别传递所有列，然后转换特定列。

假设表中有 5 列，col2 是 bigint(int8) 列。所以，你可以阅读如下：

from sqlalchemy import create_engine 
import pandas as pd  
connstr = 'redshift+psycopg2://<username>:<password>@<cluster_name>/<db_name>' 
engine = create_engine(connstr) 
with engine.connect() as conn,conn.begin():     
    df = pd.read_sql('''select col1,cast(col2 as int),col3,col4,col5... from schema.table_name''',conn)
print(df)

PS：我不确定这是最聪明的解决方案，但从逻辑上讲，如果 python 无法正确转换为 int64，那么我们可以从 SQL 本身读取转换值。

此外，如果长度超过 17，我想尝试动态转换 int 列。

使用熊猫从Redshift中读取bigintint8列数据而无需科学计数法

如何解决使用熊猫从Redshift中读取bigintint8列数据而无需科学计数法

解决方法

相关推荐