python pandas-动态清洗具有不同列/行的4000 csv

如何解决python pandas-动态清洗具有不同列/行的4000 csv

我正在寻找一种方法来清除4000个类似格式的csv，但具有不同的行数/列数，然后将它们组合到一个表中（可能是SQLite超过400万条记录）。数据是相关的起点/终点（O / D）调查-每个csv是特定的路线和票证类型，其中包含多个停靠站计数（例如，路线101成人，路线101儿童摔跤等）。每个csv的步进格式类似于，其中csv具有与列相同的行数（如果不包括第一行（路由信息））：

route info
stop1,stop1
stop2,value,stop2
stop3,stop3
stop4,stop4

route info
stop11,stop11
stop32,stop32
....
stop150,.......,stop150

Sample raw data

但是，每个csv可以具有不同/更多/更少的O / D组合。数据没有标题，因此很难进入我建议的“中间步骤”。

Intermediary cleaning step - not required if can go directly to final output

Required cleaned output

我只是开始寻找解决方案，但是在将数据加载到Pandas DataFrame中时遇到了问题：

CParserError：令牌化数据时出错。 C错误：第3行中应有2个字段，看到3 。（通过使用sep =“ \ t”更正）
在单列中看到的所有数据（通过首先使用csv.reader打开数据以获取列数并为每个数字分配一个数字，对此进行纠正

    for dirty_csv in csvs_to_be_cleaned:
        print (dirty_csv)
        # open csv to get number of columns so that Pandas can read data
        with open(dirty_csv,'r') as csvfile:
            reader = csv.reader(csvfile)
            # subtract 1 from length to get actual number of columns
            # first row contains route/ticket info (which will be populated in 2 new fields)
            col_range = len(list(reader)) - 1
        default_cols = [str(i) for i in range(col_range)] # create some col names
        df = pd.read_csv(dirty_csv,sep = "\t",delimiter=",",names = default_cols,header = None)
        print(df)

问题：

是否有一种更优雅的解决方案，可以让Pandas看到csv中以步进方式显示数据/标题的所有数据
Pandas可以使用列中的第一个字符串条目作为列标题吗？

我想知道是否有类似的解决方案，或者有人愿意提供帮助。

python模块/进程：

glob可以清除所有csvs
使用csv / pandas处理每个csv个人（完整解决方案，而不是atm编写）
将已清理的csv输出到新文件夹
将所有内容合并到单个SQLite表中

解决方法

这是重新格式化原始文件/目标文件的一种方法。首先，创建一个样本数据集。

from io import StringIO
import pandas as pd

data = '''
route r1
dest 1,Origin 1
dest 2,a,Origin 2
dest 3,b,d,Origin 3
dest 4,c,e,f,Origin 4
'''

第二，解析文件。

with StringIO(data) as handle:
    # get the route (first non-blank line)
    while True:
        line = next(handle).rstrip('\n')
        if line:
            break
    route = line
    
    origins = list()
    bus_trips = list()
    
    for line in handle:
        fields = line.rstrip().split(',')
        fields = [f.strip() for f in fields]
        
        destination = fields[0]
        origins.append(fields[-1])
        
        for origin,count in zip(origins[:-1],fields[1:]):
            t = (route,origin,destination,count)
            bus_trips.append(t)

bus_trips是一个元组列表。您可以将其转换为熊猫数据框，然后使用熊猫.to_sql()方法写入数据库。

[('route r1','Origin 1','dest 2','a'),('route r1','dest 3','b'),'Origin 2','d'),'dest 4','c'),'e'),'Origin 3','f')]

现在，创建数据框非常简单，因为我们将“阶梯式”格式更改为矩形格式。

col_names = ['route','origin','destination','ride_count']
df = pd.DataFrame.from_records(bus_trips,columns=col_names)
print(df)

      route    origin destination ride_count
0  route r1  Origin 1      dest 2          a
1  route r1  Origin 1      dest 3          b
2  route r1  Origin 2      dest 3          d
3  route r1  Origin 1      dest 4          c
4  route r1  Origin 2      dest 4          e
5  route r1  Origin 3      dest 4          f

python pandas-动态清洗具有不同列/行的4000 csv

如何解决python pandas-动态清洗具有不同列/行的4000 csv

解决方法

相关推荐