如何解析日志文本文件,解析日期时间并获取timedelta的总和

如何解决如何解析日志文本文件,解析日期时间并获取timedelta的总和

我尝试了各种方法来打开文件并将其作为一个整体传递。但是我做不到。输出为零或为空集。

我有一个包含以下数据的日志文件:

Time Log Nitrogen:
5/1/12: 3:39am - 4:43am data file study
        3:57pm - 5:06pm bg ui,combo boxes
        7:44pm - 8:50pm bg ui with scaler; slider
        10:30pm - 12:48am state texts; slider
5/2/12: 10:00am - 12:00am discuss with Blanca about the data file
5/8/12: 11:00pm - 11:40pm mapMC,5/9/12: 3:05pm - 3:42pm wholeMapMC,subMapMC,AS3 functions reading
        10:35pm - 1:33am whole view data; scrollpane; 
5/10/12: 6:10pm - 8:13pm blue slider
5/11/12: 8:45am - 12:10pm purple slider
         1:30pm - 5:00pm Nitrate bar
         11:18pm - 12:03am change NitrogenViewBase to static
5/12/12: 8:06am - 9:47am correct data and change NitrogenViewBase to static
         5:45pm - 8:00pm costs bar,embed font
         9:51pm - 12:31am costs bar
5/13/12: 7:45am - 8:45am read the Nitrogen Game doc
5/15/12: 2:07am - 5:09am corn
         2:06pm - 5:11pm hypoxic zone
5/16/12: 2:53pm - 5:09pm data re-structure
         7:00pm - 9:10pm sub sections watershed data
5/17/12: 12:30am - 2:32am sub sections sliders
         10:30am - 11:45am meet with Dr. Lant and Blanca
         3:09pm - 5:05pm crop yield and sub sections pink bar
         7:00pm - 7:50pm sub sections nitrate to gulf bar
5/18/12: 3:15pm - 3:52pm sub sections slider legend
5/27/12: 5:46pm - 7:30pm feedback fixes
6/20/12: 2:57pm - 5:00pm Teachers' feedback fixes
         7:30pm - 8:30pm 
6/22/12: 3:40pm - 5:00pm
6/25/12: 3:24pm - 5:00pm
6/26/12: 11:24am - 12:35pm
7/4/12:  1:00pm - 10:00pm research on combobox with dropdown subitem - to no avail
7/5/12:  1:30am - 3:00am continue the research
         9:31am - 12:45pm experiment on the combobox-subitem concept
         3:45pm - 5:00pm
         6:23pm - 8:14pm give up
         8:18pm - 10:00pm zone change
         11:07pm - 12:00am
7/10/12: 11:32am - 12:03pm added BASE_X and BASE_Y to the NitrogenSubView
         4:15pm - 5:05pm fine-tune the whole view map
         7:36pm - 8:46pm 
7/11/12: 1:38am - 4:42am
7/31/12: 11:26am - 1:18pm study photoshop path shape
8/1/12:  2:00am - 3:41am collect the coordinates of wetland shapes
         10:31am - 11:40am restorable wetlands implementation
         4:00pm - 5:00pm 
8/2/12:  12:20am - 4:42am
8/10/12: 2:30am - 4:55am sub watersheds color match; wetland color & size change 
3/13/13: 6:06pm - 6:32pm Make the numbers in the triangle sliders bigger and bolder; Larger font on "Crop Yield Reduction"

如何计算通过解析时间日志文件花费的总时间?我无法解析整个文件。

我尝试过:

  import re
    import datetime
    
    text="""5/1/12: 3:39am - 4:43am data file study
        3:57pm - 5:06pm bg ui,"""
    
    total=re.findall("(\d{1,2}:\d{1,2}[ap]m)\s*-\s*(\d{1,2}[ap]m)",text)
    
    print(sum([datetime.datetime.strptime(t[1],"%I:%M%p")-datetime.datetime.strptime(t[0],"%I:%M%p") for t in total],datetime.timedelta()))

执行此操作,我得到的时间为负数。如何解决呢?

解决方法

要考虑时间重叠的天数,您必须分别计算两天的持续时间并将其相加。
请参考下面的代码

Control.SetPadding(2,1,2,1)

输出

import re
from datetime import datetime as dt,timedelta as td
strp=dt.strptime
with open("log.txt","r") as f:
    total=re.findall("(\d{1,2}:\d{1,2}[ap]m)\s*-\s*(\d{1,2}[ap]m)",f.read())
    print(sum([strp(t[1],"%I:%M%p")-strp(t[0],"%I:%M%p") if strp(t[1],"%I:%M%p")>strp(t[0],"%I:%M%p") else (strp("11:59pm","%I:%M%p"))+(strp(t[1],"%I:%M%p")-strp("12:00am","%I:%M%p"))+td(minutes=1) for t in total],td()))
,

您可以在Panda数据框中解析日志文件,然后轻松进行计算:

import pandas as pd 
import dateparser

x="""5/1/12: 3:39am - 4:43am data file study
            3:57pm - 5:06pm bg ui,combo boxes
            7:44pm - 8:50pm bg ui with scaler; slider
            10:30pm - 12:48am state texts; slider
    5/2/12: 10:00am - 12:00am discuss with Blanca about the data file
    5/8/12: 11:00pm - 11:40pm mapMC,5/9/12: 3:05pm - 3:42pm wholeMapMC,subMapMC,AS3 functions reading
            10:35pm - 1:33am whole view data; scrollpane; 
    5/10/12: 6:10pm - 8:13pm blue slider
    5/11/12: 8:45am - 12:10pm purple slider
             1:30pm - 5:00pm Nitrate bar
             11:18pm - 12:03am change NitrogenViewBase to static
    5/12/12: 8:06am - 9:47am correct data and change NitrogenViewBase to static
             5:45pm - 8:00pm costs bar,embed font
             9:51pm - 12:31am costs bar
    5/13/12: 7:45am - 8:45am read the Nitrogen Game doc
    5/15/12: 2:07am - 5:09am corn
             2:06pm - 5:11pm hypoxic zone
    5/16/12: 2:53pm - 5:09pm data re-structure
             7:00pm - 9:10pm sub sections watershed data
    5/17/12: 12:30am - 2:32am sub sections sliders
             10:30am - 11:45am meet with Dr. Lant and Blanca
             3:09pm - 5:05pm crop yield and sub sections pink bar
             7:00pm - 7:50pm sub sections nitrate to gulf bar
    5/18/12: 3:15pm - 3:52pm sub sections slider legend
    5/27/12: 5:46pm - 7:30pm feedback fixes
    6/20/12: 2:57pm - 5:00pm Teachers' feedback fixes
             7:30pm - 8:30pm 
    6/22/12: 3:40pm - 5:00pm
    6/25/12: 3:24pm - 5:00pm
    6/26/12: 11:24am - 12:35pm
    7/4/12:  1:00pm - 10:00pm research on combobox with dropdown subitem - to no avail
    7/5/12:  1:30am - 3:00am continue the research
             9:31am - 12:45pm experiment on the combobox-subitem concept
             3:45pm - 5:00pm
             6:23pm - 8:14pm give up
             8:18pm - 10:00pm zone change
             11:07pm - 12:00am
    7/10/12: 11:32am - 12:03pm added BASE_X and BASE_Y to the NitrogenSubView
             4:15pm - 5:05pm fine-tune the whole view map
             7:36pm - 8:46pm 
    7/11/12: 1:38am - 4:42am
    7/31/12: 11:26am - 1:18pm study photoshop path shape
    8/1/12:  2:00am - 3:41am collect the coordinates of wetland shapes
             10:31am - 11:40am restorable wetlands implementation
             4:00pm - 5:00pm 
    8/2/12:  12:20am - 4:42am
    8/10/12: 2:30am - 4:55am sub watersheds color match; wetland color & size change 
    3/13/13: 6:06pm - 6:32pm Make the numbers in the triangle sliders bigger and bolder; Larger font on "Crop Yield Reduction"
"""



#We will store records there
records = []

#Loop through lines
for line in x.split("\n"):
    
    #Look for a date in line
    match_date = re.search(r'(\d+/\d+/\d+)',line)
    
    if match_date!=None:
        #If a date exists,store it in a variable
        date = match_date.group(1)
    #Extract times
    times =  re.findall("(\d{1,line)
    #if there's no valid time in the line,skip it
    if len(times) == 0: continue
    #parse dates
    start = dateparser.parse(date + " " + times[0][0],languages=['en'])
    end = dateparser.parse(date + " " + times[0][1],languages=['en'])
    content =line.split(times[0][1])[-1].strip()
    #Append records
    records.append(dict(date=date,start= start,end = end,content =content))
    
df = pd.DataFrame(records)

#Correct end time if it's lower than start time 
df.loc[df.start>df.end,"end"] = df[df.start>df.end].end + timedelta(days=1)

print("Total spent time :",(df.end - df.start).sum())

输出

Total spent time : 4 days 09:13:00
,

您已经从Liju和Sebastien D获得了两个有趣且可行的解决方案。在这里,我提出了两个新的变体,它们虽然相似,但都具有重要的性能优势。

当前的两种解决方案都以这种方式解决问题:

    由Liju提出的{li>

    Solution by one_pass:采用正则表达式匹配并求和由列表推导创建的列表。在该理解期间,它将相同的两个字符串解析为datetime三次(以评估>,输出if或输出else)。

    由塞巴斯蒂安D提出的
  • dateparser提出的解决方案:提取文本的每一行,并尝试将正则表达式换成该行的日期,然后尝试从同一行中查找开始/结束时间(可能是改进为单个正则表达式,但正则表达式不是此解决方案的瓶颈)。然后,它使用dateparser组合日期和时间,并收集文本描述。这更类似于完整的解析器,但是出于时间测试的目的,我删除了描述功能。

这两个新解决方案相似:

  • two_pass解决:类似于one_pass,但在第一遍中,它只是将字符串解析为datetime,在第二遍中,它评估了start > end并求和了正确的时间增量。主要优点是它只解析一次日期,而缺点是必须迭代两次。

  • pure_pandas解决:类似于dateparser,但只调用一次正则表达式,并使用pandas内置的to_datetime进行解析。

如果我们比较所有这些解决方案在不同文本长度下的性能,我们会发现w_dateparser是迄今为止性能最低的解决方案。

timings

如果我们放大以比较其他三个解决方案,我们会发现w_pure_pandas相对于其他解决方案来说,对于较小的文本长度而言,速度稍慢一些,但是它在利用numpy C实现比较较长的条目方面表现出色(与其他解决方案使用的列表理解相反)。其次,two_pass通常比one_pass更快,而较长的文本则越来越快。

timins without w_dateparser

two_passw_pure_pandas的代码:

def two_pass(text):
    total = re.findall(r"(\d{1,text)
    total = [
        (datetime.datetime.strptime(t[0],'%I:%M%p'),datetime.datetime.strptime(t[1],'%I:%M%p'))
        for t in total
    ]
    return sum(
        (
            end - start if end > start
            else end - start + datetime.timedelta(days=1)
            for start,end in total
        ),datetime.timedelta()
    )


def w_pure_pandas(text):
    import pandas as pd
    
    total = re.findall(r"(\d{1,text)
    df = pd.DataFrame(total,columns=['start','end'])
    for col in df:
        # pandas.to_datetime has issues with date compatibility
        # but since we only care for time deltas,# we can just use the default behavior
        df[col] = pd.to_datetime(df[col])
    
    df.loc[df.start > df.end,'end'] += datetime.timedelta(days=1)
    
    return df.diff(axis=1).sum()['end']

所有解决方案和时间测试的完整代码:

import re
import datetime
import timeit
from matplotlib import pyplot as plt

text = '''
Time Log Nitrogen:
5/1/12: 3:39am - 4:43am data file study
        3:57pm - 5:06pm bg ui,combo boxes
        7:44pm - 8:50pm bg ui with scaler; slider
        10:30pm - 12:48am state texts; slider
5/2/12: 10:00am - 12:00am discuss with Blanca about the data file
5/8/12: 11:00pm - 11:40pm mapMC,AS3 functions reading
        10:35pm - 1:33am whole view data; scrollpane; 
5/10/12: 6:10pm - 8:13pm blue slider
5/11/12: 8:45am - 12:10pm purple slider
         1:30pm - 5:00pm Nitrate bar
         11:18pm - 12:03am change NitrogenViewBase to static
5/12/12: 8:06am - 9:47am correct data and change NitrogenViewBase to static
         5:45pm - 8:00pm costs bar,embed font
         9:51pm - 12:31am costs bar
5/13/12: 7:45am - 8:45am read the Nitrogen Game doc
5/15/12: 2:07am - 5:09am corn
         2:06pm - 5:11pm hypoxic zone
5/16/12: 2:53pm - 5:09pm data re-structure
         7:00pm - 9:10pm sub sections watershed data
5/17/12: 12:30am - 2:32am sub sections sliders
         10:30am - 11:45am meet with Dr. Lant and Blanca
         3:09pm - 5:05pm crop yield and sub sections pink bar
         7:00pm - 7:50pm sub sections nitrate to gulf bar
5/18/12: 3:15pm - 3:52pm sub sections slider legend
5/27/12: 5:46pm - 7:30pm feedback fixes
6/20/12: 2:57pm - 5:00pm Teachers' feedback fixes
         7:30pm - 8:30pm 
6/22/12: 3:40pm - 5:00pm
6/25/12: 3:24pm - 5:00pm
6/26/12: 11:24am - 12:35pm
7/4/12:  1:00pm - 10:00pm research on combobox with dropdown subitem - to no avail
7/5/12:  1:30am - 3:00am continue the research
         9:31am - 12:45pm experiment on the combobox-subitem concept
         3:45pm - 5:00pm
         6:23pm - 8:14pm give up
         8:18pm - 10:00pm zone change
         11:07pm - 12:00am
7/10/12: 11:32am - 12:03pm added BASE_X and BASE_Y to the NitrogenSubView
         4:15pm - 5:05pm fine-tune the whole view map
         7:36pm - 8:46pm 
7/11/12: 1:38am - 4:42am
7/31/12: 11:26am - 1:18pm study photoshop path shape
8/1/12:  2:00am - 3:41am collect the coordinates of wetland shapes
         10:31am - 11:40am restorable wetlands implementation
         4:00pm - 5:00pm 
8/2/12:  12:20am - 4:42am
8/10/12: 2:30am - 4:55am sub watersheds color match; wetland color & size change 
    3/13/13: 6:06pm - 6:32pm Make the numbers in the triangle sliders 
bigger and bolder; Larger font on "Crop Yield Reduction"
'''

def one_pass(text):
    total = re.findall(r"(\d{1,text)
    return sum(
        [
            datetime.datetime.strptime(t[1],'%I:%M%p')
            - datetime.datetime.strptime(t[0],'%I:%M%p')
            if datetime.datetime.strptime(t[1],'%I:%M%p') >
                datetime.datetime.strptime(t[0],'%I:%M%p')
            else
            datetime.datetime.strptime('11:59pm','%I:%M%p')
            + datetime.datetime.strptime(t[1],'%I:%M%p')
            - datetime.datetime.strptime('12:00am','%I:%M%p')
            + datetime.timedelta(minutes=1)
            for t in total
        ],start=datetime.timedelta()
    )


def w_dateparser(text):
    import pandas as pd
    import dateparser
    
    #We will store records there
    records = []
    #Loop through lines
    # t0 = t1 = t2 = 0
    for line in text.split("\n"):
        #Look for a date in line
        # t0 = time() - t0
        match_date = re.search(r'(\d+/\d+/\d+)',line)
        if match_date!=None:
            #If a date exists,store it in a variable
            date = match_date.group(1)
        #Extract times
        times =  re.findall("(\d{1,line)
        # t0 = time() - t0
        #if there's no valid time in the line,skip it
        if len(times) == 0: continue
        # t1 = time() - t1
        #parse dates
        start = dateparser.parse(date + " " + times[0][0],languages=['en'])
        end = dateparser.parse(date + " " + times[0][1],languages=['en'])
        # content = line.split(times[0][1])[-1].strip()
        # t1 = time() - t1
        #Append records
        # records.append(dict(date=date,content =content))
        records.append(dict(date=date,end = end))
        
    # t2 = time() - t2
    df = pd.DataFrame(records)
    # print(df)
    #Correct end time if it's lower than start time 
    df.loc[df.start>df.end,"end"] = df[df.start>df.end].end + datetime.timedelta(days=1)
    # t2 = time() - t2
    # print(t0,t1,t2)
    return (df.end - df.start).sum()


def two_pass(text):
    total = re.findall(r"(\d{1,'end'] += datetime.timedelta(days=1)
    
    return df.diff(axis=1).sum()['end']

timings = {}
for l in [1,5,10,50,100]:
    text_long = text * l
    n = 2
    timings[l] = {}
    for func in ['two_pass','one_pass','w_pure_pandas','w_dateparser']:
        t = timeit.timeit(f"{func}(text_long)",number=n,globals=globals()) / n
        timings[l][func] = t

timings = pd.DataFrame(timings).T
timings.info()
print(timings)

timings.plot()
plt.xlabel('multiplier for lines of text')
plt.ylabel('runtime (s)')
plt.grid(True)
plt.show()
plt.close('all')

timings[['two_pass','w_pure_pandas']].plot()
plt.xlabel('multiplier for lines of text')
plt.ylabel('runtime (s)')
plt.grid(True)
plt.show()
plt.close('all')

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 <select id="xxx"> SELECT di.id, di.name, di.work_type, di.updated... <where> <if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 <property name="dynamic.classpath" value="tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams['font.sans-serif'] = ['SimHei'] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -> systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping("/hires") public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate<String
使用vite构建项目报错 C:\Users\ychen\work>npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-