通过bash删除冗余 旧版本

如何解决通过bash删除冗余 旧版本

我有这个问题,我有以下几行:

http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre&ca=2412

我想删除每一行中每一行都有每个参数的每一行,假设这两行:

http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre&ca=2412

我只想保留这个:

http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre&ca=2412

因为是具有更多参数的参数,而第一个将是多余的。

我想保留这些:

http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre&ca=2412

我要删除其他所有具有相同参数的行,保留具有更多参数的行,而不是具有较少参数的行。

另一个例子:

我要转换此:

http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&amp
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?ID=123

对此:

http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103

不同资源中的相同参数,必须是不同的行。

如果我明白了:

http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/content2/index.cfm?ID=123

我都想保留它们。

编辑8月19日:


URL的另一个示例以及我希望如何处理它们:

https://es.answers.search.yahoo.com/search?p=mixmail+correo&fr2=piv-web
https://es.answers.search.yahoo.com/search?p=educastur+campus&fr2=piv-web
https://techvalidation.dell.com/Default.aspx?id=9d459f5c-8a26-4268-b37c-23980a6ba577&Key=%2fuKb2WS3da4lk%2f34VSXE4F02YqS5LfvbKFGcDXNQxgIvvbodU3o3lHoNm09M67Ut&SRC=QuoteCenter&newsession=true
https://techvalidation.dell.com/technicalvalidationlist.aspx?key=6ivAYJco9bouAJBNkQ8rgtGWPdfLVRumAScf7bIb6DMpj6SYVdWy6bd4ITEPF4tQMkNzNpGshERZndX3Ia%2bbqhJ3CnrC46qJkHJ4TdiyN78%3d&PartnerAffinityId=3341728904&SRC=QuoteCenter
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_LOGOFF
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_ERROR&language=fr
https://wwwmg.pandacn.ford.com/forms/frmservlet?config=pandacn3
https://www.panda.ford.com/forms/frmservlet?config=pandain4
https://www.panda.ford.com/forms/frmservlet?config=pandain3

它应该输出:

https://es.answers.search.yahoo.com/search?p=mixmail+correo&fr2=piv-web
https://techvalidation.dell.com/Default.aspx?id=9d459f5c-8a26-4268-b37c-23980a6ba577&Key=%2fuKb2WS3da4lk%2f34VSXE4F02YqS5LfvbKFGcDXNQxgIvvbodU3o3lHoNm09M67Ut&SRC=QuoteCenter&newsession=true
https://techvalidation.dell.com/technicalvalidationlist.aspx?key=6ivAYJco9bouAJBNkQ8rgtGWPdfLVRumAScf7bIb6DMpj6SYVdWy6bd4ITEPF4tQMkNzNpGshERZndX3Ia%2bbqhJ3CnrC46qJkHJ4TdiyN78%3d&PartnerAffinityId=3341728904&SRC=QuoteCenter
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_ERROR&language=fr
https://wwwmg.pandacn.ford.com/forms/frmservlet?config=pandacn3
https://www.panda.ford.com/forms/frmservlet?config=pandain4

我的方法仅适用于只有一个参数的网址:

https://www.panda.ford.com/forms/frmservlet?config=pandain4
https://www.panda.ford.com/forms/frmservlet?config=pandain3

我做:cat list.txt | sort -u -t "=" -k 1,1,然后输出:

https://www.panda.ford.com/forms/frmservlet?config=pandain4

但是这些失败了:

https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_LOGOFF
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_ERROR&language=fr

我在哪里

https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_LOGOFF

使用| cat list.txt | sort -u -t "=" -k 1,1,我正好想要另一行

https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_ERROR&language=fr

因为它具有相同的参数并且更多。

致谢!


解决方法

要正确执行此操作,需要进行大量内部排序,这些内部排序会在bash循环内产生大量进程,并使工作减慢太多。

切换到perl。注意,这会重新排列参数和行;如果您需要未修改的原始行和/或按照原始的顺序,我们将不得不再增加一到三个步骤。您还应该注意,您同时拥有knowledge大写字母和小写字母; url通过端口不区分大小写,但是此后的路径区分大小写,因此即使它们具有相同的参数,它们也不会注册为相同的内容。

#!/usr/bin/env perl

use strict;     # I ALWAYS use strict and warnings unless 
use warnings;   # there is some compelling reason not to.

open my $fh,'urls' or die "urls: $!";
my %urlsOUT;
foreach ( <$fh> ) { chomp;
    my %args;                              # clean for each record
    m!^(https?://[^/]+)(/[^?]+)[?](.*)!i;  # catch the base in separate case sensitivities
    my ($base) = lc($1).$2;                # always lowercase the case insensitive part
    @args{ split /[?&]+/,$3 } = ();       # removes duplicate args in a url
    my ( $args ) = join '&',reverse sort keys %args; # reassemle ORDERED
    $urlsOUT{"$base?$args"}='';            # now a unique key
}

my $urlsOUT='';
REC: foreach my $url (reverse sort keys %urlsOUT ) { # ORDERED
       for ( split /[?&]/,$url ) {                  # for each arg
         if ( $urlsOUT !~ /\b$_\b/ ) {               # if new
           $urlsOUT .= "$url\n";                     # keep this
           next REC;                                 # check next
         }
       }
}

print $urlsOUT;

这将一致地对URL中的所有参数进行重新排序和去重复,对所有结果记录进行去重复,然后检查每个剩余的记录(以降序排列)以消除没有某物的任何记录之前没有其他记录。

我将程序文件命名为tst,并分别创建了tst1urls

$: cat tst1
http://test/foo?foo
http://test/foo?bar
http://test/foo?foo
http://test2/foo?foo
http://test2/foo?baz
http://test2/foo?foo&bar
http://test2/foo?baz
http://test/foo?foo&bar
http://test/foo?bar&foo
http://test2/foo?bar&foo
http://test3/foo?bar
http://test3/foo?foo&bar&baz
http://test2/foo?foo&bar&baz
http://test/foo?foo&bar&baz

$: ./tst tst1
http://test3/foo?foo&baz&bar
http://test2/foo?foo&baz&bar
http://test/foo?foo&baz&bar

$: cat urls
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&amp
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/content/index.cfm?ID=123&foo=bar
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?    upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&amp
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?ID=123

$: ./tst urls
http://grouplogic.com:80/store/index.cfm?upTp=2&ptype=FS&prTpID=5&fa=upgrade&UpNewType=2
http://grouplogic.com:80/store/index.cfm?prTpID=5&id=532&fa=PrtSlt
http://grouplogic.com:80/store/index.cfm?fa=conre&cftoken=26157811&cfid=11812682
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/news-events/index.cfm?prod=2&fa=viewRelease&ID=21
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&archive=1&ProdID=1
http://grouplogic.com:80/content/index.cfm?foo=bar&ID=123
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&amp

请注意,输出采用区分大小写的ASCII排序,并清除了结尾和重复/重复的与号。

使用内部读取和排序在perl中执行此操作也快得多。

real    0m0.170s
user    0m0.046s
sys     0m0.092s

旧版本

虽然您至少可以消除嵌套循环中的多余比较,但我认为没有比蛮力两次通过更优雅的方法了。

lst=( $( sort -ru x ) ) # unique reverse sort once to eliminate simple dups

for (( ndx1=0; ndx1<${#lst[@]}-1; ndx1++ ))       # walk thru once in outer loop
do [[ -n "${lst[ndx1]}" ]] || continue            # ignore removed
   for (( ndx2=ndx1+1; ndx2<${#lst[@]}; ndx2++ )) # inner skips prev,no redux
   do case "${lst[ndx1]}" in                      # case statement string match
      "${lst[ndx2]}"*) unset lst[ndx2] ;;         # remove shorter versions
                    *) continue 2      ;;         # no match,skip ahead
      esac
   done
done

printf "%s\n" "${lst[@]}"                         # print out what's left

sort以相反的顺序唯一,以消除简单的重复并建立比较,并存储到数组中以方便嵌套循环。

外循环遍历数组一次;它不会打扰最后一个记录,因为内部循环将处理该记录。内循环从外循环中当前记录之后的记录开始-由于已对它们进行排序,因此无需再次检查上一个。

由于内部循环删除了记录,因此外部循环将完全跳过检查指示索引处的外键记录是否为空的情况。

case语句从外部循环中检查当前记录之后的每个记录。如果内部键包含在当前的外部循环键记录中,则使用unset从数组中删除较短的版本,然后循环进行到下一条记录以进行检查。

当内循环记录不再是外循环键的一部分时,我们知道我们已经移过了相关记录(因为它们已排序),因此我们无意义地跳过了列表的其余部分,然后移至下一个具有continue 2的外键记录。

此移动的相关记录窗口应尽量减少浪费的工作。

,

最后,看起来我已经完成了此测试文件:

$ cat file2
test?foo
test?bar
test?foo
test2?foo
test2?baz
test2?foo&bar
test2?baz
test?foo&bar
test?bar&foo
test2?bar&foo
test3?bar
test3?foo&bar&baz
test2?foo&bar&baz
test?foo&bar&baz

脚本

#!/bin/bash
declare -A resorces
raw=( $(sort -u $1) )
for url in "${raw[@]}"; { resorces[${url//\?*}]+=" ${url//*\?}"; }
for res in "${!resorces[@]}"; {
    list=( ${resorces[$res]} )
    for i in "${!list[@]}"; {
        par=${list[$i]}
        unset list[$i]
        [[ ${list[@]} =~ $par ]] || result+=("$res?$par")
    }
}
printf '%s\n' "${result[@]}"

结果

$ ./test2 file2
test2?bar&foo
test2?foo&bar&baz
test3?foo&bar&baz
test?bar&foo
test?foo&bar&baz

对于此测试文件:

$ cat file
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&amp
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/content/index.cfm?ID=123&foo=bar
http://grouplogic.com:80/content/index.cfm?ID=123&foo=bar

结果

$ ./test2 file
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/content/index.cfm?ID=123&foo=bar
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&amp
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
,

我看到以下算法可以做到这一点(不幸的是,我不知道如何实现它):

首先,您按字母顺序对文件进行排序。

然后,您逐行读取文件,并且如果一行是下一行的子字符串,则不要将其放入结果文件中。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 &lt;select id=&quot;xxx&quot;&gt; SELECT di.id, di.name, di.work_type, di.updated... &lt;where&gt; &lt;if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 &lt;property name=&quot;dynamic.classpath&quot; value=&quot;tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-