如何解决通过bash删除冗余 旧版本
我有这个问题,我有以下几行:
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre&ca=2412
我想删除每一行中每一行都有每个参数的每一行,假设这两行:
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre&ca=2412
我只想保留这个:
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre&ca=2412
因为是具有更多参数的参数,而第一个将是多余的。
我想保留这些:
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre&ca=2412
我要删除其他所有具有相同参数的行,保留具有更多参数的行,而不是具有较少参数的行。
另一个例子:
我要转换此:
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?ID=123
对此:
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
不同资源中的相同参数,必须是不同的行。
如果我明白了:
http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/content2/index.cfm?ID=123
我都想保留它们。
编辑8月19日:
URL的另一个示例以及我希望如何处理它们:
https://es.answers.search.yahoo.com/search?p=mixmail+correo&fr2=piv-web
https://es.answers.search.yahoo.com/search?p=educastur+campus&fr2=piv-web
https://techvalidation.dell.com/Default.aspx?id=9d459f5c-8a26-4268-b37c-23980a6ba577&Key=%2fuKb2WS3da4lk%2f34VSXE4F02YqS5LfvbKFGcDXNQxgIvvbodU3o3lHoNm09M67Ut&SRC=QuoteCenter&newsession=true
https://techvalidation.dell.com/technicalvalidationlist.aspx?key=6ivAYJco9bouAJBNkQ8rgtGWPdfLVRumAScf7bIb6DMpj6SYVdWy6bd4ITEPF4tQMkNzNpGshERZndX3Ia%2bbqhJ3CnrC46qJkHJ4TdiyN78%3d&PartnerAffinityId=3341728904&SRC=QuoteCenter
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_LOGOFF
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_ERROR&language=fr
https://wwwmg.pandacn.ford.com/forms/frmservlet?config=pandacn3
https://www.panda.ford.com/forms/frmservlet?config=pandain4
https://www.panda.ford.com/forms/frmservlet?config=pandain3
它应该输出:
https://es.answers.search.yahoo.com/search?p=mixmail+correo&fr2=piv-web
https://techvalidation.dell.com/Default.aspx?id=9d459f5c-8a26-4268-b37c-23980a6ba577&Key=%2fuKb2WS3da4lk%2f34VSXE4F02YqS5LfvbKFGcDXNQxgIvvbodU3o3lHoNm09M67Ut&SRC=QuoteCenter&newsession=true
https://techvalidation.dell.com/technicalvalidationlist.aspx?key=6ivAYJco9bouAJBNkQ8rgtGWPdfLVRumAScf7bIb6DMpj6SYVdWy6bd4ITEPF4tQMkNzNpGshERZndX3Ia%2bbqhJ3CnrC46qJkHJ4TdiyN78%3d&PartnerAffinityId=3341728904&SRC=QuoteCenter
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_ERROR&language=fr
https://wwwmg.pandacn.ford.com/forms/frmservlet?config=pandacn3
https://www.panda.ford.com/forms/frmservlet?config=pandain4
我的方法仅适用于只有一个参数的网址:
https://www.panda.ford.com/forms/frmservlet?config=pandain4
https://www.panda.ford.com/forms/frmservlet?config=pandain3
我做:cat list.txt | sort -u -t "=" -k 1,1
,然后输出:
https://www.panda.ford.com/forms/frmservlet?config=pandain4
但是这些失败了:
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_LOGOFF
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_ERROR&language=fr
我在哪里
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_LOGOFF
使用| cat list.txt | sort -u -t "=" -k 1,1
,我正好想要另一行
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_ERROR&language=fr
因为它具有相同的参数并且更多。
致谢!
解决方法
要正确执行此操作,需要进行大量内部排序,这些内部排序会在bash
循环内产生大量进程,并使工作减慢太多。
切换到perl
。注意,这会重新排列参数和行;如果您需要未修改的原始行和/或按照原始的顺序,我们将不得不再增加一到三个步骤。您还应该注意,您同时拥有knowledge
大写字母和小写字母; url通过端口不区分大小写,但是此后的路径区分大小写,因此即使它们具有相同的参数,它们也不会注册为相同的内容。
#!/usr/bin/env perl
use strict; # I ALWAYS use strict and warnings unless
use warnings; # there is some compelling reason not to.
open my $fh,'urls' or die "urls: $!";
my %urlsOUT;
foreach ( <$fh> ) { chomp;
my %args; # clean for each record
m!^(https?://[^/]+)(/[^?]+)[?](.*)!i; # catch the base in separate case sensitivities
my ($base) = lc($1).$2; # always lowercase the case insensitive part
@args{ split /[?&]+/,$3 } = (); # removes duplicate args in a url
my ( $args ) = join '&',reverse sort keys %args; # reassemle ORDERED
$urlsOUT{"$base?$args"}=''; # now a unique key
}
my $urlsOUT='';
REC: foreach my $url (reverse sort keys %urlsOUT ) { # ORDERED
for ( split /[?&]/,$url ) { # for each arg
if ( $urlsOUT !~ /\b$_\b/ ) { # if new
$urlsOUT .= "$url\n"; # keep this
next REC; # check next
}
}
}
print $urlsOUT;
这将一致地对URL中的所有参数进行重新排序和去重复,对所有结果记录进行去重复,然后检查每个剩余的记录(以降序排列)以消除没有某物的任何记录之前没有其他记录。
我将程序文件命名为tst
,并分别创建了tst1
和urls
。
$: cat tst1
http://test/foo?foo
http://test/foo?bar
http://test/foo?foo
http://test2/foo?foo
http://test2/foo?baz
http://test2/foo?foo&bar
http://test2/foo?baz
http://test/foo?foo&bar
http://test/foo?bar&foo
http://test2/foo?bar&foo
http://test3/foo?bar
http://test3/foo?foo&bar&baz
http://test2/foo?foo&bar&baz
http://test/foo?foo&bar&baz
$: ./tst tst1
http://test3/foo?foo&baz&bar
http://test2/foo?foo&baz&bar
http://test/foo?foo&baz&bar
$: cat urls
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/content/index.cfm?ID=123&foo=bar
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm? upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?ID=123
$: ./tst urls
http://grouplogic.com:80/store/index.cfm?upTp=2&ptype=FS&prTpID=5&fa=upgrade&UpNewType=2
http://grouplogic.com:80/store/index.cfm?prTpID=5&id=532&fa=PrtSlt
http://grouplogic.com:80/store/index.cfm?fa=conre&cftoken=26157811&cfid=11812682
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/news-events/index.cfm?prod=2&fa=viewRelease&ID=21
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&archive=1&ProdID=1
http://grouplogic.com:80/content/index.cfm?foo=bar&ID=123
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&
请注意,输出采用区分大小写的ASCII排序,并清除了结尾和重复/重复的与号。
使用内部读取和排序在perl
中执行此操作也快得多。
real 0m0.170s
user 0m0.046s
sys 0m0.092s
旧版本
虽然您至少可以消除嵌套循环中的多余比较,但我认为没有比蛮力两次通过更优雅的方法了。
lst=( $( sort -ru x ) ) # unique reverse sort once to eliminate simple dups
for (( ndx1=0; ndx1<${#lst[@]}-1; ndx1++ )) # walk thru once in outer loop
do [[ -n "${lst[ndx1]}" ]] || continue # ignore removed
for (( ndx2=ndx1+1; ndx2<${#lst[@]}; ndx2++ )) # inner skips prev,no redux
do case "${lst[ndx1]}" in # case statement string match
"${lst[ndx2]}"*) unset lst[ndx2] ;; # remove shorter versions
*) continue 2 ;; # no match,skip ahead
esac
done
done
printf "%s\n" "${lst[@]}" # print out what's left
我sort
以相反的顺序唯一,以消除简单的重复并建立比较,并存储到数组中以方便嵌套循环。
外循环遍历数组一次;它不会打扰最后一个记录,因为内部循环将处理该记录。内循环从外循环中当前记录之后的记录开始-由于已对它们进行排序,因此无需再次检查上一个。
由于内部循环删除了记录,因此外部循环将完全跳过检查指示索引处的外键记录是否为空的情况。
case
语句从外部循环中检查当前记录之后的每个记录。如果内部键包含在当前的外部循环键记录中,则使用unset
从数组中删除较短的版本,然后循环进行到下一条记录以进行检查。
当内循环记录不再是外循环键的一部分时,我们知道我们已经移过了相关记录(因为它们已排序),因此我们无意义地跳过了列表的其余部分,然后移至下一个具有continue 2
的外键记录。
此移动的相关记录窗口应尽量减少浪费的工作。
,最后,看起来我已经完成了此测试文件:
$ cat file2
test?foo
test?bar
test?foo
test2?foo
test2?baz
test2?foo&bar
test2?baz
test?foo&bar
test?bar&foo
test2?bar&foo
test3?bar
test3?foo&bar&baz
test2?foo&bar&baz
test?foo&bar&baz
脚本
#!/bin/bash
declare -A resorces
raw=( $(sort -u $1) )
for url in "${raw[@]}"; { resorces[${url//\?*}]+=" ${url//*\?}"; }
for res in "${!resorces[@]}"; {
list=( ${resorces[$res]} )
for i in "${!list[@]}"; {
par=${list[$i]}
unset list[$i]
[[ ${list[@]} =~ $par ]] || result+=("$res?$par")
}
}
printf '%s\n' "${result[@]}"
结果
$ ./test2 file2
test2?bar&foo
test2?foo&bar&baz
test3?foo&bar&baz
test?bar&foo
test?foo&bar&baz
对于此测试文件:
$ cat file
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/content/index.cfm?ID=123&foo=bar
http://grouplogic.com:80/content/index.cfm?ID=123&foo=bar
结果
$ ./test2 file
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/content/index.cfm?ID=123&foo=bar
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
,
我看到以下算法可以做到这一点(不幸的是,我不知道如何实现它):
首先,您按字母顺序对文件进行排序。
然后,您逐行读取文件,并且如果一行是下一行的子字符串,则不要将其放入结果文件中。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。