如何解决了解 R 中 agrep 模糊匹配中的约束
这看起来很简单,但出于某种原因,我不明白 agrep
涉及替换的模糊匹配的行为。当指定 all=2
时,两个替换会按预期产生匹配,但在 substitutions=2
时不会产生匹配。这是为什么?
# Finds a match as expected
agrep("abcdeX","abcdef",value = T,max.distance = list(sub=1,ins=0,del=0))
#> [1] "abcdef"
# Doesn't find a match as expected
agrep("abcdXX",del=0))
#> character(0)
# Finds a match as expected
agrep("abcdXX",max.distance = list(all=2))
#> [1] "abcdef"
# Doesn't find a match UNEXPECTEDLY
agrep("abcdXX",max.distance = list(sub=2,del=0))
#> character(0)
由 reprex package (v2.0.0) 于 2021 年 6 月 3 日创建
解决方法
all
是一个始终适用的上限,与其他 max.distance
控件(cost
除外)无关。默认为 10%。
# one characters can change
agrep(pattern = "abcdXX",x = "abcdef",value = TRUE,max.distance = list(sub = 2,ins = 0,del = 0,all = 0.1))
# character(0)
# two characters can change
agrep(pattern = "abcdXX",all = 0.2))
# [1] "abcdef"
# one character can change
agrep(pattern = "abcdXX",max.distance = list(sub = 1,ins = 1,all = 0.2))
# [1] "abcdef"
设置 all
的小数模式在 1 处切换到整数模式有一点问题。
# 8 insertions allowed
agrep(pattern = "abcdXXef",max.distance = list(sub = 0,ins = 2,all = 1 - 1e-9))
# [1] "abcdef"
# 1 insertion allowed
agrep(pattern = "abcdXXef",all = 1))
# character(0)
当您通过将 all
设置为小于 1 来抑制它时,距离模式的限制适用。
# two substitutions allowed
agrep(pattern = "abcdXX",x = c("abcdef","abcXdef","abcefg"),all = 1 - 1e-9))
# [1] "abcdef"
设置成本的目的是让您可以在不同方向以不同速率在变异空间中移动。这将取决于您的用例。例如,某些语言方言可能更有可能添加字母。您可能选择让一次删除花费两次插入。默认情况下,当 costs = NULL
,即 costs = c(ins = 1,del = 1,sub = 1)
。
编辑:关于您对为什么某些模式匹配而其他模式不匹配的评论,10% 是指模式中的字符数,四舍五入。
agrep(pattern = "01234567XX89",x = "0123456789",del = 0))
# [1] "0123456789"
agrep(pattern = "01234567XX",del = 0))
# character(0)
num_mutations <- nchar(c("01234567XX89","01234567XX")) * 0.1
num_mutations
# [1] 1.2 1.0
ceiling(num_mutations)
[1] 2 1
第二个模式只有 10 个字符,所以只允许一个替换。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。