Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename OptimalStringAlignement to OptimalStringAlignment #57

Merged
merged 1 commit into from
Dec 2, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ The available distances are:
- Hamming Distance `Hamming() <: SemiMetric`
- [Jaro and Jaro-Winkler Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()` `JaroWinkler() <: SemiMetric`
- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) `Levenshtein() <: Metric`
- [Optimal String Alignement Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Optimal_string_alignment_distance) (a.k.a. restricted Damerau-Levenshtein) `OptimalStringAlignement() <: SemiMetric`
- [Optimal String Alignment Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Optimal_string_alignment_distance) (a.k.a. restricted Damerau-Levenshtein) `OptimalStringAlignment() <: SemiMetric`
- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Distance_with_adjacent_transpositions) `DamerauLevenshtein() <: Metric`
- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) `RatcliffObershelp() <: SemiMetric`
- Q-gram distances compare the set of all substrings of length `q` in each string (and which
Expand Down Expand Up @@ -72,7 +72,7 @@ The package also adds convenience functions to find elements in a iterator of st
findall(s, itr, dist; min_score = 0.8)
```

The functions `findnearest` and `findall` are particularly optimized for the `Levenshtein` and `OptimalStringAlignement` distances, as these algorithm can stop early if the distance becomes higher than a certain threshold.
The functions `findnearest` and `findall` are particularly optimized for the `Levenshtein` and `OptimalStringAlignment` distances, as these algorithm can stop early if the distance becomes higher than a certain threshold.



Expand Down
16 changes: 8 additions & 8 deletions benchmark/benchmark.jl
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ end
# 0.33s
@time f(Levenshtein(), x, y, min_score = 0.8);
# 0.11
@time f(OptimalStringAlignement(), x, y);
@time f(OptimalStringAlignment(), x, y);
# 0.44s.
@time f(OptimalStringAlignement(), x, y, min_score = 0.8);
@time f(OptimalStringAlignment(), x, y, min_score = 0.8);
# 0.08
@time f(DamerauLevenshtein(), x, y);
# 0.8s
Expand All @@ -35,7 +35,7 @@ end

@time findnearest(x[1], y, Levenshtein());
# 0.1
@time findnearest(x[1], y, OptimalStringAlignement());
@time findnearest(x[1], y, OptimalStringAlignment());
# 0.1
@time findnearest(x[1], y, QGram(2));
# 0.75
Expand All @@ -44,17 +44,17 @@ end

@time findall(x[1], y, Levenshtein());
# 0.05
@time findall(x[1], y, OptimalStringAlignement());
@time findall(x[1], y, OptimalStringAlignment());
# 0.05
@time findall(x[1], y, Partial(OptimalStringAlignement()));
@time findall(x[1], y, Partial(OptimalStringAlignment()));
# 0.96
@time findall(x[1], y, QGram(2));
# 0.81
@time findall(x[1], y, TokenSort(OptimalStringAlignement()));
@time findall(x[1], y, TokenSort(OptimalStringAlignment()));
# 0.27 (now 0.32)
@time findall(x[1], y, TokenSet(OptimalStringAlignement()));
@time findall(x[1], y, TokenSet(OptimalStringAlignment()));
# 0.55
@time findall(x[1], y, TokenMax(OptimalStringAlignement()));
@time findall(x[1], y, TokenMax(OptimalStringAlignment()));
# 2.25 (now 3.6)


Expand Down
2 changes: 1 addition & 1 deletion src/StringDistances.jl
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Hamming,
Jaro,
JaroWinkler,
Levenshtein,
OptimalStringAlignement,
OptimalStringAlignment,
DamerauLevenshtein,
RatcliffObershelp,
# Qgram distances
Expand Down
28 changes: 15 additions & 13 deletions src/distances/edit.jl
Original file line number Diff line number Diff line change
Expand Up @@ -165,25 +165,25 @@ function (dist::Levenshtein)(s1, s2; max_dist::Union{Integer, Nothing} = nothing
end

"""
OptimalStringAlignement()
OptimalStringAlignment()

Creates the OptimalStringAlignement distance (also known ad the unrestricted DamerauLevenshtein distance).
Creates the OptimalStringAlignment distance (also known as the restricted DamerauLevenshtein distance).

It is the minimum number of operations (consisting of insertions,
deletions or substitutions of a single character, or transposition of two adjacent characters)
required to change one string into the other.
It is the minimum number of operations (consisting of insertions,
deletions or substitutions of a single character, or transposition of two adjacent characters)
required to change one string into the other.

The distance differs slightly from the Damerau-Levenshtein algorithm by imposing
the restriction that no substring is edited more than once. So for example, "CA" to "ABC" has an edit
distance of 2 by a complete application of Damerau-Levenshtein, but a distance of 3 by this method that
uses the optimal string alignment algorithm. In particular, the restricted distance does not satisfy
the triangle inequality.
The distance differs slightly from the Damerau-Levenshtein algorithm by imposing
the restriction that no substring is edited more than once. So for example, "CA" to "ABC" has an edit
distance of 2 by a complete application of Damerau-Levenshtein, but a distance of 3 by this method that
uses the optimal string alignment algorithm. In particular, the restricted distance does not satisfy
the triangle inequality.
"""
struct OptimalStringAlignement <: StringSemiMetric end
struct OptimalStringAlignment <: StringSemiMetric end

## http://blog.softwx.net/2015/01/optimizing-damerau-levenshtein_15.html
# Return max_dist + 1 if distance higher than max_dist
function (dist::OptimalStringAlignement)(s1, s2; max_dist::Union{Integer, Nothing} = nothing)
function (dist::OptimalStringAlignment)(s1, s2; max_dist::Union{Integer, Nothing} = nothing)
(s1 === missing) | (s2 === missing) && return missing
len1, len2 = length(s1), length(s2)
if len1 > len2
Expand Down Expand Up @@ -246,6 +246,8 @@ function (dist::OptimalStringAlignement)(s1, s2; max_dist::Union{Integer, Nothin
return Int(current)
end

Base.@deprecate_binding OptimalStringAlignement OptimalStringAlignment

"""
DamerauLevenshtein()

Expand Down Expand Up @@ -370,4 +372,4 @@ function longest_common_pattern!(p, s1, s2, start1, start2, end1, end2)
end
end
return j1, j2, len
end
end
4 changes: 2 additions & 2 deletions src/normalize.jl
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ function (dist::Normalized{<:Union{Hamming, DamerauLevenshtein}})(s1, s2; max_di
return out
end

function (dist::Normalized{<:Union{Levenshtein, OptimalStringAlignement}})(s1, s2; max_dist = 1.0)
function (dist::Normalized{<:Union{Levenshtein, OptimalStringAlignment}})(s1, s2; max_dist = 1.0)
(s1 === missing) | (s2 === missing) && return missing
s1, s2 = reorder(s1, s2)
len1, len2 = length(s1), length(s2)
Expand Down Expand Up @@ -66,4 +66,4 @@ function (dist::Normalized{<:AbstractQGramDistance})(s1, s2; max_dist = 1.0)
end
max_dist !== nothing && out > max_dist && return 1.0
return out
end
end
48 changes: 24 additions & 24 deletions test/distances.jl
Original file line number Diff line number Diff line change
Expand Up @@ -41,28 +41,28 @@ using StringDistances, Unicode, Test, Random
@test ismissing(Levenshtein()("", missing))
end

@testset "OptimalStringAlignement" begin
@test OptimalStringAlignement()("", "") == 0
@test OptimalStringAlignement()("abc", "") == 3
@test OptimalStringAlignement()("bc", "abc") == 1
@test OptimalStringAlignement()("fuor", "four") == 1
@test OptimalStringAlignement()("abcd", "acb") == 2
@test OptimalStringAlignement()("cape sand recycling ", "edith ann graham") == 17
@test OptimalStringAlignement()("jellyifhs", "jellyfish") == 2
@test OptimalStringAlignement()("ifhs", "fish") == 2
@test OptimalStringAlignement()("a cat", "an act") == 2
@test OptimalStringAlignement()("a cat", "an abct") == 4
@test OptimalStringAlignement()("a cat", "a tc") == 3
@test OptimalStringAlignement()("abcdef", "abcxyf") == 2
@test OptimalStringAlignement()("abcdef", "abcxyf"; max_dist = 2) == 2
@testset "OptimalStringAlignment" begin
@test OptimalStringAlignment()("", "") == 0
@test OptimalStringAlignment()("abc", "") == 3
@test OptimalStringAlignment()("bc", "abc") == 1
@test OptimalStringAlignment()("fuor", "four") == 1
@test OptimalStringAlignment()("abcd", "acb") == 2
@test OptimalStringAlignment()("cape sand recycling ", "edith ann graham") == 17
@test OptimalStringAlignment()("jellyifhs", "jellyfish") == 2
@test OptimalStringAlignment()("ifhs", "fish") == 2
@test OptimalStringAlignment()("a cat", "an act") == 2
@test OptimalStringAlignment()("a cat", "an abct") == 4
@test OptimalStringAlignment()("a cat", "a tc") == 3
@test OptimalStringAlignment()("abcdef", "abcxyf") == 2
@test OptimalStringAlignment()("abcdef", "abcxyf"; max_dist = 2) == 2
prefix = "my_prefix"
@test OptimalStringAlignement()(prefix * "alborgów", prefix * "amoniak") == OptimalStringAlignement()("alborgów", "amoniak")
@test OptimalStringAlignement()([1, 2, 3], [1,2, 4]) == 1
@test OptimalStringAlignement()(graphemes("alborgów"), graphemes("amoniak")) == OptimalStringAlignement()("alborgów", "amoniak")
@test OptimalStringAlignement()("bc", "abc") == 1
@test result_type(OptimalStringAlignement(), "hello", "world") == Int
@inferred OptimalStringAlignement()("", "")
@test ismissing(OptimalStringAlignement()("", missing))
@test OptimalStringAlignment()(prefix * "alborgów", prefix * "amoniak") == OptimalStringAlignment()("alborgów", "amoniak")
@test OptimalStringAlignment()([1, 2, 3], [1,2, 4]) == 1
@test OptimalStringAlignment()(graphemes("alborgów"), graphemes("amoniak")) == OptimalStringAlignment()("alborgów", "amoniak")
@test OptimalStringAlignment()("bc", "abc") == 1
@test result_type(OptimalStringAlignment(), "hello", "world") == Int
@inferred OptimalStringAlignment()("", "")
@test ismissing(OptimalStringAlignment()("", missing))
end

@testset "DamerauLevenshtein" begin
Expand Down Expand Up @@ -316,7 +316,7 @@ using StringDistances, Unicode, Test, Random
]

solutions = ((Levenshtein(), [2 2 4 1 3 0 3 2 3 3 4 6 17 3 3 2]),
(OptimalStringAlignement(), [1 2 4 1 3 0 3 2 3 3 4 6 17 2 2 2]),
(OptimalStringAlignment(), [1 2 4 1 3 0 3 2 3 3 4 6 17 2 2 2]),
(Jaro(), [0.05555556 0.17777778 0.23333333 0.04166667 1.00000000 0.00000000 1.00000000 0.44444444 0.25396825 0.2805556 0.2285714 0.48809524 0.3916667 0.07407407 0.16666667 0.21666667]),
(QGram(1), [0 3 3 1 3 0 6 4 5 4 4 11 14 0 0 3]),
(QGram(2), [ 6 7 7 1 2 0 4 4 7 8 4 13 32 8 6 5]),
Expand Down Expand Up @@ -344,8 +344,8 @@ using StringDistances, Unicode, Test, Random
for i in eachindex(strings)
d = Levenshtein()(strings[i]...)
@test Levenshtein()(strings[i]...; max_dist = d) == d
d = OptimalStringAlignement()(strings[i]...)
@test OptimalStringAlignement()(strings[i]...; max_dist = d) == d
d = OptimalStringAlignment()(strings[i]...)
@test OptimalStringAlignment()(strings[i]...; max_dist = d) == d
end
end

Expand Down
10 changes: 5 additions & 5 deletions test/modifiers.jl
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,9 @@ end
#Levenshtein
compare("aüa", "aua", Levenshtein())
@test compare("ok", missing, Levenshtein()) === missing
compare("aüa", "aua", OptimalStringAlignement())
@test StringDistances.Normalized(Partial(OptimalStringAlignement()))("ab", "cde") == 1.0
@test compare("ab", "de", Partial(OptimalStringAlignement())) == 0
compare("aüa", "aua", OptimalStringAlignment())
@test StringDistances.Normalized(Partial(OptimalStringAlignment()))("ab", "cde") == 1.0
@test compare("ab", "de", Partial(OptimalStringAlignment())) == 0

# RatcliffObershelp
@test compare("New York Mets vs Atlanta Braves", "", RatcliffObershelp()) ≈ 0.0
Expand Down Expand Up @@ -115,7 +115,7 @@ end
("ifhs", "fish"),
("leia", "leela"),
]
for dist in (Levenshtein, OptimalStringAlignement)
for dist in (Levenshtein, OptimalStringAlignment)
for i in eachindex(strings)
if compare(strings[i]..., dist()) < 1 / 3
@test compare(strings[i]..., dist() ; min_score = 1/ 3) ≈ 0.0
Expand Down Expand Up @@ -150,4 +150,4 @@ end
@test findall("New York", skipmissing(["NewYork", "Newark", missing]), Levenshtein()) == [1]
@test findall("New York", skipmissing(Union{AbstractString, Missing}[missing, missing]), Levenshtein()) == []
end
end
end
4 changes: 2 additions & 2 deletions test/pairwise.jl
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ using StringDistances, Unicode, Test, Random
TestStrings1missing = ["", "abc", "bc", missing]
TestStrings2missing = ["mew", missing]

for d in [Jaro(), Levenshtein(), OptimalStringAlignement(), RatcliffObershelp(),
for d in [Jaro(), Levenshtein(), OptimalStringAlignment(), RatcliffObershelp(),
QGram(2), Cosine(2), Jaccard(2), SorensenDice(2), Overlap(2)]

R = pairwise(d, TestStrings1)
Expand Down Expand Up @@ -82,4 +82,4 @@ using StringDistances, Unicode, Test, Random
R5 = pairwise(d, TestStrings1missing; preprocess = true)
@test eltype(R5) == Union{result_type(d, String, String), Missing}
end
end
end