Fast and Easy Levenshtein distance using a Trie

RSS Bot@lemmy.bestiver.se · 2 months ago

Fast and Easy Levenshtein distance using a Trie

setsubyou@lemmy.world · 2 months ago

I know it’s 15 years old, but I’m not buying that the trie as such is what improves the performance.

traversing a trie built from a sorted list yields the nodes in the original sorted order, so you’re not optimizing the order
you do have additional pre-computed information in the trie, but it’s basically just the common prefix of the current and previous word, which can be trivially obtained without the trie
subtree pruning basically just translates to skipping until the prefix changes at a shallower level

Basically, yes storing the data in a trie persistently will improve performance (scanning all words is still slow, but not that slow because you are skipping most early and can do binary search too, while the trie approach does a lot of python dict lookups), but the main performance gain comes from the trie forcing him to use properties that the sorted input also has but he just wasn’t using in his naive implementation (specifically, shared prefixes and subtree pruning).

He’s also leaving performance on the table at trie build time, which is not part of the benchmark. He’s building the trie in the straight-forward way, but for sorted input it could be built in O(n).