A database I am working on has a lot of records (40M+) and the guys mining it are currently using python to process it. This is very powerful (nltk) but also quite slow. One of the guys has found that you can do fuzzy string matching straight in Postgres and the feature is quite advanced!
- fuzzystrmatch is a PostgreSQL module providing functions to determine similarities and distances between strings.
- pg_trgm is a PostgreSQL module to determine the similarity of ASCII alphanumeric text based on trigram matching. This module offers indexing options too :)
- pg_similarity is a third party module to support similarity queries on PostgreSQL
The first two can be installed with a simple CREATE EXTENSION command. The last module.. you need to get it, compile it and install it yourself.
- Module import
- PostgreSQL documentation for pg_tgrm and fuzzystrmatch
- The Natural Language Toolkit (python)
- Pg_similarity module
A little experiment: If you find this post and ad below useful, please check the ad out :-)