-
Notifications
You must be signed in to change notification settings - Fork 69
Fuzzy search #1
Comments
Hi Daniel, That sounds like useful functionality, though I can't say it would be high on my to-do list. I would have to become more familiar with elasticsearch. This would also require support for legacy indexing, which is something I think I should add anyway. You can go ahead and do legacy indexing yourself using RCurl and the directions here: http://docs.neo4j.org/chunked/stable/rest-api-indexes.html Keep me updated on any progress you make with this. It sounds very interesting and useful but I won't have the time in the near future to implement something like that. But I should be able to add legacy indexing pretty soon. Nicole |
Sure, I would pick that challenge up myself if my curl skills were up to par. |
I managed to get the fuzzy search going. Some rough working prototype (unclean code + should be generalized further): searchNodes <- function(graph, pattern, label ,fuzzy=TRUE) {
fuzzy_factor = function(x) {
# convenience function to set fuzzy tolerance based on pattern string length.
# Longer strings have higher tolerance.
above100 = 10*10^(1:5)
breaks= c(0, 6, 10, 15, above100 )
factors=c(NA, .7, .6, .5, 0.6-0.1*log10(above100))
f=cut(nchar(x), breaks=breaks,
labels=na.omit(factors))
as.numeric(levels(f))[f]
}
fields = c("name, name_long,name_short,name_official,name_common,bbg,aliases")
spl <- function (s, delim = ',', trim=T) {
splitted=unlist(strsplit(s,delim))
gsub("^\\s+|\\s+$", "", splitted)
}
keywords = unlist(strsplit(pattern,'[[:punct:]]|[[:space:]]', perl = TRUE))
keywords = keywords[nchar(keywords)>3]
fuzzy_keywords = paste0(keywords,"~",fuzzy_factor(keywords), collapse = " AND ")
lucene = paste0(spl(fields), ":(",fuzzy_keywords, ")", collapse = " OR ")
query = sprintf("START n=node:node_auto_index('%s')
WHERE(n:%s)
RETURN n", lucene, label )
getNodes(graph, query)
}
> searchNodes(graph, pattern="worlt", label="Geography")
[[1]]
Labels: OPERA Geography
$un_m.49
[1] "001"
$name
[1] "World"
$name_OPERA
[1] "Global" Requires setting up fulltext type auto_index manually via REST beforehand. |
Daniel, that is awesome work. When I find some time I'll start playing around with it. Maybe we can do a pull request after some polishing. |
The user interface could be generalized even further by dispatching from the existing generic #' Usual use. This is the default exact match on the pattern
getUniqueNode(graph, "MyLabel", name="pattern")
#' This is fuzzy match on the "pattern" string with similarity factor 0.2 or better.
#' Retrieves the single, most similar match (as ranked by distance measure)
#' `~` is the R's `formula operator, coinciding happily with Lucene's fuzzy match operator
#' If the function detects formula passed to the property via dots, it will dispatch to the fuzzy search function.
getUniqueNode(graph, "MyLabel", name="pattern" ~0.2) |
This would be better as a pull request. |
Hi Nicole,
just curious, do you intend to develop a fuzzy full-text search function, retrieving list of node matches in descending order based on some distance measure?
http://linkurio.us/ utilizes such a search on top of Neo4j, using elasticsearch technology. There seem to be a elasticsearch package for R now. Maybe integrating that into RNeo4j would be worth considering.
Fuzzy search is quite a common exploration use case.
Thanks, Daniel
The text was updated successfully, but these errors were encountered: