Skip to content
This repository has been archived by the owner on May 23, 2019. It is now read-only.

Fuzzy search #1

Closed
danielkrizian opened this issue Jul 4, 2014 · 6 comments
Closed

Fuzzy search #1

danielkrizian opened this issue Jul 4, 2014 · 6 comments

Comments

@danielkrizian
Copy link

Hi Nicole,
just curious, do you intend to develop a fuzzy full-text search function, retrieving list of node matches in descending order based on some distance measure?
http://linkurio.us/ utilizes such a search on top of Neo4j, using elasticsearch technology. There seem to be a elasticsearch package for R now. Maybe integrating that into RNeo4j would be worth considering.
Fuzzy search is quite a common exploration use case.
Thanks, Daniel

@nicolewhite
Copy link
Owner

Hi Daniel,

That sounds like useful functionality, though I can't say it would be high on my to-do list. I would have to become more familiar with elasticsearch. This would also require support for legacy indexing, which is something I think I should add anyway. You can go ahead and do legacy indexing yourself using RCurl and the directions here: http://docs.neo4j.org/chunked/stable/rest-api-indexes.html

Keep me updated on any progress you make with this. It sounds very interesting and useful but I won't have the time in the near future to implement something like that. But I should be able to add legacy indexing pretty soon.

Nicole

@danielkrizian
Copy link
Author

Sure, I would pick that challenge up myself if my curl skills were up to par.
In the meantime, I've just come across two useful links showing curl queries for any reader volunteering to implement:
http://www.sinking.in/blog/seven-databases-neo4j-and-misunderstanding-indexes/
http://jexp.de/blog/2014/03/full-text-indexing-fts-in-neo4j-2-0/

@danielkrizian
Copy link
Author

I managed to get the fuzzy search going. Some rough working prototype (unclean code + should be generalized further):

searchNodes <- function(graph, pattern, label ,fuzzy=TRUE) {
  fuzzy_factor = function(x) {
    # convenience function to set fuzzy tolerance based on pattern string length.
    # Longer strings have higher tolerance.
    above100 = 10*10^(1:5)
    breaks= c(0, 6, 10, 15, above100 )
    factors=c(NA, .7,  .6,  .5,   0.6-0.1*log10(above100))

    f=cut(nchar(x), breaks=breaks, 
          labels=na.omit(factors))

    as.numeric(levels(f))[f]
  }

  fields = c("name, name_long,name_short,name_official,name_common,bbg,aliases")

  spl <- function (s, delim = ',', trim=T) {
    splitted=unlist(strsplit(s,delim))
    gsub("^\\s+|\\s+$", "", splitted)
  }

  keywords = unlist(strsplit(pattern,'[[:punct:]]|[[:space:]]', perl = TRUE))
  keywords = keywords[nchar(keywords)>3]
  fuzzy_keywords = paste0(keywords,"~",fuzzy_factor(keywords), collapse = " AND ")

  lucene = paste0(spl(fields), ":(",fuzzy_keywords, ")", collapse = " OR ")
  query = sprintf("START n=node:node_auto_index('%s') 
                WHERE(n:%s) 
                RETURN n", lucene, label )

  getNodes(graph, query)
}


> searchNodes(graph, pattern="worlt", label="Geography")
[[1]]
Labels: OPERA Geography

$un_m.49
[1] "001"

$name
[1] "World"

$name_OPERA
[1] "Global"

Requires setting up fulltext type auto_index manually via REST beforehand.

@nicolewhite
Copy link
Owner

Daniel, that is awesome work. When I find some time I'll start playing around with it. Maybe we can do a pull request after some polishing.

@danielkrizian
Copy link
Author

The user interface could be generalized even further by dispatching from the existing generic getUniqueNode. The above searchNodes can thus be unexported internal function.

#' Usual use. This is the default exact match on the pattern
getUniqueNode(graph, "MyLabel", name="pattern") 

#' This is fuzzy match on the "pattern" string with similarity factor 0.2 or better. 
#' Retrieves the single, most similar match (as ranked by distance measure)
#' `~` is the R's `formula operator, coinciding happily with Lucene's fuzzy match operator
#' If the function detects formula passed to the property via dots, it will dispatch to the fuzzy search function.
getUniqueNode(graph, "MyLabel", name="pattern" ~0.2)  

@nicolewhite nicolewhite reopened this Oct 7, 2014
@nicolewhite
Copy link
Owner

This would be better as a pull request.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants