Skip to content

Commit

Permalink
Updated README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
zgornel committed Oct 23, 2018
1 parent 930e7c7 commit 6b24ae5
Showing 1 changed file with 17 additions and 1 deletion.
18 changes: 17 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ julia> conceptnet = load_embeddings(file_conceptnet, languages=:en)
# ConceptNet{Languages.English} (compressed): 1 language(s), 150875 embeddings

julia> conceptnet["apple"] # Get embeddings for a single word
# 300×1 Array{Int8,2}:
# 300-element Array{Int8,1}:
# 0
# 0
# 1
Expand Down Expand Up @@ -98,11 +98,27 @@ julia> # `keys` returns an iterator for all words
# couvents
```

Document embedding is quite straightforward:
```julia
julia> doc = "embed this document containing X_#-s231 which cannot be embedded"
edoc, idxs_missed = embed_document(conceptnet, doc, language=Languages.English(), keep_size=false)
missed_words = tokenize_for_conceptnet(doc)[idx_missed]
println("Missed word: $missed_word")
edoc
# Missed word: SubString{String}["X_#-s231"]
# 300×8 Array{Int8,2}:
# 0 0 0 0 0 1 0 0
# -1 -2 -1 -1 -3 -2 -3 0
# 1 5 0 4 6 6 6 2
# ...
```


## Remarks

- fast for retrieving embeddings of exact matches
- fast for retrieving embeddings of wildcard matches (`xyzabcish` is matched to `######ish`)
- fast document embedding
- if neither exact or wildcard matches exist, retrieval can be based on string distances (slow, see `src/search.jl`)
- for another package handling word embeddings, check out [Embeddings.jl](https://github.com/JuliaText/Embeddings.jl)

Expand Down

0 comments on commit 6b24ae5

Please sign in to comment.