Skip to content

Commit

Permalink
Merge pull request #3 from zgornel/improved_word_indexing
Browse files Browse the repository at this point in the history
Improved word indexing
  • Loading branch information
Corneliu Cofaru authored Sep 30, 2018
2 parents b2b2d89 + ea440d1 commit 8f74013
Show file tree
Hide file tree
Showing 8 changed files with 251 additions and 99 deletions.
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "ConceptnetNumberbatch"
uuid = "2d1d9008-b762-11e8-11f1-375fdd6dca71"
authors = ["Corneliu Cofaru <cornel@oxoaresearch.com>"]
version = "0.0.1"
version = "0.1.0"

[deps]
CodecZlib = "944b1d66-785c-5afd-91f1-9de20f533193"
Expand Down
108 changes: 91 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,23 +10,101 @@ An Julia interface to [ConceptNetNumberbatch](https://github.com/commonsense/con

## Introduction

This package is a simple API to *ConceptNetNumberbatch*.
This package is a simple API to **ConceptNetNumberbatch**.


## Documentation

There is little documentation available however these examples illustrate some common usage patterns:
- TO DO: Add usage examples
## Documentation

The following examples illustrate some common usage patterns:

```julia
julia> using Conceptnet, Languages
file_conceptnet = download_embeddings(url=CONCEPTNET_HDF5_LINK,
localfile="./_conceptnet_/conceptnet.h5");
# [ Info: Download ConceptNetNumberbatch to ./_conceptnet_/conceptnet.h5...
# % Total % Received % Xferd Average Speed Time Time Time Current
# Dload Upload Total Spent Left Speed
# 100 127M 100 127M 0 0 3646k 0 0:00:35 0:00:35 --:--:-- 4107k
# "./_conceptnet_/conceptnet.h5"

# Load embeddings
julia> conceptnet = load_embeddings(file_conceptnet, languages=:en)
# ConceptNet{Languages.English} (compressed): 1 language(s), 150875 embeddings

julia> conceptnet["apple"] # Get embeddings for a single word
# 300×1 Array{Int8,2}:
# 0
# 0
# 1
# -4
# ...

julia> conceptnet[["apple", "pear", "cherry"]] # Get embeddings for multiple words
# 300×3 Array{Int8,2}:
# 0 0 0
# 0 0 0
# 1 1 1
# -4 -6 -7
# ...
```

```julia
# Load multiple languages
julia> conceptnet = load_embeddings(file_conceptnet, languages=[:en, :fr])
# ConceptNet{Language} (compressed): 2 language(s), 174184 embeddings

julia> conceptnet["apple"] # fails, language must be specified
# ERROR: ...

julia> [conceptnet[:en, "apple"] conceptnet[:fr, "poire"]]
# 300×2 Array{Int8,2}:
# 0 -2
# 0 -2
# 1 -2
# -4 -7
# ...

# Wildcard matching
julia> conceptnet[:en, "xxyyzish"] # returns embedding for "#####ish"
# 300×1 Array{Int8,2}:
# 5
# -1
# 0
# 1
# ...
```

```julia
# Useful functions
julia> length(conceptnet) # total number of embeddings for all languages
# 174184

julia> size(conceptnet) # embedding vector length, number of embeddings
# (300, 174184)

julia> "apple" in conceptnet # found in the English embeddings
# true

julia> "poire" in conceptnet # found in the French embeddings
# true

julia> # `keys` returns an iterator for all words
for word in Iterators.take(keys(conceptnet),3)
println(word)
end
# définie
# invités
# couvents
```


## Remarks

- pretty fast for retrieving an existing word
- slow for retrieving a mismatch
- could be wrong for mismatches
- retrieval is based on string distances
- it is not possible to retrieve embeddings from multiple distinct languages at the same time (in a single indexing operation)
- decreasing the vocabulary size based on language (i.e. detect the language of the text before searching) may increase performance significantly at the cost of more mismatches for rare words
- fast for retrieving embeddings of exact matches
- fast for retrieving embeddings of wildcard matches (`xyzabcish` is matched to `######ish`)
- if neither exact or wildcard matches exist, retrieval can be based on string distances (slow, see `src/search.jl`)
- for another package handling word embeddings, check out [Embeddings.jl](https://github.com/JuliaText/Embeddings.jl)


## Installation
Expand All @@ -35,13 +113,6 @@ The installation can be done through the usual channels (manually by cloning the



## Remarks

At this point this is a work in progress and should NOT be used. For an alternative to this
package (with respect to word embeddings), check out [Embeddings.jl](https://github.com/JuliaText/Embeddings.jl)



## License

This code has an MIT license and therefore it is free.
Expand All @@ -51,4 +122,7 @@ This code has an MIT license and therefore it is free.
## References

[1] [ConceptNetNumberbatch GitHub homepage](https://github.com/commonsense/conceptnet-numberbatch)

[2] [ConceptNet GitHub homepage](https://github.com/commonsense/conceptnet5)

[3] [Embeddings.jl GitHub homepage](https://github.com/JuliaText/Embeddings.jl)
2 changes: 1 addition & 1 deletion REQUIRE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
julia 0.7
julia 1.0
TranscodingStreams
CodecZlib
HDF5
Expand Down
103 changes: 64 additions & 39 deletions src/ConceptnetNumberbatch.jl
Original file line number Diff line number Diff line change
@@ -1,27 +1,22 @@
################################################################################
# ConceptnetNumberbatch.jl - an interface for ConceptNetNumberbatch #
# written in Julia by Cornel Cofaru at 0x0α Research, 2018 #
# #
# Paper: #
# Robert Speer, Joshua Chin, and Catherine Havasi (2017). #
# "ConceptNet 5.5: An Open Multilingual Graph of General Knowledge." #
# In proceedings of AAAI 2017. #
################################################################################
# MMMMMMMMMMMMMMMMMMMMMMMMMMMWNNKN0KMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
# MMMMMMMMMMMMMMMMMMMMMMMMMW0OMMMMX0MMMMMMMMMMXXNMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMNWMMMMNMMMMMMMMMMMMMMMMMMMMMMMM
# MMMMMMMMMMMMMMMMMMMMMMMMMKMMMMMOKWMMMMMMMN:lkkdXMMNXWMMMWMNNMMMMWXWMMMNNMMMWMNNMMMlXWM;.xMMW.MMMWXWMM0dWWMMMMMMMMMMMM
# MMMMMMMMMMMMMMMMMMMMMMMMNXMMMMMXXMMMMMMMM:oMMMMMX,xkl:MK.xkd'MN,dkxMx:Ox,NW dkd'N0 xOM;OccWW.MW;dOcoN:;kNMMMMMMMMMMMM
# MMMMMMMMMMMMMMMMMMWXXXXXkMMMMMMM0MMMMMMMM:lMMMMMocMMM.XK.MMM Mk;MMMM.lxxlKM MMM;xM.NMM;OMk,X.MO.xxdoMolMMMMMMMMMMMMMM
# MMMMMMMMMMMMMMXXXXNMMMMMoMMMMMMMOMMMMMMMMW:cdxdWN;dxlcMK.MMM MW;oxdMk;dxdWM dxo,NM;lkM;OMMX'.MW:oxd0MO'xXMMMMMMMMMMMM
# MMMMMMMMMMMMXXWMMMMMMMMM0KMMMMMOOMMMMMMMMMMMWWMMMMMWMMMMMMMMMMMMMWMMMMWWMMM MWWMMMMWWMMMMMMMMMMMMWWMMMMWMMMMMMMMMMMMM
# MMMMMMMMMMW0WMMMMMMMMMMMMNMMMM0:WMMMMMMMMXxxNMM0kMMMMMMMMMMMMMMMMMMMMXl0MMMxMMMMMMMMMMMMModMMMMMMMMMMMMW0WMMMMMMMMdoM
# MMMMMMMMMM0MMMMMMMMMMMMMMMMMMk'0MMMMMMMMM0cxlXMxoMNxKMWd0M0xOxdOkxd0MXcxkdkWM0xkx0MNd0xxMlokdxNMNxkxkNNxcxkMXxxxOMdlk
# MMMMMMMMMMKMMMMMMMMMNK0XNMMMM:'OWMMMMMMMM0cNOlKkoMXcOMWckMkcXMxcNMdlMXcOMXcOXcdOxcKNcxWMMloMWooMN0OkcOMOcNMWldMMMMdlW
# NKOkk0KK0OkMMMMMMWd;:kWMWXX0kdo;'OMMMMMMM0cNMKlloMNlxXOckMkcWMxcMMdlMXcxXOcKNlxXXKWNcOMMMllKKlxMloX0cOM0c0XMooKXKMdoM
# MMMMWX0kdl:0MMMMX;'lWMMMMMMWk;'0XWMMMMMMMN0WMMX0KMMX00XKXMX0WMX0MMKKMW0N00XMMWK00KMW0XMMMKXK0KWMX00NKXMWK00MWK00XMKKM
# MMMMMMMMMMMWOOoc'o0KWMMMMMMMMWK0MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM

# Remarks:
# ####
# /-----\
# | O ^ |
# | \_/ |
# \___/
# - pretty fast for retrieving an existing word
# - slow for retrieving a mismatch
# - could be wrong for mismatches
# - retrieval is based on string distances
# - decreasing the vocabulary size based on language
# (i.e. detect the language of the text before
# searching) may increase performance significantly at the cost
# of more mismatches for rare words
# ConceptnetNumberbatch.jl - an interface for ConceptNetNumberbatch written in Julia at 0x0α Research,
# by Corneliu Cofaru, 2018
# Paper:
# Robert Speer, Joshua Chin, and Catherine Havasi (2017).
# "ConceptNet 5.5: An Open Multilingual Graph of General Knowledge." in proceedings of AAAI 2017.

module ConceptnetNumberbatch

Expand All @@ -32,29 +27,59 @@ using Languages
using StringDistances
using NearestNeighbors

import Base: getindex, size, length, show, keys, values, in
import Base: get, getindex, size, length, show, keys, values, in

# Links pointing to the latest ConceptNetNumberbatch version (v"17.06")
const CONCEPTNET_MULTI_LINK = "https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-17.06.txt.gz"
const CONCEPTNET_EN_LINK = "https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-en-17.06.txt.gz"
const CONCEPTNET_HDF5_LINK = "https://conceptnet.s3.amazonaws.com/precomputed-data/2016/numberbatch/17.06/mini.h5"

# Accepted languages (map from conceptnet to Languages.Language)
const LANG_MAP = Dict(:en=>Languages.English(),
:fr=>Languages.French(),
:de=>Languages.German(),
:it=>Languages.Italian(),
:fi=>Languages.Finnish(),
:nl=>Languages.Dutch(),
:af=>Languages.Dutch(),
:pt=>Languages.Portuguese(),
:es=>Languages.Spanish(),
:ru=>Languages.Russian(),
:ro=>Languages.Romanian(),
:sw=>Languages.Swedish()
# add more mappings here if needed
# AND supported by Languages.jl
)
const LANGUAGES = Dict(:en=>Languages.English(),
:fr=>Languages.French(),
:de=>Languages.German(),
:it=>Languages.Italian(),
:fi=>Languages.Finnish(),
:nl=>Languages.Dutch(),
:af=>Languages.Dutch(),
:pt=>Languages.Portuguese(),
:es=>Languages.Spanish(),
:ru=>Languages.Russian(),
:sh=>Languages.Serbian(),# and Languages.Croatian()
:sw=>Languages.Swedish(),
:cs=>Languages.Czech(),
:pl=>Languages.Polish(),
:bg=>Languages.Bulgarian(),
:eo=>Languages.Esperanto(),
:hu=>Languages.Hungarian(),
:el=>Languages.Greek(),
:no=>Languages.Nynorsk(),
:sl=>Languages.Slovene(),
:ro=>Languages.Romanian(),
:vi=>Languages.Vietnamese(),
:lv=>Languages.Latvian(),
:tr=>Languages.Turkish(),
:da=>Languages.Danish(),
:ar=>Languages.Arabic(),
:fa=>Languages.Persian(),
:ko=>Languages.Korean(),
:th=>Languages.Thai(),
:ka=>Languages.Georgian(),
:he=>Languages.Hebrew(),
:te=>Languages.Telugu(),
:et=>Languages.Estonian(),
:hi=>Languages.Hindi(),
:lt=>Languages.Lithuanian(),
:uk=>Languages.Ukrainian(),
:be=>Languages.Belarusian(),
:sw=>Languages.Swahili(),
:ur=>Languages.Urdu(),
:ku=>Languages.Kurdish(),
:az=>Languages.Azerbaijani(),
:ta=>Languages.Tamil()
# add more mappings here if needed
# AND supported by Languages.jl
)

export CONCEPTNET_MULTI_LINK,
CONCEPTNET_EN_LINK,
Expand Down
41 changes: 26 additions & 15 deletions src/files.jl
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ pointed to by `localfile`.
function download_embeddings(;url=CONCEPTNET_EN_LINK,
localfile=abspath("./_conceptnet_/" *
split(url,"/")[end]))
_dir = join(split(localfile, "/")[1:end-1], "/")
!isempty(_dir) && !isdir(_dir) && mkpath(_dir)
directory = join(split(localfile, "/")[1:end-1], "/")
!isempty(directory) && !isdir(directory) && mkpath(directory)
@info "Download ConceptNetNumberbatch to $localfile..."
if !isfile(localfile)
download(url, localfile)
Expand All @@ -29,10 +29,16 @@ function load_embeddings(filepath::AbstractString;
keep_words=String[],
languages::Union{Nothing,
Languages.Language,
Vector{<:Languages.Language}
Vector{<:Languages.Language},
Symbol,
Vector{Symbol}
}=nothing)
if languages == nothing
languages = unique(collect(values(LANG_MAP)))
if languages isa Nothing
languages = unique(collect(values(LANGUAGES)))
elseif languages isa Symbol
languages = LANGUAGES[languages]
elseif languages isa Vector{Symbol}
languages = [LANGUAGES[lang] for lang in languages]
end

if any(endswith.(filepath, [".gz", ".gzip"]))
Expand Down Expand Up @@ -68,7 +74,7 @@ function _load_gz_embeddings(filepath::S1,
Vector{<:Languages.Language}
}=nothing) where
{S1<:AbstractString, S2<:AbstractString}
local lang_embs, _length::Int, _width::Int, type_lang
local lang_embs, _length::Int, _width::Int, type_lang, fuzzy_words
type_word = String
type_vector = Vector{Float64}
open(filepath, "r") do fid
Expand All @@ -79,6 +85,7 @@ function _load_gz_embeddings(filepath::S1,
keep_words)
lang_embs, languages, type_lang, english_only =
process_language_argument(languages, type_word, type_vector)
fuzzy_words = Dict{type_lang, Vector{type_word}}()
no_custom_words = length(keep_words)==0
lang = :en
cnt = 0
Expand All @@ -89,12 +96,14 @@ function _load_gz_embeddings(filepath::S1,
lang = Symbol(_lang)
end
if word in keep_words || no_custom_words
if lang in keys(LANG_MAP) && LANG_MAP[lang] in languages # use only languages mapped in LANG_MAP
_llang = LANG_MAP[lang]
if lang in keys(LANGUAGES) && LANGUAGES[lang] in languages # use only languages mapped in LANGUAGES
_llang = LANGUAGES[lang]
if !(_llang in keys(lang_embs))
push!(lang_embs, _llang=>Dict{type_word, type_vector}())
push!(fuzzy_words, _llang=>type_word[])
end
_, embedding = _parseline(line, word_only=false)
occursin("#", word) && push!(fuzzy_words[_llang], word)
push!(lang_embs[_llang], word=>embedding)
cnt+=1
if cnt > vocab_size-1
Expand All @@ -105,7 +114,7 @@ function _load_gz_embeddings(filepath::S1,
end
close(cfid)
end
return ConceptNet{type_lang, type_word, type_vector}(lang_embs, _width), _length, _width
return ConceptNet{type_lang, type_word, type_vector}(lang_embs, _width, fuzzy_words)
end


Expand All @@ -119,6 +128,7 @@ function _load_hdf5_embeddings(filepath::S1,
Vector{<:Languages.Language}
}=nothing) where
{S1<:AbstractString, S2<:AbstractString}
local fuzzy_words
type_word = String
type_vector = Vector{Int8}
payload = h5open(read, filepath)["mat"]
Expand All @@ -132,15 +142,18 @@ function _load_hdf5_embeddings(filepath::S1,
keep_words)
lang_embs, languages, type_lang, _ =
process_language_argument(languages, type_word, type_vector)
fuzzy_words = Dict{type_lang, Vector{type_word}}()
no_custom_words = length(keep_words)==0
cnt = 0
for (idx, (lang, word)) in enumerate(words)
if word in keep_words || no_custom_words
if lang in keys(LANG_MAP) && LANG_MAP[lang] in languages # use only languages mapped in LANG_MAP
_llang = LANG_MAP[lang]
if lang in keys(LANGUAGES) && LANGUAGES[lang] in languages # use only languages mapped in LANGUAGES
_llang = LANGUAGES[lang]
if !(_llang in keys(lang_embs))
push!(lang_embs, _llang=>Dict{type_word, type_vector}())
push!(fuzzy_words, _llang=>type_word[])
end
occursin("#", word) && push!(fuzzy_words[_llang], word)
push!(lang_embs[_llang], word=>embeddings[:,idx])
cnt+=1
if cnt > vocab_size-1
Expand All @@ -149,9 +162,7 @@ function _load_hdf5_embeddings(filepath::S1,
end
end
end
_length::Int = length(words)
_width::Int = size(embeddings,1)
return ConceptNet{type_lang, type_word, type_vector}(lang_embs, _width), _length, _width
return ConceptNet{type_lang, type_word, type_vector}(lang_embs, size(embeddings,1), fuzzy_words)
end


Expand All @@ -167,7 +178,7 @@ function process_language_argument(languages::Nothing,
type_word::T1,
type_vector::T2) where {T1, T2}
return Dict{Languages.Language, Dict{type_word, type_vector}}(),
collect(language for language in LANG_MAP),
collect(language for language in LANGUAGES),
Languages.Language, false
end

Expand Down
Loading

0 comments on commit 8f74013

Please sign in to comment.