Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Pytorch embedding #2

Closed
wants to merge 2 commits into from
Closed

Pytorch embedding #2

wants to merge 2 commits into from

Conversation

DeNeutoy
Copy link
Contributor

Switching the word embedding functionality over to use Pytorch. Mainly did this to get a quick idea of how tough it would be to switch layers. Answer: pretty straightforward.


Parameters
----------
num_embeddings:, int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably figure out markdown vs. rest sooner rather than later, so we can switch docstrings now, if we're going to switch.

If we're sticking with rest, this should be formatted like:

num_embeddings : ``int``

Note the space after the parameter name - it's important for the code that generates the docs.

self.weight.data[self.padding_index].fill_(0)

def forward(self, input):
padding_idx = self.padding_idx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.padding_index

self.sparse = sparse

if embedding_dim == 1:
raise ConfigurationError("There is no need to embed tokens if you "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? I could be learning something like a tf-idf weight for each token. Seems odd to put in this restriction, as it's unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I was just porting the previous embedding. I'll remove it.

embedding_dim: int,
weight: torch.FloatTensor=None,
padding_index: int=None,
trainable: bool=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is missing from the docstring.

if weight is None:
weight = torch.FloatTensor(num_embeddings, embedding_dim)
self.weight = torch.nn.Parameter(weight, requires_grad=trainable)
self.reset_parameters()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work? This is scary to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you defined this method below. Ok, that makes more sense. When I first saw this, I thought you were calling a super class method, and I was really nervous about using pytorch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you even need this method? Looks like it's just a single line, and you can move the padding fill to below the if/else block, because you do it in both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if you do want it, I'd call it something like initialize_parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there is a technical reason to have this function if we ever need to call it to create parameters in the call of forward: https://discuss.pytorch.org/t/dynamic-parameter-declaration-in-forward-function/427/7

This isn't the reason it's used here though, but I was wondering why other pytorch layers use it and it looks like this might be why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, ok. But it's not a method defined on Module, so we can handle this however we want, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have removed it for now.

trainable: bool=True,
log_misses: bool=False):
"""
Reads a pre-trained embedding file and generates a Keras Embedding layer that has weights
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keras?

it a zero vector.

The embeddings file is assumed to be gzipped, formatted as [word] [dim 1] [dim 2] ...
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documenting parameters here would be good.

embedding_dim = None

# TODO(matt): make this a parameter
embedding_misses_filename = 'embedding_misses.txt'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably just log misses using logger.debug, and remove the log_misses parameter entirely, and this variable.

assert embedding_dim > 1, "Found embedding size of 1; do you have a header?"
else:
if len(fields) - 1 != embedding_dim:
# Sometimes there are funny unicode parsing problems that lead to different
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do better here? Maybe we can just import a library that reads glove vectors already, so we don't have to duplicate that functionality?


# Depending on whether the namespace has PAD and UNK tokens, we start at different indices,
# because you shouldn't have pretrained embeddings for PAD or UNK.
start_index = 0 if namespace.startswith("*") else 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to always start at 0. The check for word in embeddings will fail, and everything will be ok, because we just continue in that case. And the namespace check is wrong here, anyways.

@DeNeutoy DeNeutoy closed this Jun 26, 2017
LiyuanLucasLiu referenced this pull request in LiyuanLucasLiu/allennlp Jul 16, 2018
otherwise it would raise an error in line 505 (expected object of type torch.FloatTensor but found type torch.cuda.FloatTensor for argument #2 'other')
DeNeutoy pushed a commit that referenced this pull request Jul 16, 2018
otherwise it would raise an error in line 505 (expected object of type torch.FloatTensor but found type torch.cuda.FloatTensor for argument #2 'other')
gabrielStanovsky pushed a commit to gabrielStanovsky/allennlp that referenced this pull request Sep 7, 2018
otherwise it would raise an error in line 505 (expected object of type torch.FloatTensor but found type torch.cuda.FloatTensor for argument allenai#2 'other')
brendan-ai2 pushed a commit to brendan-ai2/allennlp that referenced this pull request Feb 27, 2019
segmental conll 2000 dataset reader
joelgrus added a commit to joelgrus/allennlp that referenced this pull request Jul 11, 2019
dirkgr pushed a commit to dirkgr/allennlp that referenced this pull request Dec 11, 2019
schmmd added a commit that referenced this pull request Feb 26, 2020
Call make_app correctly.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants