-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with ppi.py example - classification.py:1143: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples. #195
Comments
I pushed a potential fix. I would appreciate it if you could verify if it works now. |
It has removed the F-score metric warning, but I still get the same behaviour with the Loss becoming nan and Micro-F1 0.0. This was the output from a run: |
Mh, seems like you need to help me out on this one. Are you running on CPU or GPU? Do you have an idea which operator produces the NaNs? |
So, when I run this on a CPU it runs fine and the issue isn't present. On a GPU I see the problem with nan loss. These are the outputs of model prediction and the data.y values before the loss is calculated. model out : tensor([[ 1.4462, -3.2312, -1.9278, ..., 1.5005, 0.5878, -2.9639], loss : 0.48732060194015503 data y : tensor([[1., 0., 0., ..., 1., 1., 0.], model out : tensor([[ 0.0739, -0.1491, -1.0226, ..., 0.9851, 0.1597, -3.0990], loss : nan data y : tensor([[1., 0., 0., ..., 1., 1., 0.], model out : tensor([[nan, nan, nan, ..., nan, nan, nan], loss : nan It is probably worth noting here different versions I am using: |
Thank you. Can you verify which op outputs NaNs by adding print(torch.isnan(out).sum()) after each conv call. My guess is that the softmax of GAT may produce NaNs. I am just wondering why I cannot reproduce this issue. |
Yes it does appear that some NaNs are appearing after a conv call. `model out : tensor([[-0.3082, -0.3048, -1.4101, ..., 1.0179, 0.3648, -3.8706], sum of nan output : 0 sum of nan output : 0 sum of nan output : 0 loss : 0.4614076018333435 data y : tensor([[1., 0., 0., ..., 1., 1., 0.], model out : tensor([[-2.8252e-03, -2.5085e+00, -1.2768e+00, ..., 7.4126e-01, sum of nan output : 0 sum of nan output : 256 sum of nan output : 2057 loss : nan data y : tensor([[1., 0., 0., ..., 1., 1., 0.], model out : tensor([[nan, nan, nan, ..., nan, nan, nan], sum of nan output : 2547712 sum of nan output : 2547712 sum of nan output : 301048 loss : nan Hope this helps. |
That definitely helps in narrowing down the problem. Unfortunately, I still don't know exactly what is causing this problem. Can you debug PyG? Especially verify the inputs and outputs of def softmax(src, index, num_nodes=None):
num_nodes = maybe_num_nodes(index, num_nodes)
print('----------------------------')
print('1', src.min().item(), src.max().item())
out = src - scatter_max(src, index, dim=0, dim_size=num_nodes)[0][index]
print('2', out.min().item(), out.max().item())
out = out.exp()
print('3', out.min().item(), out.max().item())
out = out / (
scatter_add(out, index, dim=0, dim_size=num_nodes)[index] + 1e-16)
print('4', out.min().item(), out.max().item())
print('----------------------------')
return out For me, this gives something like:
|
These are the outputs from the softmax function with also the sum of number of NaNs after each conv call and the calculated loss. Note there is output before where I have started copying from, i.e. it runs ok for a little while before the NaN appears.
So, the issue arises from taking the exponential of a large number and then dividing by the subsequent inf I assume. |
The outputs of scatter_max are all zero so this is clearly the issue.
I do have torch_scatter-1.1.2 installed though so I am not sure what has gone wrong. |
Can you run the test suite of |
I installed |
You do not need to reinstall, simply run: |
The folder for the torch_scatter package only has the following files in:
I can run
|
You should clone |
This didn't run successfully:
|
Ok, so I have to apologise for wasting your time and thank you for being patient! The issue was with my installation of This fixed the In turn this also fixed the issue I was having with the PPI example not working due to Thanks for your help, I wouldn't have got it working otherwise! |
Cool :) |
Error with ppi.py example
I am getting the error F-score is ill-defined and being set to 0.0 due to no predicted samples when running the ppi.py example.
This is the full output when the error occurs, it is causing nan Loss and 0 Accuracy after the error occurs.
Downloading https://s3.us-east-2.amazonaws.com/dgl.ai/dataset/ppi.zip Extracting /home/josh/Documents/Graph_Networks/PyTorchGeometric/Model_Study/pytorch_geometric/data/PPI/ppi.zip Processing... Done! Epoch: 01, Loss: 0.7505, Acc: 0.5000 Epoch: 02, Loss: 0.5345, Acc: 0.5595 /home/josh/.local/lib/python3.6/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples. 'precision', 'predicted', average, warn_for) Epoch: 03, Loss: nan, Acc: 0.0000 Epoch: 04, Loss: nan, Acc: 0.0000 Epoch: 05, Loss: nan, Acc: 0.0000
I have used the latest version of pytorch_geometric and my PyTorch version is 1.0.1.post2.
The text was updated successfully, but these errors were encountered: