-
Notifications
You must be signed in to change notification settings - Fork 963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build with tensorflow_cc: tf_trajectories_example, vpnet_test, and alpha_zero_example_test fail #539
Comments
Tagging @mrdaliri, any ideas? Have you run into this? |
Small update: I copied the same minimal usage examples from the tensorflow_cc repo to see if building/running within open_spiel would expose any linking (or other) errors. Upon running, I got the following warning from cmake:
but the example ran as expected afterwards:
After running this example, I got new errors from the 2 of the tests that failed above:
and ./alpha_zero_example
tf_trajectories_example still fails as above and all three fail as above when rebuilding, until I run the example again. These appear to be the same errors that popped up here: #172 (comment). I thought the BUILD_WITH_TENSORFLOW_CC PR was the fix for this (see here: #307 (comment)), but it looks like the same error is still happening. Is there a step not mentioned in the linked issue/pr that I could be missing? |
Hi @ngrupen, did you modify the .lds file and build with the protobufs tag in this comment: #172 (comment) It seems based on some discussion that you might have to remove the system's protobufs if you have them installed. |
Yes I followed the steps listed in #172 (comment) to install tensorflow_cc. After rebuilding both tensorflow_cc and open_spiel I am still seeing the errors from the first comment. I also confirmed that system protobufs were uninstalled -- the only one on my machine is the version that is installed with requirements.txt. I also tried installing outside of anaconda to see if the extra wrapper was causing the issue, but got the same errors. In addition to the errors from the first message, I am seeing errors anytime tensorflow tries to write a graph. For example, here is the output of running python3 open_spiel/contrib/python/export_graph.py:
|
This is the problem. It is somehow unable to find the CUDA library at runtime. I've seen this before. One super easy thing to try is find where that file lives (let's call it
Also can you try running python3 outside of OpenSpiel and just importing tensorflow? Does it work? |
Since you compiled TF from scratch it might be somewhere in the build dir (if you didn't run |
Hmm I did run
My issues could stem from this version not being the same as the one installed by tensorflow_cc. You're right, there is a tensorflow directory in the build dir of tensorflow_cc, though I did make install after compiling so this is curious... As for CUDA, the previous run was on my GPU-less desktop. I saw previous warnings of the same nature, but at least some of them have come with a message saying that it is ok to ignore the warning if you don't have CUDA installed. I am actually compiling tensorflow_cc on a lab machine with CUDA setup now, so I can confirm in a bit if the warning goes away. Update: Confirmed path issue for tensorflow. Running regular python inside tensorflow_cc build directory allows it to be found/imported. |
Cool, so does the LD_LIBRARY_PATH trick let you run Edit: Wait -- if that's the case maybe you need to add that dir to PYTHONPATH instead. |
If you didn't run |
Two quick updates:
Also had to switch the pip3 tensorflow version to 2.2 from 2.4.1. Still seeing
Looking into this further. |
Another update: after downgrading to TF=2.2 and fixing issues in previous update, I've essentially caught up to the original issue (see: #172 (comment)) on my local (no gpu, no cuda) machine.
The issue hints that this could be due to further mismatches (see last sentence of #172 (comment)) but I haven't found exactly which yet. On the lab machine (gpu+cuda),
which is similar to the errors we've seen before above. Have to move to something else to a bit but will keep digging. |
One more quick update: export_graph.py fails because it is using deprecated functions that have since been removed. Replacing: |
Hmm, I thought we had fixed those last two problems (the ref count and isAligned()). @mrdaliri @michalsustr judging from going back through the conversations, it seems like those might still be outstanding issues....? (see #539 (comment)). @mrdaliri I thought you were using it for your research, did you ever manage to get TF AlphaZero fully working? If not, we should clearly mark it somewhere in the documentation as not fully working. |
As a sanity check, I re-installed tensorflow_cc and OpenSpiel from a blank slate. I wanted to try out using @mrdaliri 's Anyways, here's what's left to tackle after building with TF=2.2.0 and including everything we've discussed so far:
|
Yes sorry, I gave up on TF and just use libtorch. |
@ngrupen I contacted @mrdaliri he confirmed that he never got it fully working. I had mistakenly misunderstood that he had, my apologies for not making that clear. I assume the furthest point he got to is the same place you are now; @mrdaliri can you please confirm? I will update the documentation to reflect this state of the C++ TF AlphaZero. It would be wonderful if you can find a fix for these last two problems so we can finally get it to work. It'd also be great if it was using a more recent version of TF. I understand if you're not motivated to continue on this. If that's the case, I suggest using the libtorch based C++ AlphaZero. I know this one works, at least one person is using it for their research, and we're about to build a C++ DQN based on the same code as well (which may involve refactoring some of it to make more uses of deep RL via C++ libtorch cleaner). So there's a very good chance that we continue to support that code and build on it, whereas I expect to eventually remove the TF C++ code if nobody can get it to work. |
Hi @lanctot @ngrupen, However, one of the later releases of OpenSpiel switched to a different data type (from float to a special double or something like that if I remember correctly) which again introduced some linking issues. I couldn't solve that error, and eventually, I ran out of time and had to give up on TF C++. |
Hi @ngrupen, can you submit a small PR fix for this? |
@lanctot Apologies for the delay -- I was tied up at the end of last week / weekend responding to reviewers :-) Anyways, just submitted this PR for the minor fix: #547 (comment) As for libtorch vs. tensorflow_cc, I am planning to use OpenSpiel for research as well, so I have started using the libtorch version for now just to start making progress. I can try to pickup where we left off with tensorflow_cc if I have an extra moment. |
Ubuntu 18.04
cmake version 3.18.4
clang version 10.0.0-4ubuntu1~18.04.2
bazel 2.0.0
Python 3.8.8
Build succeeds when following standard installation instructions (i.e. BUILD_WITH_TENSORFLOW_CC=OFF). Following instructions from alphazero.md and #172 I installed @mrdaliri 's open_spiel branch of tensorflow_cc and followed installation instructions. Tensorflow_cc built/installed fine, as did the usage examples. After running:
BUILD_WITH_TENSORFLOW_CC=ON ./install.sh
andCXX=/usr/bin/clang++ BUILD_WITH_TENSORFLOW_CC=ON ./open_spiel/scripts/build_and_run_tests.shc
, I noticed that tf_trajectories_example, vpnet_test, and alpha_zero_example_test were each failing with the same error:tf_trajectories_example:
vpnet_test:
alpha_zero_example_test:
After searching around a little bit, it looks like this is usually caused by a version mismatch between tf and the tf model server, but I haven't seen any dependency on tf model server for open_spiel or tensorflow_cc. I have a feeling there is a simple/key piece I am missing here. Have you seen this error before?
The text was updated successfully, but these errors were encountered: