Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build with tensorflow_cc: tf_trajectories_example, vpnet_test, and alpha_zero_example_test fail #539

Closed
ngrupen opened this issue Mar 18, 2021 · 19 comments

Comments

@ngrupen
Copy link
Contributor

ngrupen commented Mar 18, 2021

Ubuntu 18.04
cmake version 3.18.4
clang version 10.0.0-4ubuntu1~18.04.2
bazel 2.0.0
Python 3.8.8

Build succeeds when following standard installation instructions (i.e. BUILD_WITH_TENSORFLOW_CC=OFF). Following instructions from alphazero.md and #172 I installed @mrdaliri 's open_spiel branch of tensorflow_cc and followed installation instructions. Tensorflow_cc built/installed fine, as did the usage examples. After running: BUILD_WITH_TENSORFLOW_CC=ON ./install.sh and CXX=/usr/bin/clang++ BUILD_WITH_TENSORFLOW_CC=ON ./open_spiel/scripts/build_and_run_tests.shc, I noticed that tf_trajectories_example, vpnet_test, and alpha_zero_example_test were each failing with the same error:

tf_trajectories_example:

1/189 Test #94: tf_trajectories_example ...........................Child aborted***Exception: 0.75 sec
2021-03-17 23:21:02.238356: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-03-17 23:21:02.357149: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3600000000 Hz
2021-03-17 23:21:02.357492: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x24a7d20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-17 23:21:02.357505: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-03-17 23:21:02.499277: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: NodeDef mentions attr 'allowed_devices' not in Op<name=VarHandleOp; signature= -> resource:resource; attr=container:string,default=""; attr=shared_name:string,default=""; attr=dtype:type; attr=shape:shape; is_stateful=true>; NodeDef: {{node beta1_power}}. (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).
2021-03-17 23:21:02.508800: F /.../open_spiel/open_spiel/contrib/tf_trajectories.cc:114] Non-OK-status: tf_session_->Run({}, {}, {"init_all_vars_op"}, nullptr) status: Invalid argument: NodeDef mentions attr 'allowed_devices' not in Op<name=VarHandleOp; signature= -> resource:resource; attr=container:string,default=""; attr=shared_name:string,default=""; attr=dtype:type; attr=shape:shape; is_stateful=true>; NodeDef: {{node beta1_power}}. (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).
[[beta1_power]]

vpnet_test:

2021-03-17 23:21:12.672911: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: NodeDef mentions attr 'allowed_devices' not in Op<name=VarHandleOp; signature= -> resource:resource; attr=container:string,default=""; attr=shared_name:string,default=""; attr=dtype:type; attr=shape:shape; is_stateful=true>; NodeDef: {{node beta1_power}}. (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).
2021-03-17 23:21:12.697392: F /.../open_spiel/open_spiel/algorithms/alpha_zero/vpnet.cc:92] Non-OK-status: tf_session_->Run({}, {}, {"init_all_vars_op"}, nullptr) status: Invalid argument: NodeDef mentions attr 'allowed_devices' not in Op<name=VarHandleOp; signature= -> resource:resource; attr=container:string,default=""; attr=shared_name:string,default=""; attr=dtype:type; attr=shape:shape; is_stateful=true>; NodeDef: {{node beta1_power}}. (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).
[[beta1_power]]

alpha_zero_example_test:

2021-03-17 23:21:45.369748: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: NodeDef mentions attr 'allowed_devices' not in Op<name=VarHandleOp; signature= -> resource:resource; attr=container:string,default=""; attr=shared_name:string,default=""; attr=dtype:type; attr=shape:shape; is_stateful=true>; NodeDef: {{node beta1_power}}. (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).
2021-03-17 23:21:45.371901: F /.../open_spiel/open_spiel/algorithms/alpha_zero/vpnet.cc:92] Non-OK-status: tf_session_->Run({}, {}, {"init_all_vars_op"}, nullptr) status: Invalid argument: NodeDef mentions attr 'allowed_devices' not in Op<name=VarHandleOp; signature= -> resource:resource; attr=container:string,default=""; attr=shared_name:string,default=""; attr=dtype:type; attr=shape:shape; is_stateful=true>; NodeDef: {{node beta1_power}}. (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).
[[beta1_power]]

After searching around a little bit, it looks like this is usually caused by a version mismatch between tf and the tf model server, but I haven't seen any dependency on tf model server for open_spiel or tensorflow_cc. I have a feeling there is a simple/key piece I am missing here. Have you seen this error before?

@lanctot
Copy link
Collaborator

lanctot commented Mar 18, 2021

Tagging @mrdaliri, any ideas? Have you run into this?

@ngrupen
Copy link
Contributor Author

ngrupen commented Mar 23, 2021

Small update: I copied the same minimal usage examples from the tensorflow_cc repo to see if building/running within open_spiel would expose any linking (or other) errors. Upon running, I got the following warning from cmake:

-- Configuring done
CMake Warning (dev) at CMakeLists.txt:4 (add_executable):
Policy CMP0003 should be set before this line. Add code such as

if(COMMAND cmake_policy)
  cmake_policy(SET CMP0003 NEW)
endif(COMMAND cmake_policy)

as early as possible but after the most recent call to
cmake_minimum_required or cmake_policy(VERSION). This warning appears
because target "example" links to some libraries for which the linker must
search:

dl, pthread

and other libraries with known full path:

/usr/local/lib/libtensorflow_cc.so.2

CMake is adding directories in the second list to the linker search path in
case they are needed to find libraries from the first list (for backwards
compatibility with CMake 2.4). Set policy CMP0003 to OLD or NEW to enable
or disable this behavior explicitly. Run "cmake --help-policy CMP0003" for
more information.
This warning is for project developers. Use -Wno-dev to suppress it.

but the example ran as expected afterwards:

2021-03-23 00:12:59.925127: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-03-23 00:12:59.951152: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3600000000 Hz
2021-03-23 00:12:59.951499: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55aaccca5d20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-23 00:12:59.951528: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Session successfully created.

After running this example, I got new errors from the 2 of the tests that failed above:
./vpnet_test

TestModelCreation: mlp
WARNING:tensorflow:From /.../anaconda3/envs/spiel/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py:1666: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0323 00:14:54.595646 139746188588864 deprecation.py:506] From /.../anaconda3/envs/spiel/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py:1666: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *constraint arguments to layers.
2021-03-23 00:14:54.826483: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-03-23 00:14:54.826519: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2021-03-23 00:14:54.826554: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (b-612): /proc/driver/nvidia/version does not exist
2021-03-23 00:14:54.826739: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-03-23 00:14:54.851159: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3600000000 Hz
2021-03-23 00:14:54.851730: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f1850000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-23 00:14:54.851766: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Game: tic_tac_toe()
Model type: mlp(32, 2)
Model size: 4394 variables
Variables:
torso_0_dense/kernel:0: (27, 32)
torso_0_dense/bias:0: (32,)
torso_1_dense/kernel:0: (32, 32)
torso_1_dense/bias:0: (32,)
policy_dense/kernel:0: (32, 32)
policy_dense/bias:0: (32,)
policy/kernel:0: (32, 9)
policy/bias:0: (9,)
value_dense/kernel:0: (32, 32)
value_dense/bias:0: (32,)
value/kernel:0: (32, 1)
value/bias:0: (1,)
2021-03-23 00:14:55.004728: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-03-23 00:14:55.017015: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3600000000 Hz
2021-03-23 00:14:55.017309: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2cdfd20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-23 00:14:55.017322: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-03-23 00:14:55.062097: F /usr/local/include/tensorflow/bazel-bin/tensorflow/include/tensorflow/core/platform/refcount.h:90] Check failed: ref
.load() == 0 (1 vs. 0)
Aborted (core dumped)

and ./alpha_zero_example

Logging directory: /tmp/az
Overwriting existing model: /tmp/az/vpnet.pb
WARNING:tensorflow:From /.../anaconda3/envs/spiel/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py:1666: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0323 00:16:12.188916 139938278864704 deprecation.py:506] From /.../anaconda3/envs/spiel/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py:1666: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2021-03-23 00:16:12.209343: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-03-23 00:16:12.209378: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2021-03-23 00:16:12.209402: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (b-612): /proc/driver/nvidia/version does not exist
2021-03-23 00:16:15.225238: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-03-23 00:16:15.247157: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3600000000 Hz
2021-03-23 00:16:15.247675: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f450c000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-23 00:16:15.247697: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Playing game: tic_tac_toe
2021-03-23 00:16:16.168392: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-03-23 00:16:16.180859: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3600000000 Hz
2021-03-23 00:16:16.181090: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x38a3d20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-23 00:16:16.181101: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-03-23 00:16:17.362159: F tensorflow/core/framework/tensor.cc:672] Check failed: IsAligned() Aligned and single element
2021-03-23 00:16:17.362160: F tensorflow/core/framework/tensor.cc:672] Check failed: IsAligned() Aligned and single element
2021-03-23 00:16:17.362194: F tensorflow/core/framework/tensor.cc:672] Check failed: IsAligned() Aligned and single element
Aborted (core dumped)

tf_trajectories_example still fails as above and all three fail as above when rebuilding, until I run the example again. These appear to be the same errors that popped up here: #172 (comment). I thought the BUILD_WITH_TENSORFLOW_CC PR was the fix for this (see here: #307 (comment)), but it looks like the same error is still happening. Is there a step not mentioned in the linked issue/pr that I could be missing?

@lanctot
Copy link
Collaborator

lanctot commented Mar 23, 2021

Hi @ngrupen, did you modify the .lds file and build with the protobufs tag in this comment: #172 (comment)

It seems based on some discussion that you might have to remove the system's protobufs if you have them installed.

@ngrupen
Copy link
Contributor Author

ngrupen commented Mar 23, 2021

Yes I followed the steps listed in #172 (comment) to install tensorflow_cc. After rebuilding both tensorflow_cc and open_spiel I am still seeing the errors from the first comment. I also confirmed that system protobufs were uninstalled -- the only one on my machine is the version that is installed with requirements.txt. I also tried installing outside of anaconda to see if the extra wrapper was causing the issue, but got the same errors.

In addition to the errors from the first message, I am seeing errors anytime tensorflow tries to write a graph. For example, here is the output of running python3 open_spiel/contrib/python/export_graph.py:

2021-03-23 12:58:47.034648: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-03-23 12:58:47.034690: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-03-23 12:58:47.920939: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-03-23 12:58:47.921052: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-03-23 12:58:47.921061: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-03-23 12:58:47.921074: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (b-612): /proc/driver/nvidia/version does not exist
2021-03-23 12:58:47.921277: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-03-23 12:58:47.921859: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
/home/niko/Documents/playground/venv/lib/python3.6/site-packages/tensorflow/python/keras/legacy_tf_layers/core.py:171: UserWarning: `tf.layers.dense` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dense` instead.
  warnings.warn('`tf.layers.dense` is deprecated and '
/home/niko/Documents/playground/venv/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer_v1.py:1719: UserWarning: `layer.apply` is deprecated and will be removed in a future version. Please use `layer.__call__` method instead.
  warnings.warn('`layer.apply` is deprecated and '
WARNING:tensorflow:From open_spiel/contrib/python/export_graph.py:95: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.
W0323 12:58:48.040002 139662656255808 deprecation.py:339] From open_spiel/contrib/python/export_graph.py:95: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.
WARNING:tensorflow:From /home/niko/Documents/playground/venv/lib/python3.6/site-packages/tensorflow/python/util/tf_should_use.py:247: initialize_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.variables_initializer` instead.
W0323 12:58:48.040196 139662656255808 deprecation.py:339] From /home/niko/Documents/playground/venv/lib/python3.6/site-packages/tensorflow/python/util/tf_should_use.py:247: initialize_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.variables_initializer` instead.
Writing file: /tmp/graph.pb
ERROR:tensorflow:==================================
Object was never used (type <class 'tensorflow.python.framework.ops.Operation'>):
<tf.Operation 'init_all_vars_op' type=NoOp>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
  File "open_spiel/contrib/python/export_graph.py", line 103, in <module>
    app.run(main)  File "/home/niko/Documents/playground/venv/lib/python3.6/site-packages/absl/app.py", line 322, in run
    raise  File "/home/niko/Documents/playground/venv/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))  File "open_spiel/contrib/python/export_graph.py", line 99, in main
    sess.graph_def, FLAGS.dir, FLAGS.filename, as_text=False)  File "/home/niko/Documents/playground/venv/lib/python3.6/site-packages/tensorflow/python/util/tf_should_use.py", line 249, in wrapped
    error_in_function=error_in_function)
==================================
E0323 12:58:48.071783 139662656255808 tf_should_use.py:90] ==================================
Object was never used (type <class 'tensorflow.python.framework.ops.Operation'>):
<tf.Operation 'init_all_vars_op' type=NoOp>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
  File "open_spiel/contrib/python/export_graph.py", line 103, in <module>
    app.run(main)  File "/home/niko/Documents/playground/venv/lib/python3.6/site-packages/absl/app.py", line 322, in run
    raise  File "/home/niko/Documents/playground/venv/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))  File "open_spiel/contrib/python/export_graph.py", line 99, in main
    sess.graph_def, FLAGS.dir, FLAGS.filename, as_text=False)  File "/home/niko/Documents/playground/venv/lib/python3.6/site-packages/tensorflow/python/util/tf_should_use.py", line 249, in wrapped
    error_in_function=error_in_function)
==================================

@lanctot
Copy link
Collaborator

lanctot commented Mar 23, 2021

2021-03-23 12:58:47.921052: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory

This is the problem. It is somehow unable to find the CUDA library at runtime. I've seen this before. One super easy thing to try is find where that file lives (let's call it /path/to/dir) and then add it to LD_LIBRARY_PATH. So:

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/path/to/dir"
python3 open_spiel/contrib/python/export_graph.py

Also can you try running python3 outside of OpenSpiel and just importing tensorflow? Does it work?

@lanctot
Copy link
Collaborator

lanctot commented Mar 23, 2021

Since you compiled TF from scratch it might be somewhere in the build dir (if you didn't run make install)

@ngrupen
Copy link
Contributor Author

ngrupen commented Mar 23, 2021

Hmm I did run make install, but tf is indeed failing to load when I use python3 outside of OpenSpiel. The version loaded by the venv is from its site packages:

import tensorflow as tf
tf.file
'/.../Documents/playground/open_spiel/venv/lib/python3.6/site-packages/tensorflow/init.py'

My issues could stem from this version not being the same as the one installed by tensorflow_cc. You're right, there is a tensorflow directory in the build dir of tensorflow_cc, though I did make install after compiling so this is curious...

As for CUDA, the previous run was on my GPU-less desktop. I saw previous warnings of the same nature, but at least some of them have come with a message saying that it is ok to ignore the warning if you don't have CUDA installed. I am actually compiling tensorflow_cc on a lab machine with CUDA setup now, so I can confirm in a bit if the warning goes away.

Update: Confirmed path issue for tensorflow. Running regular python inside tensorflow_cc build directory allows it to be found/imported.

@lanctot
Copy link
Collaborator

lanctot commented Mar 23, 2021

Cool, so does the LD_LIBRARY_PATH trick let you run export_graph.py?

Edit: Wait -- if that's the case maybe you need to add that dir to PYTHONPATH instead.

@lanctot
Copy link
Collaborator

lanctot commented Mar 23, 2021

If you didn't run sudo make install (i.e., as root, presumably because it's a lab machine) then it will not have been moved to a directory that's a standard load path for dynamic libraries. You'd need to add where it was installed to some path, I guess maybe to both LD_LIBRARY_PATH and PYTHONPATH (maybe just the latter...?)

@ngrupen
Copy link
Contributor Author

ngrupen commented Mar 23, 2021

Two quick updates:

  1. CUDA
    No more issues finding libcuda.so.1 on machine with CUDA:

2021-03-23 16:21:52.757493: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-03-23 16:21:52.760504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:04:00.0 name: GeForce GTX 1070 computeCapability: 6.1

Also had to switch the pip3 tensorflow version to 2.2 from 2.4.1. Still seeing expand_graph.py fail with same errors though.

  1. Linking TF build from source
    On local machine, TF doesn't seem to like being imported from its source path -- e.g. export_graph.py fails with:

Traceback (most recent call last):
File "export_graph.py", line 28, in
import tensorflow.compat.v1 as tf
File "/home/niko/Documents/playground/tensorflow_cc/tensorflow_cc/build/tensorflow/tensorflow/init.py", line 24, in
from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
File "/home/niko/Documents/playground/tensorflow_cc/tensorflow_cc/build/tensorflow/tensorflow/python/init.py", line 50, in
from tensorflow.python import pywrap_tensorflow
File "/home/niko/Documents/playground/tensorflow_cc/tensorflow_cc/build/tensorflow/tensorflow/python/pywrap_tensorflow.py", line 25, in
from tensorflow.python.platform import self_check
File "/home/niko/Documents/playground/tensorflow_cc/tensorflow_cc/build/tensorflow/tensorflow/python/platform/self_check.py", line 27, in
raise ImportError("Could not import tensorflow. Do not import tensorflow "
ImportError: Could not import tensorflow. Do not import tensorflow from its source directory; change directory to outside the TensorFlow source tree, and relaunch your Python interpreter from there.

Looking into this further.

@ngrupen
Copy link
Contributor Author

ngrupen commented Mar 23, 2021

Another update: after downgrading to TF=2.2 and fixing issues in previous update, I've essentially caught up to the original issue (see: #172 (comment)) on my local (no gpu, no cuda) machine.

**tf_trajectories_example**
Works as expected!

**vpnet_test**

2021-03-23 18:00:29.763857: F /usr/local/include/tensorflow/bazel-bin/tensorflow/include/tensorflow/core/platform/refcount.h:90] Check failed: ref_.load() == 0 (1 vs. 0)
Aborted (core dumped)

**alpha_zero_example**

2021-03-23 18:00:50.037764: F tensorflow/core/framework/tensor.cc:672] Check failed: IsAligned() Aligned and single element
Aborted (core dumped)

The issue hints that this could be due to further mismatches (see last sentence of #172 (comment)) but I haven't found exactly which yet.

On the lab machine (gpu+cuda), alpha_zero_example and vpnet_test behave similarly, but I get this for tf_trajectories_example:

2021-03-23 18:11:38.578086: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2021-03-23 18:11:38.579639: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-03-23 18:11:38.581147: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-03-23 18:11:38.581388: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-03-23 18:11:38.583199: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-03-23 18:11:38.584041: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2021-03-23 18:11:38.587469: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-23 18:11:38.588810: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2021-03-23 18:12:40.119304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-23 18:12:40.119330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0
2021-03-23 18:12:40.119352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N
2021-03-23 18:12:40.120813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7334 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:04:00.0, compute capability: 6.1)
2021-03-23 18:12:40.122737: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x162d5330 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-03-23 18:12:40.122753: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1070, Compute Capability 6.1
2021-03-23 18:12:40.148163: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: NodeDef mentions attr 'allowed_devices' not in Op<name=VarHandleOp; signature= -> resource:resource; attr=container:string,default=""; attr=shared_name:string,default=""; attr=dtype:type; attr=shape:shape; is_stateful=true>; NodeDef: {{node beta1_power}}. (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).
2021-03-23 18:12:40.148282: F /home/cornell.edu/nag83/Documents/playground/open_spiel/open_spiel/contrib/tf_trajectories.cc:114] Non-OK-status: tf_session_->Run({}, {}, {"init_all_vars_op"}, nullptr) status: Invalid argument: NodeDef mentions attr 'allowed_devices' not in Op<name=VarHandleOp; signature= -> resource:resource; attr=container:string,default=""; attr=shared_name:string,default=""; attr=dtype:type; attr=shape:shape; is_stateful=true>; NodeDef: {{node beta1_power}}. (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).
[[beta1_power]]
Aborted (core dumped)

which is similar to the errors we've seen before above. Have to move to something else to a bit but will keep digging.

@ngrupen
Copy link
Contributor Author

ngrupen commented Mar 23, 2021

One more quick update: export_graph.py fails because it is using deprecated functions that have since been removed. Replacing:
init = tf.initialize_variables(tf.all_variables(), name="init_all_vars_op")
with
init = tf.variables_initializer(tf.global_variables(), name="init_all_vars_op")
does the trick. Using tf.global_variables_initializer() would also work in this case. Either way, it fixes the errors from up here: #539 (comment)

@lanctot
Copy link
Collaborator

lanctot commented Mar 23, 2021

Hmm, I thought we had fixed those last two problems (the ref count and isAligned()).

@mrdaliri @michalsustr judging from going back through the conversations, it seems like those might still be outstanding issues....? (see #539 (comment)). @mrdaliri I thought you were using it for your research, did you ever manage to get TF AlphaZero fully working? If not, we should clearly mark it somewhere in the documentation as not fully working.

@ngrupen
Copy link
Contributor Author

ngrupen commented Mar 24, 2021

As a sanity check, I re-installed tensorflow_cc and OpenSpiel from a blank slate. I wanted to try out using @mrdaliri 's open_spiel branch of the tensorflow_cc fork, because I thought it was conspicuous that the link to that repo in the AlphaZero docs went to the open_spiel branch and not the mrdaliri-tf-mirror branch that's listed in the steps we've referenced so far (i.e. #172 (comment)). I was able to get to the same point using the open_spiel branch so, unless I'm mistaken, I think @mrdaliri accounted for the fixes up to this point in that branch, which removes the need to do the steps individually each time.

Anyways, here's what's left to tackle after building with TF=2.2.0 and including everything we've discussed so far:

The following tests FAILED:
23 - vpnet_test (Child aborted)
32 - alpha_zero_example_test (Child aborted)
100 - python_evaluator_test (Child aborted)
101 - python_model_test (Child aborted)
105 - python_deep_cfr_test (Child aborted)
106 - python_deep_cfr_tf2_test (Child aborted)
110 - python_eva_test (Child aborted)
113 - python_exploitability_descent_test (Child aborted)
136 - python_rcfr_test (Child aborted)

vpnet_test and alpha_zero_example_test failed for the reasons discussed above. I think the other failures might be the result of package mismatches (they didn't fail previously and I've only changed the TF version thus far), but I'll look into this more.

@michalsustr
Copy link
Collaborator

@mrdaliri @michalsustr judging from going back through the conversations, it seems like those might still be outstanding issues....

Yes sorry, I gave up on TF and just use libtorch.

@lanctot
Copy link
Collaborator

lanctot commented Mar 24, 2021

@ngrupen I contacted @mrdaliri he confirmed that he never got it fully working. I had mistakenly misunderstood that he had, my apologies for not making that clear. I assume the furthest point he got to is the same place you are now; @mrdaliri can you please confirm?

I will update the documentation to reflect this state of the C++ TF AlphaZero. It would be wonderful if you can find a fix for these last two problems so we can finally get it to work. It'd also be great if it was using a more recent version of TF. I understand if you're not motivated to continue on this. If that's the case, I suggest using the libtorch based C++ AlphaZero. I know this one works, at least one person is using it for their research, and we're about to build a C++ DQN based on the same code as well (which may involve refactoring some of it to make more uses of deep RL via C++ libtorch cleaner). So there's a very good chance that we continue to support that code and build on it, whereas I expect to eventually remove the TF C++ code if nobody can get it to work.

@mrdaliri
Copy link
Contributor

Hi @lanctot @ngrupen,
Back in the day, I got tf_trajectories_example, vpnet_test and alpha_zero_example compiled successfully but only tf_trajectories_example ran without errors. (see this comment on #172)

However, one of the later releases of OpenSpiel switched to a different data type (from float to a special double or something like that if I remember correctly) which again introduced some linking issues. I couldn't solve that error, and eventually, I ran out of time and had to give up on TF C++.

@lanctot
Copy link
Collaborator

lanctot commented Mar 26, 2021

One more quick update: export_graph.py fails because it is using deprecated functions that have since been removed. Replacing:
init = tf.initialize_variables(tf.all_variables(), name="init_all_vars_op")
with
init = tf.variables_initializer(tf.global_variables(), name="init_all_vars_op")
does the trick. Using tf.global_variables_initializer() would also work in this case. Either way, it fixes the errors from up here: #539 (comment)

Hi @ngrupen, can you submit a small PR fix for this?

@ngrupen
Copy link
Contributor Author

ngrupen commented Mar 29, 2021

@lanctot Apologies for the delay -- I was tied up at the end of last week / weekend responding to reviewers :-) Anyways, just submitted this PR for the minor fix: #547 (comment)

As for libtorch vs. tensorflow_cc, I am planning to use OpenSpiel for research as well, so I have started using the libtorch version for now just to start making progress. I can try to pickup where we left off with tensorflow_cc if I have an extra moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants