link tensorflow with open_spiel. #195

Liuweiming · 2020-04-25T04:47:57Z

Dear authors, I am currently trying C++ version AlphaZero and I have been trying to link TensorFlow with open_spiel for some time. I have tried to compile TensorFlow c++ libs, but linking them from open_spiel just failed. The reported problems are about Absl lib and Eigen lib.
I also tried to link with the prebuilt TensorFlow c_api lib. It works without linking errors. However, the program may be stuck somehow. It seems the problem is caused by absl::Mutex.
I gauss it is because open_spiel and TensorFlow depend on different versions of Absl, like below.

open_spiel -----> Absl-lts_2020_02_25
└─--------------> TensorFlow -----> Absl_some_old_version

What do you think? Could you please give me some advice?

Liuweiming · 2020-04-25T05:22:33Z

I noticed the issue (#172), and that is why I am trying TensorFlow c_api lib. If you are interested in the c_api lib and how to reproduce the problem, I can provide more details.

lanctot · 2020-04-25T06:02:11Z

Hi @Liuweiming , as mentioned in that issue you linked, I've been working on getting Tensorflow compiled with CMake using tensorflow_cc (see https://github.com/FloopCZ/tensorflow_cc).

Compiling against a prebuilt Tensorflow c_api sounds promising.. I like simple solutions. I would be quite sad if it was because of two different version of absl. Yes, I am quite curious about the details, so I would be happy to try to reproduce the problem. No rush, though: I unfortunately don't have much time to work on this, so responses might be slow.

lanctot · 2020-04-25T15:10:54Z

I also tried to link with the prebuilt TensorFlow c_api lib. It works without linking errors. However, the program may be stuck somehow. It seems the problem is caused by absl::Mutex.

Curious what program you ran to assess this, and what makes you think it's caused by absl::Mutex? Was it a simple example or our C++ AlphaZero? I'd love to know if it runs for a very simple example. You can also look at tf_trajectories in contrib which basically just runs batch inference (no training). I'm curious if it would work on that.

Liuweiming · 2020-04-26T00:57:04Z

@lanctot , I wrote some code based on the C++ AlphaZero. Actually, I wrote a c_api wrapper, so I can call the c_api lib easily. If you are interested, please take a look here. The implementation is based on these two projects:

But in order to make the problem clear. I also wrote a very simple project, which only depends on Absl lib and TensorFlow c_api lib. I put the code here. The Absl version is lts_2020_02_25. The c_api lib was downloaded from https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-gpu-linux-x86_64-1.15.0.tar.gz (GPU_version). The system and the compiler are Ubuntu 16.04 and gcc-7.5.0. The GPU is 1080ti and the driver version is 440.64.00.
Four settings were tested:

No absl::Mutex, No GPU.
absl::Mutex, No GPU.
No absl::Mutex, GPU.
absl::Mutex, GPU.

Here are the results.

example 1: Load a simple model, and do an inference. The model takes input_a and input_b, the result = input_a + input_b. The Python code is shown below.

# Two simple inputs
a = tf.placeholder(tf.float32, shape=(1, 100), name="input_a")
b = tf.placeholder(tf.float32, shape=(1, 100), name="input_b")

# Output
c = tf.add(a, b, name='result')

# To add an init operation to the model
i = tf.initializers.global_variables()

# Write the model definition
with open('loading_example.pb', 'wb') as f:
    f.write(tf.get_default_graph().as_graph_def().SerializeToString())

Then I import the graph and do an inference in c++. The code is in here.

Please note the second line in the main function, where I created an absl::Mutex object: absl::Mutex m;. I tried to find out what will happen if I compile with or without that line. However, this example works well on all four settings:

absl::Mutex	GPU	Work?
No	No	Yes
No	Yes	Yes
Yes	No	Yes
Yes	Yes	Yes

example 2: I build a one-layer neural network and try to inference and train it from c++. The Python code is still simple:

# inputs
input_ = tf.placeholder(tf.float32, shape=(1, 100), name="input")
target_ = tf.placeholder(tf.float32, shape=(1, 3), name="target")

# Output
output_ = tfkl.Activation("relu")(tfkl.Dense(64)(input_))
output_ = tfkl.Dense(3)(output_)
output_ = tf.identity(output_, name="output")

loss = 0.5 * tf.reduce_mean(tf.squared_difference(output_, target_))
loss = tf.identity(loss, name="loss")

optimizer = tf.train.AdamOptimizer(0.01)
train = optimizer.minimize(loss, name="train")

init = tf.variables_initializer(
    tf.global_variables(), name="init")

# Write the model definition
with open('training.pb', 'wb') as f:
    f.write(tf.get_default_graph().as_graph_def().SerializeToString())

The c++ code is in here.

In this example, the result is different.

absl::Mutex	GPU	Work?
No	No	Yes
No	Yes	Yes
Yes	No	Yes
Yes	Yes	No

When running normally, the output should look like this:

Session starting: init
Session ending: init
Session starting: output
Session ending: output
before training, ouput = [0.412944 -0.634564 -1.20341 ]
Session starting: train
Session ending: train
after training, ouput = [-0.00074295 1.00136 2.00789 ]

However, when running on GPU, it was stuck here:

2020-04-26 08:29:02.183289: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-04-26 08:29:02.185364: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-04-26 08:29:02.187103: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-04-26 08:29:02.187548: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-04-26 08:29:02.211719: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-04-26 08:29:02.213615: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-04-26 08:29:02.228917: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-26 08:29:02.239596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-04-26 08:29:02.244991: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-04-26 08:29:02.904258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-26 08:29:02.904521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-04-26 08:29:02.904596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-04-26 08:29:02.912069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3186 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1)
Session starting: init
Session ending: init
Session starting: output
2020-04-26 08:29:04.950677: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

But anyway, I don't plan to work into this issue more, because I am getting busy now.

Liuweiming · 2020-04-27T04:53:38Z

Update. I tried to rename the namespace of Absl lib, by doing:
sed -i "s/namespace absl/namespace my_absl/g" `grep "namespace absl" -rl .`
sed -i "s/absl::/my_absl::/g" `grep "absl::" -rl .` .
It compiled and linked perfectly. But the problem remains the same. So I tried to debug into TensorFlow lib. The screenshot showed that libtensorflow_framework.so linked to my_absl::lts_2020_02_25 !

Please note the function AbslInternalPerThreadSemWait, the code in Absl lib shows

extern "C" {
void AbslInternalPerThreadSemPost(
    my_absl::base_internal::ThreadIdentity* identity);
bool AbslInternalPerThreadSemWait(
    my_absl::synchronization_internal::KernelTimeout t);
}  // extern "C"

So this function was exported as a c function, and obviously, the namespace didn't work at all.

lanctot mentioned this issue Jun 25, 2020

Openspiel+Bazel+Tensorflow build failure #172

Closed

lanctot closed this as completed Aug 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

link tensorflow with open_spiel. #195

link tensorflow with open_spiel. #195

Liuweiming commented Apr 25, 2020

Liuweiming commented Apr 25, 2020

lanctot commented Apr 25, 2020 •

edited

Loading

lanctot commented Apr 25, 2020 •

edited

Loading

Liuweiming commented Apr 26, 2020

Liuweiming commented Apr 27, 2020

link tensorflow with open_spiel. #195

link tensorflow with open_spiel. #195

Comments

Liuweiming commented Apr 25, 2020

Liuweiming commented Apr 25, 2020

lanctot commented Apr 25, 2020 • edited Loading

lanctot commented Apr 25, 2020 • edited Loading

Liuweiming commented Apr 26, 2020

Liuweiming commented Apr 27, 2020

lanctot commented Apr 25, 2020 •

edited

Loading

lanctot commented Apr 25, 2020 •

edited

Loading