Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

link tensorflow with open_spiel. #195

Closed
Liuweiming opened this issue Apr 25, 2020 · 5 comments
Closed

link tensorflow with open_spiel. #195

Liuweiming opened this issue Apr 25, 2020 · 5 comments

Comments

@Liuweiming
Copy link

Dear authors, I am currently trying C++ version AlphaZero and I have been trying to link TensorFlow with open_spiel for some time. I have tried to compile TensorFlow c++ libs, but linking them from open_spiel just failed. The reported problems are about Absl lib and Eigen lib.
I also tried to link with the prebuilt TensorFlow c_api lib. It works without linking errors. However, the program may be stuck somehow. It seems the problem is caused by absl::Mutex.
I gauss it is because open_spiel and TensorFlow depend on different versions of Absl, like below.

open_spiel -----> Absl-lts_2020_02_25
└─--------------> TensorFlow -----> Absl_some_old_version

What do you think? Could you please give me some advice?

@Liuweiming
Copy link
Author

I noticed the issue (#172), and that is why I am trying TensorFlow c_api lib. If you are interested in the c_api lib and how to reproduce the problem, I can provide more details.

@lanctot
Copy link
Collaborator

lanctot commented Apr 25, 2020

Hi @Liuweiming , as mentioned in that issue you linked, I've been working on getting Tensorflow compiled with CMake using tensorflow_cc (see https://github.com/FloopCZ/tensorflow_cc).

Compiling against a prebuilt Tensorflow c_api sounds promising.. I like simple solutions. I would be quite sad if it was because of two different version of absl. Yes, I am quite curious about the details, so I would be happy to try to reproduce the problem. No rush, though: I unfortunately don't have much time to work on this, so responses might be slow.

@lanctot
Copy link
Collaborator

lanctot commented Apr 25, 2020

I also tried to link with the prebuilt TensorFlow c_api lib. It works without linking errors. However, the program may be stuck somehow. It seems the problem is caused by absl::Mutex.

Curious what program you ran to assess this, and what makes you think it's caused by absl::Mutex? Was it a simple example or our C++ AlphaZero? I'd love to know if it runs for a very simple example. You can also look at tf_trajectories in contrib which basically just runs batch inference (no training). I'm curious if it would work on that.

@Liuweiming
Copy link
Author

@lanctot , I wrote some code based on the C++ AlphaZero. Actually, I wrote a c_api wrapper, so I can call the c_api lib easily. If you are interested, please take a look here. The implementation is based on these two projects:

But in order to make the problem clear. I also wrote a very simple project, which only depends on Absl lib and TensorFlow c_api lib. I put the code here. The Absl version is lts_2020_02_25. The c_api lib was downloaded from https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-gpu-linux-x86_64-1.15.0.tar.gz (GPU_version). The system and the compiler are Ubuntu 16.04 and gcc-7.5.0. The GPU is 1080ti and the driver version is 440.64.00.
Four settings were tested:

  1. No absl::Mutex, No GPU.
  2. absl::Mutex, No GPU.
  3. No absl::Mutex, GPU.
  4. absl::Mutex, GPU.

Here are the results.

example 1: Load a simple model, and do an inference. The model takes input_a and input_b, the result = input_a + input_b. The Python code is shown below.

# Two simple inputs
a = tf.placeholder(tf.float32, shape=(1, 100), name="input_a")
b = tf.placeholder(tf.float32, shape=(1, 100), name="input_b")

# Output
c = tf.add(a, b, name='result')

# To add an init operation to the model
i = tf.initializers.global_variables()

# Write the model definition
with open('loading_example.pb', 'wb') as f:
    f.write(tf.get_default_graph().as_graph_def().SerializeToString())

Then I import the graph and do an inference in c++. The code is in here.

Please note the second line in the main function, where I created an absl::Mutex object: absl::Mutex m;. I tried to find out what will happen if I compile with or without that line. However, this example works well on all four settings:

absl::Mutex GPU Work?
No No Yes
No Yes Yes
Yes No Yes
Yes Yes Yes

example 2: I build a one-layer neural network and try to inference and train it from c++. The Python code is still simple:

# inputs
input_ = tf.placeholder(tf.float32, shape=(1, 100), name="input")
target_ = tf.placeholder(tf.float32, shape=(1, 3), name="target")

# Output
output_ = tfkl.Activation("relu")(tfkl.Dense(64)(input_))
output_ = tfkl.Dense(3)(output_)
output_ = tf.identity(output_, name="output")

loss = 0.5 * tf.reduce_mean(tf.squared_difference(output_, target_))
loss = tf.identity(loss, name="loss")

optimizer = tf.train.AdamOptimizer(0.01)
train = optimizer.minimize(loss, name="train")

init = tf.variables_initializer(
    tf.global_variables(), name="init")

# Write the model definition
with open('training.pb', 'wb') as f:
    f.write(tf.get_default_graph().as_graph_def().SerializeToString())

The c++ code is in here.

In this example, the result is different.

absl::Mutex GPU Work?
No No Yes
No Yes Yes
Yes No Yes
Yes Yes No

When running normally, the output should look like this:

Session starting: init
Session ending: init
Session starting: output
Session ending: output
before training, ouput = [0.412944 -0.634564 -1.20341 ]
Session starting: train
Session ending: train
after training, ouput = [-0.00074295 1.00136 2.00789 ]

However, when running on GPU, it was stuck here:

2020-04-26 08:29:02.183289: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-04-26 08:29:02.185364: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-04-26 08:29:02.187103: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-04-26 08:29:02.187548: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-04-26 08:29:02.211719: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-04-26 08:29:02.213615: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-04-26 08:29:02.228917: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-26 08:29:02.239596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-04-26 08:29:02.244991: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-04-26 08:29:02.904258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-26 08:29:02.904521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-04-26 08:29:02.904596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-04-26 08:29:02.912069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3186 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1)
Session starting: init
Session ending: init
Session starting: output
2020-04-26 08:29:04.950677: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

But anyway, I don't plan to work into this issue more, because I am getting busy now.

@Liuweiming
Copy link
Author

Update. I tried to rename the namespace of Absl lib, by doing:
sed -i "s/namespace absl/namespace my_absl/g" `grep "namespace absl" -rl .`
sed -i "s/absl::/my_absl::/g" `grep "absl::" -rl .` .
It compiled and linked perfectly. But the problem remains the same. So I tried to debug into TensorFlow lib. The screenshot showed that libtensorflow_framework.so linked to my_absl::lts_2020_02_25 !
image

Please note the function AbslInternalPerThreadSemWait, the code in Absl lib shows

extern "C" {
void AbslInternalPerThreadSemPost(
    my_absl::base_internal::ThreadIdentity* identity);
bool AbslInternalPerThreadSemWait(
    my_absl::synchronization_internal::KernelTimeout t);
}  // extern "C"

So this function was exported as a c function, and obviously, the namespace didn't work at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants