Installing TensorFlow with distributed GPU support.

Today, I wrote my first “Hello World” script using the freshly open-sourced version of TensorFlow with distributed GPU support. At the time of this writing, the binary releases of TensorFlow don’t come with the distributed GPU support therefore I had to build TensorFlow from sources. All the documentation to do this already exists but is a bit scattered on multiple websites. Here is a condensed version of the install process (on a Linux Ubuntu 14.04 platform).

In order to build TensorFlow, you first need to install a few basic tools. Here is the command :

$ sudo apt-get install pkg-config zip g++ zlib1g-dev unzip swig git

You also need to install Java 8. On Ubuntu 14.04, it can easily be done with the following commands as openJDK-8 is not available for Ubuntu 14.04 :

$ sudo apt-get install software-properties-common
$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

The last tool needed is Bazel. Again, it is simply a question of two command lines :

$ wget
$ sudo dpkg -i bazel_0.2.0-linux-x86_64.deb

If all the commands were successfull then you are ready to build TensorFlow … no wait, I have said distributed GPU !


Please refer to for CUDA and for cuDNN (you will need to register to the Accelerated Computing Developer Program).

Assuming a standard installation in /usr/local/cuda and the following cuDNN cudnn-7.0-linux-x64-v3.0-prod.tgz, simply run :

$ tar -xf cudnn-7.0-linux-x64-v3.0-prod.tgz
$ sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
$ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/

Now you are ready to build TensorFlow !


TensorFlow uses gRPC for inter-process communication. To build the server binary, first clone TensorFlow repository :

$ git clone –recurse-submodules

NOTE: The initial commit of the open-source distributed TensorFlow runtime is 00986d48bb646daab659503ad3a713919865f32d.

Then, cd into the TensorFlow repository and run the ./configure script. Now, you can build the server binary with :

$ bazel build -c opt –config=cuda //tensorflow/core/distributed_runtime/rpc:grpc_tensorflow_server


To build the pip package with GPU support, just run :

$ bazel build -c opt –config=cuda //tensorflow/tools/pip_package:build_pip_package
$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

and install it with pip :

# The name of the .whl file will depend on your platform.
$ pip install /tmp/tensorflow_pkg/tensorflow-0.7.1-py2-none-linux_x86_64.whl


First, start a TensorFlow server as a single-process “cluster” :

$ bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server –cluster_spec=’local|localhost:2222′ –job_name=local –task_id=0 &

Then start a Python interpreter and create a remote session with a simple “Hello World !” command:

$ python
>>> import tensorflow as tf
>>> c = tf.constant(“Hello World !”)
>>> sess = tf.Session(“grpc://localhost:2222”)
“Hello World !”

Now repeat the process on the different nodes of your cluster and start playing with TensorFlow !



As you can see, it is really easy to set up a cluster supporting distributed Deep Learning with TensorFlow. If you want to know more about what is possible to do, please refer to the README. The approach of distributed support followed by TensorFlow is quite a low-level one, enabling the user to tune any step of the learning process.

What about the others

CNTK claims a huge performance gap, especially in the distributed GPU setting. After a quick look at the documentation, it is not easy to understand their distribution policy and how to reproduce their tests. Digging into the GitHub repository, I found this configuration file : Multigpu.cntk. Apparently, the only option for parrallelism is a DataParallelSGD approach.

MXNet seems to be a serious competitor in the distributed setting. Their approach to distribute the training progress based on a distributed key-value store to exchange the gradients parameters is straightforward yet flexible enough in practice to switch from synchronous to asynchronous learning.

And you ? What is your experience with distributed Deep Learning ?

