Solve the problem that TensorFlow cannot find the GPU in HPCC Systems

According to the configuration instructions provided by TensorFlow and CUDA, I installed and configured the environment, but HPCC Systems couldn't detect the GPU. This blog post documents the process of how I resolved this issue.

I run the test code:

1
2
3
4
5
6
7
8
9
10
11
12
IMPORT PYTHON3 AS PYTHON;

STRING GPUtest() := EMBED(Python)
import tensorflow as tf
if tf.test.is_gpu_available():
return 'available'
else:
return 'unavailable'
ENDEMBED;

res := GPUtest();
OUTPUT(res, NAMED('res'))

Reason : In my previous environment setup, I installed TensorFlow and CUDA in root mode, but I only configured the environment information in the .bashrc file of the current user. However, HPCC Systems creates a new user named "hpcc" and uses the environment variables from that user. As a result, in the "hpcc" user, the LD_LIBRARY_PATH and other environment variables were not present, causing CUDA and GPU recognition to fail.

I first modify the password of hpcc:

1
sudo passwd hpcc

In Ubuntu, there are two methods to switch to another user:

  • su user: The su command requires you to enter the password of the target user. You must know the password of the target user and have root user privileges. When switching to the target user using the su command, the target user's complete environment variables are not loaded. It only switches to the target user's identity and inherits the current user's environment variables.
  • sudo -i -u user: When using the sudo -i -u user command to switch to the target user, the target user's complete environment variables are loaded. It switches you to the target user's identity and loads the target user's environment settings.

According to the environment variable setting rules in Linux, I have added the previously set environment variables into /etc/profile:

1
2
3
4
5
6
7
8
alias python='python3'

export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-11.8/lib64
export PATH=/usr/local/cuda-11.8/bin${PATH:+:${PATH}}

CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDNN_PATH/lib

Using sudo -i -u hpcc to enter the hpcc user and typing env, I found that the previous settings have taken effect. Please note that at this point, you should not use su hpcc to enter the user, as it would load incorrect environment variables.

However, even after making these settings, it appears that HPCC still cannot properly recognize the GPU.

So I tried running the code in HPCC, retrieving the environment variables, and restarting HPCC Systems. After that, I ran:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
IMPORT PYTHON3 AS PYTHON;

STRING GPUtest() := EMBED(Python)
import tensorflow as tf
import subprocess

command = "env"
result = subprocess.run(command, shell=True, capture_output=True, text=True)
exit_code = result.returncode
output = result.stdout
error = result.stderr
return output
ENDEMBED;

res := GPUtest();
OUTPUT(res, NAMED('res'))

I found that the environment variable LD_LIBRARY_PATH was not loaded correctly. Could it be because the CUDNN_PATH is using a Python statement that was not executed correctly? To test this, I changed CUDNN_PATH to

1
/usr/local/lib/python3.10/dist-packages/nvidia/cudnn/lib

which is the value obtained from the terminal, and after testing, I found that the GPU could be recognized correctly.

Additionally, I later stumbled upon an error:

1
2
3
/etc/profile: line 33: python: command not found
dirname: missing operand
Try 'dirname --help' for more information.

It turned out that the python command was not recognized correctly. I resolved this issue by changing python to python3 in the following setting:

1
CUDNN_PATH=$(dirname $(python3 -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))

After making this change, the code ran successfully without any issues.