According to the configuration instructions provided by TensorFlow and CUDA, I installed and configured the environment, but HPCC Systems couldn't detect the GPU. This blog post documents the process of how I resolved this issue.
I run the test code:
IMPORT PYTHON3 AS PYTHON;
Reason : In my previous environment setup, I installed TensorFlow and
CUDA in root mode, but I only configured the environment information in
.bashrc file of the current user.
However, HPCC Systems creates a new user named "hpcc" and uses the
environment variables from that user. As a result, in the "hpcc" user,
LD_LIBRARY_PATH and other environment variables were
not present, causing CUDA and GPU recognition to fail.
I first modify the password of hpcc:
sudo passwd hpcc
In Ubuntu, there are two methods to switch to another user:
- su user: The su command requires you to enter the password of the target user. You must know the password of the target user and have root user privileges. When switching to the target user using the su command, the target user's complete environment variables are not loaded. It only switches to the target user's identity and inherits the current user's environment variables.
- sudo -i -u user: When using the sudo -i -u user command to switch to the target user, the target user's complete environment variables are loaded. It switches you to the target user's identity and loads the target user's environment settings.
According to the environment variable setting rules in Linux, I have
added the previously set environment variables into
sudo -i -u hpcc to enter the hpcc user and typing
env, I found that the previous settings have taken effect.
Please note that at this point, you should not use
to enter the user, as it would load incorrect environment variables.
However, even after making these settings, it appears that HPCC still cannot properly recognize the GPU.
So I tried running the code in HPCC, retrieving the environment variables, and restarting HPCC Systems. After that, I ran:
IMPORT PYTHON3 AS PYTHON;
I found that the environment variable
was not loaded correctly. Could it be because the
CUDNN_PATH is using a Python statement that was not
executed correctly? To test this, I changed
which is the value obtained from the terminal, and after testing, I found that the GPU could be recognized correctly.
Additionally, I later stumbled upon an error:
/etc/profile: line 33: python: command not found
It turned out that the
python command was not recognized
correctly. I resolved this issue by changing
python3 in the following setting:
CUDNN_PATH=$(dirname $(python3 -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
After making this change, the code ran successfully without any issues.