Common Problems
From UF HPC Wiki
Contents |
Introduction
A number of common problems that you may run into that have to do with specific programs may be listed under that program's description and entry in this Wiki instead of here. Please be sure to check out those pages as well in case something has been said there. If you do find something interesting that you had to work on in order to get things working properly, please either let us know, or create an account in the Wiki and add it yourself!
Bad interpreter
Why do I get messages like "/bin/bash: bad interpreter: No such file or directory" from my shell and/or job scripts?
For example:
-bash: /var/spool/PBS/mom_priv/jobs/<pbs_job_ib>.pbs.local.SC: /bin/bash : bad interpreter: No such file or directory
One common way this can happen is if the script was created on a Windows machine and then copied over to the HPC cluster via scp, ftp, etcetera. Any text file, including PBS job scripts and shell scripts, created on Windows machines will have invisible characters (like ^M and such) them which the shells on UNIX/Linux machines cannot interpret.
dos2unix program to "scrub" the scripts you create on a Windows host. To use it, simply type the command dos2unix <filename>where <filename> is your file to be scrubbed.
dos2unix will rewrite the file in UNIX format. See the dos2unix manpage via "man dos2unix" for more information.
Another alternative is to simply create a new file on the HPC cluster with your favorite text editor and copy/paste the contents of the script into the new file.
Walltime
What happens if I do not specify the walltime in my job submission?
A default walltime exists of 12 hours, so if yo do not specify a walltime, this is the maximum walltime you will have. It would be better if you did include a walltime in your job if you have a good idea of how long it will take, as this will help the Maui scheduler in figuring when best to schedule your job.
SSH Keys
Passwordless logins
Our group is working on a project, and we need to run about 5000 simulations intermittently during the season. We are trying to automate the process, which requires us to automate the call to HPC. But as the accounts are password enabled, I was wondering how we could go about this.
One thing you can do is copy your RSA public key into the known_hosts file in your HPC account. In linux, you would do this by doing the following:
- Find your RSA public key. Typically this is located in your .ssh directory under your home directory. In the following example it is called id_rsa.pub.
jka@puppy:~/.ssh$ pwd /home/jka/.ssh jka@puppy:~/.ssh$ ls -l total 48 -rw------- 1 jka jka 392 2007-01-16 15:40 cise.pub -rw------- 1 jka jka 887 2005-12-16 14:27 id_rsa -rw-r--r-- 1 jka jka 356 2008-06-06 10:48 id_rsa.keystore -rw-r--r-- 1 jka jka 222 2005-12-16 14:27 id_rsa.pub -rw-r--r-- 1 jka jka 31048 2008-06-05 16:22 known_hosts
- All you need to do is copy the contents of that file into your ~/.ssh/known_hosts file in your HPC account. If this file does not exist, just create the file with the contents of that RSA key. If it does exist, append the contents of your RSA key to the end of that file.
- By default, when your HPC account is created we copy your public RSA key that is generated into the known_hosts file in order to enable jobs to work more smoothly with your account on the various nodes in the cluster.
Changed Keys
Recently we changed the SSH keys on Submit.hpc.ufl.edu. As such, those who try to login via ssh to this machine who still have the old keys will see a warning similar to this:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that the RSA host key has just been changed. The fingerprint for the RSA key sent by the remote host is ab:b9:cd:21:50:df:55:f5:23:79:45:c4:e5:68:4a:6e. Please contact your system administrator. Add correct host key in /ufl/qtp/jlk/sa/.ssh/known_hosts to get rid of this message. Offending key in /ufl/qtp/jlk/sa/.ssh/known_hosts:4 RSA host key for submit.hpc.ufl.edu has changed and you have requested strict checking. Host key verification failed.
If this happens, you will have to remove the offending key from your ssh keylist and then import the new key. In the above case, we know that the offending key is in the file /ufl/qtp/jlk/sa/.ssh/known_hosts at line 4 (because it says so!) All we would have to do in this case is edit the file, go to line 4, and delete that line.
Once you have deleted the line and saved the file, you should be able to connect to the system via ssh. The first time you do this it will ask if you want to accept the new key. Do this, and it will not ask again.
Infiniband Issues
Because we have some nodes with Infiniband, some nodes w/out infiniband and some nodes with IB cards but are not yet connected to the fabric (awaiting switches), you might want to add the following logic for setting your "mpirun" command to your submission scripts.
set IbEnabled = `/usr/local/sbin/IbEnabled`
if ( $IbEnabled ) then
echo "Running on IB-enabled node set"
set MPIRUN = "mpirun --mca btl openib"
else
echo "Running on GigE-enabled node set"
set MPIRUN = "mpirun --mca btl ^udapl,openib --mca
btl_tcp_if_include eth0"
endif
Use this if you use BASH scripting:
IbEnabled=`/usr/local/sbin/IbEnabled`
if [ $IbEnabled -gt 0 ]; then
echo "Running on IB-enabled node set"
MPIRUN="mpirun --mca btl openib"
else
echo "Running on GigE-enabled node set"
MPIRUN="mpirun --mca btl ^udapl,openib --mca btl_tcp_if_include eth0"
fi
Note that this only applies to MPI applications built using OpenMPI (the cluster default). This will avoid some potential problems.
Another thing to watch out for with OpenMPI and Torque is the use of the machinefile directive. This is not a good thing to do, as it will typically result in an error similar to the following:
[r5a-s11.ufhpc:30484] pls:tm: failed to poll for a spawned proc, return status = 17002 [r5a-s11.ufhpc:30484] [0,0,0] ORTE_ERROR_LOG: In errno in file rmgr_urm.c at line 462 [r5a-s11.ufhpc:30484] mpirun: spawn failed with errno=-11
The idea is to not use the machinefile directive, which typically will look like this: machinefile $PBS_O_WORKDIR/pbsnodes, as the purpose of this directive is already handled by OpenMPI and Torque.
