Frequently Asked Questions
From UF HPC Wiki
|
Introduction
A number of common problems that you may run into that have to do with specific programs may be listed under that program's description and entry in this Wiki instead of here. Please be sure to check out those pages as well in case something has been said there. If you do find something interesting that you had to work on in order to get things working properly, please either let us know, or create an account in the Wiki and add it yourself!
Account Functions
Logging In
How do log into the HPC Center?
The login host for the HPC Center is submit.hpc.ufl.edu, and you use ssh to log in. Please see the Getting Started page for more information.
Password
How do I reset my password?
There are two ways in which you can reset your password. The first is from within your account while you are logged in, in which case you would use the passwd command like so:
[jka@submit ~]$ passwd Changing password for user jka. Enter login(LDAP) password: New UNIX password: Retype new UNIX password: New password: Re-enter new password: LDAP password information changed for jka passwd: all authentication tokens updated successfully.
Yes, you have to put in your new password a total of four times. This is caused by a problem we are having with LDAP. Also note that it asks for your current password the first time through.
If you cannot remember what your password is, you can reset it through the web as well. Simply go to the Password Reset Page and authenticate via Gatorlink and you should have no problem with resetting your password.
If these two methods fail, you can still email support@hpc.ufl.edu and ask the administrators to reset your password for you.
Security
How secure is my account on the system? Can anyone access or see my directory?
It is only as secure as the linux/unix permissions you use to protect it. If you don't want other users or members of your group to see your data, you need to set your file permission accordingly.
Compilers
Why do we use an Intel compiler on AMD chips?
When the first half of the current incarnation of the UF HPC Center was purchased back in the fall of 2005, we purchased a license for the Pathscale compilers at the same time. For the x86_64 CPU architecture (Opterons), Pathscale was a good choice at the time for a highly optimizing compiler suite. We had access to licenses for the Intel compilers at this time, as well. That said, Pathscale was the HPC Center compiler of choice until the fall of 2006 or so. When it came time to renew the license, we elected not to do so.
Why? Well, there are a few reasons. First, there is money. The Pathscale and Intel compilers are not free, and the HPC Center operations budget is meager. Second, in the experience we gained with the compilers over the first year, we found no compelling reason to favor Pathscale over the Intel compilers. Third, we submitted a bug report to Pathscale during that first year - they never got back to us with a fix, or any substantial communication from them at all. So we dropped Pathscale.
The Intel compilers generate optimized code for the x64_64 arch just fine in our experience.
Of course, GCC is always there if you want to use it.
Scratch Space
Can I submit jobs from /scratch or do I have to submit them from home? I had always been submitting them from home.
Yes! This is not a problem at all and in reality you should be doing this, as the home area is hugely inefficient for this sort of work.
Can the program my scripts run be located on /scratch rather than home? I had the program in home, but if I can move it to scratch and submit from /scratch, that would be the simplest solution.
Sure thing. This will probably help in the long run as the scratch filesystem is a faster one that your home area.
Should I avoid jobs reading data in my home directory, too?
Yes. You should both read and write data for your programs from the scratch area as it is much more efficient.
If I copy the scripts to my /scratch directory, what modifications to my script will need to be made?
You will have to at least put in a CD command to change directory to the right place. Otherwise the commands may not work properly on standard input.
Test Nodes
I see from your Wiki that there are machines that can be used to run short test scripts in order to reduce the chances of me hosing up anything important. How exactly is this done?
- To access the "test" nodes you first ssh to submit.hpc.ufl.edu (our primary login host). From there, you can ssh into any or all of them as you wish. Since the "test" nodes are on our private network, they are not accessible directly from outside hosts.
- On the test nodes you can do pretty much anything you wish as long as you are considerate of other users an don't monopolize the resources. These nodes are intended for you to be able to develop, test, debug scripts and programs as needed.
- You can do this interactively initially and when you are ready to submit through the queue, you can test your submission script on the test nodes as well since they are dedicated to the "testq" queue. In other words, get your code and submission scripts working properly using the "test" nodes and when you are sure everything works, you can submit your jobs, be they a few or many, with confidence that once they are scheduled they will run correctly.
Job Manipulation
I just submitted a bunch of jobs, and now I realize that there is a mistake in the job submission script for all of them! How do I delete all of the jobs I currently have submitted to the queue?
While there is no command to do this in Moab or Torque, we have run across this problem more than once and we now have a command that will do it for you. The command qdelmine will delete any job that you currently have in the queue, running or waiting to be run.
Infiniband versus Ethernet
I am running parallel program on HPC and I used 4 processors. It works but I got a message:
-------------------------------------------------------------------------- [0,1,0]: OpenIB on host r6b-s35.ufhpc was unable to find any HCAs. Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- -------------------------------------------------------------------------- [0,1,1]: OpenIB on host r6b-s35.ufhpc was unable to find any HCAs. Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- -------------------------------------------------------------------------- [0,1,2]: OpenIB on host r6b-s35.ufhpc was unable to find any HCAs. Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- -------------------------------------------------------------------------- [0,1,3]: OpenIB on host r6b-s35.ufhpc was unable to find any HCAs. Another transport will be used instead, although this may result in lower performance. --------------------------------------------------------------------------
You are getting that message because your job was scheduled on ethernet-only nodes (r6). The default/preferred transport for OpenMPI is InfiniBand but it will also work w/ non-IB transports but you get the message below. You can avoid the message two ways.
- Request IB Nodes:
#PBS -l nodes=n:ppn=1:infiniband
- Add the following logic to your submission script:
set IbEnabled = `/usr/local/sbin/IbEnabled`
if ( $IbEnabled ) then
echo "Running on IB-enabled node set"
set MPIRUN = "mpirun --mca btl openib"
else
echo "Running on GigE-enabled node set"
set MPIRUN = "mpirun --mca btl ^udapl,openib --mca
btl_tcp_if_include eth0"
endif
Use this if you use BASH scripting:
IbEnabled=`/usr/local/sbin/IbEnabled`
if [ $IbEnabled -gt 0 ]; then
echo "Running on IB-enabled node set"
MPIRUN="mpirun --mca btl openib"
else
echo "Running on GigE-enabled node set"
MPIRUN="mpirun --mca btl ^udapl,openib --mca btl_tcp_if_include eth0"
fi
Note that this only applies to MPI applications built using OpenMPI (the cluster default). This will avoid some potential problems.
Another thing to watch out for with OpenMPI and Torque is the use of the machinefile directive. This is not a good thing to do, as it will typically result in an error similar to the following:
[r5a-s11.ufhpc:30484] pls:tm: failed to poll for a spawned proc, return status = 17002 [r5a-s11.ufhpc:30484] [0,0,0] ORTE_ERROR_LOG: In errno in file rmgr_urm.c at line 462 [r5a-s11.ufhpc:30484] mpirun: spawn failed with errno=-11
The idea is to not use the machinefile directive, which typically will look like this: machinefile $PBS_O_WORKDIR/pbsnodes, as the purpose of this directive is already handled by OpenMPI and Torque.
Problems with Bugzilla
If you are having problems with Bugzilla, please see the Bugzilla wiki entry.
Bad interpreter
Why do I get messages like "/bin/bash: bad interpreter: No such file or directory" from my shell and/or job scripts?
For example:
-bash: /var/spool/PBS/mom_priv/jobs/<pbs_job_ib>.pbs.local.SC: /bin/bash : bad interpreter: No such file or directory
One common way this can happen is if the script was created on a Windows machine and then copied over to the HPC cluster via scp, ftp, etcetera. Any text file, including PBS job scripts and shell scripts, created on Windows machines will have invisible characters (like ^M and such) them which the shells on UNIX/Linux machines cannot interpret.
dos2unix program to "scrub" the scripts you create on a Windows host. To use it, simply type the command dos2unix <filename>where <filename> is your file to be scrubbed.
dos2unix will rewrite the file in UNIX format. See the dos2unix manpage via "man dos2unix" for more information.
Another alternative is to simply create a new file on the HPC cluster with your favorite text editor and copy/paste the contents of the script into the new file.
Walltime
What happens if I do not specify the walltime in my job submission?
A default walltime exists of 12 hours, so if yo do not specify a walltime, this is the maximum walltime you will have. It would be better if you did include a walltime in your job if you have a good idea of how long it will take, as this will help the Maui scheduler in figuring when best to schedule your job.
SSH Keys
Passwordless logins
Our group is working on a project, and we need to run about 5000 simulations intermittently during the season. We are trying to automate the process, which requires us to automate the call to HPC. But as the accounts are password enabled, I was wondering how we could go about this.
One thing you can do is copy your RSA public key into the known_hosts file in your HPC account. In linux, you would do this by doing the following:
- Find your RSA public key. Typically this is located in your .ssh directory under your home directory. In the following example it is called id_rsa.pub.
jka@puppy:~/.ssh$ pwd /home/jka/.ssh jka@puppy:~/.ssh$ ls -l total 48 -rw------- 1 jka jka 392 2007-01-16 15:40 cise.pub -rw------- 1 jka jka 887 2005-12-16 14:27 id_rsa -rw-r--r-- 1 jka jka 356 2008-06-06 10:48 id_rsa.keystore -rw-r--r-- 1 jka jka 222 2005-12-16 14:27 id_rsa.pub -rw-r--r-- 1 jka jka 31048 2008-06-05 16:22 known_hosts
- All you need to do is copy the contents of that file into your ~/.ssh/known_hosts file in your HPC account. If this file does not exist, just create the file with the contents of that RSA key. If it does exist, append the contents of your RSA key to the end of that file.
- By default, when your HPC account is created we copy your public RSA key that is generated into the known_hosts file in order to enable jobs to work more smoothly with your account on the various nodes in the cluster.
Changed Keys
Recently we changed the SSH keys on Submit.hpc.ufl.edu. As such, those who try to login via ssh to this machine who still have the old keys will see a warning similar to this:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that the RSA host key has just been changed. The fingerprint for the RSA key sent by the remote host is ab:b9:cd:21:50:df:55:f5:23:79:45:c4:e5:68:4a:6e. Please contact your system administrator. Add correct host key in /ufl/qtp/jlk/sa/.ssh/known_hosts to get rid of this message. Offending key in /ufl/qtp/jlk/sa/.ssh/known_hosts:4 RSA host key for submit.hpc.ufl.edu has changed and you have requested strict checking. Host key verification failed.
If this happens, you will have to remove the offending key from your ssh keylist and then import the new key. In the above case, we know that the offending key is in the file /ufl/qtp/jlk/sa/.ssh/known_hosts at line 4 (because it says so!) All we would have to do in this case is edit the file, go to line 4, and delete that line.
Once you have deleted the line and saved the file, you should be able to connect to the system via ssh. The first time you do this it will ask if you want to accept the new key. Do this, and it will not ask again.
Wine
For questions about Wine please look at the Wine wiki page.
Floating Point Precision
For questions about floating point precision and the differences that may occur between different architectures, see our page on Floating Point Precision
