TorqueHowto
From UF HPC Wiki
Contents |
Introduction
PBS stands for "Portable Batch System". It is a popular networked subsystem for submitting, monitoring, and controlling a work load of batch jobs on one or more systems. PBS has a long history, and is available nowadays in three flavors:
* OpenPBS - the original PBS developed for NASA in the early to mid-1990s, available as open source * PBS Pro - a commercial version of PBS from Altair Engineering * Torque - the open source successor to <nop>OpenPBS
OpenPBS still works, and you can use it, but you should be aware that the focus of the open source development effort has moved on to Torque for some time now. If you want to run OpenPBS and support is important to you, you may purchase it from Altair Engineering. Personally, I am not aware of a technical argument to prefer OpenPBS to Torque.
PBS Pro is a fine product. It is reasonably priced compared to competing commercial batch systems and has dedicated support from Altair Engineering should you desire it. We use it where I work, and I have no serious complaints. The most recent releases of PBS Pro (versions 7.x and later) have features that OpenPBS and Torque do not, as well as a more flexible resource specification than its open source counterparts. If I recall correctly, you should be able to get a trial version of PBS Pro before you buy - contact Altair for info. Prior to versions 7.x, you could plug in the MAUI scheduler in place of the FIFO scheduler that is included with PBS Pro. With versions 7.x and later, this no longer seems to work.
As mentioned previously, Torque is the open source PBS project which is being actively developed, and the user community has followed suit. Iin its current 2.x versions, it has also matured into a quality product and should be more than capable of scaling to typical Tier3 cluster sizes. In addition, you have the option of using the more flexible and open source MAUI scheduler in place of the FIFO scheduler included with Torque, should you wish it.
The above are my personal opinions. In this document, I will specifically describe deployment and configuration of Torque and its FIFO scheduler as the batch system backing an an OSG computing element. The configuration steps outlined below, however, should apply nearly equally well to OpenPBS or PBS Pro.
Useful Links
It is always handy to have a few reference links at your fingertips, so I enclose a few here. As always, Google is your friend and a wealth of information.
* Cluster Resource's Torque page: http://www.clusterresources.com/pages/products/torque-resource-manager.php * Torque Admin Manual: http://www.clusterresources.com/wiki/doku.php?id=torque:torque_wiki * Torque Quickstart Manual: http://www.clusterresources.com/wiki/doku.php?id=torque:appendix:l_torque_quickstart_guide * Altair's PBS Pro front page: http://www.altair.com/software/pbspro.htm * <nop>OpenPBS versus PBS Pro - "Which PBS is for you?": http://www.openpbs.org/which_pbs.html * <nop>OpenPBS Mini-HOWTO: http://dcwww.camp.dtu.dk/pbs.html
Obtaining Torque
Torque is distributed as source files packaged as a tarball that you unpack and build yourself. To get this tarball, point your web browser to http://clusterresources.com/downloads/torque. Select the most recent version and save it to a file (at the time of this writing, version 2.1.7 is the latest release). Copy the tarball to the node you intend to use as your OSG computing element headnode.
Building Torque
We have a couple of options in ways to build Torque: you can either build binaries as one normally does from a tarball, or you can build RPMs. In my opinion, if you are running a Red Hat/RPM-based Linux distribution, you should build the RPMs - having RPMs makes deploying Torque components on multiple machines easy, and the RPMs can likely be integrated into a cluster management tool like Rocks very trivially.
Unpack the tarball and change directory to the top level of the source tree. You may do this as a normal user, if you like. If you are *not* going to make RPMs, you will find it convenient later to do this in an NFS-shared area. If you are going to make RPMs, it doesn't matter where you do this.
tar xvzf torque-2.1.7.tar.gz cd torque-2.1.7
You may wish to glance at the README.torque and README.configure files at this point. If you have built open source software before, you may also wish to examine the myriad of configuration options of the build by executing
./configure --help
But you don't have to if you aren't too curious. :-)
Some words about build dependencies: you will need a basic development environment installed in order to build Torque. The GNU compiler collection is just what the doctor ordered; you need the C, C++, and Fortran 77 compilers installed to build everything. Specifically, make sure you have the programs gcc, g++, and g77 installed. In addition, you should have SSH clients installed (does anyone use rsh anymore?); the configuration step prior to compilation will check to see if scp is available - if so, it will be used as the "remote copy program" used to relay job stdout and stderr back to the user from the compute nodes. In addition, Torque comes with Tcl/Tk-based GUI programs monitor jobs and batch system status. To build these, you will need a Tcl/Tk development environment, as well.
On an Red Hat-based machine, everything will be built if you have these packages (and their associated dependencies) installed:
* make * gcc * gcc-g++ * gcc-g77 * gawk * glibc-devel * bison * flex * groff * openssh-clients * tcl * tcl-devel * tclx * tclx-devel * tk * tk-devel * xorg-x11-xauth
Probably you have (almost?) all this stuff installed already.
Ok, we have all of that installed and available. Now we configure the source tree to prepare for building the package. Most folks are probably happy to accept the default installation path (you can control it with the --prefix=... option if you like) and other options, so can say:
./configure
At the UF HPC Center, for our x86_64 nodes, we use the following configure command:
./configure --prefix=/usr --with-default-server=torque.ufhpc --with-rcp=scp --disable-rpp --libdir=/usr/lib64
A bunch of test output will spew forth. Ultimately, this step generates all the Makefiles needed to build the package. Now we just say
make
or, if we are on a Red Hat-based machine, you should do *as root*:
make rpm
The former command will simply build binaries. The latter command will build binary RPM packages. I am on a Red Hat-based machine, so I use make rpm. On any modern machine, this will take a few minutes at most. In the end, I am left with the following files:
/usr/src/redhat/RPMS/i386/torque-2.1.7-1cri.i386.rpm /usr/src/redhat/RPMS/i386/torque-debuginfo-2.1.7-1cri.i386.rpm /usr/src/redhat/RPMS/i386/torque-docs-2.1.7-1cri.i386.rpm /usr/src/redhat/RPMS/i386/torque-scheduler-2.1.7-1cri.i386.rpm /usr/src/redhat/RPMS/i386/torque-server-2.1.7-1cri.i386.rpm /usr/src/redhat/RPMS/i386/torque-mom-2.1.7-1cri.i386.rpm /usr/src/redhat/RPMS/i386/torque-client-2.1.7-1cri.i386.rpm /usr/src/redhat/RPMS/i386/torque-gui-2.1.7-1cri.i386.rpm /usr/src/redhat/RPMS/i386/torque-localhost-2.1.7-1cri.i386.rpm /usr/src/redhat/RPMS/i386/torque-devel-2.1.7-1cri.i386.rpm /usr/src/redhat/RPMS/i386/torque-pam-2.1.7-1cri.i386.rpm
Installing Torque on the Head Node
Now we are ready to deploy some packages on our head node. If you didn't create RPMs in the steps above, you need to *become root* on the head node and type make install from the top level of the Torque source tree. Everything will be installed into /usr/local unless you specified an alternate installation path at configure time.
On the other hand, if you have RPMs read on. On our head node, we at least want to install the Torque server and scheduler, and have commands available to submit, monitor, and remove jobs. If you want your head node to be able to run batch jobs, you will also need to install the "MOM" package here. You may also like to install the documentation and GUI tools. To do all this, *become root*, go to the directory where the RPMs reside, and type:
rpm -Uvh torque-server-2.1.7-1cri.i386.rpm \
torque-scheduler-2.1.7-1cri.i386.rpm \
torque-mom-2.1.7-1cri.i386.rpm \
torque-client-2.1.7-1cri.i386.rpm \
torque-docs-2.1.7-1cri.i386.rpm \
torque-gui-2.1.7-1cri.i386.rpm \
torque-2.1.7-1cri.i386.rpm
Installing Torque on the Compute Nodes
On the compute nodes, we need to be able to run jobs. As an administrator, I also find it occasionally convenient to query the batch system from compute nodes, so installing client tools would be handy, too.
If you *don't* have RPMs, *become root* on one of your compute nodes, go to the top level directory of your NFS-shared Torque source tree, and type make install. Repeat this step for every compute node you want to add to the batch system.
If you *do* have RPMs, go to the directory where they reside (copy them over if you need to), and say:
rpm -Uvh torque-mom-2.1.7-1cri.i386.rpm \ torque-client-2.1.7-1cri.i386.rpm \ torque-2.1.7-1cri.i386.rpm
Configuring SSH for PBS
The above Torque packages will use ssh for process transport and scp for copying of output files. Torque must be able to use these tools with password-less authentication between the head node and the compute nodes. There are two means of accomplishing this with <nop>OpenSSH - hostbased authentication, and user public key authentication. While using individual user public keys certainly works, it is a pain to manage in my opinion, and error prone for any local users. So in this section, I will describe how to set up hostbased authentication for <nop>OpenSSH in order to satisfy PBS's requirements.
Let's start on the head node. First, create the /etc/hosts.equiv file and populate it with fully qualified domain names corresponding to your head node's LAN interface and all your compute nodes:
headnode.internal.domain computenode1.internal.domain computenode2.internal.domain ...
In the /etc/ssh/ssh_config file, make sure the following are on so ssh clients can try hostbased authentications:
HostbasedAuthentication yes EnableSSHKeysign yes
In the /etc/ssh/sshd_config, we want the server to permit hostbased authentication attempts (only from hosts in our /etc/hosts.equiv):
HostbasedAuthentication yes IgnoreRhosts yes IgnoreUserKnownHosts yes
Hostbased authentications will succeed for hosts whose public keys are in the /etc/ssh/ssh_known_hosts file. So you will have to collect the keys for every host you intend to put in the batch system. One way to to that is to use the hosts.equiv file you just populated:
ssh-keyscan -t rsa,rsa1,dsa -f /etc/hosts.equiv > /etc/ssh/ssh_known_hosts
Now restart sshd with /etc/init.d/sshd restart. Copy the /etc/ssh/ssh_config, /etc/ssh/sshd_config, /etc/ssh/ssh_known_hosts, and /etc/hosts.equiv files to each of your compute nodes and restart sshd there, too. You should now be able to authenticate as a normal user between hosts across your cluster without having to enter a password.
Configuring the Head Node
Here, we will initialize the server and create some queues. You will need to do these steps *as root*. My recommendation is to create a routing queue which can submit jobs to one or more execution queues, where the jobs will actually run. The presence of the routing queue will allow some flexibility in the case you wish to have multiple execution queues. In the "usual" case, where a dual-homed head node has a public interface exposed to the WAN and a private interface dedicated for LAN traffic, it makes sense to configure Torque to use the LAN interface. I will assume this is the case.
There are three PBS services currently deployed and enabled on your head node: pbs_server, pbs_sched, and pbs_mom. pbs_server runs the show, so to speak - it instanciates the batch system on your cluster. pbs_sched is the scheduler - it makes the decisions of which jobs to run. pbs_mom (MOM: Machine Oriented Mini-server) actually runs the jobs.
You will need to run the server and scheduler services. In my experience, MOM is not usually run on the head node, as head nodes usually have plenty to do already. They typically offer lots of key cluster infrastructure services like NFS, DNS, LDAP/NIS, etcetera; allowing compute and memory-intensive jobs to run on the head node can have adverse effects upon the entire cluster. So feel free to simply turn MOM off on the head node:
chkconfig pbs_mom off
If you wish to allow your head node to run jobs, you should make sure the /var/spool/torque/server_name file contains the hostname corresponding the LAN address of your head node.
The Torque configuration files live underneath /var/spool/torque by default. On the head node, you will see the following files and subdirectories underneath /var/spool/torque:
aux/ mom_logs/ pbs_environment sched_priv/ server_name spool/ checkpoint/ mom_priv/ sched_logs/ server_logs/ server_priv/ undelivered/
In the configuration steps which follow, we will be poking around this area on the head and compute nodes.
PBS Server
We can initialize the batch system thusly:
pbs_server -t create [-h <hostname>] [-S <hostname>] [-M <hostname>]
The -h <hostname>, -M <hostname> and -S <hostname> options are only useful if the hostname corresponding to your head node's LAN address is *not* the same as the output of the hostname command; these options tell the server to use the supplied <hostname> for server, scheduler, and MOM communications, respectively. In this case, you will also want to create /etc/sysconfig/pbs_server with the following contents:
PBS_DAEMON="/usr/local/sbin/pbs_server -h <hostname> -S <hostname> -M <hostname>"
where <hostname> is the same hostname you used when you created the server instance.
We can now configure the server. This is done with the qmgr command. If qmgr is executed without any options, it will put you in an interactive shell from which you can just type in PBS commands. But you can also feed commands to qmgr with the -c option. Let's turn on scheduling, create a routing queue and an execution queue, and take care of some defaults:
qmgr -c 'set server scheduling=true' qmgr -c 'create queue defaultq' qmgr -c 'set queue defaultq queue_type = route' qmgr -c 'create queue batchq' qmgr -c 'set queue batchq queue_type = execution' qmgr -c 'set queue defaultq started = true' qmgr -c 'set queue defaultq route_destinations = batchq' qmgr -c 'set queue defaultq enabled = true' qmgr -c 'set server default_queue = defaultq' qmgr -c 'set queue batchq started = true' qmgr -c 'set queue batchq enabled = true' qmgr -c 'set server resources_default.ncpus = 1' qmgr -c 'set server resources_default.nodes = 1'
Once this is done, you can restart the pbs_server from the init.d script:
/etc/init.d/pbs_server restart
PBS Scheduler
The scheduler configuration file can be found at /var/spool/torque/sched_priv/sched_config. It is documented. There is no need to change any of the values at this time, though. If you ever do change them, be sure to restart the scheduler.
If you wish to bind the scheduler to an interface that resolves to a hostname other than the output of the hostname, as you may have create the file /etc/sysconfig/pbs_sched with the following contents:
PBS_DAEMON="/bin/sh -c 'h=`hostname`; hostname <hostname> ; /usr/local/sbin/pbs_sched; hostname \$h'"
where <hostname> is the same hostname you used for the pbs_server setup. This is a dirty trick. But the fact is that the built-in Torque scheduler will only listen on the interface that corresponds to the output of gethostname, so the gloves may have to come off.
Start the scheduler *as root* with /etc/init.d/pbs_sched start. The scheduler logs will go to /var/spool/torque/sched_logs/<yyyymmdd>; each day will have its own scheduler log.
Configuring the Compute Nodes
As mentioned above, /var/spool/torque/server_name on each compute node needs to contain the correct value for the PBS server hostname. Additional MOM configuration directives can be entered into /var/spool/torque/mom_priv/config. For now, we will not put anything into the MOM config file, but we do need to create it or the MOM startup will complain. *As root*, do the following:
touch /var/spool/torque/mom_priv/config /etc/init.d/pbs_mom start
The MOM logs will go to /var/spool/torque/mom_logs/<yyyymmdd> (one log file per day).
Adding Compute Nodes to PBS
Now we are ready to add compute nodes to the batch system. Go back to the head node, and *as root* add compute nodes via qmgr:
qmgr -c 'create node <fqdn> np=<ncpus>'
where <fqdn> is the fully qualified domain name of your compute node and <ncpus> is the number of processors it has. Do this for each compute node you want to add to PBS.
Testing the Installation
Query the Server and Queue Configuration
The following command will dump the server and queue configuration to stdout:
qmgr -c 'print server'
You should see something similar to:
# # Create queues and set their attributes. # # # Create and define queue defaultq # create queue defaultq set queue defaultq queue_type = Route set queue defaultq route_destinations = batchq set queue defaultq enabled = True set queue defaultq started = True # # Create and define queue batchq # create queue batchq set queue batchq queue_type = Execution set queue batchq enabled = True set queue batchq started = True # # Set server attributes. # set server scheduling = True set server default_queue = defaultq set server log_events = 511 set server mail_from = adm set server resources_default.ncpus = 1 set server resources_default.nodes = 1 set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server pbs_version = 2.1.7
Querying Node Status
You can query node status with the pbsnodes command. pbsnodes -a will show node status for all compute nodes in the batch system. You should see something like the following for each node in your batch system:
invigo.local
state = free
np = 4
ntype = cluster
status = opsys=linux,uname=Linux invigo.local 2.6.9-34.ELsmp #1 SMP \
Wed Mar 8 00:27:03 CST 2006 i686,sessions=? 0,nsessions=? 0, \
nusers=0,idletime=535854,totmem=5223572kb,availmem=5036844kb, \
physmem=2074840kb,ncpus=4,loadave=0.00,netload=3206850556, \
state=free,jobs=? 0,rectime=1172691859
The state *free* means that the machine has capacity to run jobs, and *np* signifies the number of processors offerered.
Creating and Submitting a Job
A PBS job is just a script - typically, a shell script. In each job script, you can tell things to PBS with special shell comments. As a *normal user*, fire up your favorite editor and let's create a job script with the following contents:
#! /bin/sh #PBS -N testjob #PBS -o testjob.out #PBS -e testjob.err #PBS -M <your_email_here> #PBS -l walltime=00:01:00 date hostname sleep 20 date
I saved the above into a file called testjob.job. As you can see, the job just runs the date command, the hostname command, sleeps for 20 seconds, and runs the date command again. The funny #PBS comments at the top of the script are directives for PBS. #PBS -N sets the job name. #PBS -o sets the job stdout file, and #PBS -e option sets filename for the stderr output. The #PBS -M directive sets the email address to use for job summary reports. Finally, the #PBS -l walltime=00:01:00 is a resource request that asks PBS for one minute of walltime for the job. There are many types of resource requests you can make - the job will be killed by PBS if it does not complete before that resource is exhausted.
You use the qsub command to submit the job to the batch system; just give qsub the name of your job script, like so:
qsub testjob.job
The PBS id of the job will return to you on standard output.
Checkout the Job Submission Queues and Sample Scripts pages for more information
Querying Job and Queue Status
The qstat program is used to query jobs and queue status. If we just type qstat, we will see the job we just submitted (submit it again if it is already finished):
Job id Name User Time Use S Queue ------------------- ---------------- --------------- -------- - ----- 0.osg testjob prescott 0 R batchq
We see its full job ID, job name, user who submitted the job, how much time it has used, the state of the job (R means "running") and what queue
it is running in. In our setup, the job was routed by the routing queue into the "batchq" execution queue, where it currently runs.
We can get oodles of information about a job with qstat -f <job_id>. In our case, qstat -f 0 yields:
Job Id: 0.osg.local
Job_Name = testjob
Job_Owner = prescott@osg.local
job_state = R
queue = batchq
server = osg.local
Checkpoint = u
ctime = Wed Feb 28 20:37:21 2007
Error_Path = osg.hpc.ufl.edu:/home/prescott/testjob.err
exec_host = invigo.local/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
Mail_Users = prescott@hpc.ufl.edu
mtime = Wed Feb 28 20:37:21 2007
Output_Path = osg.hpc.ufl.edu:/home/prescott/testjob.out
Priority = 0
qtime = Wed Feb 28 20:37:21 2007
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.walltime = 00:01:00
session_id = 1325
Variable_List = PBS_O_HOME=/home/prescott,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=prescott,
PBS_O_PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/
bin,PBS_O_MAIL=/var/spool/mail/prescott,PBS_O_SHELL=/bin/bash,
PBS_O_HOST=osg.hpc.ufl.edu,PBS_O_WORKDIR=/home/prescott,
PBS_O_QUEUE=defaultq
comment = Job started on Wed Feb 28 at 20:37
etime = Wed Feb 28 20:37:21 2007
Job Output and Summary
Our example job executed a few simple programs that wrote to stdout. In our job script, we specified the filename that contains our job's standard output as testjob.out. We can cat testjob.out to see what out job did:
Wed Feb 28 20:37:21 EST 2007 invigo.local Wed Feb 28 20:37:41 EST 2007
We also can see that there was no standard error output from the job; our standard error output file has zero length. Finally, we asked PBS to send us an email with a job summary when our job completed - if you set up mail delivery, it should be waiting for you.
Job Post-Mortem and Accounting
Torque's accouting logs are located in /var/spool/torque/server_priv/accounting; there is one file per day with filenames just like for the PBS server, scheduler, and MOM. This is the definitive source of accounting info from PBS. You can use the pbsaccounting package from http://pbsaccounting.sourceforge.net/ to generate reports and graphs from these logs. There are other packages you can find to do this, as well, and it is even not too difficult to write your own accounting log processor to generate custom reports.
Sometimes, you may want to query what PBS did with an already completed job ID - you can use the tracejob command. It will pick through the PBS accounting logs and give you a timeline for the job in question. For example, tracejob 0 (0 was the job ID for our test job) yields:
Job: 0.osg.local
02/28/2007 17:31:43 A queue=defaultq
02/28/2007 17:31:43 A queue=batchq
02/28/2007 17:39:12 A requestor=root@osg.local
02/28/2007 20:37:21 S enqueuing into defaultq, state 1 hop 1
02/28/2007 20:37:21 S dequeuing from defaultq, state QUEUED
02/28/2007 20:37:21 S enqueuing into batchq, state 1 hop 1
02/28/2007 20:37:21 S Job Queued at request of prescott@osg.local, owner =
prescott@osg.local, job name = testjob, queue =
batchq
02/28/2007 20:37:21 S Job Modified at request of Scheduler@osg.local
02/28/2007 20:37:21 L Job Run
02/28/2007 20:37:21 S Job Run at request of Scheduler@osg.local
02/28/2007 20:37:21 A queue=defaultq
02/28/2007 20:37:21 A queue=batchq
02/28/2007 20:37:21 A user=prescott group=hpcadmin jobname=testjob
queue=batchq ctime=1172713041 qtime=1172713041
etime=1172713041 start=1172713041
exec_host=invigo.local/0 Resource_List.ncpus=1
Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1
02/28/2007 20:37:41 S Exit_status=0 resources_used.cput=00:00:00
resources_used.mem=3304kb resources_used.vmem=16848kb
resources_used.walltime=00:00:20
02/28/2007 20:37:41 S dequeuing from batchq, state COMPLETE
02/28/2007 20:37:41 A user=prescott group=hpcadmin jobname=testjob
queue=batchq ctime=1172713041 qtime=1172713041
etime=1172713041 start=1172713041
exec_host=invigo.local/0 Resource_List.ncpus=1
Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1 session=1325 end=1172713061
Exit_status=0 resources_used.cput=00:00:00
resources_used.mem=3304kb resources_used.vmem=16848kb
resources_used.walltime=00:00:20
Final Words
Hopefully this note will help you get your Torque batch system up and running, and give you a bit of familiarity with the typical procedures and tools available. If you have further questions, I highly recommend to look at the Admin Manual and numerous man pages included with the Torque packages, and to consult the torqueuser mailing list (archives at http://www.supercluster.org/pipermail/torqueusers/). Torque is highly configurable; in this short tutorial, we have only done enough to get you started. While what we've done so far may be perfectly adequate for many environments, you should be aware that configuration options exist to add user and group ACLs, resource attributes handy for heterogenous environments, optimizations for job output relay, multiple execution queues with their own scheduling priorities, considerations for running parallel jobs, dropping in of powerful third party schedulers such as Maui, etcetera. Good luck!
