Big Bytes

Mail notifications for HPC jobs

We have introduced mail notifications within PBS to allow our users to be notified when a job begins, aborts and completes. The following additional directives will need to be set at the top of your job shell submission script.

#PBS -m bae 

#PBS -M e-mail@domain.com 

The first directive "-m" indicates the notification options, (b) begins, (a) aborts, (e) completion or end. You can either have all the options set or a single option. So if you just wanted to be notified about jobs completed, specify the "e" option within the directive.

#PBS -m e

The second directive is " -M " and this implies that a e-mail address be specified to send the notifications to. 

Depending on the number of jobs being submitted we would strongly suggest creating a separate mail folder in which your notifications could be filtered into. Nobody likes a clunky inbox :-) 

Happy qsub'ing

 

Musings on Torque/PBS and OpenMPI

We've spent most of this week working on mpiBlast, and have bumped our heads against a few problems.  Fortunately we have a very patient user in computational biology who's assisted in running test jobs.  It's been a learning experience so we thought we'd jot down a few notes... 

Running OpenMPI jobs in Torque/PBS is not quite the same as running them directly from a head node.  Firstly the initial worker node that the job is launched from is considered a 'head node' from OpenMPI's perspective.  This means that when setting up key sharing in the cluster a many-to-many relationship is required between worker nodes.

Additionally the way that PBS and mpirun are invoked are slightly different.  When dealing with OpenMPI jobs it's best to specify only the number of cores the job needs.  However in order to do this the PBS nodes argument to the -l parameter is considered to be CPUs, not servers.

There are two other crucial elements to bare in mind.  Firstly the machine or host file should be referenced from PBS, rather than user-created.  This is done by using the $PBS_NODEFILE variable.  Secondly PBS should be allowed to supply the cores, rather then request them via mpirun's -np argument. The number of nodes versus threads that users can consume can be controled via the maui.cfg file

Below is a screenshot of multiple MPI jobs seeking 5 CPUs each on any series worker node.  The starting node for the initial job was unspecified and turned out to be 300.  Nodes 300, 206, 205 and 204 have high CPU but no threads advertised as they're just winding down from 3 completed jobs, totaling 15 cores.  The 1 thread on 204 is the first of 5 spread "left" into 203.

Distributed MPI jobs

Another item to consider is heterogenous environments.  Not all clusters are composed of identical equipment, hence allowing auto-assignment of resources in MPI jobs can produce unpredictable results.  In the image above the 300 series CPUs are taking longer to spool down than the 200 series.  In order to constrain job runs use can be made of the free form node_spec tag in the nodes file.  However here you should remember that once again nodes = servers so you'll need the ppn directive.

So to reserve 20 cores on the BL460 servers use the directive: #PBS -l nodes=4:series400:ppn=5

If there are any inaccuracies in the above please feel free to point them out.  Here are two article we found useful running OpenMPI under Torque.

Odd queuing behaviour in Grid cluster

This evening we observed some weirdness on our Grid cluster.  Users could submit jobs but they were immediately queued.  Initially it was suspected that only one worker node was affected, however we soon realized that all three worker nodes were exhibiting the same issue.  Oddly some jobs (short term test jobs submitted via EUMed) were running.

Restarting the pbs_server daemon on the head node had no effect, other than to cause all worker nodes to register a down status.  Checking the worker nodes revealed that all pbs_mom daemons were in a running but dead state.  Restarting all pbs_mom daemons allowed some jobs to be submitted, however this was only on 2 of the worker nodes.  It was then noted that there was an old SAGrid job that was still in a queued state from several days ago.  Killing this job put the queues back into a happy state.

Not sure exactly what the issue was, possibly a malformed queue submission or JDL causing a hang up in the scheduler.  Currently we are considering increasing the level of monitoring to test the state of the pbs daemons.

New worker nodes added to ICTS HPC

Over the last week ICTS engineers have been adding 5 new BL460 blades to the cluster.  The OS install took place on Tuesday, applications on Wednesday and our first user is already running jobs on the CPUs.

New worker nodes

The 400 series have two 6 core Xeon 2.8GHz CPUs and 24GB of RAM each.  This increases our available core count from 40 to 100 cores.  We will hopefully also be extending the 200 series over the next few weeks as upgrades take place in our data center.  We have also increased the queue size accordingly to allow more jobs to run simultaneously.

The new servers also mount the /home NFS and hence can accomodate parallel MPI jobs.  However the version of mpicc on the 400 series differs from that on the 200 and 300 series, hence it is non-trivial to spread jobs across the entire cluster.  Later in the year we will embark on an OS and application streamlining project to bring all the versions up to date.

New ICTS HPC cluster

Installed and configured a new cluster, srvslnhpc001.  The head node is a HP BL20P blade with 2 dual core 3.6GHz CPUs and 8GB RAM.  The 3 worker nodes are BL20P blades with 2 dual core 3.6GHz CPUs and 4GB RAM each.  Currently it is being configured with a shared 500 GB of SAN disk space for software and data.  While small the cluster will serve as a first step in designing a fully fledged HPC system.  Tests are currently being conducted with OpenMPI and GCC.  Further software to be deployed includes the GNU suite of C++ and Fortran75/90, Matlab, R, Auto07p, NAMD, BLAST and FFT.  The cluster uses the latest versions of Torque and MAUI for job scheduling.

hpc001
Above is a snapshot from the dashboard, node 202 undergoing severe memory stress testing.

PBS POC upgrade test

Tested the upgrade path of Torque from 2.5.3 to 3.0.0. Fairly easy, except one needs to remember a few things:

  1. shut down the pbs daemons before starting.
  2. run torque.setup to sort out the symlinks.
  3. restart the maui daemon.

All seems to be working fine now

HPC queue longevity

Increased the default research queue (UCTlong) from 2700 to 5400 hours.

Installed and tested a proof of concept cluster

Installed Oracle Virtual box on a laptop and created 3 Scientific Linux virtual machines.  Installed Torque PBS with one head node and 2 worker nodes.  Installed worker node software on head node to test pbs_mom, worked fine.  Replaced the default scheduler with Maui.

Definitions

Below are some snippets to help non specialists understand some of the terms used in this blog.  Much of this has been taken from Wikipedia:

High Performance Computing (HPC) - uses supercomputers and computer clusters to solve advanced computation problems.

Grid computing - The Grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files. What distinguishes grid computing from conventional high performance computing systems such as cluster computing is that grids tend to be more loosely coupled, heterogeneous, and geographically dispersed. Although a grid can be dedicated to a specialized application, it is more common that a single grid will be used for a variety of different purposes. Grids are often constructed with the aid of general-purpose grid software libraries known as middleware.

Cluster - A computer cluster is a group of linked computers, working together closely thus in many respects forming a single computer. The components of a cluster are commonly, but not always, connected to each other through fast local area networks. Clusters are usually deployed to improve performance and availability over that of a single computer, while typically being much more cost-effective than single computers of comparable speed or availability.

Cloud computing - Cloud computing is location independent computing, whereby shared servers provide resources, software, and data to computers and other devices on demand, as with the electricity grid. Or more simply, remote computing. Cloud computing is a natural evolution of the widespread adoption of virtualization, service-oriented architecture and utility computing. Details are abstracted from consumers, who no longer have need for expertise in, or control over, the technology infrastructure "in the cloud" that supports them.

Compute Element (CE) - Also called a Head Node.  In the Grid paradigm its main functions are to manage job submissions and update the WMS on the status of the jobs.  In HPC it often hosts the home directories of the users, and shares these directories with the worker nodes via NFS.

Workload Management System (WMS) - Accepts and satisfies requests for job management and resources from users.  It will pass the job to an appropriate CE for execution taking into account the job requirements and the preferences expressed in the job description.

Worker Node (WN) - A host (computer) normally with a a large number of powerful CPUs and a significant amount of RAM.  This is where the actual jobs are run.  The submission of jobs to the WN and the return of the results is managed by the CE.

Information Systems (TopBDII and SiteBDII) - In the grid environment each site has an information index, the SiteBDII, which publishes resources at the site to the TopBDII.  This provides a central point of resource allocation for a Grid VO.

Virtual Organization (VO) - A group of individuals or institutions who share the computing resources of a "grid" for a common goal.

PBS \ Torque \ Maui - Portable Batch System (or simply PBS) is the name of computer software that performs job scheduling. Its primary task is to allocate computational tasks, i.e., batch jobs, among the available computing resources. It is often used in conjunction with UNIX cluster environments.  Torque is a derivative of OpenPBS that is actively developed, supported and maintained by Cluster Resources IncMaui is an open source job scheduler for clusters and supercomputers often used to replace the default PBS scheduler.