Big Bytes

MPI standardization

Our cluster is a heterogenous mixture of 3 blade types and 2 operating sytems, the latter being Scientific Linux 5.4 and 5.5.  Unfortunately these two versions of OS come with slightly differing versions of openmpi.  In order to allow jobs to span all blade architectures we have bypassed SL's install of openmpi and upgraded it manually to 1.4-4.el5.

Pros: Users can now create a hostfile referencing all hosts in the 200, 300 and 400 series.  This will allow jobs to span up to 96 cores.  Run mpi-selector-menu from the CLI to select the installed version of openmpi.

Cons: Memory size and CPU speeds differ between series which will cause a discrepancy in the completion times depending on the critical resource (RAM or MHz) defined by a job's algorithm.  Until all threads are finished all nodes will be marked as in use.

 

Parallel poison solution

Having installed Gromacs last week we decided to run some tests to see how well it performed on our cluster.  We downloaded a tutorial which analysed solutions of funnel web spider venom in water.  Given the bounding block size of the virtualized water space (1 angstrom around the protein) Gromacs refused to go beyond 4 simultaneous cores as it was not possible to break this job into more then four meaningful problem sets.

It quickly became clear that with large problem sets the assistance of high speed networking would assist in achieving faster computations.  The communication requirement between servers over a public network is fairly high, topping out at around 5% of a GB connection.  In an environment where a large number of jobs (100+) are competing for bandwidth this would introduce significant latency.

Gromacs NIC and CPU load

Note that the load for the parallel openMPI jobs (the first three jobs) is lower than for the OMP jobs.  This is because the load for OMP is kept on one server only, while openMPI distributes the load over as many servers as the job parameters allow.  The trade off however is that the network communication between the distributed processes is high.  Despite this the completion times between OpenMPI and OMP were fairly similar:

GMtimes

We were expecting the non-openMPI time for one core to exceed the completion time for the openMPI job "distributed" over 1 core and were surprised to see that the times were almost identical.  After some head scratching we realized that this was because the non-openMPI job was still a parallel job, using OMP, as the software had been compiled with this option enabled.  In fact by not specifying the -nt parameter the job automatically grabs as many free cores as possible.

Gromacs data files can be exported to PDB format.  The image below shows the final solution state after 20 pico seconds and 25000 iterations.  The red\white "sticks" show the water molecules (2 white hydrogen bonds and 1 red oxygen atom) and immersed in this is the peptide toxin of the funnel web spider, omega-Aga-IVB.

spider venom peptide

6 million monkeys

In order to test the parallel version of MrBayes we ran several simulations of the primates.nex tutorial provided with the installation.  This compares the evolutionary species change and divergence between various primates.

Primates

Running the parallel version produced predictable results.  "Distributing" the problem over only 1 core slowed down the computation time slightly due to the MPI overhead compared with the serial computation on only one processor.  However once more cores were brought in the computation time decreased markedly until the amount of work that could be distributed was reached.  This occurred at 5 cores and was in line with the documented observation that the number of chains should not exceed the number of cores.  Providing more cores did not assist in computation and merely lead to more overhead in setting up MPI.

MrBayes MPI

The efficiency of the cluster is related to the problem being analysed as can be seen from the documentation:  "In our experience, heating is essential for problems with more than about 50 taxa, whereas smaller problems often can be analysed successfully without heating.  The time complexity of the analysis is directly proportional to the number of chains used (unless MrBayes runs out of physical RAM memory, in which case the analysis will suddenly become much slower), but the cold and heated chains can be distributed among processors in a cluster of computers using the MPI version of the program greatly speeding up the calculations."

Hence the more data (groups of organisms) one analyses the more cores one can use in order to speed up the computation.

Introducing Mr Bayes

One of our students requested that we install MrBayes, software for the estimation of phylogeny on a genetic level.  This is a fairly popular package which has been ported to numerous platforms so we thought we'd have a crack at installing it.  Building the package on Scientific Linux was fairly easy with a few caveats.  The base install is insufficient and requires readline-devel and ncurses-devel.  Once these were installed the package compiled fine.  Copying the package to a shared NFS software area is sufficient for all worker nodes to be able to run it.  We compiled it with MPI options and it seems to respond happily to mpirun instantiations.

Getting Gromacs

Gromacs is used to perform molecular dynamics simulations.  It's a popular tool and we decided to install it on our cluster based on the needs of two researchers, one at UCT and the other in Kenya.  Gromacs is designed to work on many platforms from desktops to high end clustered systems.  It is optimized to run in multi threaded systems as well as over multiple nodes via MPI.

There are a number of different ways to install the software.  As we are running Scientific Linux we decided to install via the EPEL repository rather than download an RPM or compile from scratch.  While the instructions on the Gromacs site claim that installation is simple, in order to get the package to interact with FFT, as well as compile with double precision support, multi thread and MPI capability is not trivial.  Gromacs' simple installation instructions can be found here and we decided that while making for an interesting read we'd rather not grapple with them.  So a quick "yum install gromacs" was all that was required.  Right?

No, not quite so simple.  The EPEL repo has a separate install for MPI, so another "yum install gromacs-openmpi" was required.  Then it was a matter of discovering that under the new EPEL compile of Gromacs 4.5.3 the binaries have all been renamed and now start with "g_" so mdrun becomes g_mdrun. However once that hurdle was overcome running the software was fairly simple. Right?

Well, not quite.  While the program ran (and here one of our researchers at UCT was very helpful in providing some test cases), the software seemed to perform a bit, well, greedily.  Running on a single worker node, or even multiple worker nodes it grabbed as many cores as were available, which is not great for the scheduling software as it looses track of how many cores are free for other applications.  Some research indicated that mdrun no longer needed its own -np argument as Gromacs version 4 and above makes use of mpirun's -np argument.  However it was only after we introduced the -nt argument for mdrun that we finally regained control over the number of cores used.  Problem solved.  Right?

Not really.  After examining the output our UCT user noticed that multiple output files were being created, almost as if independent instances of the software was being run on each node.  After some investigation it turned out that this was exactly what was happening, and then it dawned on us that this was why the software had not behaved properly in the first instance and required the -nt argument.  The problem was that we were running a non MPI aware compilation that had been configured only for multi-threading.  After doing some searching through the file system we found what we were looking for, g_mdrun_openmpi, which strangely wasn't on the search path which was why we'd missed it initially.  Running this with mpirun would surely solve the problem.  Right?

Well, almost.  Turns out that EPEL's RPM was compiled without shared library support.  Trying to run the software even directly on the worker node gave the following nasty message: Error while loading shared libraries: libgmxpreprocess_openmpi.so.6: cannot open shared object file.  Manually adding the library to the LD_LIBRARY environment variable fixed the problem which should allow one to submit jobs to the scheduler.  Right?

Well technically yes.  However the jobs still failed because of the shared library problem.  Exporting the environment variable in the shell script or even placing it in the profile didn't seem to work.  We're not quite sure why but it may have something to do with how mpirun connects to the worker nodes.  Feeling a bit desperate now we pulled a sneaky trick and used mpirun's environment argument -x to set the LD_LIBRARY variable.

Finally success.  Jobs start on multiple cores, ignore the -nt argument, and the number of threads started equals the -np argument, correctly spread over the available cores in the cluster.  There is only one log file, one topology file and one energy file, as expected.  The log file indicates that there are a number of nodes being used equivalent to the -np argument.  Running on more nodes also seems to introduce a significant speed improvement.  The only fly in the ointment seems to be the visualization files with two of them being updated simultaneously.  These are meant to be used with the Grace visualization package, are not compulsory and can even be disabled.

Running a job now shows how rapidly the threads are distributed over the cores in the following image, snapshots taken 10 seconds apart.

gromacs jobs

More investigation is still to be done in looking at the efficiency of multiple threads, for instance is the relationship between threads and job speed inversely geometric as expected and if so what is the minimum number of cores that will achieve at least a 90% speed improvement?  The communication latency between nodes will also be examined as multiple cores seem to consume considerable bandwidth.

A stressful day

Patched kernels on HPC servers to 2.6.18-238.1.1.el5; All went fine except for the head node which has an issue with latest kernel (dies at boot with a kernel panic) so booting it into older version 2.6.18-194.1.1.el5 until we can sort this out.

Fixed an issue with openmpi on grid cluster - usual problem with openmpi.conf looking for Infiniband

Investigated unexpected computation speeds for differing chips with MPI.  Slower CPU cores were completing jobs in less time than faster CPUs however this seems to boil down to the nature of the jobs, large array computations, the size of the CPU cache and the communication latency between parallel processes.

Some interesting links to Intel CPU types:
List_of_Intel_microprocessors
Comparison_of_Intel_processors
Intel_Cores

 

Testing the new cluster with MPI

Two new servers have been added to the cluster bringing the core count to 20.

Below is a set of simple tests carried out with a C program compiled with mpicc which runs a large number of floating point computations on a matrix.  The speed improvement over greater numbers of cores is gratifying, but not linear due to thread\RAM communication latency as well as the fact that a mixed architecture is being used.  Node 300 has 1.6 GHz cores as opposed to the 3.6 GHz cores of the 200 series.

MPI graph

Another disadvantage of having a mixed environment is that cores that complete early are still assigned to a job via the scheduling system and are not released until the entire job completes, making them temporarily unavailable to other users.  The solution will be to partition the 200, 300 and 400 series of worker nodes.

MPI CPUs

A potential problem due to potential mis-use through lack of understanding of the scheduling system can be seen.  Here the submitted job has reserved 4 cores on each node yet it only runs 4 threads in total.  The PBS reservation directive should closely match the mpirun arguments and machine file parameters.  Here mpirun has been called with the argument "-np 4", while the PBS directive "-l nodes=srvslnhpc200:ppn=4+ srvslnhpc201:ppn=4+ srvslnhpc400:ppn=4" reserves a total of 12 cores.  This is wasteful as it blocks other users from submitting jobs to idle cores.  In the above example nodes 201 and 202 were not considered as they were running a separate job.

First Light

OpenMPI is now configured on the new cluster.  There was an issue with the installation, in that the package was pre-configured to expect Infiniband which we do not have (yet).  However after several hours spent battling with it we found the configuration parameter to bypass this requirement and MPI jobs are now running.

Our newest HPC user is currently submitting jobs on the cluster using MPI compiled C code.  Things seem to be running smoothly and we'll continue to monitor the job progress over the weekend.  While we've been running live user jobs for the last two months this is actually a major step for us as it represents a maturation in our ability to provision an independent cluster from the ground up, with user and software support in under 48 hours.

We are also anticipating increasing the CPU count by 8 early next week with the addition of two extra servers.  We will use the new kit in a proof of concept arrangement to test partitioning the cluster to segregate resources for specific user groups.

OpenMPI

Installed openmpi 1.4-4 as well as the MPI development suite on the test cluter.

Configured host based authentication to obviate the ssh password authentication.

Compiled a test case program.  Seeing a 50% speed improvement when splitting the workload over 3 worker nodes (all communication simulated ethernet in a virtual environment).

In our current model the executable does not need to be copied to the worker nodes as it appears in the NFS mounted file system.  The user only needs to execute mpirun with the correct parameters.

Work to do:

 - test with torque\pbs and qsub directives in order to reserve worker nodes and CPUs appropriately.

 - develop and practice data sharing methodologies.

Multi threaded programming and SAGrid CRL error

Investigated mixed cluster options with multiple queues for Torque.

Brief test of OMP - successful.

Started investigating MPI.

Fixed an issue with SAGrid:  The WMS had lost access to the internet due to a proxy change hence CRL downloads were failing.  This resulted in an end point failure when users attempted to create proxies. This has now been fixed and CRLs are once again being downloaded.  GLite services have been restarted and users can now delegate proxies.

Maths jobs submitted to HPC cluster

Our first user has started submitting jobs to the cluster.  The software being used is auto07p for continuation and bifurcation problems in ordinary differential equations.  Auto07 can be compiled with OMP to make it multi-cpu aware.  Currently each job makes use of all 8 cpus on the worker node it runs on.

Parallel code, benefits and pit-falls

Most high end platforms for high performance computing are equipped with multi-core CPUs.  In order to fully utilize the CPUs multiple jobs must be run on each platform or the code must be changed to utilize multiple CPUs.  There are several methods used to take advantage of multiple CPUs; OpenMP, MPI, MPICH etc.  The simpler approaches utilize one server and all its CPUs in a shared memory model, the more complex approach is to split the code accross several servers with a master process handling communication between the shared memories and aggragating the results.  Either way, well written code split accross multiple CPUs can generally increase job efficiency.

There are obviously several caveats; some code cannot be 'parallelized' due to the nature of the algorithm, the code should be correctly optimized, disk IO should be reduced and in the shared memory model network latency can become a significant delaying factor.

Below is a graph of job completion times, where a lower (faster) result is better.  The first bar is the time for the job to complete using only one processor.  This is a simple array calculation compiled in C++ running on a BL460 blade with dual quad cores.  The single CPU iterative job completes in 20 seconds.  Next the code is compiled with the omp.h library allowing it to parallelize the array calculation loops.  Unexpectedly the time to complete is longer than the iterative job.  This is because the job was only allowed to run on one core.  The overhead of the omp library managing multi-threading in the core is what caused the increase in run-time.
MPI

By increasing the number cores on which the job is allowed to run we see an immediate increase in speed and reduction of job time.  This is unfortunately not a linear improvement due to communication latency, in this case in the processor cache.  OMP allows more threads to run than there are physical cores which is fine for the purpose of testing.  Additionally one can run more than one multi-threaded job per server.  These practices however should be avoided as they cause processor contention as the tasks are switched in and out of CPU context.  This behaviour is clearly seen in the last two job runs.

C vs Fortran

Surprisingly, given the proliferation of C code, Fortran out-performs C in many areas.  Fortran allows better numerical array manipulation, provides a rich set of highly optimized precision numeric functions making it more predictable and faster than C and also provides extremely efficient IO functionality.

However something to keep in mind when writing Fortran applications or porting code is "row versus column order".  Fortran and C differ in their methods of storing arrays in linear memory.  Fortran uses column-major order (as does Matlab) while C uses row-major order.  While there is no intrinsic benefit in either approach, a lack of undertsanding of row versus column ordering can lead to speed degredation.  This is because the elements of the array that are being traversed in RAM are not contiguous when using the incorrect method and for very large arrays the data may not be cached.  This is especially true for large higher dimension arrays.

As a simple example consider the 2 dimensional array:

Array

Fortran would store the array in memory as follows:

While C would store the array as follows:

A programmer should take care when formulating "for loops" to ensure that the array traversal variables i and j are ordered correctly. For example the C code:

    for (i=0; i<MAXi; i++)
        {
        for (j=0; j<MAXj; j++)
            {
                [arithmetic caculation on  array[i][j]]          
            }
        }

...is optimal as the primary array elements are addressed in "row major order" by the outer loop.

The graph below shows what happens when identical array computations are performed using the incorrect array adressing scheme (a lower number is better).  The blue graph shows the time taken to complete a programm compiled in Fortran.  Here column major addressing is clearly faster than row major.  The red graph shows time taken to run the same code compiled in C.  Here it is clear that the inverse is true, the column major scheme takes slightly longer to run than row major.

C vs Fortran

Additionally, it can be seen that the Fortran code is significantly and consistently faster than C.  These tests were performed repeatedly and an average taken to avoid inconsistencies caused by data caching.  Additionally the tests were run using the OMP library to make use of multiple processors.  The findings were consistent and independent of the number of cores used.  The ICTS cluster provides both GNU Fortran and C compilers.

Catania application porting

The ICTS HPC team spent a month in Catania at the Institute for Nuclear Physics, working as part of a South African scientific application porting team.  Once again the trip was supported by EPIKH and was extremely successful.

After the Africa-2 application course in Johannesburg earlier in the year a number applications were put forward by South African scientists for conversion to Grid format.  The South African team consisted of the ICTS specialists, Andrew Lewis and Timothy Carr, as well as Albert van Eck from the University of the Free State.  The team was headed up by Dr Bruce Becker of the Meraka Institute.  By working at the INFN the team had direct access to Grid specialists in porting, software and gLite middleware.

The conversion course was extremely successful and saw a number of applications being converted and deployed to South African grid sites as packaged RPM modules.  Some progress was also made in understanding how MPI is used in the Grid environment.  Unfortunately during the course the SEACOM link was interupted, however we still had access to the GILDA training laboratory.

NAMD ebola

Portion of the Ebola virus rendered with NAMD

Once again the team took the opportunity to do some site seeing and take in as much of the local culture and history as time allowed.  We visited Taormina in the North of Sicily, as well as Castelmola.  The highlight was a weekend trip to Italy to visit Sorrento and the ancient Roman town of Pompeii, an unforgetable experience.

Pompeii

Pompeii, with Vesuvius in the background

We would like to thank the INFN team, Andrea Cort, Fabrizio Pistgna, Emidio Giorgio, Valeria Ardezonne and Dr Roberto Barbera for their hospitality and enthusiastic support.


Clockwise from left: Timothy Carr, Andrew Lewis, Valeria Ardizonne, Albert van Eck, Dr Roberto Barbera.

We would also like to thank Sakkie van Rensburg, Andre le Roux and Eugene van Rooyen for making this opportunity available to us.

EGEE User Forum in Sweden

Andrew Lewis and Timothy Carr from UCT accompanied by Dr Bruce Becker attended the 5th and final EGEE User Forum in Uppsala, Sweden.  It was a great opportunity to meet other Grid users and project coordinators and we made some very useful contacts.  It was also fun to meet up with some of the people we'd been working with in Europe for the past year.  The key theme from the conference was one of more collaboration in order to avoid duplication of effort, especially in the MPI and portal domains.
Uppsala University
It took a while to get used to the weather, frequently well below zero with occasional light snow.  As it was the last EGEE forum (the project phase is now closing and it will now be EGI) the coordinators decided to make it a memorable event and on Tuesday night we enjoyed a spectacular dinner in Uppsala Castle.  Over the course of the forum the South African delegates attended discussions in bioinformatics, Earth observation, scietific gateways and portals, computational chemsitry, high energy physics, parallel processing and MPI architecture.  Dr Bruce Becker gave a very well recieved talk on the South African National Grid in the Regional Projects session.  Two windfalls to come out of the forum were the implementation of gqsub (a qsub like wrapper for gLite developed by a Scottish University) and a graphical monitoring tool for the grid using Google Earth.

EGEE participants
Unfortunately on the 2nd last day of the conference ash from the Icelandic volcano Eyjafjallajökull shut down air traffic over Europe and we found ourselves stranded indefinitely in Sweden.  Luckily we had not booked out of our hotel yet, or found ourselves stuck in the 'departure hall limbo' as so many unfortunate travellers did.  The Swedish home affairs were extremely helpful and granted us temporary residency.  We took full advantage of the situation and visisted the HPC data center at Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX), went on a brief visit to Stockholm and a walking tour of a Viking village at Gamla Uppsala.  We also visited the spectacular Uppsala  University Museum housing a collection of Viking and Egyptian exhibits, a huge display dedicated to Carl Linnaeus as well as first editions of Newton's Principia and Galileo's Sidereus Nuncius.
Uppmax.se
On the 2nd Friday we took a chance and managed to wangle our way onto an aircraft heading for Doha.  48 hours later we were back in SA to find that our director had kindly given us the Monday off in consolation for all our hardships ;-)

1 2  Next»