Big Bytes

ZA-UCT-ICTS status

Our Grid cluster currently has an issue with gLite services. While our overall infrastructure is fine (SAGrid site advertising and WMS are working properly) no jobs will be able to run on UCT's Grid cluster until this is resolved.  We have logged a call via GGUS on the Africa ROC.

Jobs are still running on the HPC cluster as this cluster is not part of the Grid infrastructure.

Site BDII issue at ZA-UCT-ICTS

This morning at about 06:00 our Site BDII stopped working.  This was traced to the slapd BDII process which had died.

[root@srvslngrd002 ~]# /etc/init.d/bdii status
BDII slapd PID file exists but the process died  [FAILED]


This has since been restarted and our site is advertising again.  This has been an ongoing issue (failing once every 3 or so months) for about a year now.

The Site BDII advertises services and availability of a Grid Site's cluster and software.  When the process is not running no jobs will be sent to the site, even if the site is still functional.

WMS cert

The certificate for the WMS (srvslngrd010.uct.ac.za) has been udpated.

ZA-UCT-CERN downtime

The middleware is being upgraded to EMI 1 (Kebnekaise) and will be down for a day or so.  UCT-CERN will now become the test bed for a MoU initiative with EMI to assist with product support and training.  More info...

Certificate update

The certs on our CE and Site BDII have just been updated.  Jobs are running on our CE currently so things look OK.

Odd queuing behaviour in Grid cluster

This evening we observed some weirdness on our Grid cluster.  Users could submit jobs but they were immediately queued.  Initially it was suspected that only one worker node was affected, however we soon realized that all three worker nodes were exhibiting the same issue.  Oddly some jobs (short term test jobs submitted via EUMed) were running.

Restarting the pbs_server daemon on the head node had no effect, other than to cause all worker nodes to register a down status.  Checking the worker nodes revealed that all pbs_mom daemons were in a running but dead state.  Restarting all pbs_mom daemons allowed some jobs to be submitted, however this was only on 2 of the worker nodes.  It was then noted that there was an old SAGrid job that was still in a queued state from several days ago.  Killing this job put the queues back into a happy state.

Not sure exactly what the issue was, possibly a malformed queue submission or JDL causing a hang up in the scheduler.  Currently we are considering increasing the level of monitoring to test the state of the pbs daemons.

Fully utilized

Our Grid cluster is seeing heavy usage at present, three users running a total of 53 jobs (some queued at present).  The longest running job is still going at 1213 hours.  The jobs are using Matlab, Auto07 and Root, a particle physics package.

heavy utilization

UCT site bdii

Problem:
Top bdii stopped publishing information for the ZA-UCT-ICTS site on Sunday 3rd April, 13:00.

Cause:
The reason for this was that slapd on the site bdii had stopped responding, even though the daemon was running.

Impact:
No jobs could be submitted to the site via the WMS.

Solution:
Restarted slapd, site now being published again at 4th April 12:00.

 

Long term proxies with gLite

A persistent problem experienced with proxies expiring for long term jobs on the SAGrid VO has been resolved.  A user must create a proxy in order to submit jobs to the Grid.  However the proxy will expire after a pre-determined time (default 12 hours) and the job will be terminated if it is still running.  In order to allow jobs to run for longer a method of proxy renewal must be used.  In the past it was found that this method did not work but only recently were we able to determine that the CNAF MyProxy server did not appear to be honouring requests for renewal by SAGrid proxies.  By changing the MyProxy server to the INFN we were able to resolve this issue.

Creating a long term proxy is simple.  Create a short term (1 hour) local proxy:
   voms-proxy-init --voms sagrid -valid 1:00

Now create a long term (168 hours) proxy on the MyProxy server:
   myproxy-init -s myproxy.ct.infn.it -d -n

You will need to put the following line in your JDL file:
   MyProxyServer = "myproxy.ct.infn.it";

Then submit your job as normal.  You will need a valid short term proxy to carry out any local commands or to check the status of your long term proxy.  If your local proxy expires just create another one.

To check the status of your long term proxy use the following command:
   myproxy-info -s myproxy.ct.infn.it -d

If your long term proxy looks like it will expire before your job has finished then you can extend it's lifetime by running the init command again:
   myproxy-init -s myproxy.ct.infn.it -d -n

NB. When using long term proxies you do not need to create a delegation user ID.

Example:
1) The user creates a proxy certificate on the local user interface \ portal.
long term proxy 1

2) A long term proxy is created on the remote proxy server and ‘signed’ with the user’s credentials.  This will become the official proxy certificate for the job.
long term proxy 2

3) Using the local proxy the user submits the job to the WMS which in turn submits it to the relevant CE.
long term proxy 3

4) The user’s local proxy expires, however the job proxy renewal is now dealt with automatically via the long term proxy server.  The user can create a new local proxy at a later stage for job status monitoring or retrieval.
long term proxy 4

Integration of 2 or more similar node types using gLite 3.2

This week SAGrid Core Services added an additional computing element (CE) to their site. This may sound like a complicated task but can be achieved in a few easy steps. The reason you would want this setup is to facilitate for different clusters. There are a number of other node types which could publish to facilitate failure, for example LFC, AMGA, SE, etc ...

1. Configure your new computing element (CE) exactly as you normal would configure a computing element. A few variables would need to be updated in your site-info.def file.
NB: Make a copy of the your primary computing element site-info.def file and update the required changes below.

SITE_NAME = <Site_Name_New_Name>
CE_HOST = <FQDN>
BDII_REGIONS="PROD_CERN_CREAM_CE"
BDII_PROD_CERN_CREAM_CE_URL="ldap://$CE_HOST:2170/mds-vo-name=resource,o=grid"

2. Reconfigure your Site BDII (site-info.def) to include the new computing element as part of the GRIS.

The BDII regions are not fixed values which you need to adhere to within the grid middleware. The GRIS will identify the service upon a successful yaim. Therefore these values can be used as a description for yourself to identify services at your site.

Add the following to your site-info.def file.

BDII_REGIONS="PROD_CERN_CREAM_CE"
BDII_PROD_CERN_CREAM_CE_URL="ldap://$CERN_CE_HOST:2170/mds-vo-name=resource,o=grid"

* Execute yaim to refresh your sBDII services
* View the /opt/glite/etc/gip/site-urls.conf file to validate that the information was populated with the correct values.

NOTE: Ensure that the distinguished name (DN) is syntactically correct. Probe the /opt/glite/var/tmp/gip/log/site-urls.conf log directory to confirm error free configuration.

3. Configuration of Top BDII

Finally, add a entry into the bdii.conf so that the second computing element appears in the information index

ZA-UCT-CERN ldap://<site_bdii_fqdn>:2170/mds-vo-name=<Site_Name>,o=grid

UI storage element issue corrected

Fixed an issue on the UCT user interface relating to lcg-infosites that was preventing the storage elements from being published.

Top level BDII error

Corrected a config error which was preventing the Wits University site from publishing its details to the core SAGrid services.  Wits is now visible again:

NWU site admin training course

Timothy and Andrew attended a site administrators course in Potch over the last 3 days.  The focus was on doing higher level site admin training and to thrash out some problems we've been experiencing on SAGrid.  The closing session included some very interesting talks by NWU staff on the use of their local HPC cluster in chemical engineering and aerodynamic design for high performance gliders.  We'd also like to extend our thanks to Hannes, Zuko and Attie for their kind hospitality.

Fixed issue relating to WMS

We have finally resolved the issue of the high load on the WMS.  It was being caused by test jobs being submitted by Eumed Grid being unable to access a required CE and remaining in the WMS queue.  The SAGrid WMS has now been placed in a bypass list.

UI and CE upgrade

The following elements of the UCT Grid infrastructure have been upgraded:

UI - portal.sagrid.ac.za
CE and Worker Nodes

Upgrade from SL4, gLite 3.1 to SL5, gLite 3.2

1 2  Next»