Head node reboot
This weekend the head node suffered an unexpected reboot. We're still not sure what the cause of this was. However it looks as if the running jobs were not effected.
This weekend the head node suffered an unexpected reboot. We're still not sure what the cause of this was. However it looks as if the running jobs were not effected.
We're currently investigating a memory issue with some of the worker nodes. Memory is not being freed up after jobs complete.
UPDATE - 4 July:
Turns out it's not a memory error. The problem is the way that net-snmp monitors and returns memory information. Unfortunately net-snmp clumps both real and cached memory together, so over time it looks as if memory available drops to zero. See this post for more info. In order to fix this some jiggery-pokery may be required.
UPDATE - 5 July:
The issue is now resolved. By creating a custom snmp OID with a perl script which gets its information straight from /etc/proc we can get a far better idea of actual memory in use. This is displayed on the dashboard and also tracked in real time by our monitoring systems.
Patched kernels on HPC servers to 2.6.18-238.1.1.el5; All went fine except for the head node which has an issue with latest kernel (dies at boot with a kernel panic) so booting it into older version 2.6.18-194.1.1.el5 until we can sort this out.
Fixed an issue with openmpi on grid cluster - usual problem with openmpi.conf looking for Infiniband
Investigated unexpected computation speeds for differing chips with MPI. Slower CPU cores were completing jobs in less time than faster CPUs however this seems to boil down to the nature of the jobs, large array computations, the size of the CPU cache and the communication latency between parallel processes.
Some interesting links to Intel CPU types:
List_of_Intel_microprocessors
Comparison_of_Intel_processors
Intel_Cores