[GE users] Scheduler stops transferring queued jobs after GDI error
neil at futurity.co.uk
Fri Jan 8 15:33:35 GMT 2010
[ The following text is in the "Windows-1252" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
I was wondering if anyone may be able to help me?
We?re using Grid Engine 6.1u3 and experiencing problems where queued jobs aren?t transferred from state ?qw / queue waiting? to machines to be run. This has been ongoing for the last few months where this problem used to only occur once every 2 weeks at the start, but since the new year its started to happen multiple times a day. Rebooting the host machines doesn?t seem to stop it happening any less frequently.
When the qmaster is soft stopped and started again, the queued jobs then transfer and run fine until the problem reoccurs.
The sequence of events leading up the the problem are as follows:
1. Everything on the grid is working fine.
2. A user experiences an error message ?error: failed receiving gdi request?.
3. Subsequence job submission appear to work without the gdi error being received.
4. Jobs in state ?qw? or jobs submitted after step 2 stay in state ?qw? and are never transferred.
We haven?t modified our grid configuration for 6 months, possibly a year and its been running without any problems what so ever for months before this started to happen.
Disk space is fine (7GB free). Top shows that the machine?s load is nothing when the grid is working fine and when in this problem state.
Has anyone else experienced this problem or has any other suggestions?
Would upgrading to 6.1u6 help?
More information about the gridengine-users