[GE users] Illumina/Solexa pipeline on large SGE 6.2 systems?

Jesse Becker beckerjes at mail.nih.gov
Fri Sep 5 20:49:39 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Chris Dagdigian wrote:

> The Illumina analysis pipeline is kinda clever in that it has scripts  
> that generate massive Makefiles that control all the tasks associated  
> with the analysis run. To run the pipeline on a local server you just  
> kick off the prep script, navigate to a directory and type "make".
> 
> Pretty cool.

Yeah, it is a pretty interesting use of makefiles.  The dependency tree is 
really impressive to see (there are a few scripts out there that parse 
Makefiles generate graphviz files to help vizualize them).


> Of course the nicest thing about a workflow based on unix make is that  
> you can "parallelize" your workflow by replacing unix make with SGE  
> qmake and presto, you have a "cluster aware" analysis pipeline with  
> very little effort.

Unfortunately, it wasn't quite that simple when we originally tried to do 
this.  We're running SGE 6.0u8.  The Makefiles themselves are valid, but we 
ran headlong into the two "Known Problems" mentioned in the qmake man page.

The biggest issue has to do with multiple commands in a single rule.  There 
were a few instances of the problem described in the manpage that qmake can't 
handle.  The man page has an example, but briefly, the problem is that a 
target such as:

   foo:
	cd bar
	cc -o foo foo.c

won't work with qmake.  It has to be rewritten as something like:

   foo:
	cd bar; \
	cc -o foo foo.c

Instead of modifying makefiles after they were generated, we instead left them 
as-is and created a PE to allocate N slots per host (where N is the number of 
cores).  We then call 'make' directly against each lane using something like this:

    qsub -cwd -pe make-dedicated 4 -b y  make -j 4

While not ideal, we do get 8-way parallelism on a per-lane basis, and 4-way 
parallelism within each lane.  This has worked well enough for the 10 months. 
  However, we are starting the process of re-evaluating this method, along 
with an upgrade to 6.2.

Also, as Sean Davis mentioned elsewhere, the pipeline is very IO intensive, 
and you can have issues with the NFS servers.  We had some issues with 
attribute (e.g. timestamp) caching that caused some dependency problems.

> Given that a big feature of SGE 6.2 is a new totally-internal  
> implementation of interactive job support I'm wondering if there is  
> anyone on this list who has found that running under 6.2 has made the  
> pipeline runs smoother and less prone to resource or bottleneck  
> related SGE errors.

We actually wound up witting our own "pipeline" around the Illumina pipeline. 
  It's more extensive, in that we use it for moving data off the sequencers, 
checking that the raw data (images) are valid, and actually running the 
Illumina pipeline.  After each cycle, we also run a "partial" analysis on a 
subset of tiles, just to see how things are doing.  But as I said, this is 
using 6.0u8, not 6.2.

I'll let you know how the upgrade goes.


-- 
Jesse Becker
NHGRI Linux support (Digicon Contractor)



More information about the gridengine-users mailing list