[GE users] Illumina/Solexa pipeline on large SGE 6.2 systems?
beckerjes at mail.nih.gov
Fri Sep 5 20:49:39 BST 2008
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
Chris Dagdigian wrote:
> The Illumina analysis pipeline is kinda clever in that it has scripts
> that generate massive Makefiles that control all the tasks associated
> with the analysis run. To run the pipeline on a local server you just
> kick off the prep script, navigate to a directory and type "make".
> Pretty cool.
Yeah, it is a pretty interesting use of makefiles. The dependency tree is
really impressive to see (there are a few scripts out there that parse
Makefiles generate graphviz files to help vizualize them).
> Of course the nicest thing about a workflow based on unix make is that
> you can "parallelize" your workflow by replacing unix make with SGE
> qmake and presto, you have a "cluster aware" analysis pipeline with
> very little effort.
Unfortunately, it wasn't quite that simple when we originally tried to do
this. We're running SGE 6.0u8. The Makefiles themselves are valid, but we
ran headlong into the two "Known Problems" mentioned in the qmake man page.
The biggest issue has to do with multiple commands in a single rule. There
were a few instances of the problem described in the manpage that qmake can't
handle. The man page has an example, but briefly, the problem is that a
target such as:
cc -o foo foo.c
won't work with qmake. It has to be rewritten as something like:
cd bar; \
cc -o foo foo.c
Instead of modifying makefiles after they were generated, we instead left them
as-is and created a PE to allocate N slots per host (where N is the number of
cores). We then call 'make' directly against each lane using something like this:
qsub -cwd -pe make-dedicated 4 -b y make -j 4
While not ideal, we do get 8-way parallelism on a per-lane basis, and 4-way
parallelism within each lane. This has worked well enough for the 10 months.
However, we are starting the process of re-evaluating this method, along
with an upgrade to 6.2.
Also, as Sean Davis mentioned elsewhere, the pipeline is very IO intensive,
and you can have issues with the NFS servers. We had some issues with
attribute (e.g. timestamp) caching that caused some dependency problems.
> Given that a big feature of SGE 6.2 is a new totally-internal
> implementation of interactive job support I'm wondering if there is
> anyone on this list who has found that running under 6.2 has made the
> pipeline runs smoother and less prone to resource or bottleneck
> related SGE errors.
We actually wound up witting our own "pipeline" around the Illumina pipeline.
It's more extensive, in that we use it for moving data off the sequencers,
checking that the raw data (images) are valid, and actually running the
Illumina pipeline. After each cycle, we also run a "partial" analysis on a
subset of tiles, just to see how things are doing. But as I said, this is
using 6.0u8, not 6.2.
I'll let you know how the upgrade goes.
NHGRI Linux support (Digicon Contractor)
More information about the gridengine-users