This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
array_jobs [2014/02/16 20:17] rdiazgar |
array_jobs [2014/02/18 12:04] (current) rdiazgar old revision restored (2014/02/17 16:31) |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | $$$$$$$$$$$$ | + | ====== Array Jobs ====== |
- | ====== Array jobs for clusters running SGE ======$$$$$$$$$$$$ | + | |
- | $$$$$$$$$$$$ | + | Array jobs are essentially a mecanism for executing the very same script several times. Say that, for instance, you need to run a certain job script N times (you want to apply a certain action to an image). You would typically call the same script N times and change just one parameter (the image index i). In array jobs, you can specify the index range you want to execute and SGE will take care of the rest. There are plenty of advantages: |
- | $$$$$$$$$$$$ | + | |
- | $$$$$$$$$$$$ | + | - Simplification of the job script |
- | $$$$$$$$$$$$ | + | - Better queue management in the head node of the cluster |
- | Say that we want to run the exact same script several times, but with different parameters each time. The naive solution would be calling qsub N times, but this is impractical. Instead, array jobs are the solution.$$$$$$$$$$$$ | + | - Better job organization: a unique JOBID is created for the array job and a separate TASKID is added |
- | $$$$$$$$$$$$ | + | |
- | $$$$$$$$$$$$ | + | In order to execute an array job, simply add the following to a qsub call or a script header: |
- | $$$$$$$$$$$$ | + | |
- | Array jobs are created with the -t parameter (either in the qsub call or in the script header). You must specify a range from 1 to N. The variable SGE_TASK_ID will indicate the i-th call of the array job as in the example below:$$$$$$$$$$$$ | + | <code>qsub -t 1-N ... |
- | $$$$$$$$$$$$ | + | </code> |
- | $$$$$$$$$$$$ | + | |
- | $$$$$$$$$$$$ | + | Where 1-N is the range you want to cover (note that a range 0-N is invalid). |
- | $$$$$$$$$$$$ | + | |
- | $$$$$$$$$$$$ | + | In the script side, we will control the i-th call of our script using the variable SGE_TASK_ID. Take this as an example: |
- | $$$$$$$$$$$$ | + | |
- | <code>$$$$$$$$$$$$ | + | <code> |
- | $$$$$$$$$$$$ | + | #!/bin/bash |
- | #!/bin/sh$$$$$$$$$$$$ | + | # |
- | #$ -t 1-10000$$$$$$$$$$$$ | + | # MatchDist.sh |
- | SEEDFILE=~/data/seeds$$$$$$$$$$$$ | + | # Create a script for distributedly match a list of key files |
- | SEED=$(cat $SEEDFILE | head -n $SGE_TASK_ID | tail -n 1)$$$$$$$$$$$$ | + | |
- | ~/programs/simulation -s $SEED -o ~/results/output.$SGE_TASK_ID$$$$$$$$$$$$ | + | export PATH=~/software/bundler_sfm/bin:$PATH |
- | $$$$$$$$$$$$ | + | export LD_LIBRARY_PATH=~/software/bundler_sfm/bin:$LD_LIBRARY_PATH |
- | </code>$$$$$$$$$$$$ | + | |
- | $$$$$$$$$$$$ | + | WORKDIR=$4 |
- | $$$$$$$$$$$$ | + | LIST=$1 |
- | === What if you number files from 0 instead of 1? ===$$$$$$$$$$$$ | + | TMPDIR=$5 |
- | $$$$$$$$$$$$ | + | OUTDIR=$6 |
- | $$$$$$$$$$$$ | + | mkdir -p $TMPDIR |
- | $$$$$$$$$$$$ | + | mkdir -p $OUTDIR |
- | $$$$$$$$$$$$ | + | RATIO=$3 |
- | The '-t' option will not accept 0 as part of the range, i.e. #$ -t 0-99 is invalid, and will generate an error. However, you can label the input files from 0 to n−1. That’s easy to deal with:$$$$$$$$$$$$ | + | I=$((${SGE_TASK_ID}-1)) |
- | $$$$$$$$$$$$ | + | OUT=$(echo $(printf "d" $I)_${2}) |
- | $$$$$$$$$$$$ | + | |
- | <code>$$$$$$$$$$$$ | + | cd $WORKDIR |
- | $$$$$$$$$$$$ | + | |
- | #!/bin/sh$$$$$$$$$$$$ | + | echo "KeyMatchSingle $LIST $OUTDIR/$OUT $RATIO $I" |
- | # Tell the SGE that this is an array job, with "tasks" to be numbered 1 to 10000$$$$$$$$$$$$ | + | KeyMatchSingle $LIST $TMPDIR/$OUT $RATIO $I |
- | #$ -t 1-10000$$$$$$$$$$$$ | + | cp -f $TMPDIR/$OUT $OUTDIT/$OUT |
- | i=$(expr $SGE_TASK_ID - 1)$$$$$$$$$$$$ | + | </code> |
- | if [ ! -e ~/results/output.$i ]$$$$$$$$$$$$ | + | |
- | then$$$$$$$$$$$$ | + | In the example above, the SGE_TASK_ID variable is used as the i-th call of our array job. |
- | ~/programs/program -i ~/data/input.$i -o ~/results/output.$i$$$$$$$$$$$$ | + | |
- | fi $$$$$$$$$$$$ | + | ==== Preventing too many tasks to be run simultaneously ==== |
- | </code>$$$$$$$$$$$$ | + | |
- | === Limiting the number of concurrent array jobs ===$$$$$$$$$$$$ | + | If we know that our jobs will demand too many resources and might stall a node, we can prevent this by limiting the number of concurrent tasks for that specific job. Just add the following parameter in the qsub call or in the script header: |
- | $$$$$$$$$$$$ | + | |
- | $$$$$$$$$$$$ | + | <code> |
- | If we don't want to run all array jobs simultaneously, we can add the parameter -tc MAX_JOBS. This will allow only a maximum of MAX_JOBS to be running in the cluster at the same time.$$$$$$$$$$$$ | + | qsub -t 1-N -tc NMAX ... |
- | $$$$$$$$$$$$ | + | </code> |
- | $$$$$$$$$$$$ | + | |
- | $$$$$$$$$$$$ | + | This will allow at most NMAX tasks to be executed simultaneously. |
- | $$$$$$$$$$$$ | + | |
- | $$$$$$$$$$$$ | + | |
- | $$$$$$$$$$$$ | + | |
- | $$$$$$$$$$$$ | + | |
- | $$$$$$$$$$$$ | + | |
- | $$$$$$$$$$$$ | + | |
- | $$$$$$$$$$$$ | + | |
- | $$$$$$$$$$$$ | + | |
- | $$$$$$$$$$$$ | + | |