SLURM別のタスクが終了したらタスクをqsubする方法は？

私は現在、SLURMのみを使用してジョブをサブミットするLinuxベースのHPCを使用しています。HPCでは、ジョブを12時間実行することができます。しかし、良い結果を得るためには、1週間24のジョブを連続して実行する必要があるかもしれません。SLURM別のタスクが終了したらタスクをqsubする方法は？

完了したら（自動的に）ジョブを再実行する方法はありますか？ジョブが終了すると

の.outファイルが作成されます。

種類は

の追加を考えています。つまり、.outファイルの数が1つ増えます。

.outの数を増やすとジョブを再キューできますか？

#!/bin/bash 
#! 
#! Example SLURM job script for Darwin (Sandy Bridge, ConnectX3) 
#! Last updated: Sat Apr 18 13:05:53 BST 2015 
#! 

#!############################################################# 
#!#### Modify the options in this section as appropriate ###### 
#!############################################################# 

#! sbatch directives begin here ############################### 
#! Name of the job: 
#SBATCH -J Validation 
#! Which project should be charged: 
#SBATCH -A SOGA 
#! How many whole nodes should be allocated? 
#SBATCH --nodes=1 
#! How many (MPI) tasks will there be in total? (<= nodes*16) 
#SBATCH --ntasks=1 

#!SBATCH --mem=200 

#! How much wallclock time will be required? 
#SBATCH --time=12:00:00 
#SBATCH --mail-user=zl352 
#SBATCH --mail-type=ALL 
#! Uncomment this to prevent the job from being requeued (e.g. if 
#! interrupted by node failure or system downtime): 
##SBATCH --no-requeue 


#! Do not change: 
#SBATCH -p sandybridge 

#! sbatch directives end here (put any additional directives above this line) 

#! Notes: 
#! Charging is determined by core number*walltime. 
#! The --ntasks value refers to the number of tasks to be launched by SLURM only. This 
#! usually equates to the number of MPI tasks launched. Reduce this from nodes*16 if 
#! demanded by memory requirements, or if OMP_NUM_THREADS>1. 
#! Each task is allocated 1 core by default, and each core is allocated 3994MB. If this 
#! is insufficient, also specify --cpus-per-task and/or --mem (the latter specifies 
#! MB per node). 

#! Number of nodes and tasks per node allocated by SLURM (do not change): 
numnodes=$SLURM_JOB_NUM_NODES 
numtasks=$SLURM_NTASKS 
mpi_tasks_per_node=$(echo "$SLURM_TASKS_PER_NODE" | sed -e 's/^\([0-9][0-9]*\).*$/\1/') 
#! ############################################################ 
#! Modify the settings below to specify the application's environment, location 
#! and launch method: 

#! Optionally modify the environment seen by the application 
#! (note that SLURM reproduces the environment at submission irrespective of ~/.bashrc): 
. /etc/profile.d/modules.sh    # Leave this line (enables the module command) 
module purge        # Removes all modules still loaded 
module load default-impi     # REQUIRED - loads the basic environment 

#! Insert additional module load commands after this line if needed: 

#! Full path to application executable: 
application="~/scratch/code7/viv" 

#! Run options for the application: 
options=" > test.e" 

#! Work directory (i.e. where the job will run): 
workdir="$SLURM_SUBMIT_DIR" # The value of SLURM_SUBMIT_DIR sets workdir to the directory 
          # in which sbatch is run. 

#! Are you using OpenMP (NB this is unrelated to OpenMPI)? If so increase this 
#! safe value to no more than 16: 
export OMP_NUM_THREADS=1 

#! Number of MPI tasks to be started by the application per node and in total (do not change): 
np=$[${numnodes}*${mpi_tasks_per_node}] 

#! The following variables define a sensible pinning strategy for Intel MPI tasks - 
#! this should be suitable for both pure MPI and hybrid MPI/OpenMP jobs: 
export I_MPI_PIN_DOMAIN=omp:compact # Domains are $OMP_NUM_THREADS cores in size 
export I_MPI_PIN_ORDER=scatter # Adjacent domains have minimal sharing of caches/sockets 
#! Notes: 
#! 1. These variables influence Intel MPI only. 
#! 2. Domains are non-overlapping sets of cores which map 1-1 to MPI tasks. 
#! 3. I_MPI_PIN_PROCESSOR_LIST is ignored if I_MPI_PIN_DOMAIN is set. 
#! 4. If MPI tasks perform better when sharing caches/sockets, try I_MPI_PIN_ORDER=compact. 


#! Uncomment one choice for CMD below (add mpirun/mpiexec options if necessary): 

#! Choose this for a MPI code (possibly using OpenMP) using Intel MPI. 
#!CMD="mpirun -ppn $mpi_tasks_per_node -np $np $application $options" 

#! Choose this for a pure shared-memory OpenMP parallel program on a single node: 
#! (OMP_NUM_THREADS threads will be created): 
CMD="$application $options" 

#! Choose this for a MPI code (possibly using OpenMP) using OpenMPI: 
#!CMD="mpirun -npernode $mpi_tasks_per_node -np $np $application $options" 


############################################################### 
### You should not have to change anything below this line #### 
############################################################### 

cd $workdir 
echo -e "Changed directory to `pwd`.\n" 

JOBID=$SLURM_JOB_ID 

echo -e "JobID: $JOBID\n======" 
echo "Time: `date`" 
echo "Running on master node: `hostname`" 
echo "Current directory: `pwd`" 

if [ "$SLURM_JOB_NODELIST" ]; then 
     #! Create a machine file: 
     export NODEFILE=`generate_pbs_nodefile` 
     cat $NODEFILE | uniq > machine.file.$JOBID 
     echo -e "\nNodes allocated:\n================" 
     echo `cat machine.file.$JOBID | sed -e 's/\..*$//g'` 
fi 

echo -e "\nnumtasks=$numtasks, numnodes=$numnodes, mpi_tasks_per_node=$mpi_tasks_per_node (OMP_NUM_THREADS=$OMP_NUM_THREADS)" 

echo -e "\nExecuting command:\n==================\n$CMD\n" 

eval $CMD

出典

2016-10-12 zlin

あなたの仕事は本質的に再開可能であるならば、あなたがする必要があるすべてはあなたの提出スクリプトの末尾にsbatchを呼び出すことです。ジョブが実行されたときに0を返すコマンドに置き換える必要がありsubmit.sh

if ! job_is_done; 
then 
sbatch submit.sh 
fi

job_is_done部分と呼ばれると仮定すると、ログファイルに「grepを」で、たとえば（すなわち計算は、プロセス等、収束し、終了しました）特定の手がかりのために。

あなたは、ジョブを再キューイングすることができます

job_is_done || scontrol requeue $SLURM_JOB_ID

あなたのプログラムは、本質的に、再開されていない場合、あなたはそれを再起動可能にするために、このようなDMCTPとしてラッパーを使用することができます。

出典

2016-10-13 12:18:11 damienfrancois

ありがとうございました。私の仕事が終了すると、.outファイルが作成されます。つまり、.outファイルの数が1つ増えます。.outの数を増やすとジョブを再キューできますか？私はかなり嫌な思いをしています。あなたはそれを私に助けてくれますか？ – zlin

SLURM別のタスクが終了したらタスクをqsubする方法は？

答えて

関連する問題