Introduction to SLURM scitas.epfl.ch October 9, 2014 Bellatrix Frontend at bellatrix.epfl.ch 16 x 2.2 GHz cores per node 424 nodes with 32GB Infiniband QDR network The batch system is SLURM 1 / 36 Castor Frontend at castor.epfl.ch 16 x 2.6 GHz cores per node 50 nodes with 64GB 2 nodes with 256GB For sequential jobs (Matlab etc.) The batch system is SLURM RedHat 6.5 2 / 36 Deneb (October 2014) Frontend at deneb.epfl.ch 16 x 2.6 GHz cores per node 376 nodes with 64GB 8 nodes with 256GB 2 nodes with 512GB and 32 cores 16 nodes with 4 Nvidia K40 GPUs Infiniband QDR network 3 / 36 Storage /home filesystem has per user quotas will be backed up for important things (source code, results and theses) /scratch high performance ”temporary” space is not backed up is organised by laboratory 4 / 36 Connection Start the X server (automatic on a Mac) Open a terminal ssh -Y username@castor.epfl.ch Try the following commands: id pwd quota ls /scratch/<group>/<username> 5 / 36 The batch system Goal: to take a list of jobs and execute them when appropriate resources become available SCITAS uses SLURM on its clusters: http://slurm.schedmd.com The configuration depends on the purpose of the cluster (serial vs parallel) 6 / 36 sbatch The fundamental command is sbatch sbatch submits jobs to the batch system Suggested workflow: create a short job-script submit it to the batch system 7 / 36 sbatch - exercise Copy the first two examples to your home directory cp /scratch/examples/ex1.run . cp /scratch/examples/ex2.run . Open the file ex1.run with your editor of choice 8 / 36 ex1.run #!/bin/bash #SBATCH #SBATCH #SBATCH #SBATCH #SBATCH --workdir /scratch/<group>/<username> --nodes 1 --ntasks 1 --cpus-per-task 1 --mem 1024 sleep 10 echo "hello from $(hostname)" sleep 10 9 / 36 ex1.run #SBATCH is a directive to the batch system --nodes 1 the number of nodes to use - on Castor this is limited to 1 --ntasks 1 the number of tasks (in an MPI sense) to run per job --cpu-per-task 1 the number of cores per aforementioned task --mem 4096 the memory required per node in MB --time 12:00:00 --time 2-6 the time required # 12 hours # two days and six hours 10 / 36 Running ex1.run The job is assigned a default runtime of 15 minutes $ sbatch ex1.run Submitted batch job 439 $ cat /scratch/<group>/<username>/slurm-439.out hello from c03 11 / 36 What went on? sacct -j <JOB_ID> sacct -l -j <JOB_ID> Or more usefully: Sjob <JOB ID> 12 / 36 Cancelling jobs To cancel a specific job: scancel <JOB_ID> To cancel all your jobs: scancel -u <username> 13 / 36 ex2.run #!/bin/bash #SBATCH #SBATCH #SBATCH #SBATCH #SBATCH #SBATCH --workdir /scratch/<group>/<username> --nodes 1 --ntasks 1 --cpus-per-task 8 --mem 122880 --time 00:30:00 /scratch/examples/linpack/runme_1_45k 14 / 36 What’s going on? squeue squeue -u <username> Squeue Sjob <JOB_ID> scontrol -d show job <JOB_ID> sinfo 15 / 36 squeue and Squeue squeue Job states: Pending, Resources, Priority, Running squeue | grep <JOB_ID> squeue -j <JOB_ID> Squeue <JOB_ID> 16 / 36 Sjob $ Sjob <JOB_ID> JobID JobName Cluster Account Partition Timelimit User Group ------------ ---------- ---------- ---------- ---------- ---------- --------- --------31006 ex1.run castor scitas-ge serial 00:15:00 jmenu scitas-ge 31006.batch batch castor scitas-ge Submit Eligible Start End ------------------- ------------------- ------------------- ------------------2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:56:08 2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:56:08 Elapsed ExitCode State ---------- -------- ---------00:00:20 0:0 COMPLETED 00:00:20 0:0 COMPLETED NCPUS NTasks NodeList UserCPU SystemCPU AveCPU MaxVMSize ---------- -------- --------------- ---------- ---------- ---------- ---------1 c04 00:00:00 00:00.001 1 1 c04 00:00:00 00:00.001 00:00:00 207016K 17 / 36 scontrol $ scontrol -d show job <JOB_ID> $ scontrol -d show job 400 obId=400 Name=s1.job UserId=user(123456) GroupId=group(654321) Priority=111 Account=scitas-ge QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:03:39 TimeLimit=00:15:00 TimeMin=N/A SubmitTime=2014-03-06T09:45:27 EligibleTime=2014-03-06T09:45:27 StartTime=2014-03-06T09:45:27 EndTime=2014-03-06T10:00:27 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=serial AllocNode:Sid=castor:106310 ReqNodeList=(null) ExcNodeList=(null) NodeList=c03 BatchHost=c03 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=c03 CPU IDs=0 Mem=1024 MinCPUsNode=1 MinMemoryCPU=1024M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/<user>/jobs/s1.job WorkDir=/scratch/<group>/<user> 18 / 36 Modules Modules make your life easier module avail module show <take your pick> module load <take your pick> module list module purge module list 19 / 36 ex3.run - Mathematica Copy the following files to your chosen directory: cp /scratch/examples/ex3.run . cp /scratch/examples/mathematica.in . Submit �ex3.run’ to the batch system and see what happens... 20 / 36 ex3.run #!/bin/bash #SBATCH #SBATCH #SBATCH #SBATCH #SBATCH --ntasks 1 --cpus-per-task 1 --nodes 1 --mem 4096 --time 00:05:00 echo STARTING AT �date� module purge module load mathematica/9.0.1 math < mathematica.in echo FINISHED at �date� 21 / 36 Compiling ex4.* source files Copy the following files to your chosen directory: /scratch/examples/ex4_README.txt /scratch/examples/ex4.c /scratch/examples/ex4.cxx /scratch/examples/ex4.f90 /scratch/examples/ex4.run Then compile them with: module load intelmpi/4.1.3 mpiicc -o ex4 c ex4.c mpiicpc -o ex4 cxx ex4.cxx mpiifort -o ex4 f90 ex4.f90 22 / 36 The 3 methods to get interactive access 1/3 In order to schedule an allocation use salloc with exactly the same options for resources as sbatch You will then arrive in a new prompt which is still on the submission node but by using srun you can get access to the allocated resources eroche@castor:hello > salloc -N 1 -n 2 salloc: Granted job allocation 1234 bash-4.1$ hostname castor bash-4.1$ srun hostname c03 c03 23 / 36 The 3 methods to get interactive access 2/3 To get a prompt on the machine one needs to use the “--pty” option with “srun” and then “bash -i” (or “tcsh -i”) to get the shell: eroche@castor > salloc -N 1 -n 1 salloc: Granted job allocation 1235 eroche@castor > srun --pty bash -i bash-4.1$ hostname c03 24 / 36 The 3 methods to get interactive access 3/3 This is the least elegant but it is the method by which one can run X11 applications: eroche@bellatrix > salloc -n 1 -c 16 -N 1 salloc: Granted job allocation 1236 bash-4.1$ srun hostname c04 bash-4.1$ ssh -Y c04 eroche@c04 > 25 / 36 Dynamic libs used in an application “ldd” displays the libraries an executable file depends on: /COURS > ldd ex4 f90 jmenu@castor:~ linux-vdso.so.1 => (0x00007fff4b905000) libmpigf.so.4 => /opt/software/intel/14.0.1/intel64/lib/libmpigf.so.4 (0x00007f556cf88000) libmpi.so.4 => /opt/software/intel/14.0.1/intel64/lib/libmpi.so.4 (0x00007f556c91c000) libdl.so.2 => /lib64/libdl.so.2 (0x0000003807e00000) librt.so.1 => /lib64/librt.so.1 (0x0000003808a00000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003808200000) libm.so.6 => /lib64/libm.so.6 (0x0000003807600000) libc.so.6 => /lib64/libc.so.6 (0x0000003807a00000) libgcc s.so.1 => /lib64/libgcc s.so.1 (0x000000380c600000) /lib64/ld-linux-x86-64.so.2 (0x0000003807200000) 26 / 36 ex4.run #!/bin/bash ... module purge module load intelmpi/4.1.3 module list echo LAUNCH DIR=/scratch/scitas-ge/jmenu EXECUTABLE="./ex4 f90" echo "--> LAUNCH DIR = ${LAUNCH DIR}" echo "--> EXECUTABLE = ${EXECUTABLE}" echo echo "--> ${EXECUTABLE} depends on the following dynamic libraries:" ldd ${EXECUTABLE} echo cd ${LAUNCH DIR} srun ${EXECUTABLE} ... 27 / 36 The debug QoS In order to have priority access for debugging sbatch --qos debug ex1.run Limits on Castor: 30 minutes walltime 1 job per user 16 cores between all users To display the available QoS’s: sacctmgr show qos 28 / 36 cgroups (Castor) General: cgroups (“control groups”) is a Linux kernel feature to limit, account, and isolate resource usage (CPU, memory, disk I/O, etc.) of process groups SLURM: Linux cgroups apply contraints to the CPUs and memory that can be used by a job They are automatically generated using the resource requests given to SLURM They are automatically destroyed at the end of the job, thus releasing all resources used Even if there is physical memory available a task will be killed if it tries to exceed the limits of the cgroup! 29 / 36 System process view Two tasks running on the same node with “ps auxf” root user user 177873 177877 177908 slurmstepd: [1072] \_ /bin/bash /var/spool/slurmd/job01072/slurm_script \_ sleep 10 root user user 177890 177894 177970 slurmstepd: [1073] \_ /bin/bash /var/spool/slurmd/job01073/slurm_script \_ sleep 10 Check memory, thread and core usage with “htop” 30 / 36 Fair share 1/3 The scheduler is configured to give all groups a share of the computing power Within each group the members have an equal share by default: jmenu@castor:~ > sacctmgr show association where account=lacal format=Account,Cluster,User,GrpNodes, QOS,DefaultQOS,Share tree Account Cluster User GrpNodes QOS Def QOS Share -------------------- ---------- ---------- -------- -------------------- --------- --------lacal castor normal 1 lacal castor aabecker debug,normal normal 1 lacal castor kleinjun debug,normal normal 1 lacal castor knikitin debug,normal normal 1 Priority is based on recent usage this is forgotten about with time (half life) fair share comes into play when the resources are heavily used 31 / 36 Fair share 2/3 Job priority is a weighted sum of various factors: jmenu@castor:~ > sprio -w JOBID Weights PRIORITY AGE 1000 jmenu@bellatrix:~ > sprio -w JOBID PRIORITY Weights FAIRSHARE 10000 AGE 1000 QOS 100000 FAIRSHARE 10000 JOBSIZE 100 To compare jobs’ priorities: jmenu@castor:~ > sprio -j80833,77613 JOBID PRIORITY AGE FAIRSHARE 77613 145 146 0 80833 9204 93 9111 32 / 36 QOS 0 0 QOS 100000 Fair share 3/3 FairShare values range from 0.0 to 1.0: Value Meaning ≈ 0.0 you used much more resources that you were granted 0.5 ≈ 1.0 you got what you paid for you used nearly no resources jmenu@bellatrix:~ > sshare -a -A lacal Accounts requested: : lacal Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------lacal 666 0.097869 1357691548 0.256328 0.162771 lacal boissaye 1 0.016312 0 0.042721 0.162771 lacal janson 1 0.016312 0 0.042721 0.162771 lacal jetchev 1 0.016312 0 0.042721 0.162771 lacal kleinjun 1 0.016312 1357691548 0.256328 0.000019 lacal pbottine 1 0.016312 0 0.042721 0.162771 lacal saltini 1 0.016312 0 0.042721 0.162771 More information at: http://schedmd.com/slurmdocs/priority_multifactor.html 33 / 36 Helping yourself man pages are your friend! man sbatch man sacct man gcc module load intel/14.0.1 man ifort 34 / 36 Getting help If you still have problems then send a message to: 1234@epfl.ch Please start the subject with HPC for automatic routing to the HPC team Please give as much information as possible including: the jobid the directory location and name of the submission script where the “slurm-*.out” file is to be found how the “sbatch” command was used to submit it the output from “env” and “module list” commands 35 / 36 Appendix Change your shell at: https://dinfo.epfl.ch/cgi-bin/accountprefs Scitas web site: http://scitas.epfl.ch 36 / 36
© Copyright 2024 Paperzz