Specialized Computing Cluster An Introduction

Specialized Computing Cluster
An Introduction
May 2012
Concepts and Terminology:
What is Cluster Computing?
• Traditionally software has been written for serial
computation
• Cluster computing is the simultaneous use of
multiple compute resources (processors) to solve a
computational problem.
Concepts and Terminology:
Serial Computing
• Only one instruction and data
stream is acted on during any one
clock cycle.
Computer 1
Analysis Pipeline
• Slow, “one at a time” processing.
Brown et. al. NeuroImage 2010
Concepts and Terminology:
Serial Computing
Computer#1
Analysis
Pipeline
Concepts and Terminology:
Batch Computing
Computer#1
Computer#2
Computer#3
Analysis Pipeline
Analysis Pipeline
Analysis Pipeline
•
Run multiple datasets on multiple
computers simultaneously.
•
All processing units execute the same
instruction at any given clock cycle.
•
Can batch many jobs together to gain
a linear increase in speed.
Concepts and Terminology:
Parallel Computing
Computer#1
Analysis Pipeline
Computer#2
Analysis Pipeline
Computer#3
Analysis Pipeline
• Break up a dataset across different nodes.
• Execute instructions on different data elements.
• Combine results of processors.
• Decrease computation time.
Costagli, et. al. NeuroImage 2009
Concepts and Terminology:
Parallel Computing
60 hrs on one processor
Costagli, et. al. NeuroImage 2009
Concepts and Terminology:
Parallel Computing
4
5
6
1
2
3
• Break up a computationally intensive task across nodes.
• Execute different instructions on different data elements.
• Combine results of processors.
• Decrease computation time.
Costagli, et. al. NeuroImage 2009
Concepts and Terminology:
Why Use Parallel Computing?
• Saves time
- Saves money
• Energy efficiency
- Power consumption ≈ processor frequency
- Cost savings
• The current future direction of computing
Specialized Computing Cluster (SCC)
New computing resource dedicated to CAMH researchers and collaborators
•
22 Compute nodes, with 6 x 2.80GHz Dual Intel Xeon processors
– (264 cores total)
•
18 GB RAM per node (1.5 GB per core)
– RAM is shared by the processers during job execution
•
2 x 146 GB Hard Drives
•
Gigabit Ethernet network on all nodes
– Communication and data transfer between nodes
•
CentOS 6 operating system
– Linux (Derived from Red Hat)
– Linux Tutorials: Software Carpentry
SCC Schematic
SCC Schematic
Dual Head Nodes:
• Login
• Data transfer
• Compilation
• Send jobs to cluster
Tape Drive for Backup:
3TB native capacity
Storage Space:
• 54 TB Usable
• RAID 6
SCC Schematic
Switch:
Coordinates communication
Compute Nodes:
• All batch computations
• 1GB Ethernet Connection
• Communicate with Head Nodes
SCC Schematic
Getting an Account
All CAMH research faculty and associated researchers are eligible for an account.
1) Agree to the Acceptable Usage Policy
- Established by the SCC Committee, and made available on the wiki
- All users are expected to be familiar with the terms
2) Complete Application Form
- Form available on wiki
- Associated researchers require confirmation from CAMH Research faculty
3) Email completed form to scc_support@camh.net
(Account setup should take no longer than 48 business hours)
Default User Account
Default Users
• Run 10 jobs simultaneously
• 48 hours maximum wall clock time
Disk Space:
Location
Quota
Time-Limit
Backup
Login
Compute
/home
10 GB
perpetual
yes
rw
ro
/project
40 GB
perpetual
no
rw
rw
/scratch
1TB
3 months
no
rw
rw
Data Management
Backup Schedule:
An incremental backup of /home is taken daily Mon-Thurs 10-12PM
A full backup is taken on Friday 10-12PM
Scratch Purge Policy:
Data that have not been touched within 3 months are deleted
It is your responsibility to move data from /scratch to /home.
Users are emailed a list of files that are set for deletion in the next 5 days.
Notifications:
Users receive notifications when nearing the quota for /home, or /project.
Group PI’s are notified regarding /scratch quotas
Using the SCC
1)
Access the SCC
2)
Transfer data to the SCC
3)
Load modules
4)
Compile programs
5)
Prepare and submit jobs to the queue
6)
Monitor and manage jobs
7)
Transfer processed data back to local machines
Using the SCC
1) Access the SCC
First ssh to login node (not part of the cluster)
ssh <username>@mgmt2.scc.camh.net
The login nodes are gateways, they are should not
be used for computation and are excluded by the
scheduler. They are only to be used for data
transfer and compilation.
Using the SCC
2)
Transfer data to the SCC
Move data via ‘scp’ (secure copy)
Secure copy a local file(s) from a remote host (your local machine) to a directory on the SCC:
scp
<username>@<local_machine>:/<local_file> <SCC_directory>
scc$ scp david@kimsrv:/home/david/file.c /imaging/home/kimel/david/
Make certain that you have read permission on the data you are copying and write
permissions in the destination directory.
The speed of transfer will vary depending on daily traffic. Remember we are limited to
1GB Ethernet, over which many users may be transferring substantial datasets.
Software and Libraries
What software is already installed on login?
• Essentials software is available from initial login (i.e. basic libraries)
• Additional software is accessed by loading modules
Modules
• Modules set environment variables LD LIBRARY PATH, PATH, .etc)
• Allows multiple, (potentially) conflicting versions to be available.
Module Command
Description
module avail
Lists available modules
module load <module_name>
Load module <module_name>
module list
Lists currently loaded modules
module unload <module_name>
Unload module <module_name>
module help
Displays modules help file
Comprehensive list of Software and Libraries available on the SCC can be
found on the Software page of the wiki.
Using the SCC
Using Modules:
scc$ module avail
------- /quarantine/Modules/modulefiles -------gcc/4.6.1
Xlibraries/X11-64(default)…
…
…
scc$ module load gcc
scc$ module list
Currently Loaded Modulefiles:
1) gcc/4.6.1
Load frequently used modules into ~/.bashrc file i.e. include line module load
gcc. This way gcc will be loaded on login “automatically”.
Dependencies and Conflicts
Dependencies:
Many modules require prerequisite modules to be loaded first.
scc$ module load python/2.6.2
scc$ python/2.6.2(11):ERROR:151: Module 'python/2.6.2' depends on
one of the module(s) 'gcc/4.6.1'
python/2.6.2(11):ERROR:102: Tcl command execution failed:
prereq gcc/4.6.1
scc$ module load gcc/4.6.1 python/2.6.2
Conflicts:
Conflicting versions cannot be loaded simultaneously.
scc$ module load python/2.6.2
scc$ python/2.6.2(11):ERROR:161: Module 'python/2.6.2'
conflicts with Module ‘python/2.5.1’
scc$ module unload python/2.5.1
scc$ module load python/2.6.2
Using the SCC
4) Compile Program
Use gcc to compile ‘c’ code
scc$ module load gcc
scc$ gcc file.c –o job.exe
Use MPI/OpenMPI for parallel code
Job Queuing System
Submitting jobs to the scheduler.
• When you submit a job, it gets placed in a queue.
• Job priority is based on allocation.
• When sufficient nodes are free to execute a job, it is started
on the appropriate compute node(s).
• Jobs remain ‘idle’ until resources become available.
• Jobs can be temporarily ‘blocked’ if too many are submitted.
• Jobs must be submitted as a batch script
Batch Script
Batch scripts are necessary for job submission to the queue.
Create a shell script e.g. batch_script.sh:
#!/bin/bash
#PBS -l nodes=1:ppn=12
#PBS -l walltime=1:00:00
cd $PBS_O_WORKDIR
./job.exe data_1 > output
PBS directives tell the resource manager what are the jobs requirements:
1 node with 12 processors, for one hour.
Typically the script will point to and loop over multiple datasets.
Submitting Jobs
Batch scripts are submitted with the ‘qsub’ command:
qsub [options] <script>
scc$ qsub job_script.sh
109.scc_batch
PBS directives can also be specified in qsub command line, such as
number of nodes, processers, walltime: –l nodes=1:ppn=12
A jobid is returned upon successful job submission, e.g. 109. This is
used as identification for the job in the queue and can used to monitor
its status.
Queues
There are two queues available on the SCC
1) ‘batch’ (Default) Suitable for submitting batch jobs.
2) ‘debug’ Restricted queue designed for code testing
Queue
Max Jobs Running
Max Wall-Time (hr)
batch
10
48
debug
1
2
Queue selection can be achieved using the –q flag during job submission.
scc$ qsub –l nodes=1:ppn=12 –q debug job_script.sh
NOTE: There is no queue for serial jobs. It is your responsibility to
group together 12 processes to use the node's full power.
GNU Parallel can help group together processes in this manner.
Monitoring Jobs
qstat and checkjob:
Show queue status: qstat
scc$ qstat
scc$
Job id
Name
User
Time Use
S
Queue
--------------------------------------------------------------------------2961983
JobName
drotenb
0
Q
batch
• Status: R (running) , Q (queued), E (error)
• Show individual job status: checkjob jobid
• See more details of the job: checkjob -v jobid
(e.g., why is my job blocked?)
showq
• See all the jobs in the queue: showq
• See your jobs in the queue: showq -u user
showstart
• Estimate when a queued job may start: showstart jobid
Output and Error Files
Output/Error files are generated for each job run on the SCC.
Error Logs (JobName.eJobID):
Can be very instructive during debugging.
Output Logs (JobName.oJobID):
Contain useful summaries pertaining to job execution.
These files are automatically written to the directory from which the
job was submitted.
canceljob
If jobs are failing, or you spot a mistake call: canceljob jobid
job_script.sh
Example: Single Job
#!/bin/bash
#PBS -l nodes=8:ppn=12,walltime=1:00:00
#PBS -N JobName
cd $PBS_O_WORKDIR
module load module_1
./mycode > output
scc$ module load gcc
scc$ gcc code.c -o mycode
scc$ mkdir scratch/wrkdir
scc$ cp mycode scratch/wrkdir/
scc$ cd scratch/wrkdir
scc$ qsub job_script.sh
210.scc
scc$ qstat
JobID Name User Time Use S Queue
---------------- -------------- ---- -------- - -------210.2scc JobName drotenb 0 Q batch
scc$ ls
JobID.e JobID.o mycode job_script.sh output
Example: Batch Job
scc$ module load gcc
scc$ gcc code.c -o mycode
scc$ mkdir scratch/example2
scc$ cp mycode scratch/example2
scc$ cd scratch/example2
scc$ cat > joblist.txt
mkdir run1; cd run1; ../mycode 1 > out
mkdir run2; cd run2; ../mycode 2 > out
mkdir run3; cd run3; ../mycode 3 > out
. . .
mkdir run30; cd run30; ../mycode 30 > out
scc$ cat > job_script.sh
#!/bin/bash
#PBS -l nodes=1:ppn=12,walltime=24:00:00
#PBS -N JobName_2
cd $PBS_O _WORKDIR
module load gnu-parallel
parallel -j 8 < joblist.txt
scc$ qsub job_script.sh
301.scc
scc$ ls
JobName_2.e301 JobName_2.o301 joblist.txt mycode code.c
Job_script.sh run1/ run2/ run3/...
Debugging
1) Always run test run first in debug queue
• Ascertain whether jobs are feasible
• Check whether job runs to completion
2) Submit job via qsub with ‘–I’ (interactive flag)
• Extremely useful for debugging
• The standard output and error will be displayed in the terminal
3) Check Error Logs (JobID.e)
• Contains descriptions of errors
• Trace back to determine why the job has failed
4) Check permissions
• Are you running jobs from /home? This is RO on compute nodes.
• Always run jobs from /project or /scratch. These are RW on compute nodes.
• Is your script executable?
Additional Tips
• Specify walltime precisely
• Test scaling behavior
– Start small, work up
• Avoid reading and writing lots of small amounts of data to
disk.
• Do not submit single serial jobs.
• Do not keep lots of files in your directory
– Use tar to compress files
Scheduled Down-Time
Cluster Maintenance is inevitable:
OS backups, upgrades, hardware/software installs .etc
Scheduled ‘Down-Time’: Thursday 3:00-6:00
In the event of down-time, scheduled or otherwise
Users will be notified by email and on the wiki.
SCC Wiki
SCC Wiki: User’s Guide
SCC Wiki: Ticket Tracker
SciNet
SciNet “is a consortium for High-Performance Computing
consisting of researchers at U. of T. and its associated hospitals.”
Accounts available to any researcher at a Canadian University
General Purpose Cluster (GPC)
• 3864 nodes with 8 Intel x86-64 cores@2.53/2.66GHz
• 30,912 cores in total
• InfiniBand network (faster than Ethernet)
• #1 in Canada
Massive computing resource!
Resources
SCC Wiki: https://info2.camh.net/scc/index.php/Main_Page
SciNet: http://wiki.scinethpc.ca/wiki/index.php/SciNet_User_Support_Library
Linux: http://www.linuxhelp.net/ (Many Resources)
Software Carpentry: http://software-carpentry.org/blog/
Contact: scc_support@camh.net or David_Rotenberg@camh.net