Specialized Computing Cluster An Introduction May 2012 Concepts and Terminology: What is Cluster Computing? • Traditionally software has been written for serial computation • Cluster computing is the simultaneous use of multiple compute resources (processors) to solve a computational problem. Concepts and Terminology: Serial Computing • Only one instruction and data stream is acted on during any one clock cycle. Computer 1 Analysis Pipeline • Slow, “one at a time” processing. Brown et. al. NeuroImage 2010 Concepts and Terminology: Serial Computing Computer#1 Analysis Pipeline Concepts and Terminology: Batch Computing Computer#1 Computer#2 Computer#3 Analysis Pipeline Analysis Pipeline Analysis Pipeline • Run multiple datasets on multiple computers simultaneously. • All processing units execute the same instruction at any given clock cycle. • Can batch many jobs together to gain a linear increase in speed. Concepts and Terminology: Parallel Computing Computer#1 Analysis Pipeline Computer#2 Analysis Pipeline Computer#3 Analysis Pipeline • Break up a dataset across different nodes. • Execute instructions on different data elements. • Combine results of processors. • Decrease computation time. Costagli, et. al. NeuroImage 2009 Concepts and Terminology: Parallel Computing 60 hrs on one processor Costagli, et. al. NeuroImage 2009 Concepts and Terminology: Parallel Computing 4 5 6 1 2 3 • Break up a computationally intensive task across nodes. • Execute different instructions on different data elements. • Combine results of processors. • Decrease computation time. Costagli, et. al. NeuroImage 2009 Concepts and Terminology: Why Use Parallel Computing? • Saves time - Saves money • Energy efficiency - Power consumption ≈ processor frequency - Cost savings • The current future direction of computing Specialized Computing Cluster (SCC) New computing resource dedicated to CAMH researchers and collaborators • 22 Compute nodes, with 6 x 2.80GHz Dual Intel Xeon processors – (264 cores total) • 18 GB RAM per node (1.5 GB per core) – RAM is shared by the processers during job execution • 2 x 146 GB Hard Drives • Gigabit Ethernet network on all nodes – Communication and data transfer between nodes • CentOS 6 operating system – Linux (Derived from Red Hat) – Linux Tutorials: Software Carpentry SCC Schematic SCC Schematic Dual Head Nodes: • Login • Data transfer • Compilation • Send jobs to cluster Tape Drive for Backup: 3TB native capacity Storage Space: • 54 TB Usable • RAID 6 SCC Schematic Switch: Coordinates communication Compute Nodes: • All batch computations • 1GB Ethernet Connection • Communicate with Head Nodes SCC Schematic Getting an Account All CAMH research faculty and associated researchers are eligible for an account. 1) Agree to the Acceptable Usage Policy - Established by the SCC Committee, and made available on the wiki - All users are expected to be familiar with the terms 2) Complete Application Form - Form available on wiki - Associated researchers require confirmation from CAMH Research faculty 3) Email completed form to scc_support@camh.net (Account setup should take no longer than 48 business hours) Default User Account Default Users • Run 10 jobs simultaneously • 48 hours maximum wall clock time Disk Space: Location Quota Time-Limit Backup Login Compute /home 10 GB perpetual yes rw ro /project 40 GB perpetual no rw rw /scratch 1TB 3 months no rw rw Data Management Backup Schedule: An incremental backup of /home is taken daily Mon-Thurs 10-12PM A full backup is taken on Friday 10-12PM Scratch Purge Policy: Data that have not been touched within 3 months are deleted It is your responsibility to move data from /scratch to /home. Users are emailed a list of files that are set for deletion in the next 5 days. Notifications: Users receive notifications when nearing the quota for /home, or /project. Group PI’s are notified regarding /scratch quotas Using the SCC 1) Access the SCC 2) Transfer data to the SCC 3) Load modules 4) Compile programs 5) Prepare and submit jobs to the queue 6) Monitor and manage jobs 7) Transfer processed data back to local machines Using the SCC 1) Access the SCC First ssh to login node (not part of the cluster) ssh <username>@mgmt2.scc.camh.net The login nodes are gateways, they are should not be used for computation and are excluded by the scheduler. They are only to be used for data transfer and compilation. Using the SCC 2) Transfer data to the SCC Move data via ‘scp’ (secure copy) Secure copy a local file(s) from a remote host (your local machine) to a directory on the SCC: scp <username>@<local_machine>:/<local_file> <SCC_directory> scc$ scp david@kimsrv:/home/david/file.c /imaging/home/kimel/david/ Make certain that you have read permission on the data you are copying and write permissions in the destination directory. The speed of transfer will vary depending on daily traffic. Remember we are limited to 1GB Ethernet, over which many users may be transferring substantial datasets. Software and Libraries What software is already installed on login? • Essentials software is available from initial login (i.e. basic libraries) • Additional software is accessed by loading modules Modules • Modules set environment variables LD LIBRARY PATH, PATH, .etc) • Allows multiple, (potentially) conflicting versions to be available. Module Command Description module avail Lists available modules module load <module_name> Load module <module_name> module list Lists currently loaded modules module unload <module_name> Unload module <module_name> module help Displays modules help file Comprehensive list of Software and Libraries available on the SCC can be found on the Software page of the wiki. Using the SCC Using Modules: scc$ module avail ------- /quarantine/Modules/modulefiles -------gcc/4.6.1 Xlibraries/X11-64(default)… … … scc$ module load gcc scc$ module list Currently Loaded Modulefiles: 1) gcc/4.6.1 Load frequently used modules into ~/.bashrc file i.e. include line module load gcc. This way gcc will be loaded on login “automatically”. Dependencies and Conflicts Dependencies: Many modules require prerequisite modules to be loaded first. scc$ module load python/2.6.2 scc$ python/2.6.2(11):ERROR:151: Module 'python/2.6.2' depends on one of the module(s) 'gcc/4.6.1' python/2.6.2(11):ERROR:102: Tcl command execution failed: prereq gcc/4.6.1 scc$ module load gcc/4.6.1 python/2.6.2 Conflicts: Conflicting versions cannot be loaded simultaneously. scc$ module load python/2.6.2 scc$ python/2.6.2(11):ERROR:161: Module 'python/2.6.2' conflicts with Module ‘python/2.5.1’ scc$ module unload python/2.5.1 scc$ module load python/2.6.2 Using the SCC 4) Compile Program Use gcc to compile ‘c’ code scc$ module load gcc scc$ gcc file.c –o job.exe Use MPI/OpenMPI for parallel code Job Queuing System Submitting jobs to the scheduler. • When you submit a job, it gets placed in a queue. • Job priority is based on allocation. • When sufficient nodes are free to execute a job, it is started on the appropriate compute node(s). • Jobs remain ‘idle’ until resources become available. • Jobs can be temporarily ‘blocked’ if too many are submitted. • Jobs must be submitted as a batch script Batch Script Batch scripts are necessary for job submission to the queue. Create a shell script e.g. batch_script.sh: #!/bin/bash #PBS -l nodes=1:ppn=12 #PBS -l walltime=1:00:00 cd $PBS_O_WORKDIR ./job.exe data_1 > output PBS directives tell the resource manager what are the jobs requirements: 1 node with 12 processors, for one hour. Typically the script will point to and loop over multiple datasets. Submitting Jobs Batch scripts are submitted with the ‘qsub’ command: qsub [options] <script> scc$ qsub job_script.sh 109.scc_batch PBS directives can also be specified in qsub command line, such as number of nodes, processers, walltime: –l nodes=1:ppn=12 A jobid is returned upon successful job submission, e.g. 109. This is used as identification for the job in the queue and can used to monitor its status. Queues There are two queues available on the SCC 1) ‘batch’ (Default) Suitable for submitting batch jobs. 2) ‘debug’ Restricted queue designed for code testing Queue Max Jobs Running Max Wall-Time (hr) batch 10 48 debug 1 2 Queue selection can be achieved using the –q flag during job submission. scc$ qsub –l nodes=1:ppn=12 –q debug job_script.sh NOTE: There is no queue for serial jobs. It is your responsibility to group together 12 processes to use the node's full power. GNU Parallel can help group together processes in this manner. Monitoring Jobs qstat and checkjob: Show queue status: qstat scc$ qstat scc$ Job id Name User Time Use S Queue --------------------------------------------------------------------------2961983 JobName drotenb 0 Q batch • Status: R (running) , Q (queued), E (error) • Show individual job status: checkjob jobid • See more details of the job: checkjob -v jobid (e.g., why is my job blocked?) showq • See all the jobs in the queue: showq • See your jobs in the queue: showq -u user showstart • Estimate when a queued job may start: showstart jobid Output and Error Files Output/Error files are generated for each job run on the SCC. Error Logs (JobName.eJobID): Can be very instructive during debugging. Output Logs (JobName.oJobID): Contain useful summaries pertaining to job execution. These files are automatically written to the directory from which the job was submitted. canceljob If jobs are failing, or you spot a mistake call: canceljob jobid job_script.sh Example: Single Job #!/bin/bash #PBS -l nodes=8:ppn=12,walltime=1:00:00 #PBS -N JobName cd $PBS_O_WORKDIR module load module_1 ./mycode > output scc$ module load gcc scc$ gcc code.c -o mycode scc$ mkdir scratch/wrkdir scc$ cp mycode scratch/wrkdir/ scc$ cd scratch/wrkdir scc$ qsub job_script.sh 210.scc scc$ qstat JobID Name User Time Use S Queue ---------------- -------------- ---- -------- - -------210.2scc JobName drotenb 0 Q batch scc$ ls JobID.e JobID.o mycode job_script.sh output Example: Batch Job scc$ module load gcc scc$ gcc code.c -o mycode scc$ mkdir scratch/example2 scc$ cp mycode scratch/example2 scc$ cd scratch/example2 scc$ cat > joblist.txt mkdir run1; cd run1; ../mycode 1 > out mkdir run2; cd run2; ../mycode 2 > out mkdir run3; cd run3; ../mycode 3 > out . . . mkdir run30; cd run30; ../mycode 30 > out scc$ cat > job_script.sh #!/bin/bash #PBS -l nodes=1:ppn=12,walltime=24:00:00 #PBS -N JobName_2 cd $PBS_O _WORKDIR module load gnu-parallel parallel -j 8 < joblist.txt scc$ qsub job_script.sh 301.scc scc$ ls JobName_2.e301 JobName_2.o301 joblist.txt mycode code.c Job_script.sh run1/ run2/ run3/... Debugging 1) Always run test run first in debug queue • Ascertain whether jobs are feasible • Check whether job runs to completion 2) Submit job via qsub with ‘–I’ (interactive flag) • Extremely useful for debugging • The standard output and error will be displayed in the terminal 3) Check Error Logs (JobID.e) • Contains descriptions of errors • Trace back to determine why the job has failed 4) Check permissions • Are you running jobs from /home? This is RO on compute nodes. • Always run jobs from /project or /scratch. These are RW on compute nodes. • Is your script executable? Additional Tips • Specify walltime precisely • Test scaling behavior – Start small, work up • Avoid reading and writing lots of small amounts of data to disk. • Do not submit single serial jobs. • Do not keep lots of files in your directory – Use tar to compress files Scheduled Down-Time Cluster Maintenance is inevitable: OS backups, upgrades, hardware/software installs .etc Scheduled ‘Down-Time’: Thursday 3:00-6:00 In the event of down-time, scheduled or otherwise Users will be notified by email and on the wiki. SCC Wiki SCC Wiki: User’s Guide SCC Wiki: Ticket Tracker SciNet SciNet “is a consortium for High-Performance Computing consisting of researchers at U. of T. and its associated hospitals.” Accounts available to any researcher at a Canadian University General Purpose Cluster (GPC) • 3864 nodes with 8 Intel x86-64 cores@2.53/2.66GHz • 30,912 cores in total • InfiniBand network (faster than Ethernet) • #1 in Canada Massive computing resource! Resources SCC Wiki: https://info2.camh.net/scc/index.php/Main_Page SciNet: http://wiki.scinethpc.ca/wiki/index.php/SciNet_User_Support_Library Linux: http://www.linuxhelp.net/ (Many Resources) Software Carpentry: http://software-carpentry.org/blog/ Contact: scc_support@camh.net or David_Rotenberg@camh.net
© Copyright 2025 Paperzz