Postgraduate Course: Fundamentals of HPC System Administration (EPCD11021)
Course Outline
School | School of Informatics |
College | College of Science and Engineering |
Credit level (Normal year taken) | SCQF Level 11 (Postgraduate) |
Course type | Online Distance Learning |
Availability | Not available to visiting students |
SCQF Credits | 10 |
ECTS Credits | 5 |
Summary | High Performance Computing (HPC) is a multidisciplinary field combining complex computer architectures, system software, parallel programming languages, algorithms, tools and scientific applications. It focuses on solving computationally intensive problems in parallel by distributing them across large number of processing units. The state-of-the-art HPC systems are highly heterogenous (including CPUs, GPUs and other compute components) and are used by a wide range of scientific communities.
This course covers the basics of HPC system administration, focusing on elements that allow those massively parallel systems to be used effectively by many users with different requirements at the same time. Concepts covered include HPC networks and interconnects, parallel file systems, scheduling and queue management, and software and user environments. |
Course description |
This module will be taught by administrators with experience across multiple Tier-1 and Tier-2 National HPC resources, in a direct and practical manner implementing concepts as they are taught and as they are used in live environments. Material will be made available during the course for self-paced learning and supplemental sessions where possible with guest lectures from vendors and other specialists. Interactive sessions of the course will focus on practical and facilitate discussion of coursework with experts in their field.
The course covers:
-HPC system administration from a professional perspective.
-Implementing open-source technologies to deploy a functional HPC cluster using virtual resources.
-Sharing developed best practices and lessons learned from hands-on experience operating multiple national HPC services.
-Approaches for investigating technical issues.
-Understanding functions of cluster management (incl. overview of underpinning technologies).
-Deployment, configuration, and management of: Authentication service; Parallel filesystems; Scheduler and resource manager; Monitoring and logging solutions
-Deployment, and configuration of user access and environment customisation.
-Understanding of different performant network interconnect designs and technologies.
-Development of scheduling and queue management strategies for efficient workload management.
-Using Infrastructure as Code (IaC) practices for configuration management.
-Usage of automation to improve and maintain services and user experience.
-A high-level overview of service management processes including change enablement and documenting standard operating procedures.
Students will demonstrate the learning outcomes by building a virtual HPC cluster which will require a parallel filesystem, centralised authentication, scheduler, and user access host.
|
Entry Requirements (not applicable to Visiting Students)
Pre-requisites |
|
Co-requisites | |
Prohibited Combinations | |
Other requirements | None |
Course Delivery Information
|
Academic year 2024/25, Not available to visiting students (SS1)
|
Quota: None |
Course Start |
Semester 2 |
Course Start Date |
13/01/2025 |
Timetable |
Timetable |
Learning and Teaching activities (Further Info) |
Total Hours:
100
(
Online Activities 30,
Programme Level Learning and Teaching Hours 2,
Directed Learning and Independent Learning Hours
68 )
|
Assessment (Further Info) |
Written Exam
0 %,
Coursework
100 %,
Practical Exam
0 %
|
Additional Information (Assessment) |
Coursework (100%) : The coursework will require the student to design, implement and perform an evaluation of a multi-node compute cluster with access to a shared file system under the control of a queue management system and provide software environments for a set of identified user groups.
The student will be required to submit a design for their cluster, implement the design, submit the cluster for acceptance tests and evaluate the cluster performance and implementation. This will be supplemented by a reflection on how the student approached the scenario. |
Feedback |
Not entered |
No Exam Information |
Learning Outcomes
On completion of this course, the student will be able to:
- Understand and evaluate the technologies underpinning a HPC compute resource.
- Implement and evaluate parallel file systems in a compute resource.
- Design and analyse workload scheduling and queue management strategies and technologies.
- Create, execute, and analyse processes required to enable user environments and software management.
- Execute core system administration concepts in a professional HPC setting.
|
Additional Information
Graduate Attributes and Skills |
Not entered |
Keywords | systems,cluster,performance,file system,queue,software environment,network,HPC,administration |
Contacts
Course organiser | Mr Rui Apostolo
Tel:
Email: |
Course secretary | Mr James Richards
Tel: 90131 6)51 3578
Email: |
|
|