Introduction to HPC (High-Performance Computing)
RCD Terminology
High-Performance Computing (HPC) System Components:
Login Node:
The entry point to the HPC system.
Used for logging in, submitting jobs, and managing files.
It is unsuitable for computationally intensive tasks; instead, it serves as a gateway to access more specialized computing resources.
Compute Node:
The powerhouse of the HPC system.
These nodes are where your computational tasks are executed.
Equipped with high-performance processors (CPUs) and often specialized accelerators like GPUs.
Typically accessed through job scheduling systems.
Data Transfer Node:
Dedicated nodes optimized for fast data transfer within the HPC system and with external networks.
They are used for efficiently moving large datasets in and out of the system.
Essential for managing input and output data for computational tasks.
Job Submission:
Batch Jobs:
Most computations on HPC systems are performed as batch jobs.
Users submit scripts or commands describing their computational tasks to the job scheduler.
The scheduler queues and dispatches these jobs to available compute resources.
Job Scheduler:
Software responsible for managing and scheduling computational tasks across available compute resources.
Allocates resources based on user-defined requirements, system policies, and workload priorities.
File Systems:
Home Directory:
Personal storage space for each user on the HPC system.
Typically small in capacity and intended for storing scripts, configuration files, and small datasets.
Scratch Space:
A temporary storage area optimized for high-speed I/O operations is ideal for storing intermediate computation results and temporary files.
Files in scratch space are typically purged periodically to free up storage resources.
Project Space:
A storage space shared among members of a PI’s group.
Software Environment:
Module System:
Allows users to load and unload software packages and libraries dynamically.
Ensures compatibility and provides access to a wide range of tools and applications.
SSH:
Acronym for Secure Shell.
Provides a secure channel over an unsecured network for safe log into remote systems.
Globus:
Software platform for securely & efficiently transferring data.
Â
HPC Architecture
Components:
Networking
Networking is crucial in High-Performance Computing (HPC) because it's what allows all the computers (nodes) in a cluster to talk to each other effectively. Think of it like the internet for supercomputers! As the clusters get bigger, the way they're connected becomes even more important. That's where technologies like InfiniBand come in—they help handle the complex connections in larger clusters.
Making sure the connections between nodes are strong and managing any traffic jams (congestion) is key to keeping everything running smoothly. This ensures data moves quickly, and delays are kept to a minimum, which is super important for getting results fast in HPC tasks. It's also important to be a good steward of the network because it's a finite resource—there's only so much available. Every file operation and message passed between nodes requires a small chunk of network "resources," so it's essential to utilize the network effectively to ensure optimal performance for everyone using the system.
Compute Nodes:
A compute node is a fundamental unit in High-Performance Computing (HPC) clusters, tasked with heavy computational workloads. Commonly known as the 'workhorse,' these nodes handle most computing tasks, while specialized nodes may feature additional resources such as large memory or GPU accelerators. Compute nodes execute jobs dispatched by the scheduler, accessing shared filesystems for required software and data. The scheduler evaluates job requirements like computational intensity and memory usage, then allocates resources on the most suitable compute node to maximize performance and efficiency.
Data Transfer Nodes (DTN):
A data transfer node serves as a specialized component in High-Performance Computing (HPC) clusters, dedicated to facilitating fast and efficient data movement within the system and with external networks. Unlike compute nodes, which focus on heavy computation, data transfer nodes prioritize high-speed data transfer operations. Equipped with optimized network interfaces and storage systems, these nodes ensure swift and reliable data exchange between storage systems and compute nodes. Data transfer nodes play a crucial role in transferring large datasets between storage systems, such as moving input data to compute nodes and fetching output data. Additionally, they enable researchers to efficiently share data with collaborators and transfer results to external storage or analysis platforms.
GPFS File System:
The GPFS (General Parallel File System), is a high-performance clustered file system deployed on High-Performance Computing (HPC) systems. It serves as a robust and scalable storage solution for user data and system files, accessible from both compute and data transfer nodes. GPFS offers features such as high availability, data replication, and automatic tiering, optimizing storage performance and efficiency. Furthermore, it supports parallel I/O operations, allowing multiple compute nodes to access and manipulate files concurrently. Researchers utilize GPFS for storing input data, intermediate results, and output files generated during computational tasks. Integrated seamlessly with job scheduling systems, GPFS enables efficient data access and management, enhancing overall performance in the HPC environment.
Center for Computational Sciences