Principal HPC System Administrator
Company: Consortium for School Networking
Location: Chicago
Posted on: November 11, 2024
Job Description:
Location: Chicago, ILJob Description:
- Design, configure, deploy, and maintain large computer
clusters, servers and software.
- Perform day-to-day operations leadership, including systems
administration, monitoring and storage performance up to and
including network components. Management of the system's network
switch, parallel file system and HPC software stack and tools.
- Monitor, maintain, and optimize HPC systems and software to
improve performance and resource utilization.
- Serve as the technical lead on complex projects and system
related tasks, as needed.
- Configure, install, and maintain the job scheduler/workload
manager.
- Diagnose and resolve system operational problems promptly and
effectively, coordinating with vendors to address hardware and
software issues.
- Use scripting/programming to enable system-level automation,
monitoring, and problem detection.
- Build and deploy open-source software as well as software from
vendors/partners.
- Develop and implement strategies for HPC data management,
backup, disaster recovery, and security, ensuring reliable and
efficient backup and restores for all managed systems.
- Create standard operating procedures for routine and complex
system tasks.
- Maintain and monitor the security of HPC systems and servers,
implementing robust security measures, as applicable.
- Troubleshoot and identify failed hardware, implement parts
replacement, and resolve system failures.
- Stay updated with the latest developments in HPC technologies
and apply this knowledge to improve RCC systems.
- Solves complex problems to configure, install, upgrade and
maintain server applications and hardware. Works to safeguard the
integrity of computer software. Implements operating system
enhancements to improve the reliability and performance of the
system.
- Provides expertise in planning and installing necessary patches
and upgrades for servers and their associated storage, network,
communications, and peripheral sub-systems. Installs and maintains
an appropriate level of intrusion detection, monitoring, and
auditing software as required.
- Perform other related work as needed.Preferred
QualificationsEducation:
- Bachelor's degree in Computer Science or closely related
field.Experience:
- A minimum of seven years of full-time Linux system
administration experience in a large distributed computing
environment.Technical Skills or Knowledge:
- Experience with Linux system administration (e.g., RHEL, Rocky,
CentOS).
- Proficiency in the installation, maintenance, operation, tuning
and troubleshooting of Linux and related systems and software.
- Experience in installing, configuring, and maintaining a job
scheduler/workload manager (such as SLURM, TORQUE, or PBS).
- Experience configuring, installing and troubleshooting MPI and
OpenMP.
- Experience with at least one HPC cluster management tool (e.g.
XCAT, Confluent, Warewulf, or Bright).
- Experience in configuring, administering, and supporting
network storage subsystems.
- Hands-on experience with at least one parallel file system
(e.g., Spectrum Scale-GPFS, Lustre, BeeGFS, or Ceph).
- Direct experience working with Infiniband, including a working
knowledge of Infiniband concepts, OFED layers, subnet managers, as
well as Gigabit Ethernet.
- Experience with networking and security.
- Experience with systems automation tools such as Ansible or
Puppet.
- Experience with versioning tools such as Git or
Subversion.
- Experience configuring, installing, maintaining and using
monitoring and optimization tools.
- Strong knowledge of scripting languages such as Python or
bash.Preferred Competencies
- Ability to work well with faculty and researchers.
- Ability to identify and gain expertise in appropriate new
technologies and/or software tools.
- Ability to function as part of an interactive team while
demonstrating self-initiative to achieve project's goals and
Research Computing Center's mission.
- Strong analytical skills and problem-solving
ability.Application Documents
- Cover letter (preferred)
- Resume (required)
#J-18808-Ljbffr
Keywords: Consortium for School Networking, Elmhurst , Principal HPC System Administrator, IT / Software / Systems , Chicago, Illinois
Didn't find what you're looking for? Search again!
Loading more jobs...