Site Reliability Engineer (SRE) - Infrastructure

Santa Clara Valley (Cupertino), California, United States
Software and Services

Summary

Posted:
Weekly Hours: 40
Role Number:200095113
Services and Infrastructure (S&I) is seeking a customer service oriented, self-driven, and motivated Infrastructure SRE to join our team. S&I is a diverse group of engineers that form the foundation of the build system that is responsible for assembling Apple’s software products. The candidate will possess the ability to analyze and troubleshoot a broad spectrum of problems. As an Infrastructure SRE you will be part of implementing the infrastructure to support the continued growth of the build system and reinventing the way we monitor our environment. You will join an existing team dedicated to supporting software engineering teams within Apple.

Key Qualifications

  • Minimum 5-7 years experience in a production data center with at least a 1000 servers
  • Experience troubleshooting complex issues, correlating data from multiple areas i.e. environmental, server sensors, and OS
  • Experience gathering server data from various vendor BMC i.e. HP iLO, Dell DRAC, IPMI
  • Broad experience supporting and maintaining common Linux/Unix applications and services, as well as a good understanding of DNS, DHCP, LDAP, NFS, Kerberos, PAM, PXE, SNMP, SSH, HTTP/S, and NTP
  • Experience with common version control software such as Git
  • Monitoring using Prometheus, Grafana and Splunk

Description

Specific responsibilities will include Work cross functionally with vendors and variety of other teams at Apple to identify infrastructure instabilities and help resolve them Hands on and remote troubleshooting of hardware and linux systems Document policies and procedures Troubleshooting Layer 2 / Layer 3 networking, Arista / Cisco preferred Support day-to-day operations of the environment including monitoring, measuring, and troubleshooting infrastructure and services Automate tasks and processes by identifying, owning, collaborating, and driving new or further automation to enhance the consistent stability of the environment Ability to self-manage large projects including setting and meeting deadlines Ability to participate in a regular on-call rotation

Education & Experience

Additional Requirements

  • Preferred Qualifications
  • Experience with DCIM software i.e. Struxureware
  • Cisco/Arista networking experience
  • Monitoring and metrics to gather statistical data for strategic planning
  • Understanding of server deployment process using PXE
  • Understand rack elevations, power requirements and cooling for capacity planning
  • Working with remote data center service teams