Software Delivery - Senior Site Reliability Engineer

Santa Clara Valley (Cupertino), California, United States
Software and Services

Summary

Posted:
Role Number:200520582
Apple’s Software Delivery team is looking for an innovative Senior SRE with experience managing physical infrastructure and cloud solutions to design, build, and maintain our core infrastructure. This infrastructure enables thousands of Apple software engineers to develop products that delight millions of Apple customers. As a Senior SRE you will help lead and mentor other engineers as well as communicate with senior leadership.

Key Qualifications

  • 5+ years in a Infrastructure Ops, Site Reliability Engineering, or DevOps focused role
  • Ability to autonomously manage cross functional projects, communicate expectations, set timelines and drive to completion
  • Experience supporting data center operation, distributed systems, and production services
  • In depth knowledge of Linux and Unix
  • Experienced in deploying and managing infrastructure using config management like Puppet and/or Ansible
  • Strong programming skills: Shell, Go, and/or Python

Description

Key Responsibilities: - Collaborate with cross-functional teams to understand requirements, design and implement resilient and scalable infrastructure solutions. - Play a vital role in incident response to diagnose and resolve problems minimizing outages and downtime in a high pressure and large scale environment. - Build monitoring and alerting systems for early issue detection. - Evaluate and integrate new technologies to improve system reliability, security, and performance. - Create runbooks for incidents and issues. - Develop and implement automation to provision, configure, deploy, and monitor infrastructure components. - Work across multiple time zones.

Education & Experience

Bachelor's degree in Computer Science or a related technical background involving software/system engineering, or equivalent working experience.

Additional Requirements

  • Preferred Experience:
  • - Experience gathering and analyzing system resource metrics and logs to triage issues.
  • - Experience with capacity planning including calculating power and cooling requirements.
  • - Systematic problem-solving approach, coupled with a strong sense of ownership and drive
  • - Ability to implement and coordinate telemetry using monitoring and observability tools such as Splunk, Grafana, and Prometheus
  • - Familiarity with cloud infrastructure concepts (zones, regions, VPCs, etc.)
  • - Working understanding of common authentication schemes, certificates, and securely managing secrets
  • - Leadership experience related to SRE and/or operationally focused teams.
  • - Operational experience running a production 24x7 infrastructure at scale
  • - On-premise hands on experience working in Data Centers with infrastructure at scale.
  • - Previous experience working with software development teams shipping software at scale.

Pay & Benefits