System Deployment & Lifecycle Lead

Santa Clara Valley (Cupertino), California, United States


Role Number:200064377
Do you love creating elegant solutions to highly complex challenges? Do you intrinsically see the importance in every detail? As part of our Silicon Technologies group, you’ll help develop the compute environment used to design and manufacture our next-generation, high-performance, power-efficient processor, system-on-chip (SoC). You’ll ensure Apple products and services can seamlessly and efficiently handle the tasks that make them beloved by millions! Joining this group means you’ll be responsible for crafting and building the technology that fuels Apple’s devices. Together, you and your team will enable our customers to do all the things they love with their devices. Our SRE team provides compute services to the Silicon Engineering Group. We are looking for an experienced operations leader to develop and maintain systems for tracking, automating, and reporting on compute hardware as it makes its way through its useful lifecycle. The engineer in this role will ensure that Apple's world-class silicon designers have the compute capacity needed to engineer and design the worlds most advanced silicon devices and products. They will utilize a deep understanding of designing tools to automate and continually report on the process of machine deployment, through ongoing maintenance, and finally decommissioning. Strong communication skills are vital to collaborate with complementary teams across Apple in fulfillment of our goals.

Key Qualifications

  • We are seeking someone to join our team with least 5+ years of experience in compute operations in a large Cloud, IT, or R&D environment. We require demonstrated skills in the following areas:
  • Automation of Linux OS installation via gPXE, iPXE, or other methods
  • Automation of firmware updates
  • Automation of machine maintenance and break/fix activities
  • Automation of phone-home / ticket creation on trend based detection of machine defects
  • Design and implementation of Intel based rack-mount servers: Dell, HP, Supermicro, Qunata, or specialized designs like Open Compute
  • Use of OoBM management APIs or other programmatic interfaces
  • Use of DCIM APIs: nLyte, DC Clarity, RAMP or custom solutions)
  • Centralized configuration management like Puppet, Ansible, or Chef
  • Scripting in Shell, Perl, Python or Ruby
  • Incident management and reporting, particularly hardware break / fix trends
  • KPI reporting for server uptime, maintenance status, hardware failure trends, and other business impacting metrics
  • Metrics gathering solutions like Ganglia or TICK
  • Monitoring solutions like Nagios or Prometheus
  • Log correlation systems like Splunk or ELK
  • Working knowledge in DHCP, TFTP, DNS and other common network services
  • Working knowledge of LDAP (OpenLDAP, DSEE, OpenDirectory)
  • Familiarity with NAS appliance hardware (NetApp preferred)
  • Familiarity with Ethernet switch and router hardware (Arista preferred)
  • Familiarity with revision control systems like SVN, git, or Perforce


This role supports the SRE teams that are focused on delivering EDA compute services in datacenters across the world by developing and maintaining systems for automating, reporting, and tracking of compute hardware over the course of its useful life. You will drive the process for installing machines from the point the arrive on site through OS boot and delivery to the service owners. You will also work with our SRE teams to orchestrate and automate the removal of machines for break / fix and maintenance operations while maintaining business defined capacity goals. You will accomplish this by developing internal tools, collaborating with other teams in Apple providing complimentary services, and procuring outside services as required.

Education & Experience

MS/BS Degree or equivalent experience

Additional Requirements