Site Reliability Engineer - Infrastructure
Santa Clara Valley (Cupertino), California, United States
At Apple we believe our products begin with our people. We have a diverse, smart team that drives creative thought. By giving the team the resources they need to be leaders in innovation, we are also successful at driving precision. Through our collaborative process we create memorable experiences for our customers. These elements come together to make Apple an amazing environment for motivated people to do the greatest work of their lives. Apple is looking for a Senior SRE with bare metal experience to drive and be part of a small team building, monitoring and maintaining a large scale highly resilient server, OS, stack, platform and network infrastructure for a critical and unique customer-facing Apple service. This is a rare opportunity to design, build and control the entire @scale end to end infrastructure, along with all supporting components such as logging, metrics, monitoring, deployment and SW development platform, from the beginning within a team with a no-ops culture.
- 3+ years experience in a DevOPS or SRE role
- 3+ years experience with modern web-scale systems, especially on the server side of things with experience architecting VIPs and load balancers
- Extensive experience spec'ing, building, installing and maintaining datacenter server-quality hardware
- You're comfortable determining the right OS and OS setup for a large scale platform - for both the production servers as well as the backend platform. You're comfortable testing OS changes and building the systems to keep these servers at an appropriate OS level.
- You've implemented and utilized your own metrics (TSDB) and logging platform with a front end such as Grafana
- You've implemented a monitoring system such as Sensu, Zabbix or Nagios. Even better if you've written your own.
- Able to write software needed to build an operate a large scale platform 24x7 including the development and staging platforms
- You're comfortable building and operating infrastructure that employs a Chaos Monkey. Bonus: You've written that Chaos Monkey service
- You enjoy analyzing performance, end to end service experience and overall system health. You dig and dig at what initially seems like a small oddity until you determine the root cause, and drive it to resolution if it's a potential issue (regardless of whose problem it is)
- Flexibility and comfort working on a dynamic, fast-growing effort with minimal documentation and process in a small team environment. Quick learner. Aptitude to deal with ambiguity, and enthusiasm to help solve difficult issues.
- Bonus: Kubernetes, Docker, Chef, Puppet, Kafka experience
- Bonus: Experience with Cisco, Juniper or Arista routing and switching hardware (+OS)
- Bonus: Experience building or running a QA platform
You will build and run an Apple service that millions of customers use every day. You’ll also build and run the infrastructure powering those services. We’re looking for people who like to solve operational problems using software rather than shell prompts as we scale Apple’s services for customers around the world. Help us build the Apple experience on a global scale!
Education & Experience
- Networks, Computers, Electronics, and Software have been your hobby since you were in grade school - Technical engineering BS would be nice though not required - MS in CS, CE, or EE is even better