Senior Site Reliability Engineer (SRE)
Santa Clara Valley (Cupertino), California, United States
Apple is looking for a Senior SRE with automation tooling experience to drive and be part of a team that is building, monitoring, and maintaining large scale highly resilient systems. You'll be contributing to bare metal, OS stack, platform and network infrastructure for a critical and unique customer-facing Apple service. This is a rare opportunity to design, build and control the entire @scale end to end infrastructure, along with all supporting components such as provisioning, logging, metrics, monitoring, deployment and SW development platform, from the beginning within a team with a no-ops culture.
- Experience in a DevOPS or SRE role
- Experience with modern web-scale services including servers, VIPs, load balancers, proxies
- Able to write software needed to build and operate a large scale platform 24x7 including the development and staging platforms
- Proficient in at least one of these languages: Python, Golang, Rust, C++
- Familiar with bare metal bootstrap, provisioning, configuration and orchestration
- Highly experienced with one of these: Puppet, Chef, Saltstack, Ansible
- You're comfortable testing and rolling new kernels, drivers, libraries, OS changes, config sync and building the systems to keep servers at an appropriate level and functionality.
- You're knowledgable running systems for containerized services(docker) and managing how they interact with network & system resources
- You've implemented and utilized your own metrics (TSDB) and logging platform with a front end such as Grafana
- You've implemented a monitoring system such as Sensu, Zabbix or Nagios even better if you've written your own
- You're comfortable building and operating infrastructure that employs a Chaos Monkey. Bonus: You've written that Chaos Monkey service
- You enjoy analyzing performance, end to end service experience and overall system health. You dig and dig at what initially seems like a small oddity until you determine the root cause, and drive it to resolution if it's a potential issue (regardless of whose problem it is)
- Flexibility and comfort working on a dynamic, fast-growing effort with minimal documentation and process in a small team environment. Quick learner. Aptitude to deal with ambiguity, and enthusiasm to help solve difficult issues
- Bonus: Native Kubernetes implementation including CNI, Kafka, etcd experience
- Bonus: You're familiar with distributed orchestration systems
- Bonus: Experience with Cisco, Juniper, or Arista routing and switching hardware (+OS), including wireless
You will build and run an Apple service that millions of customers use every day. You’ll also build and run the infrastructure that powers those services. We’re looking for people who like to solve operational problems using software rather than shell prompts as we scale Apple’s services for customers around the world. Help us build the Apple experience on a global scale!
Education & Experience
- Networks, Computers, Electronics and Software have been your hobby since you were in grade school - Technical engineering BS would be nice though not required - MS in CS, CE or EE is even better.