Distributed Systems - Site Reliability Engineer (SRE)
Santa Clara Valley (Cupertino), California, United States
Software and Services
The Software Engineering Operations team within Software Delivery is looking for Site Reliability Engineers to maintain and improve services that enable thousands of Apple engineers to develop the software products that delight millions of Apple customers. In this position, you will have the opportunity to work with a group of top notch systems engineers from related but different backgrounds that fosters a culture of innovation and continuous improvement. To be successful in this role, the candidate must be hands-on, proactive, good at problem solving and have a strong desire to learn and work towards excellence. This job will provide you with: A team of highly skilled coworkers ready to both mentor and learn from you. Unique distributed computing problems with an open mind on how they can be solved. The opportunity to collaborate with talented engineering teams across a wide range of technology disciplines. The freedom to take ownership and drive meaningful improvements in the operational reliability of mission critical services.
- Passion for continually learning and exploring new technologies.
- Well versed in Linux and macOS systems management.
- Familiar with application and service monitoring tools and techniques.
- Knowledge of cloud platforms and virtualization technologies.
- Development experience with Python, Ruby, Scala or Go.
- Involvement with incident management and response.
- Excellent collaborative skills, with strong written and verbal communication.
Responsibilities will include: Identify sources of instability in distributed systems and drive operational excellence. Monitor and stress test systems to collect metrics for tuning and capacity planning. Reduce the burden of toil with iterative development of tooling and automation. Collaborate with engineering teams to release new features and become an authority on our services. Participate in on-call rotation.
Education & Experience
B.S. or equivalent experience in a technical discipline
- These are not hard requirements but this position might be of interest if you have experience with or a desire to learn about:
- Cloud orchestration technologies such as Mesos or Kubernetes.
- Virtualization platforms such as KVM, Docker, and Qemu.
- Object and distributed block storage technologies such as S3 or Ceph
- Splunk, Grafana, Graphite or other monitoring tools.
- Puppet, Ansible or other configuration management tools.
- Understanding of server hardware and tools such as HP iLO and IPMI to monitor for hardware failures.