Site Reliability Engineer, Distributed Systems
Santa Clara Valley (Cupertino), California, United States
Software and Services
Imagine what you could do here. At Apple, new ideas have a way of becoming extraordinary products very quickly. Bring passion and dedication to your job and there's no telling what we can accomplish together. Do you love crafting elegant solutions to highly complex challenges? Can you intrinsically see the importance of every detail? At Apple, our Platform Architecture group is responsible for connecting our hardware and software into one unified system. Join this team, and you'll collaborate with engineers across Apple to build and deploy forward-looking prototype systems that contribute to the development of our world renowned hardware and software architecture. You and your team will validate that every product we make performs exactly as intended. Together, our work will be the reason millions of customers feel that they can trust our devices every single day. The Site Reliability Engineer within the Platform Architecture team will be responsible for supporting a team of software engineers and help to build software systems to automate the testing, deployment, management, and monitoring of Apple’s large-scale, internal engineering compute services. You should have both solid Linux / Systems expertise and demonstrated Software Development abilities.
- Extensive experience in a Systems Engineering / DevOps role in a large-scale environment running production systems.
- Candidate must possess strong knowledge of Linux systems internals and administration.
- Comfortable analyzing and troubleshooting large-scale distributed systems.
- Strong systems scripting skills (Python, Go, Bash, Ruby, etc.).
- Experience with configuration management tools like Puppet, Chef, etc (SaltStack is
- Strong initiative and passionate about learning new technologies.
- Strong systematic problem solving skills and able to work in ambiguity.
- Excellent written and verbal communication and presentation skills.
- Passionate and inquisitive, solves everyday problems in innovative ways.
Proactively ensure the highest levels of systems and infrastructure availability. Troubleshoot issues across the entire stack - hardware, software and application. Work with the team to design, build, and maintain core systems and management tools. Write tools, and leverage open source, to automate tasks. Collaborate with other engineers on code reviews, internal infrastructure improvements and process enhancements.
Education & Experience
Bachelor’s degree in Computer Science or equivalent industry experience.
- Experience with virtualization, containerization, and system image management (KVM, LXC);
- Experience with building and monitoring/alerting/logging infrastructure (Prometheus,
- Graphana, Splunk, etc. );
- Experience with distributed storage systems (HDFS, Amazon S3);
- Expertise using source code repositories (Git) and CI/CD tools (Jenkins)