Site Reliability Engineer (SRE) - Data Platform

London, England, United Kingdom
Software and Services

Summary

Posted:
Weekly Hours: 35
Role Number:200595038
At Apple, we believe that innovation flourishes in an environment where ideas are challenged, collaboration is encouraged, and technology is pushed to its limits. This environment is only possible when diverse minds come together, bringing unique perspectives and experiences. Our people and their ideas inspire innovation in everything we do. Imagine what you could accomplish here! Join Apple and help us make the world a better place. 
 As an SRE on our team, you’ll be responsible for architecting, optimizing, and scaling distributed storage and analytics systems. You’ll collaborate closely with development teams to help them grasp the broader picture of distributed systems, beyond individual components. We firmly believe in ownership, with software engineers accountable for the code they write.

Description

The Apple Services Engineering (ASE) organization builds and provides systems and infrastructure that fuel Apple’s services (such as iCloud, iTunes, Siri, and Maps). At ASE, we are building and scaling high-performance, resilient, and efficient storage and analytics platforms that power critical insights across the company. Our team sits at the heart of distributed systems, big data, and large-scale infrastructure, ensuring that petabyte-scale workloads run smoothly, efficiently, and reliably. 
 ASE runs the majority of its systems on Linux. We run a mix of open source, vendor-licensed, and internally developed tools to perform functions such as system configuration management, provisioning, software deployment, logging, and monitoring. You'll be expected to learn these tools and to improve them.

Minimum Qualifications

  • Subject Matter Expertise in leading large-scale migration and modernization initiatives in the data analytics domain, providing expert guidance to customers as they transition to cutting-edge systems.
  • Hands-on experience running analytics storage solutions such as HDFS or S3-compatible systems.
  • Proficiency in designing, authoring, and releasing code in languages like Go or Python.
  • Good understanding of networking concepts, including TCP/IP stack, DNS, DHCP, and other standard network protocols.
  • Knowledge of provisioning, data migration, disaster recovery, and capacity planning.
  • Experience in automating repetitive tasks and processes to enhance reliability and efficiency.
  • Experience in managing and scaling distributed systems in a public, private, or hybrid cloud environment.

Key Qualifications

Preferred Qualifications

  • Contribution to team and organizational strategy, including participating in architectural reviews and decision-making processes.
  • Hands-on experience managing large numbers of diverse systems with configuration management or software delivery platforms (such as Puppet, Ansible).
  • Participate in on-call rotations and incident management processes to ensure rapid resolution of critical issues.
  • Experience with monitoring tools like Splunk and Prometheus.

Education & Experience

Additional Requirements