Sr Cloud Site Reliability Engineer, IS&T Ai & Data Platforms

Sunnyvale, California, United States
Software and Services


Role Number:200515360
Apple’s Applied Machine Learning team has built systems for a number of large-scale data science applications. We work on many high-impact projects that serve various Apple lines of business. We use the latest in open source technology and as committers on some of these projects, our team looks to push the envelope! Working with multiple lines of business, we handle many streams of Apple-scale data. We bring it all together and unleash business value. We do all this with an outstanding group of software engineers, data scientists, SRE/MLOps engineers and managers. We are looking for a talented and dedicated engineers to join our team to bring passion for infrastructure and distributed systems, to build world-class platforms/products at a very large scale across cloud environments.


Join Apple's Applied Machine Learning Team, as a Senior Software Engineer, to build & support innovative software applications. Candidates should have strong background in setting up and supporting the infrastructure for large scale big data applications in public cloud like AWS. RESPONSIBILITIES: - Focus on automation and providing insight for the Infrastructure service reliability and availability through extensible services & platforms. - Design, implement and maintain software & tools for large-scale distributed systems especially Big Data stack of technologies like Iceberg, S3, HDFS, Hive, Ranger. - Experience in operating and deploying container orchestration systems like Kubernetes &/ YARN. - Utilize core computer science data structures, algorithms, and software tools in one of the languages - Python, Golang, Java or other JVM languages. - Experience in managing data pipelines using Kafka, Flink, Spark, Airflow & Jupyter. - Work with platform tools and automation systems including deployment automation practices especially across multi-AZ or DC infrastructure using CM tools like Saltstack, Ansible, Terraform, etc. - Plan, design & implement business continuity, capacity management & observability across all services & levels of the stack. - Build & Support CI/CD tools to port & manage applications on AWS & Kubernetes - Build automation to enable self-healing systems. - Trace SLIs for meeting the agreed upon SLAs. - Ensure compliance with appropriate security standards. - Deploy and debug systems built for horizontally scalable multi-tenant deployments. - Solve and find workarounds for issues in customer-impacting, production systems. - The candidate is expected to be self-motivated, proactive, and a solution-oriented individual.

Minimum Qualifications

Key Qualifications

  • 8+ years of experience in SRE/MLOps.
  • Experience operating and maintaining production systems in linux and public cloud infrastructure providers like AWS (EC2, EBS, S3, ElasticIP, Route 53, IAM).
  • Experience in cloud native orchestration systems like Kubernetes & enabling AutoScaling for both VM & Containerized workloads.
  • Strong proficiency with Helm and Kustomize for managing Kubernetes applications and configurations.
  • Possess good working knowledge of load balancers, firewalls, TCP/IP networking architecture and core technologies (http, dns, routing, etc).
  • Usage of configuration management tools: Ansible/Puppet/Chef/Saltstack.
  • Experience in GitOps or CICD tools: Spinnaker/Jenkins/Flux/ArgoCD.
  • Strong programming skills in Unix & Python/Java.
  • Experience with capacity planning, utilization reviews and performance tunings.
  • Should have critical thinking, good debugging and problem solving skills.
  • Experience in implementing, managing and refining business continuity solutions.

Preferred Qualifications

Education & Experience

BS in computer science with 7-10 years or MS plus 5-7 years experience or related experience.

Additional Requirements

  • - Work closely with multiple cross functional teams to effectively co-ordinate and manage business user expectations.
  • - Leadership, critical thinking and excellent verbal and written communication skills
  • - Working on creating new utilities for operational efficiency.

Pay & Benefits

  • Apple is an equal opportunity employer that is committed to inclusion and diversity. We take affirmative action to ensure equal opportunity for all applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or other legally protected characteristics. Learn more about your EEO rights as an applicant.