AIML - Infrastructure Services - Site Reliability Engineer, Machine Learning Platform and Infrastructure

Cupertino, California, United States
Machine Learning and AI

Summary

Posted:
Weekly Hours: 40
Role Number:200547633
This is an exciting opportunity for a Senior Systems Engineer / SRE to join the AIML team at Apple. We are looking for an experienced SRE to join a new team who understands and believes in the concept of infrastructure as code! A successful candidate will focus on developing and designing solutions to solve highly complex issues in a large scale, distributed system environment!

Key Qualifications

  • 10+ years of work experience in system administration
  • Expert knowledge of the Linux operation system (OS, networking, process level)
  • Experience in managing, scaling, and troubleshooting applications on AWS
  • Ability to implement and coordinate telemetry using monitoring and observability tools such as Splunk, Grafana, and Prometheus
  • Fluent in at least one scripting language (Shell, Python, Ruby, etc.)
  • Experience with at least one configuration management tool (Puppet, Chef, Ansible, Salt)
  • Strong verbal and written communication skills
  • Passionate about being a part of a tight-knit Operations team
  • A strong sense of ownership while being a team player who communicates clearly and transparently
  • Self-motivated, inquisitive, and always looking to learn more

Description

The team will be responsible for maintenance and delivery of infrastructure services. These services are key to the development and production process of the AIML team. This team works very closely with other teams across AIML as operational subject matter exports. They offer guidance and advice that enables other teams to improve their services. A successful candidate will likely have experience in being a Systems Administrator that has moved on to development and automation in their career. In this role, you will get to: - Help operate Apple’s largest infrastructure supporting millions of AIML customers - Manage one of the largest deployment of logging service on AWS - Migrate configs and users from legacy service to new platform on AWS - Actively participate in capacity planning, scale testing, and disaster recovery exercises - Interact with stakeholder teams, including engineering, QA, and program management - Cultivate and maintain relationships with internal and external third-party vendors - Make changes to our environment with the purpose of pushing AIML services to the next level

Education & Experience

BS in computer science with 10 years or related experience.

Additional Requirements

  • - Experience with large scale CI/CD environments
  • - Experience in writing Infrastructure as Code (IaC) using Terraform
  • - Experience in managing multi-region deployment of large scale service on AWS
  • - Working knowledge of supporting version control systems
  • - Possess a solid understanding of logging and time series services
  • - Practical experience supporting multiple services

Pay & Benefits