Site Reliability Engineer

San Diego, California, United States
Machine Learning and AI

Summary

Posted:
Role Number:200177550
The Video Computer Vision organization is working on exciting technologies for future Apple products. Our focus is on ML based solution around real time image and video. We have contributed to the FaceID and FaceKit project in the past and more recently the new LIDAR iPad sensor. We are looking for the right Site Reliability Engineer to help us take our efforts to the next level. In this role, you will be part of the core data infrastructure team for the Video Computer Vision organization. You will be a core contributor in our SRE team to develop and maintain a modern deployment system for cloud services and applications. You will be responsible for system bringup, deployment, reliability, security and service scalability. This role is highly multi-functional and you will work very closely with various highly skilled software development / ML teams developing cutting edge algorithms.

Key Qualifications

  • 3+ years in managing Site Reliability Engineering teams and supporting mission critical applications
  • 5+ years managing large fleet of *nix systems
  • 3+ years of Hybrid Cloud (data center, AWS, GCP, Azure)
  • 5+ years with configuration management tools such as Ansible or Terraform
  • 2+ years of programming experience (preferably Python)
  • 2+ years of managing Relational and NoSQL Databases
  • 2+ years of building fully automated CI-CD pipelines
  • You should also be self-directed, analytical, and work well in a team environment.

Description

Your core responsibility is to provide operational support of multiple cloud based applications with an emphasis on deployment, security, scalability and reliability running on AWS and Apple infrastructure. Operations tech stack: Ansible, Terraform, Go, Python, Prometheus, with some bash scripting. Common technologies include: Django, Docker, Kubernetes, Postgres, Redis, and Cassandra. We make have a hybrid infrastructure and make use Amazon Web Service extensively along with home-grown compute clouds. What qualities will make you successful? We are looking for a driven and dedicated Site Reliability Engineer possessing hands-on experience with: - Core Operations experience with Linux, Ansible (or similar), Docker, Kubernetes, Postgres. - Engage various software development teams to collaborate and build services from the ground up - Expertise in networking with an emphasis on security - Experience building systems both on-premise (data center) and on public cloud (AWS, GCP or Azure welcome) - Working knowledge of deploying microservices (Django, Go, JVMs’) - Have worked with schedulers such as Kubernetes, AWS ECS or EKS. - Ability to write code in one of many high level languages (Python preferred) - Vast experience using Linux with knowledge of kernel/system tuning - Last but not least, you are battle-tested and have a few interesting production tales

Education & Experience

BS/MS in Computer Science/Computer Engineering (or equivalent experience).

Additional Requirements