Site Reliability Engineer, Ad Platforms
Santa Clara Valley (Cupertino), California, United States
Software and Services
At Apple, we work every day to create products that enrich people’s lives. Our Advertising Platforms group makes it possible for people around the world to easily access informative and imaginative content on their devices while helping publishers and developers promote and monetize their work. Today, our technology and services power advertising in Search Ads in the App Store and Apple News. Our platforms are highly-performant, deployed at scale, and setting new standards for enabling effective advertising while protecting user privacy. The Ad Platforms team is seeking a Site Reliability Engineer for an extraordinary opportunity. Our mission is to enable Ad Platforms to deliver advertisements in a reliable and scalable way that results in awesome user experiences. We achieve this mission by automation, processes and education to our partner teams.
- Excellent experience supporting internet-facing production services and distributed systems.
- Proficient in configuring, deploying, managing, and supporting services in AWS
- Good programming skills in Python, Go, Java, or C.
- Expertise in operating Linux-based systems, with a solid understanding of its internals
- Extraordinary problem solving ability, utilizing creative and innovating thinking, but also adhering to a strong sense of ownership, customer service, and integrity
- Experience with Splunk, Grafana, or similar monitoring tools
- Terraform, Puppet, or other configuration management tools
- Experience collaborating closely with engineering and platform teams
- Excellent communications skills
- Drive to be self-motivated, and enthusiasm to learn new technologies
- Knowledge of container platforms like Mesos, Kubernetes, Nomad
IN THIS ROLE, YOU WILL: - Work closely in partnership with Engineering to maximize availability, operability, scalability, performance, and reliability of services - Serve as the Primary SME for application and infrastructure on call / site-up issues for production and non-production - Build, deploy, and scale systems across private and public cloud platforms - Support Operability and availability design, readiness, and review - Design operations automation and tooling to improve the reliability of our services and infrastructure - Implement best practices for Capacity planning and management for disaster recovery and resiliency - Lead application and infrastructure monitoring, alerting, and dashboards - Work on security best practices, patching, and hygiene - Drive incident and problem management and analysis
Education & Experience
- Bachelor's degree in Computer Science/Engineering discipline or equivalent. Master's degree preferred.