Cloud Monitoring SRE

Posted: 18 Dec 2023

Role Number:200529840

People at Apple don’t just build products — they craft the kind of experience that have revolutionized entire industries. The diverse collection of our people and their ideas inspire innovation in everything we do. Imagine what you could do here! Join Apple, and help us leave the world better than we found it. The Apple Service Engineering(ASE) team builds and provides systems and infrastructure that fuel Apple’s services (such as iCloud, iTunes, Siri, and Maps). We are the foundation on which Apple’s software developers build the products that our customers love. We are looking for passionate and talented Site Reliability Engineers to continue our focus in providing our customers the highest quality Apple Services experience. Our services have to scale globally, stay highly available, and "just work.” If you love designing, engineering and running systems and infrastructure that will help millions of customers, then this is the place for you! The Cloud Monitoring SRE organization is specifically tasked with enabling other teams to better understand their infrastructure and services, providing world-class observability capabilities. Keeping Apple services up and running 100% of the time is a challenging job. Accurately monitoring the health of every application and infrastructure that comprises the Apple ecosystem 100% of the time is an order of magnitude more challenging. As a Site Reliability Engineer on the Cloud Monitoring Team at Apple you will be working to improve the reliability and performance of the software systems that provide visibility into the services & infrastructure that runs Apple. Our monitoring, alerting, and visualization platform analyzes billions of metrics per minute and comprises the central nervous system of Apple's architecture. You will work shoulder-to-shoulder with our engineering teams to design and build the next generation of cloud and systems monitoring infrastructure, focusing on automation, availability, performance, and above all efficiency at 'reach every user on the planet' scale. You will dive deep into gnarly operational issues; from the software, systems, automation, and process perspectives. You will understand the challenges around integrating disparate infrastructures into new facilities, processes and procedures.

Key Qualifications

Minimum 5+ years of handling services in a large scale environment.
Strong sense of ownership and integrity demonstrated through clear communication and collaboration
Experience and confidence around incident response and incident management
Experience in managing and scaling distributed systems in a public, private, or hybrid cloud environment
Experience with the Prometheus ecosystem
Practical experience in Python, bash scripting. Theoretical knowledge of Go, Java, and/or Scala.
Acute drive to automate manual operations and to improve them through repeated iteration
Comfortable with Open Source configuration management and orchestration tools (such as Helm, Puppet, and Spinnaker)
Experience with deploying, supporting and monitoring new and existing services, platforms, and application stacks
Familiarity with micro-services architecture and container orchestration with Kubernetes
Expertise in Software Design and Development
Responsibilities:
You will perform deep dives into both systemic and latent reliability issues; partner with software and systems engineers across the organization to produce and roll out fixes.
You will drive standardization efforts across multiple disciplines and services in conjunction with embedded SREs throughout the organization.
You will participate in code reviews for projects primarily written in Python, Java, and Scala, built on open source product such as FiloDB, and running on virtual and containerized platforms.
You will represent the SRE organization in design reviews and operational readiness exercises for new and existing services.
Use of configuration management and deployment tools
Monitoring of systems and services, optimization of performance, and resource utilization
Runbook implementation for everyday maintenance actions
Incident response, diagnosis, and follow-up on system outages or alerts
Collaborating with a global and asynchronously communicating team (don’t worry if you have never worked remotely; we’ll help you get used to it)

Description

Apple Services Engineering infrastructure is BIG. Operating at our scale, across multiple geographically dispersed data centers and servicing hundreds of millions of users presents unique challenges. As an SRE at Apple, you'll need to solve these problems using data, teamwork, and your own expertise. SREs at Apple own the full infrastructure stack; from device driver performance debugging to content delivery network traffic management — our responsibilities are both broad and deep. ASE runs the majority of its systems on Linux. We run a mix of open source, vendor licensed, and internally developed tools to perform functions such as system configuration management, provisioning, software deployment, logging, and monitoring. You'll learn these tools and have opportunities to improve them. Our team is collaborative; we work closely with the development teams we support to deliver the best results for Apple. We think critically and strive to balance the best solution with the need to get things done for each engineering challenge we face. Good ideas are heard and results are rewarded.

Education & Experience

B.S. in computer science or similar field or equivalent experience.

Pay & Benefits

At Apple, base pay is one part of our total compensation package and is determined within a range. This provides the opportunity to progress as you grow and develop within a role. The base pay range for this role is between $138,900.00 and $256,500.00, and your base pay will depend on your skills, qualifications, experience, and location.

Apple employees also have the opportunity to become an Apple shareholder through participation in Apple’s discretionary employee stock programs. Apple employees are eligible for discretionary restricted stock unit awards, and can purchase Apple stock at a discount if voluntarily participating in Apple’s Employee Stock Purchase Plan. You’ll also receive benefits including: Comprehensive medical and dental coverage, retirement benefits, a range of discounted products and free services, and for formal education related to advancing your career at Apple, reimbursement for certain educational expenses — including tuition. Additionally, this role might be eligible for discretionary bonuses or commission payments as well as relocation. Learn more about Apple Benefits.

Note: Apple benefit, compensation and employee stock programs are subject to eligibility requirements and other terms of the applicable plan or program.

Apple is an equal opportunity employer that is committed to inclusion and diversity. We take affirmative action to ensure equal opportunity for all applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or other legally protected characteristics.

Cloud Monitoring SRE

Summary

Key Qualifications

Description

Education & Experience

Additional Requirements

Pay & Benefits

Cloud Monitoring SRE

Add a favorite

Summary

Key Qualifications

Description

Education & Experience

Additional Requirements

Pay & Benefits

Add a favorite