Site Reliability Engineer - Apple Media Products
Santa Clara Valley (Cupertino), California, United States
Software and Services
Imagine what you could do here. At Apple, new ideas have a way of becoming great products, services, and customer experiences very quickly. Bring passion and dedication to your job and there's no telling what you could accomplish. Apple Media Product's SRE team is looking for a world-class Site Reliability Engineer with experience in developing processes, tools, and automation for managing distributed systems in production environments. Our SRE team combines software and systems engineering and system administration practices to build and run large-scale, massively distributed, fault-tolerant systems. Our software ensures that Apple's services are reliable, scalable and secure, and we leverage both open source and home-grown technologies to provide managed data infrastructure services. We balance our time across automating operations for our growing footprint of deployments, building self-service products to empower internal customers, and increasing the reliability and scalability of our services with application and systems-level improvements. Dynamic, smart people and inspiring, innovative technologies are the norm here. Will you join us in crafting solutions that do not yet exist?
- Deep understanding of the Linux operating system, including kernel, memory, process, threads, cgroups, static / shared libraries, IPC, signals. Standard UNIX utilities, programs and packaging.
- Extensive experience in configuration management and fleet orchestration via Puppet, Chef, Ansible, or others.
- Understanding of basic Internet infrastructure services including DNS, DHCP, LDAP, server virtualization, server monitoring, cloud services (AWS S3/EC2/CloudFront/Steps... or equivalent).
- Demonstrated history in automating operations processes via services and tools
- Fluency in one or more high-level programming languages like Java, Python, Go, Ruby or equivalent.
- Consistent track record of troubleshooting and resolving issues in live production environments and implementing strategies to eliminate them.
- Driven approach to continually improving service levels.
- Comfortable working with large-scale server deployments, both on premise and in public clouds
- Knowledge of data platforms, including but not limited to: Apache, Kafka, Solr, Redis, MySQL, Cassandra, Hadoop.
- Knowledge of continuous integration, testing methodologies, TDD and agile development methodologies.
- Strong ability and enthusiasm to learn new technologies in a short time. We seek a self starter, visionary person with strong leadership capabilities.
- Experience in understanding how applications operate across distributed resources in diverse geographies
- Extraordinary communication skills, for collaborating across many participating teams.
Architect, author and deliver software to improve the availability, scalability and security of Apple Media Product's internal data infrastructure. Build and manage systems, infrastructure and applications through automation. Deploy, support and monitor new and existing services, platforms, and application stacks. Engage in improving the whole lifecycle of services from inception through deployment, operations, and refinement Provide hands-on technical expertise during service impacting events Collaborate with other engineers on code reviews, internal infrastructure improvements and process enhancements. Use scalability testing to measure, tune and optimize system performance.
Education & Experience
BS degree in computer science or equivalent field with 5+ years experience or MS degree with 3+ years experience, or equivalent.
- Participate in periodic 24x7 on-call duties
- This role may require occasional international travel/transatlantic travel