Camera & Photos - Site Reliability Infrastructure Engineer
Santa Clara Valley (Cupertino), California, United States
Software and Services
Imagine what you could do here. At Apple, new ideas have a way of becoming great products, services, and customer experiences very quickly. Bring passion and dedication to your job and there's no telling what you could accomplish. The Camera & Photos Infrastructure team is looking for a highly motivated Infrastructure Engineer to join our team. You will be responsible for building and maintaining storage servers, GPU compute nodes, working with cross-functional and cross-organizational teams to understand, augment, and implement the systems, processes, and tools that are used for the quality software business. Responsible for the provisioning, installation, configuration, operation, and maintenance of systems hardware and software and related infrastructure deployment of a multi-OS environment including Windows, BSD and Linux systems. Ideally, you are a strong generalist with good experience managing HPC (high performance computing critical servers), storage services and an interactive Unix/Linux environments. Exposure to performance troubleshooting in Linux, macOS and iOS is highly desirable.
- 5+ years of experience managing services in a distributed, high-demand critical Linux system and *nix environment (a plus)
- Demonstrated deep understanding of UNIX/Linux flavors (Linux KVM, BSD, Mac OS X, iOS, CentOS/RedHat Enterprise Linux, Ubuntu, Debian Linux administration)
- Strong understanding of ZFS, APFS, distributed compute and experience building RAID arrays with encryption to handle GDPR / PII data requirements is a plus
- Experience building PC hardware (a plus)
- Strong knowledge of network protocols and network based services, including routing, network load balancing and web proxy appliances / applications/ firewalls
- Reliability - drive fault detection and correction, performance and uptime at global scale
- Monitoring - instrument systems to track and understand of how they are performing at any time
- Automation and orchestration frameworks to enable (Ansible experience is a plus)
- Accelerated infrastructure, application and software configuration deployment
- Infrastructure as code
- Expertise with in both building and using log aggregation and distributed monitoring tools (Splunk, Elastic Stack, etc.)
- Experience building and supporting containerized application technologies including Docker, kubernetes, Mezos
- Familiarity with CI/CD tools and deployment processes
- Proficient with various programming languages such as Python/Ansible/Java/ for building automation or integration with APIs
- Proven understanding and experience with centralized configuration management, coordination and provisioning technologies, such as Ansible, Chef, Puppet, etc.
- Excellent interpersonal skills, should be capable of working with cross functional technical and business teams and varying levels of management
- Experience implementing and working with open source projects
- Strong project management skills, including excellent presentation development
- Passion for writing detailed solution specifications, diagrams, best practices/standards documentation, operating procedures, test plans/test reports, etc.
- Strong team player with high degree of flexibility Excellent verbal and written communication skills and high attention to detail The ability to analyze problems, quickly develop creative solutions, while adapting to a dynamic environment.
The purpose of the role is to ensure that Apple’s Camera and Photos storage and GPU compute systems are managed with the highest level of proactive management to maximizing uptime for Apple’s mission critical delivery systems. We build automation and tooling required to orchestrate services deployment to eliminate manual and repetitive efforts. 5+ years of system administration experience managing enterprise UNIX environments, with significant and recent experience on Linux platforms. Fundamental Data clustering and networking, Experience with virtualization platforms. Must be able to work in a fast and agile environment. Desire to learn and contribute to our Automation, and supervising labs. Install/configure and maintain server/client environment(Linux) and HPC (High Performance Computing) environment Perform scheduled system maintenance activities including hardware & software upgrades and application upgrade/patching Installation of general software updates, security updates, and service packs for the OS on both server and desktop environments Troubleshoot, document and resolve software, hardware, and network issues Interact with vendors for evaluation of new software/hardware, issue resolution and licenses Work with corporate IT and Security teams in establishing processes and compliance to the policies Perform technical research and development for continued innovation in the infrastructure. Experience in an Engineering environment. You should be comfortable working with Unix/Linux server environments attached to enterprise class storage. Experience with technologies such as: LDAP, NFS, Databases (CouchDB, SQL...), Apache, PHP, Kerberos, auto-mount, load balancers, configuration management (Ansible, Chef, puppet), backup/restore, CIFS, AFP, DNS.
Education & Experience
BS in Computer Science or equivalent experience related to Applied Information Technology or similar disciplines preferred
- Working knowledge in installation, configuration and maintenance of Linux Storage servers Experience in enterprise level deployments including: Web and Data Servers, Virtualization platforms, HPC, desktops, database, encrypted protocols with NFS, SMB solutions, etc. is a plus