AI/ML-Systems Engineer (Machine Learning), Machine Learning Platforms (Beijing / Shanghai)

Beijing, China
Software and Services

Summary

Posted:
Weekly Hours: 40
Role Number:200353378
The AI/ML Production Engineering China team is looking for an exceptional Systems Engineer with experience in Machine Learning (ML) infrastructure services and applications to work in local and global projects for platforms of computation, data retention, data processing pipelines and result delivery. The Systems Engineer is expected to be qualified as a technical leader, with the potential to design and to build architecture through cross-organizational collaboration. This role has high impact and is essential to creating the highest quality user experience that Apple internal and external customers expect and love.

Key Qualifications

  • 10+ years of Systems or AIML production-service experience, commensurate with running cutting-edge hybrid cloud services in China and the rest of world
  • Self-motivated and proactive, with demonstrated creative and critical thinking capabilities
  • Ability for identifying problems in depth, distinguishing purposes vs. measures without confusion
  • Solid understanding of system architecture and large-scale service or computational platform operations
  • Demonstrated understanding on system management, covering aspects of configuration, security, performance, troubleshooting and usage accounting
  • Proficiency in coding with scripting and programming languages, including Bash, Python, GoLang, while having the ability of selecting the proper language as tool to solve a certain problem
  • Knowledge of large data storage and processing using SQL and Cassandra, HDFS and S3, Yarn and Spark
  • Knowledge of ML as well as experience in developing real ML jobs
  • Experience of designing and implementing systems to support ML applications
  • Experience in large-scale service and job deployment, using an orchestration framework (Kubernetes) and cloud services for large-scale projects
  • Experience in observability of system behaviors (e.g. Prometheus, Grafana)
  • Strong sense of thoroughness, driving details, delivering running code and contributing to collective understanding of organization
  • Sense of speed and prioritization, driving what matters with constrained resources while delivering high-quality results
  • Good communication with internal and external teams, in English and in Chinese

Description

The Systems Engineer will do the following tasks, through collaboration with team members in China and around the world. - Analyze the requirements, demands, constraints and challenges of machine learning in local or global environments, design or re-design platform architecture to improve its scalability and agility, and to enable new, high-impact use cases - Develop and implement the above design, bringing it to an internal product, with observability to support efficient system management - Design and/or enhance automation of operations for infrastructure and platforms, including tools and processes of monitoring, logging and alerting, to improve scalability in both system construction and runtime operations - Support Dev and Eng efforts through provisioning operational solutions, co-design ML application architecture and drive the coordination among local and global, internal and cross-functional groups to achieve the result of success - Create performance profile for platforms and services, defining service level objectives (SLO) and driving the measurement, monitoring and evaluation over these objectives - Lead constant evaluation on system performance and reliability, discover potential faults, drive RCA and fixes

Education & Experience

Master or PhD degree in Computer Science, Electrical Engineering, or equivalent

Additional Requirements

  • It will be helpful if the candidate also has the following qualifications:
  • - Knowledge and concept in QA and A/B testing
  • - Knowledge on data governance and compliance
  • - Experience in Kubernetes administration and development
  • - Experience managing a group of people