Multimodal Generative Modeling Research Engineer - SIML, ISE

Cupertino, California, United States
Machine Learning and AI


Weekly Hours: 40
Role Number:200536770
Are you excited about Generative AI? Are you interested in working on cutting edge generative modeling technologies to enrich billions of people? We have multiple ongoing efforts involving generative models. We are looking for technical leaders experienced in training, adapting and deploying large scale ML models with a focus on multimodal understanding and generation. We are the Intelligence System Experience (ISE) team within Apple’s software organization. The team works at the intersection between multimodal machine learning and system experiences. System Experience (Springboard, Settings), Keyboards, Pencil & Paper, Shortcuts, User Safety are some of the experiences that the team oversees. These experiences that our users enjoy are backed by production scale ML workflows. Visual Understanding of People, Text, Handwriting & Scenes, multilingual NLP for writing workflows, knowledge extraction, conversation understanding and text generation, behavioral modeling for proactive suggestions, and privacy preserving learning are areas our multi disciplinary ML teams focus on. We are looking for senior research engineers to architecture and innovate multimodal ML technologies and ensure these technologies can be safely deployed to the real world. An ideal candidate has the ability to lead diverse cross functional efforts ranging from ML modeling, prototyping, validation and private learning. An ideal candidate will have proven ML & Generative AI fundamentals and ability to turn research contributions into products. Industry experience in Vision-Language multimodal modeling, Reinforcement and Human Preference Learning, multimodal safety and alignment would be important needs. SELECTED REFERENCES TO OUR TEAM’S WORK: - ( - ( - (


We are looking for a candidate with a proven track record in applied ML research. Responsibilities in the role will include training large scale multimodal (2D/3D vision-language) models on distributed backends, deployment of compact neural architectures efficiently on device, address growing set of safety challenges to make the model robust and aligned with human values. Ensuring quality in the wild, with an emphasis on model safety, fairness and robustness would constitute a meaningful part of the role. You will be interacting very closely with a variety of ML researchers, software engineers, hardware and design teams cross functionally. The primary responsibilities of the role would center on enriching multimodal capabilities of large language models. The user experience initiative would focus on aligning image/video content to the space of LMs for visual actions & multi-turn interactions.

Minimum Qualifications

Key Qualifications

  • 3+ years of expertise within ML and Generative Modeling fundamentals
  • Experience adapting pre-trained Vision/Language models for downstream tasks & human alignment
  • Modeling experience at the intersection of NLP and vision
  • Familiarity with distributed training
  • Proficiency in using ML toolkits, e.g., PyTorch
  • You're aware of the challenges associated to the transition of a prototype into a final product
  • Proven record of research innovation and demonstrated leadership in both applied research and development

Preferred Qualifications

Education & Experience

M.S. or PhD in Electrical Engineering, Computer Science or a related field (mathematics, physics or computer engineering), with a focus on NLP, computer vision and/or machine learning; or comparable professional experience.

Additional Requirements

Pay & Benefits

  • Apple is an equal opportunity employer that is committed to inclusion and diversity. We take affirmative action to ensure equal opportunity for all applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or other legally protected characteristics. Learn more about your EEO rights as an applicant.