Siri Search and Knowledge - Big Data Infrastructure Engineer
Santa Clara Valley (Cupertino), California, United States
Machine Learning and AI
Siri’s universal search engine powers search features across a variety of Apple products, including Siri, Spotlight, Safari, Messages, and News. The Search Data Platform team acts as the “source of truth” for our most fundamental data — such as search activity and content — as well as our core metrics across a range of products. We enable continuous improvement of the search system by building tools for data-driven decision-making and rapid iteration. As part of this group, you will work with one of the most exciting high performance computing environments, with petabytes of data and millions of queries per second. You will have an opportunity to build out the data processing and analysis platform that helps drive development of the products that delight our customers every day.
- 2+ years of experience as a Software Engineer
- Excellent data analytical and problem solving skills
- You're proficient in one of the following languages: Python, Go, Java, Scala, C++
- Experience with Spark, Hive and/or Impala
- Desire to contribute to a nascent data ecosystem and to build a strong data toolset for the company
- Working with data at scale is a requirement
- Experience designing and managing large scale data pipelines is helpful
- Experience applying algorithms to understand real-world data (classification, anomaly detection, etc.) is a plus
- Excellent interpersonal skills required
You are someone with at least a few of the following traits: - Is excited about digging into massive petabyte-scale semi-structured datasets - Has experience developing data extraction and transformation pipelines - Has experience in distributed systems, database internals, or performance analysis - Has experience with dimensional analytics platforms (OLAP) and data visualization systems - Has experience with MapReduce and other big data frameworks, such as Hadoop and/or Spark - Deep understanding of polyglot data persistence (relational, key/value, document, column, graph, data warehousing) - Strong dedication to code quality, automation, and operational perfection: unit and integration tests, linting, documentation, etc What you will do: - Empower dozens of engineering teams, hundreds of co-workers, and hundreds of millions of users to dream of new possibilities for the product. - Develop software to process, transform, and analyze data to identify signals from the billions of events we collect every day (batch, streaming, and low latency APIs). - Design and build abstractions that hide the complexity of the underlying big data stack (HDFS, Hadoop, Hive, Impala, Spark, Kafka, Parquet, etc) and allow partners to focus on their strengths: product, data modeling, data analysis, search, information retrieval, and machine learning. - Build scalable backend services and tools to help partners implement, deploy and analyze data assets with a high level of autonomy and limited friction. - Optimize end-to-end workflows of data users (crafting libraries, providing abstractions to define jobs, scheduling data pipelines, managing access to data assets, etc). - Surface datasets in near-real-time to mission critical products and business applications throughout the company, providing the signals that foster our machine learning algorithms as well as our daily product-defining decisions. - Automate and handle lifecycle of datasets (schema evolution, metadata store, backfill management, deprecation, migration). - Improve the quality and reliability of data pipelines (monitoring, retry, failure detection).
Education & Experience
BS in Computer Science, Mathematics, Statistics, or a related field, or equivalent industry experience.