Site Reliability Engineer ( SRE )

Santa Clara Valley (Cupertino), California, United States
Not Available

Summary

Posted: Nov 15, 2018
Weekly Hours: 40
Role Number: 114078410
Join Maps SRE in the multi-billion dollar a year Services organization of the Internet Services Operations team. We are currently seeking a number of extraordinary Site Reliability Engineers' as we enhance the support model for our Maps business that millions of customers use every day. We are hiring high quality engineers with a diverse set of experiences and skill sets that make you uniquely qualified to work in an environment that is fast paced, complex, and extremely large and will need to be a teammate and work effectively with other members of the Global Team. Our system has to scale globally, stay highly available, and "just work”. That's a tall order, and we're looking to add more talented and passionate engineers who love challenges. If you feel like you'd love to join this amazing team, we would love to hear from you.

Key Qualifications

  • Solid Linux background and understand the internals of Hadoop cluster setup, configuration, tuning and automation at cluster level.
  • Responsible for troubleshooting and development on Hadoop technologies like HDFS, Yarn, Hive, Pig, Sqoop, Zookeeper, Spark etc .
  • Work with cluster subscribers or users to ensure efficient resource usage in the cluster and alleviate multi-tenancy concerns.
  • Monitor cluster health and build pro-active tools to look for anomalous behavior.
  • Supervising Hadoop jobs using scheduler.
  • Designing, developing, installing, configuring and maintain Hadoop clusters.
  • Fine tune applications and systems for high performance and higher volume throughput.
  • Takes care of the day-to-day running of Hadoop clusters.
  • Proactive Monitoring of Hadoop/Spark applications .

Description

You will be responsible for the application and all aspects of it in production including the user experience Work reciprocally with developers in supporting new features, services, releases, and become an authority in our services Monitor site reliability and performance Scale infrastructure to meet demand Fix site down issues Continuously monitor/improve the quality of our infrastructure Develop automation tools Document system design and procedures Participate in on-call rotation

Education & Experience

Bachelor’s degree in Computer Science or equivalent industry experience

Additional Requirements

  • Minimum Qualifications:
  • BS in Computer Science or related technical discipline or equivalent/practical experience.
  • 3 years Linux experience
  • 3 years of distributed systems experience
  • Experience with Python,Java and Bash
  • PreferredQualifications:
  • Experience with large scale Hadoop Deployments (more than 100nodes, more than 3 clusters).
  • 5 years of academic or industry experience