Site Reliability Engineer
Uptake is a Chicago-based predictive analytics SaaS platform provider that empowers major industry leaders to optimize performance, reduce asset failures and enhance safety. At Uptake, we combine our strengths—machine learning, analytics, data visualization and software development—with the expertise of our industrial partners. The result is enormous savings in development time and resources for Uptake’s partners and a proven industrial grade software platform that delivers value to partners and their end customers.
What You'll Do:
As a Site Reliability Engineer, you’d proactively monitor and improve end-to-end system performance, identify deficiencies, and potential failures throughout our infrastructure. You will build deep, end-to-end knowledge of the complexity of our platform and continuously create improvements and automation to enhance durability, performance and maintainability of the platform. You are central to the automation of everything at Uptake.
- Support and perform maintenance across product and data environments/systems
- Proactively monitor events, investigate issues, analyze solutions, and drive problems through to resolution using a wide variety of Ops tools and monitoring platforms to gain knowledge, understanding, and enable persistent monitoring of system availability, performance, and capacity
- Develop and maintain scalable alerting, ticketing, and logging tools for debugging and monitoring
- Maintain our monitoring systems and develop new metrics/monitoring dashboards as additional coverage events become necessary
- Be on call for potential downtime problem solving and root cause diagnosis
- Provide support with network management and maintain a high availability environment
- As a technology subject matter expert, you will mentor engineers to stretch their knowledge and perspective.
- Excellent understanding of Linux, Bash and shell scripting
- Knowledge of and experience with network stack, protocols, network management and monitoring tools
- Knowledge of AWS technologies - EC2, S3
- Experience with group services, including configuration, synchronization, and naming protocols, preferably using Apache ZooKeeper.
- Experience with large-scale data processing, preferably using Apache Spark
- Experience with a distributed log tool such as Kafka
- Experience with automation tools: Puppet, Chef, Docker, Jenkins and/or Ansible
- Knowledge of Mesos/Marathon and Docker for container orchestration
- Experience in Big Data (NoSQL) & standard enterprise databases - including data modeling, testing and deployment support. Proficiency in Cassandra, HBase, or PostgreSQL is strongly preferred.
- Experience with JVM and Java stack: Tomcat, Jetty
- Ability to work collaboratively in a fast-paced, entrepreneurial environment
- Experience working with Agile methodologies
- Excited by Big Data technologies and interested in integrating statistics and analytics to make our systems perform even better
- Cover letter