Come and join the Cloud Hosting Platform Team as a Senior Software Engineer, building the next gen Observability Platform in the public cloud.
Are you passionate about system resiliency, performance, and detecting/recovering quickly from system failures? Here is a great opportunity to make huge impact across Intuit in terms of increasing resiliency and reducing Mean Time to Detect (MTTD)/Mean Time to Recovery (MTTR) for hundreds of critical services and applications across Intuit.
* Design, develop and run a highly scalable and highly performing platform to support ingesting, streaming and reporting metrics time series with high degree of dimensionality (platform expected to support metrics to be available for dashboards and alerts within a tight SLA of 15 seconds or less).
* Build web based visualization tools to visualize aggregated traffic flows across 100s of micro services and platform infrastructures, isolating the symptoms from the cause to reduce time to recover through faster identification of point of failure.
* Build the new logging components and infrastructure to enable faster and quicker logging and log searches.
* Design, build and operate highly scalable, available and performant metrics, trace pipelines, storage and visualization.
* Provide guidance to other engineers on how they can best utilize the observability components such as metrics, logging and metrics systems.
* Implement tooling to recommend appropriate hosting model to optimize both cost and performance.
* Develop and operate advanced performance testing tools to enable developers to test performance of complex systems earlier in the SDLC (Software Development Life Cycle).
* Build Automation around reporting operational insights and analytics of Incident, Production logs and System usage.
* Automate creation of monitoring dashboards and alerts for web services and web/mobile apps.
* B.S or M.S in computer science or related field
* At least 5 years of experience with object oriented languages.
* At least 3 years of experience developing complex, highly available and highly scalable systems including backend web services and front-end web applications.
* At least 3 years of experience in the field of reliability, performance, system monitoring, and operational excellence (DevOps experience is also preferred).
* At least 3 years of experience working in public cloud (including AWS, Google Cloud etc).
* Deep expertise in Java/JEE or Python is required
* Experience with REST service development using Spring, or JAXRS.
* Experience building responsive Single Page Web Applications using modern front-end technologies like React, HTML5, CSS3, and other JS frameworks.
* Expertise with unit testing & Test Driven Development (TDD).
* Experience with operational excellence, especially in terms of automating manual processes, eliminating incidents, reducing detection and recovery times for system failures.
* Experience running operational readiness, incident management and RCAs for complex and connected systems.
* Experience with system monitoring tools such as wavefront, splunk, appdynamics.
* Experience with OpenTracing platforms such Jaeger, Zipkin etc.