Lead NOC/Site Reliability Engineer
Job ID: Lea-ETP-Pun-980
Location: Pune
Lead NOC/Site Reliability Engineer
Responsibilities
- 6+ years of experience in SRE, DevOps, or infrastructure management.
- Lead the NOC/SRE team from the front, ensuring a culture of proactive monitoring, rapid response, and continuous improvement.
- Act as the primary escalation point for major incidents, providing technical guidance and decision-making.
- Collaborate with DevOps, Engineering, and Product teams to enhance system reliability.
- Define best practices, incident response protocols, and runbooks for the team.
- Lead log tracing and deep troubleshooting for infrastructure, network, and application issues.
- Reduce MTTR (Mean Time to Resolution) and improve incident management processes.
- Expertise in troubleshooting complex infrastructure and application issues.
- Strong knowledge of log tracing, distributed tracing, and observability tools (e.g., ELK, Splunk, Grafana, Prometheus, OpenTelemetry).
- Deep understanding of SLAs, SLOs, and error budgets.
- Experience with cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes, Docker).
- Good knowledge of Terraform, Kubernetes, Docker, and cloud architectures.
- Proficiency in monitoring and observability tools (New Relic, Prometheus, Datadog, etc.).
- Understanding of CI/CD pipelines, automation, and infrastructure as code (IaC).
- Basic scripting skills in Python, Go, Shell, or similar.
- Strong troubleshooting skills for complex distributed systems.
- Ability to mentor junior engineers and drive SRE best practices.
- Willingness to work in a 24×7 shift rotation and participate in on-call responsibilities.
- Strong problem-solving skills and ability to work in a fast-paced environment.
- Strong incident management, troubleshooting, and RCA skills.
Qualifications
- 6+ years of experience in Site Reliability Engineering (SRE) / NOC / DevOps roles.
- Proven leadership experience, managing or mentoring a team.
- Hands-on experience with Terraform for Infrastructure as Code (IaC).
- Experience in Python for automation and scripting.
- Expertise in troubleshooting complex infrastructure and application issues.
- Strong knowledge of log tracing, distributed tracing, and observability tools (e.g., ELK, Splunk, Grafana, Prometheus, OpenTelemetry).
- Deep understanding of SLAs, SLOs, and error budgets.
- Experience with cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes, Docker).
- Familiarity with CI/CD pipelines and GitOps practices.
- Strong problem-solving skills and the ability to make quick, data-driven decisions under pressure.