India Openings

Lead NOC/Site Reliability Engineer

Job ID: Lea-ETP-Pun-980

Location: Pune

Lead NOC/Site Reliability Engineer

Responsibilities

  • 6+ years of experience in SRE, DevOps, or infrastructure management.
  • Lead the NOC/SRE team from the front, ensuring a culture of proactive monitoring, rapid response, and continuous improvement.
  • Act as the primary escalation point for major incidents, providing technical guidance and decision-making.
  • Collaborate with DevOps, Engineering, and Product teams to enhance system reliability. 
  • Define best practices, incident response protocols, and runbooks for the team.
  • Lead log tracing and deep troubleshooting for infrastructure, network, and application issues.
  • Reduce MTTR (Mean Time to Resolution) and improve incident management processes.
  • Expertise in troubleshooting complex infrastructure and application issues.
  • Strong knowledge of log tracing, distributed tracing, and observability tools (e.g., ELK, Splunk, Grafana, Prometheus, OpenTelemetry).
  • Deep understanding of SLAs, SLOs, and error budgets.
  • Experience with cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes, Docker).
  • Good knowledge of Terraform, Kubernetes, Docker, and cloud architectures.
  • Proficiency in monitoring and observability tools (New Relic, Prometheus, Datadog, etc.).
  • Understanding of CI/CD pipelines, automation, and infrastructure as code (IaC).
  • Basic scripting skills in Python, Go, Shell, or similar.
  • Strong troubleshooting skills for complex distributed systems.
  • Ability to mentor junior engineers and drive SRE best practices.
  • Willingness to work in a 24×7 shift rotation and participate in on-call responsibilities.
  • Strong problem-solving skills and ability to work in a fast-paced environment.
  • Strong incident management, troubleshooting, and RCA skills.

Qualifications

  • 6+ years of experience in Site Reliability Engineering (SRE) / NOC / DevOps roles.
  • Proven leadership experience, managing or mentoring a team.
  • Hands-on experience with Terraform for Infrastructure as Code (IaC).
  • Experience in Python for automation and scripting.
  • Expertise in troubleshooting complex infrastructure and application issues.
  • Strong knowledge of log tracing, distributed tracing, and observability tools (e.g., ELK, Splunk, Grafana, Prometheus, OpenTelemetry).
  • Deep understanding of SLAs, SLOs, and error budgets.
  • Experience with cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes, Docker).
  • Familiarity with CI/CD pipelines and GitOps practices.
  • Strong problem-solving skills and the ability to make quick, data-driven decisions under pressure.