Join our team as a Lead Site Reliability Engineer dedicated to providing advanced support for critical Azure-based systems.

You will address complex cloud challenges, enhance system observability, and strengthen reliability using Kubernetes, monitoring platforms, and Infrastructure-as-Code. If cloud reliability excites you and collaboration across teams inspires you, apply now to contribute to our innovative projects.

EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Responsibilities

Resolve complex incidents to ensure system availability
Maintain reliability and performance of Azure-based enterprise infrastructure
Deploy observability, monitoring, and logging tools
Automate infrastructure management with Terraform and scripting technologies
Improve system performance and uptime through centralized monitoring
Collaborate with multiple teams to enhance service reliability
Perform root cause analysis and oversee postmortems for incidents
Configure deployment pipelines in Azure DevOps for secure workflows
Write and maintain automation scripts for incident recovery and recurring tasks
Enhance monitoring frameworks with platforms like Prometheus and Grafana
Respond promptly to incidents to meet SLA expectations
Facilitate integration of monitoring data from Azure and AWS environments
Advance service reliability and observability practices continuously
Document processes and incident resolutions thoroughly
Take part in Agile team events and balance task priorities

Requirements

Minimum 5 years’ expertise in site reliability engineering or comparable DevOps roles
1+ years of demonstrated leadership experience
Knowledge of Azure services, including AKS, Azure Monitor, Application Insights, Log Analytics, Cosmos DB, and PostgreSQL
Expertise in infrastructure automation using Azure DevOps and Terraform
Proficiency in scripting languages such as Bash, PowerShell, and Python
Skills in monitoring tools including Prometheus and Grafana
Background in incident management and ITSM processes with analytical capability for root cause investigations
Competency in resolving technical challenges promptly in high-pressure situations
Experience in Agile workflows and fast-paced operational environments
Flexibility to communicate effectively in written and verbal formats for teamwork and documentation
Capability to configure alerts that prevent SLA breaches proactively
Understanding of cloud scaling techniques and security best practices
Knowledge of Kubernetes administration for orchestration tasks
Ability to collaborate with diverse functional teams seamlessly
English proficiency of B2 or higher

Nice to have

Background in AWS services, such as EKS, RDS, CloudWatch, and X-Ray
Familiarity with distributed logging systems and tools for incident automation
Certifications such as Microsoft Azure Administrator or AWS Certified DevOps Engineer
Understanding of Kubernetes configurations for scaling and advanced networking setups
Proficiency in observability tools such as OpenSearch for AWS environments

We offer

Connectivity Bonus (15,000 ARS are paid with a salary receipt at the end of each month as a non-wages concept).
Medicina Prepaga (It covers the collaborator and direct family group).
Paternity Leave (Two additional days are added to what is established by law, total of 4 days).
Discounts card.
English Training (English lessons, twice per week).
Training Program (Access to multiple customized training plans according to the needs of each role within the company).
Marriage bonus (The company doubles the allowance established by law that ANSES offers).
Referral Program (Referral bonus is paid when the referral of a collaborator joins the Company).
External Agreements and Discounts.
Vacations: 14 calendar days a year

By applying to our role, you are agreeing that your personal data may be used as in set out in EPAM´s Privacy Notice and Policy.

Guardar Postular

Reportar empleo

Lead Site Reliability Engineer

Senior Site Reliability Engineer

Site Reliability Engineer II

Senior Software Engineer, Canvas

Sr Site Construction Director - Site Salta

Software Engineer - Scraping