Join our team as a Lead Site Reliability Engineer dedicated to providing advanced support for critical Azure-based systems.
You will address complex cloud challenges, enhance system observability, and strengthen reliability using Kubernetes, monitoring platforms, and Infrastructure-as-Code. If cloud reliability excites you and collaboration across teams inspires you, apply now to contribute to our innovative projects.
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
Responsibilities
- Resolve complex incidents to ensure system availability
- Maintain reliability and performance of Azure-based enterprise infrastructure
- Deploy observability, monitoring, and logging tools
- Automate infrastructure management with Terraform and scripting technologies
- Improve system performance and uptime through centralized monitoring
- Collaborate with multiple teams to enhance service reliability
- Perform root cause analysis and oversee postmortems for incidents
- Configure deployment pipelines in Azure DevOps for secure workflows
- Write and maintain automation scripts for incident recovery and recurring tasks
- Enhance monitoring frameworks with platforms like Prometheus and Grafana
- Respond promptly to incidents to meet SLA expectations
- Facilitate integration of monitoring data from Azure and AWS environments
- Advance service reliability and observability practices continuously
- Document processes and incident resolutions thoroughly
- Take part in Agile team events and balance task priorities
Requirements
- Minimum 5 years’ expertise in site reliability engineering or comparable DevOps roles
- 1+ years of demonstrated leadership experience
- Knowledge of Azure services, including AKS, Azure Monitor, Application Insights, Log Analytics, Cosmos DB, and PostgreSQL
- Expertise in infrastructure automation using Azure DevOps and Terraform
- Proficiency in scripting languages such as Bash, PowerShell, and Python
- Skills in monitoring tools including Prometheus and Grafana
- Background in incident management and ITSM processes with analytical capability for root cause investigations
- Competency in resolving technical challenges promptly in high-pressure situations
- Experience in Agile workflows and fast-paced operational environments
- Flexibility to communicate effectively in written and verbal formats for teamwork and documentation
- Capability to configure alerts that prevent SLA breaches proactively
- Understanding of cloud scaling techniques and security best practices
- Knowledge of Kubernetes administration for orchestration tasks
- Ability to collaborate with diverse functional teams seamlessly
- English proficiency of B2 or higher
Nice to have
- Background in AWS services, such as EKS, RDS, CloudWatch, and X-Ray
- Familiarity with distributed logging systems and tools for incident automation
- Certifications such as Microsoft Azure Administrator or AWS Certified DevOps Engineer
- Understanding of Kubernetes configurations for scaling and advanced networking setups
- Proficiency in observability tools such as OpenSearch for AWS environments
We offer
- Connectivity Bonus (15,000 ARS are paid with a salary receipt at the end of each month as a non-wages concept).
- Medicina Prepaga (It covers the collaborator and direct family group).
- Paternity Leave (Two additional days are added to what is established by law, total of 4 days).
- Discounts card.
- English Training (English lessons, twice per week).
- Training Program (Access to multiple customized training plans according to the needs of each role within the company).
- Marriage bonus (The company doubles the allowance established by law that ANSES offers).
- Referral Program (Referral bonus is paid when the referral of a collaborator joins the Company).
- External Agreements and Discounts.
- Vacations: 14 calendar days a year
By applying to our role, you are agreeing that your personal data may be used as in set out in EPAM´s Privacy Notice and Policy.