Join our team as a Senior Site Reliability Engineer focused on delivering advanced support for critical Azure-based systems.
You will troubleshoot complex cloud environments, enhance observability, and implement reliability solutions using Kubernetes, monitoring tools, and Infrastructure-as-Code. If you are passionate about cloud reliability and enjoy collaborating across teams, apply now to contribute to our cutting-edge projects.
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
Responsibilities
- Troubleshoot and resolve complex incidents to maintain system uptime
- Ensure reliability and performance of Azure-based enterprise infrastructure
- Implement observability, monitoring, and logging solutions
- Automate infrastructure provisioning and deployment using Terraform and scripting
- Optimize system performance and uptime through proactive monitoring and alerting
- Collaborate with cross-functional teams to improve service reliability
- Conduct root cause analysis and postmortems for incident management
- Manage deployment pipelines in Azure DevOps for secure and scalable workflows
- Develop and maintain automation scripts for routine tasks and incident recovery
- Enhance monitoring frameworks with tools like Prometheus and Grafana
- React quickly to incidents to avoid SLA degradation
- Integrate monitoring data from Azure and AWS environments
- Support continuous improvement of service reliability and observability practices
- Document technical processes and incident reports
- Participate in Agile team activities and prioritize competing tasks
Requirements
- Minimum 3 years of experience in site reliability engineering or related DevOps roles
- Hands-on experience with Azure services, including AKS, Azure Monitor, Application Insights, Log Analytics, Cosmos DB, and PostgreSQL
- Strong expertise in Azure DevOps and Terraform for infrastructure automation
- Proficient scripting skills in Bash, PowerShell, and Python
- Experience with monitoring and observability tools such as Prometheus and Grafana
- Solid background in incident management and ITSM processes with root cause analysis capabilities
- Ability to troubleshoot and debug complex technical issues in real-time
- Experience working in fast-paced Agile environments
- Strong verbal and written communication skills for collaboration and reporting
- Proactive approach to setting alerts and preventing SLA degradation
- Experience with cloud infrastructure scaling and security best practices
- Knowledge of Kubernetes administration and orchestration
- Ability to collaborate effectively with cross-functional teams
- English language proficiency at B2 level or above
Nice to have
- Hands-on experience with AWS services including EKS, RDS, CloudWatch, and X-Ray
- Familiarity with distributed logging pipelines and incident automation tools
- Knowledge of advanced Kubernetes use cases for scaling and network configurations
- Certifications such as Microsoft Azure Administrator or AWS Certified DevOps Engineer
- Experience with observability tools like OpenSearch for AWS workloads
We offer
- Connectivity Bonus (15,000 ARS are paid with a salary receipt at the end of each month as a non-wages concept).
- Medicina Prepaga (It covers the collaborator and direct family group).
- Paternity Leave (Two additional days are added to what is established by law, total of 4 days).
- Discounts card.
- English Training (English lessons, twice per week).
- Training Program (Access to multiple customized training plans according to the needs of each role within the company).
- Marriage bonus (The company doubles the allowance established by law that ANSES offers).
- Referral Program (Referral bonus is paid when the referral of a collaborator joins the Company).
- External Agreements and Discounts.
- Vacations: 14 calendar days a year
By applying to our role, you are agreeing that your personal data may be used as in set out in EPAM´s Privacy Notice and Policy.