We are seeking a Lead ML Infrastructure Engineer to strengthen our MLOps team, focusing on the design and management of our enterprise machine learning platform while advancing scalable ML infrastructure and deployment practices.
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
Responsibilities
- Provide expert advice on ML technologies, tools, and MLOps best practices with an emphasis on model observability, tracking, and deployment
- Design and maintain robust batch processing and ML inference pipelines for efficient model execution
- Automate ML model deployment processes through CI/CD pipelines to enhance production workflows
- Monitor deployed models and infrastructure for health, performance, reliability, and scalability
- Ensure seamless integration of ML inference services with other applications or systems
- Enable deployments of ML models that scale efficiently and maintain high performance in production environments
- Collaborate with client stakeholders and team members to ensure requirements are understood and tasks are completed effectively
- Develop infrastructure solutions that support both data processing pipelines and batch inferencing capabilities
- Write comprehensive unit tests to ensure reliability for ML deployment, inference, and post-processing methods
- Maintain proactive and transparent communication with team members and stakeholders to ensure alignment
Requirements
- 5+ years of experience with AWS services and MLOps-focused infrastructure for scalable ML model deployment
- Expertise in infrastructure-as-code tools, enabling efficient and consistent infrastructure provisioning
- Strong background in setting up and monitoring infrastructure for data and ML inference pipelines
- Demonstrated ability to take ownership of tasks and work collaboratively with client stakeholders and teams
- Skills in writing effective unit tests for ML deployment, inference, and related methods
- Proficiency in clear communication with the ability to ask for clarification when necessary
Nice to have
- Knowledge of Google Cloud Platform (GCP) and its ML-specific services
- Proficiency in using Snowflake as a data platform for ML workflows
- Understanding of Feature Store platforms to enhance feature management processes
- Background in Spark and AWS Elastic MapReduce (EMR) for processing distributed datasets
- Familiarity with data curation best practices to support ML model training and high-quality dataset creation
- Capability to participate in on-call rotations to maintain system reliability in production environments
We offer
- Connectivity Bonus (15,000 ARS are paid with a salary receipt at the end of each month as a non-wages concept).
- Medicina Prepaga (It covers the collaborator and direct family group).
- Paternity Leave (Two additional days are added to what is established by law, total of 4 days).
- Discounts card.
- English Training (English lessons, twice per week).
- Training Program (Access to multiple customized training plans according to the needs of each role within the company).
- Marriage bonus (The company doubles the allowance established by law that ANSES offers).
- Referral Program (Referral bonus is paid when the referral of a collaborator joins the Company).
- External Agreements and Discounts.
- Vacations: 14 calendar days a year
By applying to our role, you are agreeing that your personal data may be used as in set out in EPAM´s Privacy Notice and Policy.