Senior Machine Learning Platform Engineer
- Remote/Hybrid
- Permanent
I am collaborating with a very exciting, incredibly well backed, start up, Biotech company with offices in Amsterdam and Zurich
They are building a cutting-edge data platform and developing a multimodal foundation model that leverages complex datasets to improve cancer diagnosis, treatment, and patient outcomes. As part of their team, you'll be at the forefront of innovative technologies that help drive advancements in personalized medicine and cancer research.
They are seeking a highly skilled Senior Platform Engineer (Machine Learning) with a passion for building scalable ML platforms and ensuring a high-availability experience to empower their AI research team in their daily work. You'll play a vital role in making their ambitious AI healthcare solutions a practical reality. This exciting role will be based in either The Netherlands or Switzerland.
Your responsibilities:
Design and build my clients multi-tenant machine learning platform, including their large-scale distributed training systems;
Create robust distributed training and inference solutions for maximum computational efficiency;
Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for our large training runs;
Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments;
Ensure compliance with security best practices and industry standards.
Qualifications/requirements:
3+ years of experience building production ML platform and systems;
Experience building and optimizing latency and throughput of machine learning systems and GPU workloads;
Knowledge distributed training frameworks (e.g. Ray, Dask, PyTorch Lightning);
Experience with at least one cloud platform (e.g. AWS, Azure or Google Cloud);
Strong coding skills in at least one programming language (e.g. Python, Scala, Java, C++);
Excellent problem-solving and communication skills;
Self-motivated and able to work well in a fast-paced startup environment.
Nice to have:
Track record of successfully scaling ML platform;
Fundamentals of modern Deep Learning;
Experience with CI/CD tools (e.g. GitLab CI/CD, Github Actions or CircleCI), containerization (e.g. Docker) and orchestration tools (e.g. Kubernetes, Helm, Kustomize);
Knowledge of monitoring, logging, alerting and observability tools (e.g. Prometheus, Grafana, ELK Stack or Datadog);
Familiarity with infrastructure-as-code tools (e.g. Terraform, CloudFormation or Pulumi);
Understanding of networking, security, and system administration concepts;
Experience of high-performance computing (HPC) systems and workload managers (Slurm).
Benefits:
Comprehensive salary
Long term incentives
25+ days holiday
Clear development pathway
Following your application Joe Templeman, a specialist AI Recruiter will discuss the opportunity with you in detail.
He will be more than happy to answer any questions relating to the industry and the potential for your career growth. The conversation can also progress further to discussing other opportunities, which are also available right now or will be imminently becoming available.
This position has been highly popular, and it is likely that it will close prematurely. We recommend applying as soon as possible to avoid disappointment.
Please click ‘apply’ or contact Joe Templeman for any further information
Joe Templeman
Recruitment Manager – Barrington James
Email: jtempleman (at) barringtonjames.com