MLOPS Tech Lead

Healwell AI Inc Toronto, Ontario, Canada

About this position

Job description 

There are over 7000 rare diseases identified, affecting over 300 million patients worldwide and 1 in 12 patients in Canada. Many of these patients remain undiagnosed and unaware, resulting in a poor quality of life and potentially serious consequences. Healwell AI (HWAI) (TSX:AIDX), is a leader in AI-enabled clinical intelligence for rare diseases and specialty conditions. Through our proprietary clinical intelligence platform and deep analytical tools, HWAI allows physicians to quickly understand complex, high-risk patients and place them on the right care pathways leading to better outcomes for patients, their families, and the healthcare system.  

HWAI is looking for We are seeking an experienced MLOps Tech Lead to architect our next-generation AI infrastructure and lead a talented team of engineers. In this pivotal role, you will bridge the gap between Data Science, Cloud Engineering, and DevOps. You will not only be hands-on with our Azure/Databricks stack but will also set the technical vision, establish engineering standards, and ensure our AI platforms are secure, scalable, and cost-efficient. You will own the roadmap for our MLOps maturity, moving us from manual execution to fully automated, observable, and resilient AI systems. You will have the opportunity to enhance your technical leadership skills while contributing to impactful projects in the healthcare space.  

Responsibilities 

The successful candidate will work in a multifaceted role encompassing Cloud Architect, Cloud Security, and DevOps/MLOps responsibilities 

Lead, mentor, and grow a team of MLOps and Cloud Engineers; conduct code reviews, facilitate technical design sessions, and foster a culture of engineering excellence.Define the high-level architecture for our end-to-end ML platform on Azure, making critical decisions on "build vs. buy" for tooling and infrastructure.Oversee the Terraform codebase; implement modular, reusable infrastructure patterns and enforce state management policies to prevent drift.Own the reliability (SRE) of machine learning systems. Define SLAs/SLOs for model inference and data pipelines, and lead root cause analysis (RCA) for critical incidents.Manage cloud budgets (FinOps) for compute/Databricks usage and enforce rigorous security postures (IAM, network isolation, private endpoints) ensuring compliance with industry standardsEvolve our CI/CD pipelines from simple automation to advanced deployment strategies (Blue/Green, Canary releases, Shadow deployment) for ML models.Deploy and maintain cloud-based ML models in production, ensuring performance and scalabilityDesign, deploy, and manage scalable, secure, and highly available cloud infrastructure on Azure, utilizing infrastructure as code (IaC) principles.Build monitoring systems for data quality, model performance, and pipeline healthCollaborate with cross-functional teams to define problems and develop solutionsDevelop and maintain documentation for cloud architecture, processes, and systemsDiagnose and resolve issues related to application and model performance, pipeline failures, and infrastructure problems.

Required Qualifications 

Bachelor’s degree in computer science, Engineering, or related field7+ years of total experience in DevOps, Cloud Engineering, or Software Engineering.3+ years specifically focused on MLOps or Data Engineering at a production scale.2+ years in a technical leadership or mentoring role (Team Lead, Principal Engineer, etc.).Deep proficiency with Azure cloud and cloud-native servicesProficiency in Python and shell scriptingHands-on experience with containerization technologies (Docker) and orchestration platforms (Kubernetes)Advanced mastery of TerraformDeep hands-on experience with Databricks (MLflow, Spark, Unity Catalog) Proven experience with orchestration tools (Dagster preferred)Knowledge of Postgres or equivalent database managementExperience with containerization, infrastructure as code, and DevOps/MLOps practicesStrong problem-solving skills and ability to work independently and collaboratively 

Preferred Qualifications 

Certifications like Azure Solutions Architect Expert or DevOps Engineer Expert are desirableRelevant certifications in security domains. 

What You'll Work With 

Data Platform: Databricks (Spark, Delta Lake) + Weaviate vector store 

Orchestration: Dagster for pipeline management and scheduling 

Cloud: Azure services for compute, storage, and ML services 

Languages: Python, shell 

Tools: Docker, Kubernetes, Terraform, Git, CI/CD pipelines 

Monitoring: Custom dashboards, alerting systems, and model performance tracking 

 

Culture & Work Environment 

Communication: We value open and honest communication. Regular check-ins and team meetings ensure everyone is aligned and informed. Transparency: Our decision-making processes are transparent, encouraging input from all team members. Your ideas and feedback will be valued. Promptness: We maintain a fast-paced work environment and expect team members to be prompt in delivering work and meeting deadlines. Guidance: You will be supported and guided by our VP of Technology, who will provide mentorship and direction throughout your co-op experience.  

What We Offer 

Hands-on experience with real-world data challenges in the medical field. Opportunities to expand your technical skill set and work with advanced AI tools. A collaborative team environment that fosters learning and innovation.  

We look forward to receiving your application and hope to welcome you to the HWAI team!  

HWAI is an equal opportunity employer that welcomes all applicants including persons with disabilities, visible minorities, women, and aboriginals. HWAI will provide reasonable accommodation to qualified job applicants with a disability, on request, and will notify successful applicants of policies relating to the accommodation of employees with disabilities. We would like to thank all applicants for your interest in HWAI, but please note that only successful candidates will be contacted.  

You can learn more about HWAI at https://healwell.ai