Kubernetes AI Jobs

Discover the latest remote and onsite Kubernetes AI roles across top active AI companies. Updated hourly.

Check out 301 new Kubernetes AI roles opportunities posted on AI Chopping Block

Analytics Engineer

New
Top rated
Loop
Full-time
Full-time
Posted

Ship critical infrastructure managing real-world logistics and financial data for the largest enterprise in the world. Own the why by building deep context through customer calls and understanding the company's value to customers, pushing back on requirements if a better, faster solution is seen. Demonstrate full-stack proficiency by working across system boundaries, including frontend UX, LLM agents, database schema, and event infrastructures. Leverage AI tools to automate boilerplate so focus can be on quality, architecture, and product taste. Constantly raise the velocity bar by optimizing development loops, refactoring legacy patterns, automating workflows, and fixing broken processes.

$125,000 – $125,000
Undisclosed
YEAR

(USD)

Chicago or SF
Maybe global
Hybrid
Python
JavaScript
NLP
OpenAI API
Docker

Forward Deployed Engineer, Lead - AI Engineer

New
Top rated
Reflection
Full-time
Full-time
Posted

As a Forward Deployed Engineer Lead, you will own the end-to-end technical strategy, execution, and delivery of complex agentic applications, from early pre-sales discovery through production deployment. Responsibilities include partnering with Deployment Strategists and Sales to understand enterprise customer needs, architecting solutions, and developing transformative agentic applications. You will architect and build complex agentic systems using state-of-the-art models, orchestrate sophisticated LLM workflows, and integrate deeply with enterprise infrastructure. Collaboration with research teams to adapt and fine-tune models for customer-specific needs and contributing to the internal codebase for inference, fine-tuning, and evaluation is required. You will own end-to-end deployments across hybrid environments including public cloud, VPC, and on-premises, ensuring production-grade scalability, performance, and reliability. Additionally, you will shape and scale the Forward Deployed Engineering organization by defining playbooks, best practices, technical standards, and providing mentorship to support team growth.

Undisclosed

()

Seoul, South Korea
Maybe global
Onsite
Python
TypeScript
Docker
Kubernetes
CI/CD

Forward Deployed Engineer - AI Engineer

New
Top rated
Reflection
Full-time
Full-time
Posted

As a Forward Deployed Engineer at Reflection, you will partner with Deployment Strategists and Sales to understand enterprise customer needs, architect solutions, and develop transformative agentic applications. You will build agentic systems using state-of-the-art models, orchestrate LLM workflows, integrate with enterprise infrastructure, and deploy reliable production systems. You will collaborate with research teams to adapt and fine-tune models for customer-specific needs. You will support end-to-end deployments across hybrid environments such as public cloud, VPC, and on-premises, ensuring scalability, performance, and reliability in production. Additionally, you will contribute to evolving playbooks, processes, and best practices as part of the growing Forward Deployed Engineering organization.

Undisclosed

()

Seoul, South Korea
Maybe global
Onsite
Python
TypeScript
Docker
Kubernetes
CI/CD

Software Engineer, Platform

New
Top rated
Scale AI
Full-time
Full-time
Posted

As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, support end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. You will own the production outcome, taking full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. You will ensure full-stack integrity by overseeing the health of the platform, ensuring seamless integration between the AI core and all full-stack components from APIs to UI. Additionally, you will build automated systems to monitor model performance and data drift across geographically dispersed environments, manage the technical lifecycle within diverse regulatory frameworks, lead the response for production issues in mission-critical environments, translate deep technical performance metrics into clear insights for senior international government officials, and partner with Engineering and ML teams to ensure field lessons influence future technical architecture and decisions.

Undisclosed

()

London, United Kingdom
Maybe global
Onsite
Python
Kubernetes
Docker
Vector Databases
CI/CD

Senior Product Engineer, Growth & Lifecycle Infrastructure - Music & Audio

New
Top rated
Stability AI
Full-time
Full-time
Posted

Lead efforts to drive the design and development of customer-facing multi-modal machine learning inference systems. Work with the Platform and Inference teams on building inference systems for the next generation of models, focusing on optimization, model tuning, and deployment. Partner with leading cloud providers to deliver hosted Stability AI inference solutions. Serve as a strategic thought partner for leaders across the organization on driving business impact through machine learning. Contribute to bringing new Stability models and pipelines into existence. Prototype and productionize inference platform improvements and new features.

Undisclosed

()

Los Angeles, United States
Maybe global
Hybrid
Python
PyTorch
Docker
Kubernetes
AWS

AI Builder Intern

New
Top rated
Scale AI
Full-time
Full-time
Posted

The Production AI Ops Lead is responsible for designing and developing the production lifecycle of full-stack AI applications, supporting system reliability, real-time inference observability, sovereign data orchestration, secure software integration, and resilient cloud infrastructure for international government partners. They own the production outcome, taking full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. They oversee the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components from APIs to UI, maintaining a responsive and production-ready environment. The role involves building automated systems to monitor model performance and data drift across geographically dispersed environments to ensure reliability, managing the technical lifecycle within diverse regulatory frameworks, and leading incident response for production issues in mission-critical environments to ensure rapid resolution and prevent recurrence. The lead also translates technical performance metrics into clear insights for senior international government officials and partners with Engineering and ML teams to influence the technical architecture and decisions of future AI use cases.

Undisclosed

()

San Francisco or New York
Maybe global
Onsite
Python
Kubernetes
Vector Databases
MLOps

Safety Coordinator / Lab Lead

New
Top rated
Scale AI
Full-time
Full-time
Posted

As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications while supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. You will take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. You will oversee the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components from APIs to UI, maintaining a responsive and production-ready environment. You will build automated systems to monitor model performance and data drift across geographically dispersed environments to ensure reliability. You will manage the technical lifecycle within diverse regulatory frameworks and lead the response for production issues in mission-critical environments, ensuring rapid resolution and building guardrails to prevent recurrence. You will translate deep technical performance metrics into clear insights for senior international government officials and partner with Engineering and ML teams to ensure lessons learned influence future technical architecture and decisions.

Undisclosed

()

San Francisco, United States
Maybe global
Onsite
Kubernetes
Vector Databases
Python
CI/CD
MLOps

Technical Program Manager, Platform

New
Top rated
Scale AI
Full-time
Full-time
Posted

As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and resilient cloud infrastructure for international government partners. You will own the production outcome by taking full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. You will ensure full-stack integrity by overseeing the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components from APIs to UI to maintain a responsive and production-ready environment. You will build automated systems to monitor model performance and data drift across geographically dispersed environments, ensuring reliability. You will manage the technical lifecycle within diverse regulatory frameworks and lead the response for production issues in mission-critical environments to ensure rapid resolution and build guardrails to prevent recurrence. You will translate deep technical performance metrics into clear insights for senior international government officials and partner with Engineering and ML teams to ensure lessons learned influence the technical architecture and decisions of future use cases.

Undisclosed

()

San Francisco or New York, United States
Maybe global
Onsite
Kubernetes
Vector Databases
Python
MLOps
CI/CD

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

New
Top rated
Together AI
Full-time
Full-time
Posted

As an AI Infrastructure Engineer, the responsibilities include participating in an on-call rotation to respond to production incidents, building and running infrastructure using Ansible, Terraform, and Kubernetes to enable scaling for many concurrent users, building monitoring systems to ensure high-quality service, designing and implementing operational processes such as deployments and upgrades, debugging production issues across all services and stack levels, identifying improvements for product architecture concerning reliability, performance, and availability, and planning the growth of Together AI's infrastructure.

$190,000 – $270,000
Undisclosed
YEAR

(USD)

San Francisco
Maybe global
Onsite
Ansible
Terraform
Kubernetes
Python
CI/CD

Technical Program Manager, Enterprise

New
Top rated
Scale AI
Full-time
Full-time
Posted

As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, while supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and the resilient cloud infrastructure required for international government partners. You will own the production outcome by taking full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies. You will ensure full-stack integrity by overseeing the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components, from APIs to UI, to maintain a responsive and production-ready environment. You will scale the feedback loop by building automated systems to monitor model performance and data drift across geographically dispersed environments, ensuring the right levels of reliability. You will manage the technical lifecycle within diverse regulatory frameworks to navigate global compliance. You will lead the response for production issues in mission-critical environments as incident command, ensuring rapid resolution and building guardrails to prevent recurrence. You will translate deep technical performance metrics into clear insights for senior international government officials, and drive product evolution by partnering with Engineering and ML teams to ensure lessons learned in the field influence the technical architecture and decisions of future use cases.

Undisclosed

()

New York or San Francisco, United States
Maybe global
Onsite
Python
Kubernetes
Vector Databases
MLOps

Want to see more AI Egnineer jobs?

View all jobs

Access all 4,256 remote & onsite AI jobs.

Join our private AI community to unlock full job access, and connect with founders, hiring managers, and top AI professionals.
(Yes, it’s still free—your best contributions are the price of admission.)

Frequently Asked Questions

Need help with something? Here are our most frequently asked questions.

Question text goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

[{"question":"What are Kubernetes AI jobs?","answer":"Kubernetes AI jobs involve orchestrating containerized machine learning applications at scale. Professionals in these roles manage container deployment for AI workloads, distribute computational tasks across nodes for model training, allocate GPU resources efficiently, and automate ML pipelines. They typically work with frameworks like TensorFlow and PyTorch while ensuring high availability for production AI systems through automated scaling and self-healing capabilities."},{"question":"What roles commonly require Kubernetes skills?","answer":"Roles requiring Kubernetes skills include Machine Learning Engineers who deploy models to production, MLOps Engineers working with platforms like Kubeflow, Data Engineers managing processing pipelines, Platform Engineers supporting agentic AI applications, DevOps/SRE professionals handling containerized deployments, and Cloud Architects designing scalable environments. These positions typically involve maintaining infrastructure that supports the complete machine learning lifecycle."},{"question":"What skills are typically required alongside Kubernetes?","answer":"Alongside Kubernetes, employers typically look for container fundamentals (especially Docker), distributed systems knowledge, CI/CD pipeline experience, and cloud platform familiarity. Programming skills are essential for deployment scripts, while experience with ML frameworks like TensorFlow or PyTorch is valuable for AI-specific implementations. Understanding storage solutions, Kubernetes operators, and automated infrastructure management rounds out the typical skill requirements."},{"question":"What experience level do Kubernetes AI jobs usually require?","answer":"Kubernetes AI jobs typically require mid to senior-level experience. Employers look for professionals who understand containerization concepts, have worked with distributed systems, and can manage complex ML workflows. Prior exposure to cloud environments where Kubernetes runs is important. Candidates should demonstrate practical experience with CI/CD pipelines and familiarity with at least one major ML framework."},{"question":"What is the salary range for Kubernetes AI jobs?","answer":"Kubernetes AI jobs command competitive salaries due to the specialized intersection of container orchestration and machine learning skills. Compensation varies based on experience level, location, and specific industry. Roles requiring both strong AI expertise and Kubernetes infrastructure management typically offer premium compensation compared to general software engineering positions, reflecting the high market value of these combined skill sets."},{"question":"Are Kubernetes AI jobs in demand?","answer":"Kubernetes AI jobs are in high demand as organizations increasingly adopt containerized applications for machine learning workloads. The growth is driven by enterprises scaling their AI operations, edge computing applications, and the need for platform-agnostic infrastructure. Companies seek professionals who can manage the complexity of distributed ML systems, particularly for high-availability production environments and automated ML pipelines."},{"question":"What is the difference between Kubernetes and Docker in AI roles?","answer":"Docker creates containerized applications while Kubernetes orchestrates those containers at scale. In AI roles, Docker is used to package ML applications with their dependencies, while Kubernetes manages deployment across clusters, automates scaling during training, and handles resource allocation for GPUs. Docker provides consistency between environments, while Kubernetes adds critical production capabilities like load balancing, self-healing, and distributed computing for AI workloads."}]