Staff Engineer, Distributed Storage and HPC & AI Infrastructure
As an AI Infrastructure Engineer, the responsibilities include participating in an on-call rotation to respond to production incidents, building and running infrastructure using Ansible, Terraform, and Kubernetes to enable scaling for many concurrent users, building monitoring systems to ensure high-quality service, designing and implementing operational processes such as deployments and upgrades, debugging production issues across all services and stack levels, identifying improvements for product architecture concerning reliability, performance, and availability, and planning the growth of Together AI's infrastructure.
Manager, Infrastructure Strategy & Operations
As an AI Infrastructure Engineer at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You participate in on-call rotation (Pagerduty) to respond to production incidents. You build and run infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users. You build monitoring systems to ensure the highest quality service for customers. You design and implement operational processes such as deployments and upgrades. You debug production issues across all services and levels of the stack. You identify improvements for the product architecture from the reliability, performance, and availability perspectives. You plan the growth of Together AI's infrastructure.
Lead/Manager Together Cloud Infrastructure Engineer
As an AI Infrastructure Engineer at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You participate in on-call rotation to respond to production incidents, build and run infrastructure using Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users, build monitoring systems to ensure the highest quality service for customers, design and implement operational processes such as deployments and upgrades, debug production issues across all services and levels of the stack, identify improvements for product architecture from reliability, performance, and availability perspectives, and plan the growth of Together AI's infrastructure.
Staff Platform Engineer, Voice AI
As an AI Infrastructure Engineer at Together, you are responsible for keeping all user-facing services and production systems running smoothly by participating in on-call rotation to respond to production incidents, building and running infrastructure with Ansible, Terraform, and Kubernetes to enable scaling for a massive number of concurrent users, building monitoring systems to ensure the highest quality service, designing and implementing operational processes such as deployments and upgrades, debugging production issues across all services and levels of the stack, identifying improvements for product architecture from reliability, performance, and availability perspectives, and planning the growth of Together AI's infrastructure.
Infrastructure Design Engineer
As an AI Infrastructure Engineer at Together, you are responsible for keeping all user-facing services and production systems running smoothly. Your tasks include participating in an on-call rotation to respond to production incidents, building and running infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users, building monitoring systems to ensure the highest quality service, designing and implementing operational processes such as deployments and upgrades, debugging production issues across all services and levels of the stack, identifying improvements for the product architecture from reliability, performance, and availability perspectives, and planning the growth of Together AI's infrastructure.
Business Development Intern
Lead the team responsible for the AI/ML infrastructure that connects machine learning research with large-scale production. Develop and execute the long-term vision and roadmap for the MLOps team to support ML development and deployment needs across business units, balancing short-term tactical deliveries and long-term architectural transformation. Manage and mentor a team of 6-7+ engineers, allocating resources strategically for existing service support and key initiatives. Collaborate cross-functionally with leaders in machine learning, data science, product engineering, and infrastructure to identify issues, address bottlenecks, and facilitate new solution deployment. Architect compute and storage pipelines for managing large datasets without data fragmentation or latency. Modernize inference stack for AI product growth. Work with Site Reliability Engineering to establish comprehensive system metrics. Conduct build vs. buy assessments and audits to benchmark proprietary tools against commercial and open-source alternatives.
Forward Deployed Engineer (GPU Clusters)
As an AI Infrastructure Engineer, the responsibilities include participating in an on-call rotation to respond to production incidents, building and running infrastructure with Ansible, Terraform, and Kubernetes to support scaling to many concurrent users, building monitoring systems to ensure high-quality customer service, designing and implementing operational processes such as deployments and upgrades, debugging production issues across all services and stack levels, identifying improvements for product architecture focused on reliability, performance, and availability, and planning the growth of Together AI's infrastructure.
Technical Account Manager (TAM), AI Factory
Participate in on-call rotation to respond to production incidents, build and run infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users, build monitoring systems to ensure the highest quality service for customers, design and implement operational processes such as deployments and upgrades, debug production issues across all services and levels of the stack, identify improvements for product architecture from reliability, performance, and availability perspectives, and plan the growth of Together AI's infrastructure.
Software Engineer, Compute Infrastructure
In this role, you will spin up and scale large Kubernetes clusters, including automating provisioning, bootstrapping, and cluster lifecycle management; build software abstractions that unify multiple clusters and provide a seamless interface to training workloads; own node bring-up from bare metal through firmware upgrades ensuring fast and repeatable deployment at massive scale; improve operational metrics such as reducing cluster restart times and accelerating firmware or OS upgrade cycles; integrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructure; develop monitoring and observability systems to detect issues early and maintain cluster stability under extreme load; solve real-time operational challenges, diagnose and fix issues quickly, and continuously improve automation, resilience, performance, and uptime across the systems powering frontier AI model training.
DevOps Engineer
Build and deploy AI agents including prompt design, workflow configuration, integrations, telephony setup, and evaluation frameworks. Act as the primary technical partner for customers by leading demos, communicating progress, gathering feedback, and guiding solutions from concept to production. Configure and connect systems via APIs, handling authentication, data mapping, error handling, and integrations with CRMs, knowledge bases, and other enterprise tools. Set up telephony integration including SIP/CCaaS/PSTN routing, metadata passing, fallback configurations, and troubleshooting call quality. Write and refine prompts for LLM-driven agents, monitor performance, conduct iterative testing, and ensure agents meet automation and containment targets. Translate customer requirements into actionable solutions and work consultatively to resolve challenges related to security, connectivity, or knowledge ingestion. Collaborate with product and engineering teams to address platform gaps, resolve technical issues, and lead client implementations independently.
Access all 4,256 remote & onsite AI jobs.
Frequently Asked Questions
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
