About the Role

We're seeking an exceptional AI Systems Engineer to build and optimize the infrastructure that powers our large-scale AI models. You will work on model deployment, inference optimization, and ensuring the reliability of our mission-critical systems.

This role sits at the intersection of machine learning and infrastructure engineering. You'll be responsible for designing scalable systems that enable our research team to train state-of-the-art models while ensuring production deployments are fast, reliable, and cost-effective.

Key Responsibilities

Infrastructure Design: Design and maintain scalable infrastructure for training and deploying large language models across cloud platforms (AWS, Azure, GCP).
Performance Optimization: Optimize inference latency and throughput using techniques like quantization, model distillation, and efficient serving architectures.
Container Orchestration: Manage Kubernetes clusters and orchestrate containerized workloads for distributed training and inference at scale.
MLOps Pipelines: Build robust CI/CD pipelines for machine learning workflows, including automated testing, model versioning, and deployment.
Monitoring & Reliability: Implement comprehensive monitoring, logging, and alerting systems to ensure 99.9%+ uptime for production AI services.
Cost Optimization: Architect solutions that balance performance with cost efficiency, leveraging spot instances, auto-scaling, and resource optimization.
Collaboration: Work closely with research scientists to understand model requirements and translate them into production-ready infrastructure.

Required Qualifications

B.S. or M.S. in Computer Science, Engineering, or equivalent practical experience.
Strong proficiency in Python for scripting, automation, and ML pipelines.
Extensive experience with containerization technologies (Docker, Kubernetes).
Hands-on experience with at least one major cloud platform (AWS, Azure, or GCP).
Familiarity with ML frameworks (PyTorch, TensorFlow) and model serving tools (Triton Inference Server, TorchServe, or TensorFlow Serving).
Understanding of distributed systems, networking, and storage solutions.
Experience with Infrastructure as Code tools (Terraform, CloudFormation, or Pulumi).

Preferred Qualifications

Experience deploying and serving large language models (LLMs) in production environments.
Knowledge of GPU optimization techniques (CUDA, TensorRT, mixed precision training).
Familiarity with distributed training frameworks (Ray, Horovod, DeepSpeed).
Experience with monitoring and observability tools (Prometheus, Grafana, ELK stack).
Understanding of model compression techniques (quantization, pruning, knowledge distillation).
Contributions to open-source ML infrastructure projects.
Experience with Bayesian deep learning or uncertainty quantification systems.

What We Offer

Competitive Compensation: $120,000 - $180,000 base salary + equity + performance bonuses.
Flexible Work: 100% remote with flexible hours. Work from anywhere in the world.
Cutting-Edge Technology: Work with the latest AI/ML infrastructure, GPUs, and cloud technologies.
Professional Growth: Conference attendance, courses, certifications, and dedicated learning time.
Health & Wellness: Comprehensive health benefits, mental health support, and unlimited PTO.
Impact: Your work will directly power AI systems used in healthcare, security, and enterprise applications.
Research Environment: Collaborate with PhD-level researchers and publish papers on novel infrastructure approaches.

Technical Stack

You'll work with modern, industry-leading technologies:

Kubernetes Docker Python PyTorch TensorFlow AWS Azure GCP Terraform GitLab CI/CD Prometheus Grafana Ray Triton TensorRT MLOps

Our Culture

At TeraSystemsAI, we cultivate a culture of radical curiosity and psychological safety. We believe in deep work over shallow productivity, flexible schedules over rigid hours, and impact over optics.

We're a distributed team of world-class engineers and researchers committed to building AI that is safe, transparent, and transformative. Every voice matters, from interns to founders.

AI Systems Engineer