AI Infrastructure Engineer

About LanceDB

LanceDB is a developer-friendly, open-source data lake for multimodal AI. From hyper-scalable vector search to advanced retrieval for RAG, from streaming training data to interactive exploration of large-scale AI datasets, LanceDB is the best foundation for your AI application, and powers some of the most groundbreaking applications and challenging requirements today.

About the role

We are seeking an engineer who brings both hands-on model training, model fine-tuning, feature engineering expertise and a strong background in data/AI/ML infrastructure to join our world-class team, pushing the frontiers of multimodal data infrastructure.

Your responsibilities will include:

Resident expert on AI engineering bringing familiarity with frameworks such as PyTorch or JAX. Experienced with compute systems such as Kubernetes, Ray, or Spark to support distributed training and inference workloads.
Champion a superior Developer Experience, maximizing productivity for AI engineers.
Drive the end-to-end design and development of a high-performance and large-scale feature engineering infrastructure for leading multimodal AI companies.
Collaborate closely with the customers, design partners, and the Lance/LanceDB community

Requirements:

You like working with a small, high-caliber team with a lot of autonomy and drive, and you can iterate fast.
You have 3+ years of experience building and deploying ML/DL models in production environments or supporting infrastructure for AI researchers and AI engineers performing these tasks, using Python and libraries such as PyTorch or Tensorflow.
You have a proven ability to deliver projects end-to-end, from scoping and resourcing to implementation and delivery.
You have experience with large scale data processing systems such as Spark, Flink, Ray, KubeFlow or Dataflow.
You have a working knowledge of cloud platforms (AWS, GCP, Azure) including managed storage (S3, GCS) and compute (EC2, GKE, AKS).
You have knowledge of monitoring/logging stacks (Prometheus, Grafana, ELK/EFK) for alerting on data pipeline failures, resource saturation, or model skew.

It would be even better if you are someone who:

Has a deep understanding of training architecture, from PyTorch or Jax experience to CUDA kernel fusion or TPU programming.
Has experience designing or operating a feature store (e.g., Feast, Tecton) or building a custom feature registry.
Has understand Docker layered file system, K8s scheduling algorithms, orchestration services.