Data Machine Learning Engineer jobs in San Francisco – Page 4 | RoboApply Jobs

Data Machine Learning Engineer jobs in San Francisco· Page 4

Results 61–80 of 6,042 for “Data Machine Learning Engineer” in San Francisco.

6,042 jobs found

61 - 80 of 6,042 Jobs
Apply
Taskrabbit logo
Full-time|$148K/yr - $200K/yr|Hybrid|San Francisco, California, United States

About Taskrabbit:Taskrabbit is an innovative marketplace platform that seamlessly connects individuals with Taskers to manage everyday home tasks, including furniture assembly, handyman services, moving assistance, and much more.At Taskrabbit, we aim to transform lives one task at a time. We celebrate innovation, inclusion, and hard work, fostering a collabo…

Feb 17, 2026
Apply
VSCO logo
Full-time|$240K/yr - $260K/yr|On-site|San Francisco, CA

About VSCO At VSCO, we empower photographers with an innovative platform that provides essential tools, a vibrant community, and the visibility needed for creative and professional growth. We cultivate an authentic creative environment that welcomes photographers of all skill levels, offering a space that inspires opportunity, collaboration, and connection. Our mission is to support photographers in their journeys, enabling them to thrive and connect with fellow creatives and businesses through our comprehensive suite of tools, available on both mobile and desktop. We seek individuals who are passionate and proactive in advancing our mission. Our team members have the opportunity to make a significant impact, and we believe that collaborative efforts yield stronger results. Our core values are essential to our team culture and guide our hiring process. Learn more about what you can expect when joining VSCO on our Careers Page. About The Role As a Senior Machine Learning Engineer, you will harness the power of AI and machine learning to create innovative, reliable user-facing product features. You will leverage your extensive technical background and hands-on experience in deploying machine learning models to deliver impactful solutions based on real-world feedback. Your focus on measurable outcomes and customer satisfaction drives your work, blending innovation with practical implementation. You will be highly skilled in Python and adept across the data and machine learning stack, enabling you to develop and launch models efficiently while ensuring scalability and maintainability. Whether working with traditional algorithms or cutting-edge deep learning and generative AI, you will expertly navigate the complexity of each problem, managing every phase from defining the challenge to deployment and iterative improvement. Your dedication to software engineering excellence will inform your thoughtful approach to system design for machine learning, encompassing data quality, pipeline design, feature workflows, model serving, and ongoing monitoring and enhancement. By integrating machine learning deeply within our cohesive product experiences, you will collaborate effectively with cross-functional teams, aligning on objectives, defining success metrics, and driving meaningful outcomes. You will stay informed about the rapidly evolving AI landscape, maintaining a discerning perspective that allows your team to focus on significant advancements while avoiding distractions. The Day to Day Design and implement ML-powered features for search, discovery, personalization, and more.

Mar 23, 2026
Apply
Troveo logo
Full-time|$200K/yr - $400K/yr|On-site|San Francisco, CA

About TroveoTroveo is pioneering a cutting-edge data platform dedicated to training AI video models. We provide the most extensive library of AI video training data globally, comprising millions of hours of licensed video content. Our comprehensive data pipeline links creators, rights holders, and AI research facilities, facilitating scalable, compliant, and innovative video applications for AI model development.As a rapidly growing startup backed by visionary investors, we are looking for an innovative Senior Machine Learning Engineer to join our team and help us scale our operations.Role OverviewIn this pivotal role, the Senior Machine Learning Engineer will be responsible for designing, developing, and optimizing large-scale machine learning pipelines essential for AI video model training. You will engage in the complete ML lifecycle, from structuring enormous datasets to deploying, evaluating, and refining models in production environments.This hands-on role demands an engineer who excels in a dynamic environment, values autonomy, and thrives on cross-functional collaboration. You will leverage your deep technical knowledge alongside excellent communication and business insight to translate models into quantifiable costs, performance metrics, and tangible outcomes.Key ResponsibilitiesData Curation & Indexing Pipelines:Design and implement large-scale pipelines for video ingestion, metadata extraction, and indexing using vector databases and embedding models to facilitate swift, semantic retrieval.Develop annotation workflows that utilize active learning, weak supervision, and human-in-the-loop systems to curate high-quality labeled datasets for video models.Optimize data partitioning, sharding, and caching strategies to manage petabyte-scale video corpora, ensuring efficient low-latency search and maintaining robust data lineage.Model Training & Evaluation:Create and fine-tune multimodal models (e.g., CLIP variants, transformer-based encoders) for video embeddings, scene segmentation, and relevance ranking using PyTorch and Hugging Face frameworks.Establish evaluation frameworks incorporating metrics such as NDCG, mAP, and annotation consistency scores to iteratively enhance search accuracy and annotation efficiency.Deploy models through containerized services, implementing A/B testing and monitoring for drift detection in production.

Nov 8, 2025
Apply
tvScientific powered by Pinterest logo
Machine Learning Platform Engineer

tvScientific powered by Pinterest

Full-time|$123.7K/yr - $254.7K/yr|Remote|San Francisco, CA, US; Remote, US

tvScientific, powered by Pinterest, develops a connected TV (CTV) advertising platform designed for performance marketers. The platform combines media buying, optimization, measurement, and attribution to automate and improve TV advertising. Built by professionals in programmatic advertising, digital media, and ad verification, tvScientific aims to deliver measurable results for advertisers. Role overview As a Machine Learning Platform Engineer, you will join a team that operates where Site Reliability Engineering meets low-latency distributed systems. This team advances Pinterest’s real-time machine learning and measurement infrastructure, focusing on sub-millisecond decision-making and high-throughput data access. Seamless integration with Pinterest’s core stack is central to the work. What you will do Design and build systems to keep queries and RPCs fast and reliable, even during periods of heavy demand. Develop and enhance the foundation of the machine learning training and serving stack. Address challenges in storage, indexing, streaming, fan-out, and managing backpressure and failures across services and regions. Collaborate with software engineering, data infrastructure, and SRE teams to ensure systems are observable, debuggable, and ready for production. Key areas of focus I/O scheduling and batching Lock-free or low-contention data structures Connection pooling and query planning Kernel and network tuning On-disk layout and indexing strategies Circuit-breaking and autoscaling Incident response and failure management NixOS Defining and maintaining SLIs and SLOs This position is a strong fit for engineers interested in building and operating large-scale infrastructure, particularly those who enjoy working on real-time systems, observability, and reliability.

Apr 23, 2026
Apply
Mach9 logo
Full-time|On-site|San Francisco

Mach9’s Machine Learning Infrastructure Engineers create and maintain the backbone for production AI models used in civil engineering and surveying. The team manages a machine learning pipeline that processes over 10,000 miles of labeled survey data, supports image segmentation networks, and runs 3D prediction models. These systems deliver real-time inference capabilities directly to surveyors and engineers working in the field. Role overview This position is designed for mid-career engineers with a strong background in both training and inference aspects of machine learning infrastructure. The work involves handling large-scale data and ensuring reliable performance for demanding, real-world applications. What you will do Build and improve training pipelines for deep transformer models using hundreds of terabytes of 3D point cloud and image data. Design and implement inference infrastructure to support both offline detection algorithms and responsive, real-time inference integrated with CAD software. Location Based in San Francisco.

Apr 25, 2026
Apply
SoFi logo
Full-time|$153.6K/yr - $240K/yr|On-site|CA - San Francisco

Employee Applicant Privacy Notice SoFi is a national bank and financial services company that creates mobile-first tools to help people manage their money and reach their financial goals. The team values direct impact and aims to make a positive difference for members. Role overview The Senior Marketing Data Scientist - Machine Learning joins the Marketing Data Science team in San Francisco, CA. This position supports SoFi’s Marketing organization through analytics, model building, experimentation, and performance measurement to help drive marketing and growth initiatives. Work centers on designing, building, and scaling machine learning models that improve customer acquisition, conversion, retention, and lifetime value across SoFi’s products. The role draws on behavioral, transactional, and credit data to create predictive models and actionable insights. Collaboration with cross-functional teams is key for identifying business needs, managing model development end-to-end, implementing models in production, and monitoring their ongoing performance. Regulatory compliance is a consistent focus. Main responsibilities Design, develop, and deploy machine learning models to optimize customer acquisition, onboarding, and engagement for products such as loans, credit cards, investments, and cryptocurrency. Build predictive models for outcomes including customer lifetime value, conversion rates, cross-sell and upsell effectiveness, and retention across channels like email, direct mail, in-app, and Operations. Work with structured and unstructured data, such as behavioral signals, transaction data, and credit attributes, to enable audience segmentation and large-scale personalization. Maintain a feature store to streamline model development. Set up A/B testing frameworks to evaluate marketing strategies and measure their impact.

Apr 20, 2026
Apply
Anthropic logo
On-site|On-site|San Francisco, CA | New York City, NY | Seattle, WA

Join Anthropic as a Machine Learning Systems Engineer within our Encodings and Tokenization team, where you'll play a pivotal role in refining and optimizing our tokenization systems across Pretraining and Finetuning workflows. By bridging the gap between our Pretraining and Finetuning teams, you will help shape the essential infrastructure that enhances how our AI models learn from diverse data. Your contributions will be crucial in ensuring our AI systems remain reliable, interpretable, and steerable, driving forward our mission of developing beneficial AI technologies.

Jan 29, 2026
Apply
Scale AI logo
Full-time|$218.4K/yr - $273K/yr|On-site|San Francisco, CA; New York, NY

Artificial Intelligence is revolutionizing every aspect of our lives. At Scale AI, we are dedicated to accelerating the advancement of AI applications across industries. For nearly a decade, we have established ourselves as a premier AI data foundry, powering groundbreaking innovations in AI, including generative AI, defense systems, and autonomous technologies. With our recent investment from Meta, we are committed to enhancing our state-of-the-art post-training algorithms to achieve unparalleled performance for complex agents serving enterprises globally. The Enterprise ML Research Lab is at the forefront of this AI evolution. Our team develops a suite of proprietary research, tools, and resources tailored for our enterprise clients. As a Machine Learning Research Engineer on the Data Foundation team, you will engage in pioneering research to optimize the data flywheel that drives our entire machine learning ecosystem. Your work will involve exploring synthetic environments, defining tasks, building agents for trace analysis, and contributing to a cutting-edge framework that automates agent building through advanced evaluation techniques. You will create top-tier agents that deliver state-of-the-art results by leveraging sophisticated post-training and agent-building algorithms. If you are passionate about influencing the future of Generative AI, we encourage you to apply!

Mar 26, 2026
Apply
Plaid Inc. logo
Full-time|On-site|San Francisco

About Plaid Plaid builds tools that help developers create new financial products and experiences. Since 2013, Plaid has connected millions of users to over 12,000 financial institutions across the US, Canada, the UK, and Europe. The company partners with organizations like Venmo, SoFi, Fortune 500 firms, and major banks to make linking financial accounts to apps and services easier. Headquarters are in San Francisco, with offices in New York, Washington D.C., London, and Amsterdam. Team: Data Foundation & AI The Data Foundation and AI team designs and maintains the machine learning and AI infrastructure that supports Plaid’s products. This group transforms Plaid’s financial network data into flexible formats used by teams across the company. Responsibilities span the entire system lifecycle: data curation for pretraining, model development, deployment, serving, and monitoring in production. Role Overview: Senior Machine Learning Engineer (Research Scientist) This position focuses on applied research for Plaid’s foundation model. The Senior Research Scientist leads efforts to design model architectures, set pretraining objectives, and implement fine-tuning strategies that work across a range of product needs. The role also involves building and maintaining production machine learning systems, including training pipelines, model serving, feature engineering, and performance monitoring. Key Responsibilities Design model architectures and define pretraining objectives for Plaid’s foundation model Develop and apply fine-tuning methods for diverse product use cases Build and maintain end-to-end machine learning systems, from data pipelines to model serving Engineer features and monitor system performance in production Create evaluation frameworks to measure model quality across multiple tasks and metrics Location This role is based in San Francisco.

Apr 15, 2026
Apply
Lyft, Inc. logo
Full-time|Hybrid|San Francisco, CA

About the Role Lyft Ads is looking for a Data Science Manager with a focus on Machine Learning. This position is based in San Francisco, CA. What You Will Do Lead a team of data scientists working on advertising solutions Guide the development and optimization of machine learning algorithms Improve user targeting methods to increase advertising effectiveness Analyze large datasets to support strategic decisions across Lyft Ads Shape advertising performance and influence user engagement through data-driven insights What We’re Looking For Experience managing data science teams Strong background in machine learning and algorithm development Ability to work with large-scale datasets Skilled at translating analysis into actionable business strategies

Apr 14, 2026
Apply
The Bot Company logo
Full-time|On-site|San Francisco

The Bot CompanyWe are on a mission to create a helpful robot for every household.Our dynamic team of engineers, designers, and operators is headquartered in San Francisco, featuring talent from renowned companies such as Tesla, Cruise, OpenAI, Google, and Pixar. We have a proven track record of delivering exceptional products to hundreds of millions of users.Our lean structure fosters swift decision-making and minimizes bureaucracy, empowering every team member with significant autonomy and responsibility. We embrace a culture of rapid iteration and execution across the tech stack.What We Seek in CandidatesAt The Bot Company, we value sharp minds capable of thriving in fast-paced, high-pressure environments. Candidates should exhibit:Exceptional Mental Acuity: The ability to think quickly, assimilate new information instantly, and make connections across various domains.Engineering Curiosity: A natural inclination to explore and understand how systems function, even beyond your specialized area.High Performance Mindset: Comfort with rapid movement, adeptness in handling ambiguity, and excellence under demanding conditions.Role Overview: ML Compiler EngineerAs a specialist in developing ML compilers for edge devices (custom silicon and others), you will be pivotal in establishing a robust deployment framework to efficiently execute large neural networks on our robots with minimal latency.Key QualificationsProficient coding skills with extensive experience in C++ and/or Python.Familiarity with modern compiler infrastructure (MLIR/LLVM, XLA, TVM, Glow, etc.).Experience in deploying models on heterogeneous computing platforms (preferably edge devices).Proficiency in writing kernels (CUDA/OpenCL).Knowledge of quantization techniques is advantageous, though not mandatory.Your ResponsibilitiesDesign, develop, and maintain compiler infrastructure tailored for our hardware.Collaborate across teams, including ML and Systems Software.Independently diagnose and resolve complex numerical issues (such as discrepancies between training and inference) while enhancing performance.

Nov 21, 2025
Apply
Nudge logo
Full-time|On-site|San Francisco

About NudgeAt Nudge, our goal is to create innovative technologies that connect with the brain, enhancing individuals’ lives. We’re pioneering a non-invasive, ultrasound-based device designed to stimulate and image the brain with high precision and depth. This initiative involves developing state-of-the-art hardware, software, and research capabilities to deliver products that can positively impact millions — and eventually billions — of people.About the RoleAs a Machine Learning Software Engineer at Nudge, you will:Engineer imaging algorithms utilizing proprietary ultrasound transducers and advanced computing resources to visualize the brain and skull.Create sophisticated acoustic simulations to model sound scattering in the skull, enabling precise dose predictions.Develop real-time computer vision systems to monitor brain target movements and dynamically adjust parameters to ensure accurate targeting.Collaborate closely with mechanical, electrical, and ultrasound engineers, as well as transducer designers and neuroscientists.About YouWe are searching for engineers of all experience levels, with a preference for those boasting a minimum of three years in the industry. Regardless of your experience, you should possess:A solid understanding of engineering principles, physics, and signal processing.Proficiency in writing production-level code, preferably in Python.A degree in Computer Science or a related engineering field.No prior experience in ultrasound or neuroscience is required.Experience in delivering real-world products that provide tangible value; ideally, you have dealt with complex real-world sensors.A high level of integrity.

Sep 9, 2025
Apply
Exa logo
Full-time|On-site|San Francisco, California

At Exa, we are pioneering the next generation of search engines designed for the era of artificial intelligence, starting from the foundational Silicon architecture. Our ambitious indexing operation is unparalleled, allowing us to crawl the vast open web at an extraordinary scale. We harness cutting-edge embedding models to comprehend this data and utilize our high-performance Rust-based vector database alongside a $5M H200 GPU cluster, which powers tens of thousands of machines simultaneously.The Machine Learning (ML) division is central to this mission, focusing on the training of foundational models that enhance search capabilities. Our vision is to create systems capable of swiftly filtering the world’s knowledge to deliver precisely what you need, regardless of the complexity of your inquiry—effectively transforming the web into a robust, searchable database.To achieve this ambitious goal, we must define what constitutes “effective search”. This is where your expertise will play a crucial role.We are seeking a talented Machine Learning Evaluations Engineer to develop and implement our evaluation framework at Exa. This position entails exploring methodologies to assess search engines in a world dominated by large language models (LLMs) and crafting the most thorough, innovative, and impactful evaluation suite. Your decisions will influence the future of search optimization and directly affect the research team’s focus, shaping the company’s strategic direction.

Oct 15, 2025
Apply
Mercor logo
Full-time|On-site|San Francisco

About MercorAt Mercor, we're revolutionizing the future of work. We collaborate with top AI labs and enterprises to deliver the human insights crucial for AI development.Our extensive talent network trains cutting-edge AI models, much like educators nurture students: by imparting invaluable knowledge, experience, and context that transcends mere code. Currently, over 30,000 specialists in our network collectively generate more than $2 million daily.Mercor is pioneering a new category of work where expertise fuels AI progress. Achieving this vision requires a dynamic, fast-paced, and deeply dedicated team. You’ll collaborate with researchers, operators, and AI companies at the forefront of transforming systems that redefine society.As a profitable Series C company valued at $10 billion, we operate on-site five days a week in our offices located in San Francisco, NYC, or London.About the RoleIn your role as a Machine Learning Engineer on the growth team, you will develop the infrastructure that powers Mercor’s hiring engine: from indexing and global discovery to cross-platform sourcing and engagement, real-time scoring and personalization, and high-throughput conversion pipelines that transform interest into hires.What You Will Build:Low-latency ranking and matching pipelines that process thousands of signals.Global off-platform people search, job distribution, and ad/acquisition infrastructure.Production ML and feature infrastructure for personalization and incentive modeling.Real-time event and data pipelines, high-throughput APIs, and observability for mission-critical services.Who We Are Looking For: We seek engineers with a strong background in building distributed backends or ML infrastructure, demonstrated ownership of large-scale matching, indexing, recommender, or search systems; robust instincts for production, and experience with high-throughput services, monitoring, and reliability.Why Join Us: If you are looking for backend work that combines ML, distributed systems, and real revenue impact, the Growth team is where you belong.Tech Stack: Python, Go, embeddings, fine-tuning, RAG, Kafka, Postgres, Redis, Elasticsearch, Kubernetes, Terraform

Apr 10, 2026
Apply
Aarki Inc. logo
Full-time|On-site|San Francisco

Join our innovative team at Aarki Inc. as the Director of Machine Learning! In this pivotal role, you will lead our machine learning initiatives, driving the development of cutting-edge algorithms and models that enhance our data-driven solutions. You will collaborate with cross-functional teams to translate complex data into actionable insights, all while fostering a culture of learning and innovation.

Mar 18, 2026
Apply
Physical Intelligence logo
Full-time|On-site|San Francisco

As a Machine Learning Infrastructure Engineer at Physical Intelligence, you will play a vital role in enhancing and optimizing our training systems and core model code. You will take ownership of critical infrastructure for large-scale training, which includes managing GPU/TPU compute, orchestrating jobs, and developing reusable and efficient JAX training pipelines. Collaborating closely with researchers and model engineers, you will help transform innovative ideas into experiments and subsequently into production training runs.This position is hands-on and offers significant leverage at the intersection of machine learning, software engineering, and scalable infrastructure.The TeamOur ML Infrastructure team is dedicated to supporting and accelerating Physical Intelligence's core modeling initiatives by building systems that ensure large-scale training is reliable, reproducible, and efficient. The team collaborates with research, data, and platform engineers to guarantee that models can seamlessly transition from prototype to production-grade training runs.Key Responsibilities- Manage training/inference infrastructure: Design, implement, and maintain systems for large-scale model training, which includes scheduling, job management, checkpointing, and performance metrics/logging.- Expand distributed training: Collaborate with researchers to efficiently scale JAX-based training across TPU and GPU clusters.- Enhance performance: Profile and optimize memory usage, device utilization, throughput, and distributed synchronization to maximize efficiency.- Facilitate rapid iteration: Develop abstractions for launching, monitoring, debugging, and reproducing experiments.- Oversee compute resources: Ensure optimal allocation and utilization of cloud-based GPU/TPU compute resources while managing costs effectively.- Collaborate with researchers: Translate research requirements into infrastructure capabilities and promote best practices for large-scale training.- Contribute to core training code: Evolve the JAX model and training code to accommodate new architectures, modalities, and evaluation metrics.

Aug 24, 2024
Apply
Mercury logo
Full-time|$200.7K/yr - $250.9K/yr|Remote|San Francisco, CA, New York, NY, Portland, OR, or Remote within Canada or United States

Since the advent of the Fast Fourier Transform in 1965, the analysis of complex signals such as radio waves and images has transformed dramatically. At Mercury, we leverage advanced technologies to streamline the review of customer applications without sacrificing quality. Our Risk Onboarding team serves as the first line of defense against money laundering and financial crimes, developing innovative systems to ensure our clients are who they claim to be and that we can conduct business with them securely.We are dedicated to providing an unparalleled banking experience for startups, focusing on creating a safe and effective environment that caters to the needs of our customers, administrators, and regulators alike.Note: Mercury is a fintech company and not an FDIC-insured bank. Banking services are provided through Choice Financial Group and Column N.A., both Members FDIC.

Jan 22, 2026
Apply
Two Dots logo
Full-time|On-site|San Francisco HQ

Become a part of Two Dots as we strive to create a more robust financial ecosystem.In our fast-paced world, every time an individual seeks a mortgage, car loan, or apartment lease, they present financial documents that contribute to their financial profile. The accuracy of these profiles plays a crucial role in stabilizing the economy.At Two Dots, we are innovating a system that evaluates consumers in a consistent and fair manner. Our mission is to detect fraud that often goes unnoticed and to identify value in unconventional applications that might be overlooked.Please note that all full-time employees are required to work from our headquarters located in San Francisco, CA.Role Overview:We are seeking our second Machine Learning Engineer to collaborate closely with our CTO and Staff ML Engineer. In this position, you will be responsible for designing, developing, and deploying machine learning solutions, particularly focusing on fine-tuning multimodal large language models (LLMs) to address real-world challenges. The right candidate will possess a fervor for building and implementing advanced ML applications, aiming to enhance our automation rates for application approvals/denials and elevate our fraud detection capabilities, ultimately driving business impact and client satisfaction.Key Responsibilities:Independently design, develop, and deploy machine learning models.Examine extensive datasets to reveal insights and patterns that guide product development and enhance personalized customer experiences.Continuously assess and refine the performance of deployed models to ensure they fulfill business objectives and scalability needs.Keep abreast of the latest developments in machine learning, AI, data science, and engineering, applying this knowledge to enhance our products and services.Desirable Traits:3+ years of experience in a Machine Learning or Data Engineering role, with a strong command of Python and ML frameworks like PyTorch.Demonstrated ability to enhance models for key information extraction, including named entity recognition and financial document classification.Experience with active learning and HITL-driven workflows; collaborating with large labeling and quality teams is advantageous.Exceptional problem-solving skills, with the ability to think critically and creatively.

Jan 3, 2025
Apply
Physical Intelligence logo
Full-time|On-site|San Francisco

At Physical Intelligence, we are pioneering general-purpose AI applications for the physical world. Our innovative approach involves orchestrating thousands of accelerators across a diverse ecosystem of GPU and TPU clusters, which encompass various hardware generations, cloud platforms, and cluster configurations.Researchers frequently encounter challenges in identifying the optimal cluster for their tasks, understanding resource availability, and configuring their workloads efficiently. This process is not scalable. To enhance productivity, we require an intelligent scheduling and compute system that can automatically determine the best job placements based on availability, hardware compatibility, cost considerations, and priority levels, allowing researchers to concentrate on their scientific endeavors.This position encompasses the complete ownership of this challenge: the development of scheduling systems, placement logic, cluster management frameworks, and operational tools essential for seamless operations.This role is distinct from traditional cloud DevOps; it focuses on resource allocation intelligence, utilization efficiency, fault tolerance, and ensuring a smooth experience for large-scale distributed training.About the TeamThe ML Infrastructure team is dedicated to bolstering and accelerating Physical Intelligence’s fundamental modeling initiatives by creating systems that ensure large-scale training is reliable, reproducible, and efficient. You will collaborate closely with the ML Infrastructure, data platform, and research teams to eliminate compute scheduling as a bottleneck.Key Responsibilities- Lead Intelligent Job Scheduling and Placement: Design and implement multi-tenant scheduling systems that automatically allocate training jobs to the most suitable cluster based on hardware specifications, topology, availability, cost, and priority. Facilitate equitable resource sharing across teams and projects through quota management, priority tiers, and preemption policies. Simplify cluster discrepancies so researchers can submit jobs without needing detailed knowledge of cluster specifics.- Enhance Multi-cluster Orchestration: Develop the control plane responsible for overseeing the job lifecycle across various clusters (including mixed GPU/TPU setups, multi-generational hardware, both on-premises and cloud-based) and enable effortless job migration, failover, and rescheduling.- Optimize Accelerator Utilization and Performance: Continuously monitor and enhance GPU/TPU usage across the entire fleet. Apply priority, preemption, queuing, and fairness strategies that balance research momentum with cost efficiency.- Guarantee Scalability and Stability: Implement fault detection, automatic recovery mechanisms, and resilience strategies for long-running multi-node training tasks. Oversee health checks, node management, and scaling strategies to ensure optimal performance.

Mar 7, 2026
Apply
Whatnot logo
Full-time|On-site|San Francisco, CA

Role overview Whatnot seeks a Software Engineer specializing in Machine Learning Infrastructure to develop and maintain the systems powering its machine learning applications. This position is based in San Francisco, CA and centers on building the technical backbone that supports machine learning efforts across the company. What you will do Develop and improve frameworks that enable machine learning throughout Whatnot’s platforms. Collaborate with teams from multiple disciplines to design infrastructure that can scale as needs grow. Support seamless integration of machine learning models into existing products.

Apr 23, 2026

Sign in to browse more jobs

Create account — see all 6,042 results

61 - 80 of 6,042 Jobs
Apply
Taskrabbit logo
Full-time|$148K/yr - $200K/yr|Hybrid|San Francisco, California, United States

About Taskrabbit:Taskrabbit is an innovative marketplace platform that seamlessly connects individuals with Taskers to manage everyday home tasks, including furniture assembly, handyman services, moving assistance, and much more.At Taskrabbit, we aim to transform lives one task at a time. We celebrate innovation, inclusion, and hard work, fostering a collabo…

Feb 17, 2026
Apply
VSCO logo
Full-time|$240K/yr - $260K/yr|On-site|San Francisco, CA

About VSCO At VSCO, we empower photographers with an innovative platform that provides essential tools, a vibrant community, and the visibility needed for creative and professional growth. We cultivate an authentic creative environment that welcomes photographers of all skill levels, offering a space that inspires opportunity, collaboration, and connection. Our mission is to support photographers in their journeys, enabling them to thrive and connect with fellow creatives and businesses through our comprehensive suite of tools, available on both mobile and desktop. We seek individuals who are passionate and proactive in advancing our mission. Our team members have the opportunity to make a significant impact, and we believe that collaborative efforts yield stronger results. Our core values are essential to our team culture and guide our hiring process. Learn more about what you can expect when joining VSCO on our Careers Page. About The Role As a Senior Machine Learning Engineer, you will harness the power of AI and machine learning to create innovative, reliable user-facing product features. You will leverage your extensive technical background and hands-on experience in deploying machine learning models to deliver impactful solutions based on real-world feedback. Your focus on measurable outcomes and customer satisfaction drives your work, blending innovation with practical implementation. You will be highly skilled in Python and adept across the data and machine learning stack, enabling you to develop and launch models efficiently while ensuring scalability and maintainability. Whether working with traditional algorithms or cutting-edge deep learning and generative AI, you will expertly navigate the complexity of each problem, managing every phase from defining the challenge to deployment and iterative improvement. Your dedication to software engineering excellence will inform your thoughtful approach to system design for machine learning, encompassing data quality, pipeline design, feature workflows, model serving, and ongoing monitoring and enhancement. By integrating machine learning deeply within our cohesive product experiences, you will collaborate effectively with cross-functional teams, aligning on objectives, defining success metrics, and driving meaningful outcomes. You will stay informed about the rapidly evolving AI landscape, maintaining a discerning perspective that allows your team to focus on significant advancements while avoiding distractions. The Day to Day Design and implement ML-powered features for search, discovery, personalization, and more.

Mar 23, 2026
Apply
Troveo logo
Full-time|$200K/yr - $400K/yr|On-site|San Francisco, CA

About TroveoTroveo is pioneering a cutting-edge data platform dedicated to training AI video models. We provide the most extensive library of AI video training data globally, comprising millions of hours of licensed video content. Our comprehensive data pipeline links creators, rights holders, and AI research facilities, facilitating scalable, compliant, and innovative video applications for AI model development.As a rapidly growing startup backed by visionary investors, we are looking for an innovative Senior Machine Learning Engineer to join our team and help us scale our operations.Role OverviewIn this pivotal role, the Senior Machine Learning Engineer will be responsible for designing, developing, and optimizing large-scale machine learning pipelines essential for AI video model training. You will engage in the complete ML lifecycle, from structuring enormous datasets to deploying, evaluating, and refining models in production environments.This hands-on role demands an engineer who excels in a dynamic environment, values autonomy, and thrives on cross-functional collaboration. You will leverage your deep technical knowledge alongside excellent communication and business insight to translate models into quantifiable costs, performance metrics, and tangible outcomes.Key ResponsibilitiesData Curation & Indexing Pipelines:Design and implement large-scale pipelines for video ingestion, metadata extraction, and indexing using vector databases and embedding models to facilitate swift, semantic retrieval.Develop annotation workflows that utilize active learning, weak supervision, and human-in-the-loop systems to curate high-quality labeled datasets for video models.Optimize data partitioning, sharding, and caching strategies to manage petabyte-scale video corpora, ensuring efficient low-latency search and maintaining robust data lineage.Model Training & Evaluation:Create and fine-tune multimodal models (e.g., CLIP variants, transformer-based encoders) for video embeddings, scene segmentation, and relevance ranking using PyTorch and Hugging Face frameworks.Establish evaluation frameworks incorporating metrics such as NDCG, mAP, and annotation consistency scores to iteratively enhance search accuracy and annotation efficiency.Deploy models through containerized services, implementing A/B testing and monitoring for drift detection in production.

Nov 8, 2025
Apply
tvScientific powered by Pinterest logo
Machine Learning Platform Engineer

tvScientific powered by Pinterest

Full-time|$123.7K/yr - $254.7K/yr|Remote|San Francisco, CA, US; Remote, US

tvScientific, powered by Pinterest, develops a connected TV (CTV) advertising platform designed for performance marketers. The platform combines media buying, optimization, measurement, and attribution to automate and improve TV advertising. Built by professionals in programmatic advertising, digital media, and ad verification, tvScientific aims to deliver measurable results for advertisers. Role overview As a Machine Learning Platform Engineer, you will join a team that operates where Site Reliability Engineering meets low-latency distributed systems. This team advances Pinterest’s real-time machine learning and measurement infrastructure, focusing on sub-millisecond decision-making and high-throughput data access. Seamless integration with Pinterest’s core stack is central to the work. What you will do Design and build systems to keep queries and RPCs fast and reliable, even during periods of heavy demand. Develop and enhance the foundation of the machine learning training and serving stack. Address challenges in storage, indexing, streaming, fan-out, and managing backpressure and failures across services and regions. Collaborate with software engineering, data infrastructure, and SRE teams to ensure systems are observable, debuggable, and ready for production. Key areas of focus I/O scheduling and batching Lock-free or low-contention data structures Connection pooling and query planning Kernel and network tuning On-disk layout and indexing strategies Circuit-breaking and autoscaling Incident response and failure management NixOS Defining and maintaining SLIs and SLOs This position is a strong fit for engineers interested in building and operating large-scale infrastructure, particularly those who enjoy working on real-time systems, observability, and reliability.

Apr 23, 2026
Apply
Mach9 logo
Full-time|On-site|San Francisco

Mach9’s Machine Learning Infrastructure Engineers create and maintain the backbone for production AI models used in civil engineering and surveying. The team manages a machine learning pipeline that processes over 10,000 miles of labeled survey data, supports image segmentation networks, and runs 3D prediction models. These systems deliver real-time inference capabilities directly to surveyors and engineers working in the field. Role overview This position is designed for mid-career engineers with a strong background in both training and inference aspects of machine learning infrastructure. The work involves handling large-scale data and ensuring reliable performance for demanding, real-world applications. What you will do Build and improve training pipelines for deep transformer models using hundreds of terabytes of 3D point cloud and image data. Design and implement inference infrastructure to support both offline detection algorithms and responsive, real-time inference integrated with CAD software. Location Based in San Francisco.

Apr 25, 2026
Apply
SoFi logo
Full-time|$153.6K/yr - $240K/yr|On-site|CA - San Francisco

Employee Applicant Privacy Notice SoFi is a national bank and financial services company that creates mobile-first tools to help people manage their money and reach their financial goals. The team values direct impact and aims to make a positive difference for members. Role overview The Senior Marketing Data Scientist - Machine Learning joins the Marketing Data Science team in San Francisco, CA. This position supports SoFi’s Marketing organization through analytics, model building, experimentation, and performance measurement to help drive marketing and growth initiatives. Work centers on designing, building, and scaling machine learning models that improve customer acquisition, conversion, retention, and lifetime value across SoFi’s products. The role draws on behavioral, transactional, and credit data to create predictive models and actionable insights. Collaboration with cross-functional teams is key for identifying business needs, managing model development end-to-end, implementing models in production, and monitoring their ongoing performance. Regulatory compliance is a consistent focus. Main responsibilities Design, develop, and deploy machine learning models to optimize customer acquisition, onboarding, and engagement for products such as loans, credit cards, investments, and cryptocurrency. Build predictive models for outcomes including customer lifetime value, conversion rates, cross-sell and upsell effectiveness, and retention across channels like email, direct mail, in-app, and Operations. Work with structured and unstructured data, such as behavioral signals, transaction data, and credit attributes, to enable audience segmentation and large-scale personalization. Maintain a feature store to streamline model development. Set up A/B testing frameworks to evaluate marketing strategies and measure their impact.

Apr 20, 2026
Apply
Anthropic logo
On-site|On-site|San Francisco, CA | New York City, NY | Seattle, WA

Join Anthropic as a Machine Learning Systems Engineer within our Encodings and Tokenization team, where you'll play a pivotal role in refining and optimizing our tokenization systems across Pretraining and Finetuning workflows. By bridging the gap between our Pretraining and Finetuning teams, you will help shape the essential infrastructure that enhances how our AI models learn from diverse data. Your contributions will be crucial in ensuring our AI systems remain reliable, interpretable, and steerable, driving forward our mission of developing beneficial AI technologies.

Jan 29, 2026
Apply
Scale AI logo
Full-time|$218.4K/yr - $273K/yr|On-site|San Francisco, CA; New York, NY

Artificial Intelligence is revolutionizing every aspect of our lives. At Scale AI, we are dedicated to accelerating the advancement of AI applications across industries. For nearly a decade, we have established ourselves as a premier AI data foundry, powering groundbreaking innovations in AI, including generative AI, defense systems, and autonomous technologies. With our recent investment from Meta, we are committed to enhancing our state-of-the-art post-training algorithms to achieve unparalleled performance for complex agents serving enterprises globally. The Enterprise ML Research Lab is at the forefront of this AI evolution. Our team develops a suite of proprietary research, tools, and resources tailored for our enterprise clients. As a Machine Learning Research Engineer on the Data Foundation team, you will engage in pioneering research to optimize the data flywheel that drives our entire machine learning ecosystem. Your work will involve exploring synthetic environments, defining tasks, building agents for trace analysis, and contributing to a cutting-edge framework that automates agent building through advanced evaluation techniques. You will create top-tier agents that deliver state-of-the-art results by leveraging sophisticated post-training and agent-building algorithms. If you are passionate about influencing the future of Generative AI, we encourage you to apply!

Mar 26, 2026
Apply
Plaid Inc. logo
Full-time|On-site|San Francisco

About Plaid Plaid builds tools that help developers create new financial products and experiences. Since 2013, Plaid has connected millions of users to over 12,000 financial institutions across the US, Canada, the UK, and Europe. The company partners with organizations like Venmo, SoFi, Fortune 500 firms, and major banks to make linking financial accounts to apps and services easier. Headquarters are in San Francisco, with offices in New York, Washington D.C., London, and Amsterdam. Team: Data Foundation & AI The Data Foundation and AI team designs and maintains the machine learning and AI infrastructure that supports Plaid’s products. This group transforms Plaid’s financial network data into flexible formats used by teams across the company. Responsibilities span the entire system lifecycle: data curation for pretraining, model development, deployment, serving, and monitoring in production. Role Overview: Senior Machine Learning Engineer (Research Scientist) This position focuses on applied research for Plaid’s foundation model. The Senior Research Scientist leads efforts to design model architectures, set pretraining objectives, and implement fine-tuning strategies that work across a range of product needs. The role also involves building and maintaining production machine learning systems, including training pipelines, model serving, feature engineering, and performance monitoring. Key Responsibilities Design model architectures and define pretraining objectives for Plaid’s foundation model Develop and apply fine-tuning methods for diverse product use cases Build and maintain end-to-end machine learning systems, from data pipelines to model serving Engineer features and monitor system performance in production Create evaluation frameworks to measure model quality across multiple tasks and metrics Location This role is based in San Francisco.

Apr 15, 2026
Apply
Lyft, Inc. logo
Full-time|Hybrid|San Francisco, CA

About the Role Lyft Ads is looking for a Data Science Manager with a focus on Machine Learning. This position is based in San Francisco, CA. What You Will Do Lead a team of data scientists working on advertising solutions Guide the development and optimization of machine learning algorithms Improve user targeting methods to increase advertising effectiveness Analyze large datasets to support strategic decisions across Lyft Ads Shape advertising performance and influence user engagement through data-driven insights What We’re Looking For Experience managing data science teams Strong background in machine learning and algorithm development Ability to work with large-scale datasets Skilled at translating analysis into actionable business strategies

Apr 14, 2026
Apply
The Bot Company logo
Full-time|On-site|San Francisco

The Bot CompanyWe are on a mission to create a helpful robot for every household.Our dynamic team of engineers, designers, and operators is headquartered in San Francisco, featuring talent from renowned companies such as Tesla, Cruise, OpenAI, Google, and Pixar. We have a proven track record of delivering exceptional products to hundreds of millions of users.Our lean structure fosters swift decision-making and minimizes bureaucracy, empowering every team member with significant autonomy and responsibility. We embrace a culture of rapid iteration and execution across the tech stack.What We Seek in CandidatesAt The Bot Company, we value sharp minds capable of thriving in fast-paced, high-pressure environments. Candidates should exhibit:Exceptional Mental Acuity: The ability to think quickly, assimilate new information instantly, and make connections across various domains.Engineering Curiosity: A natural inclination to explore and understand how systems function, even beyond your specialized area.High Performance Mindset: Comfort with rapid movement, adeptness in handling ambiguity, and excellence under demanding conditions.Role Overview: ML Compiler EngineerAs a specialist in developing ML compilers for edge devices (custom silicon and others), you will be pivotal in establishing a robust deployment framework to efficiently execute large neural networks on our robots with minimal latency.Key QualificationsProficient coding skills with extensive experience in C++ and/or Python.Familiarity with modern compiler infrastructure (MLIR/LLVM, XLA, TVM, Glow, etc.).Experience in deploying models on heterogeneous computing platforms (preferably edge devices).Proficiency in writing kernels (CUDA/OpenCL).Knowledge of quantization techniques is advantageous, though not mandatory.Your ResponsibilitiesDesign, develop, and maintain compiler infrastructure tailored for our hardware.Collaborate across teams, including ML and Systems Software.Independently diagnose and resolve complex numerical issues (such as discrepancies between training and inference) while enhancing performance.

Nov 21, 2025
Apply
Nudge logo
Full-time|On-site|San Francisco

About NudgeAt Nudge, our goal is to create innovative technologies that connect with the brain, enhancing individuals’ lives. We’re pioneering a non-invasive, ultrasound-based device designed to stimulate and image the brain with high precision and depth. This initiative involves developing state-of-the-art hardware, software, and research capabilities to deliver products that can positively impact millions — and eventually billions — of people.About the RoleAs a Machine Learning Software Engineer at Nudge, you will:Engineer imaging algorithms utilizing proprietary ultrasound transducers and advanced computing resources to visualize the brain and skull.Create sophisticated acoustic simulations to model sound scattering in the skull, enabling precise dose predictions.Develop real-time computer vision systems to monitor brain target movements and dynamically adjust parameters to ensure accurate targeting.Collaborate closely with mechanical, electrical, and ultrasound engineers, as well as transducer designers and neuroscientists.About YouWe are searching for engineers of all experience levels, with a preference for those boasting a minimum of three years in the industry. Regardless of your experience, you should possess:A solid understanding of engineering principles, physics, and signal processing.Proficiency in writing production-level code, preferably in Python.A degree in Computer Science or a related engineering field.No prior experience in ultrasound or neuroscience is required.Experience in delivering real-world products that provide tangible value; ideally, you have dealt with complex real-world sensors.A high level of integrity.

Sep 9, 2025
Apply
Exa logo
Full-time|On-site|San Francisco, California

At Exa, we are pioneering the next generation of search engines designed for the era of artificial intelligence, starting from the foundational Silicon architecture. Our ambitious indexing operation is unparalleled, allowing us to crawl the vast open web at an extraordinary scale. We harness cutting-edge embedding models to comprehend this data and utilize our high-performance Rust-based vector database alongside a $5M H200 GPU cluster, which powers tens of thousands of machines simultaneously.The Machine Learning (ML) division is central to this mission, focusing on the training of foundational models that enhance search capabilities. Our vision is to create systems capable of swiftly filtering the world’s knowledge to deliver precisely what you need, regardless of the complexity of your inquiry—effectively transforming the web into a robust, searchable database.To achieve this ambitious goal, we must define what constitutes “effective search”. This is where your expertise will play a crucial role.We are seeking a talented Machine Learning Evaluations Engineer to develop and implement our evaluation framework at Exa. This position entails exploring methodologies to assess search engines in a world dominated by large language models (LLMs) and crafting the most thorough, innovative, and impactful evaluation suite. Your decisions will influence the future of search optimization and directly affect the research team’s focus, shaping the company’s strategic direction.

Oct 15, 2025
Apply
Mercor logo
Full-time|On-site|San Francisco

About MercorAt Mercor, we're revolutionizing the future of work. We collaborate with top AI labs and enterprises to deliver the human insights crucial for AI development.Our extensive talent network trains cutting-edge AI models, much like educators nurture students: by imparting invaluable knowledge, experience, and context that transcends mere code. Currently, over 30,000 specialists in our network collectively generate more than $2 million daily.Mercor is pioneering a new category of work where expertise fuels AI progress. Achieving this vision requires a dynamic, fast-paced, and deeply dedicated team. You’ll collaborate with researchers, operators, and AI companies at the forefront of transforming systems that redefine society.As a profitable Series C company valued at $10 billion, we operate on-site five days a week in our offices located in San Francisco, NYC, or London.About the RoleIn your role as a Machine Learning Engineer on the growth team, you will develop the infrastructure that powers Mercor’s hiring engine: from indexing and global discovery to cross-platform sourcing and engagement, real-time scoring and personalization, and high-throughput conversion pipelines that transform interest into hires.What You Will Build:Low-latency ranking and matching pipelines that process thousands of signals.Global off-platform people search, job distribution, and ad/acquisition infrastructure.Production ML and feature infrastructure for personalization and incentive modeling.Real-time event and data pipelines, high-throughput APIs, and observability for mission-critical services.Who We Are Looking For: We seek engineers with a strong background in building distributed backends or ML infrastructure, demonstrated ownership of large-scale matching, indexing, recommender, or search systems; robust instincts for production, and experience with high-throughput services, monitoring, and reliability.Why Join Us: If you are looking for backend work that combines ML, distributed systems, and real revenue impact, the Growth team is where you belong.Tech Stack: Python, Go, embeddings, fine-tuning, RAG, Kafka, Postgres, Redis, Elasticsearch, Kubernetes, Terraform

Apr 10, 2026
Apply
Aarki Inc. logo
Full-time|On-site|San Francisco

Join our innovative team at Aarki Inc. as the Director of Machine Learning! In this pivotal role, you will lead our machine learning initiatives, driving the development of cutting-edge algorithms and models that enhance our data-driven solutions. You will collaborate with cross-functional teams to translate complex data into actionable insights, all while fostering a culture of learning and innovation.

Mar 18, 2026
Apply
Physical Intelligence logo
Full-time|On-site|San Francisco

As a Machine Learning Infrastructure Engineer at Physical Intelligence, you will play a vital role in enhancing and optimizing our training systems and core model code. You will take ownership of critical infrastructure for large-scale training, which includes managing GPU/TPU compute, orchestrating jobs, and developing reusable and efficient JAX training pipelines. Collaborating closely with researchers and model engineers, you will help transform innovative ideas into experiments and subsequently into production training runs.This position is hands-on and offers significant leverage at the intersection of machine learning, software engineering, and scalable infrastructure.The TeamOur ML Infrastructure team is dedicated to supporting and accelerating Physical Intelligence's core modeling initiatives by building systems that ensure large-scale training is reliable, reproducible, and efficient. The team collaborates with research, data, and platform engineers to guarantee that models can seamlessly transition from prototype to production-grade training runs.Key Responsibilities- Manage training/inference infrastructure: Design, implement, and maintain systems for large-scale model training, which includes scheduling, job management, checkpointing, and performance metrics/logging.- Expand distributed training: Collaborate with researchers to efficiently scale JAX-based training across TPU and GPU clusters.- Enhance performance: Profile and optimize memory usage, device utilization, throughput, and distributed synchronization to maximize efficiency.- Facilitate rapid iteration: Develop abstractions for launching, monitoring, debugging, and reproducing experiments.- Oversee compute resources: Ensure optimal allocation and utilization of cloud-based GPU/TPU compute resources while managing costs effectively.- Collaborate with researchers: Translate research requirements into infrastructure capabilities and promote best practices for large-scale training.- Contribute to core training code: Evolve the JAX model and training code to accommodate new architectures, modalities, and evaluation metrics.

Aug 24, 2024
Apply
Mercury logo
Full-time|$200.7K/yr - $250.9K/yr|Remote|San Francisco, CA, New York, NY, Portland, OR, or Remote within Canada or United States

Since the advent of the Fast Fourier Transform in 1965, the analysis of complex signals such as radio waves and images has transformed dramatically. At Mercury, we leverage advanced technologies to streamline the review of customer applications without sacrificing quality. Our Risk Onboarding team serves as the first line of defense against money laundering and financial crimes, developing innovative systems to ensure our clients are who they claim to be and that we can conduct business with them securely.We are dedicated to providing an unparalleled banking experience for startups, focusing on creating a safe and effective environment that caters to the needs of our customers, administrators, and regulators alike.Note: Mercury is a fintech company and not an FDIC-insured bank. Banking services are provided through Choice Financial Group and Column N.A., both Members FDIC.

Jan 22, 2026
Apply
Two Dots logo
Full-time|On-site|San Francisco HQ

Become a part of Two Dots as we strive to create a more robust financial ecosystem.In our fast-paced world, every time an individual seeks a mortgage, car loan, or apartment lease, they present financial documents that contribute to their financial profile. The accuracy of these profiles plays a crucial role in stabilizing the economy.At Two Dots, we are innovating a system that evaluates consumers in a consistent and fair manner. Our mission is to detect fraud that often goes unnoticed and to identify value in unconventional applications that might be overlooked.Please note that all full-time employees are required to work from our headquarters located in San Francisco, CA.Role Overview:We are seeking our second Machine Learning Engineer to collaborate closely with our CTO and Staff ML Engineer. In this position, you will be responsible for designing, developing, and deploying machine learning solutions, particularly focusing on fine-tuning multimodal large language models (LLMs) to address real-world challenges. The right candidate will possess a fervor for building and implementing advanced ML applications, aiming to enhance our automation rates for application approvals/denials and elevate our fraud detection capabilities, ultimately driving business impact and client satisfaction.Key Responsibilities:Independently design, develop, and deploy machine learning models.Examine extensive datasets to reveal insights and patterns that guide product development and enhance personalized customer experiences.Continuously assess and refine the performance of deployed models to ensure they fulfill business objectives and scalability needs.Keep abreast of the latest developments in machine learning, AI, data science, and engineering, applying this knowledge to enhance our products and services.Desirable Traits:3+ years of experience in a Machine Learning or Data Engineering role, with a strong command of Python and ML frameworks like PyTorch.Demonstrated ability to enhance models for key information extraction, including named entity recognition and financial document classification.Experience with active learning and HITL-driven workflows; collaborating with large labeling and quality teams is advantageous.Exceptional problem-solving skills, with the ability to think critically and creatively.

Jan 3, 2025
Apply
Physical Intelligence logo
Full-time|On-site|San Francisco

At Physical Intelligence, we are pioneering general-purpose AI applications for the physical world. Our innovative approach involves orchestrating thousands of accelerators across a diverse ecosystem of GPU and TPU clusters, which encompass various hardware generations, cloud platforms, and cluster configurations.Researchers frequently encounter challenges in identifying the optimal cluster for their tasks, understanding resource availability, and configuring their workloads efficiently. This process is not scalable. To enhance productivity, we require an intelligent scheduling and compute system that can automatically determine the best job placements based on availability, hardware compatibility, cost considerations, and priority levels, allowing researchers to concentrate on their scientific endeavors.This position encompasses the complete ownership of this challenge: the development of scheduling systems, placement logic, cluster management frameworks, and operational tools essential for seamless operations.This role is distinct from traditional cloud DevOps; it focuses on resource allocation intelligence, utilization efficiency, fault tolerance, and ensuring a smooth experience for large-scale distributed training.About the TeamThe ML Infrastructure team is dedicated to bolstering and accelerating Physical Intelligence’s fundamental modeling initiatives by creating systems that ensure large-scale training is reliable, reproducible, and efficient. You will collaborate closely with the ML Infrastructure, data platform, and research teams to eliminate compute scheduling as a bottleneck.Key Responsibilities- Lead Intelligent Job Scheduling and Placement: Design and implement multi-tenant scheduling systems that automatically allocate training jobs to the most suitable cluster based on hardware specifications, topology, availability, cost, and priority. Facilitate equitable resource sharing across teams and projects through quota management, priority tiers, and preemption policies. Simplify cluster discrepancies so researchers can submit jobs without needing detailed knowledge of cluster specifics.- Enhance Multi-cluster Orchestration: Develop the control plane responsible for overseeing the job lifecycle across various clusters (including mixed GPU/TPU setups, multi-generational hardware, both on-premises and cloud-based) and enable effortless job migration, failover, and rescheduling.- Optimize Accelerator Utilization and Performance: Continuously monitor and enhance GPU/TPU usage across the entire fleet. Apply priority, preemption, queuing, and fairness strategies that balance research momentum with cost efficiency.- Guarantee Scalability and Stability: Implement fault detection, automatic recovery mechanisms, and resilience strategies for long-running multi-node training tasks. Oversee health checks, node management, and scaling strategies to ensure optimal performance.

Mar 7, 2026
Apply
Whatnot logo
Full-time|On-site|San Francisco, CA

Role overview Whatnot seeks a Software Engineer specializing in Machine Learning Infrastructure to develop and maintain the systems powering its machine learning applications. This position is based in San Francisco, CA and centers on building the technical backbone that supports machine learning efforts across the company. What you will do Develop and improve frameworks that enable machine learning throughout Whatnot’s platforms. Collaborate with teams from multiple disciplines to design infrastructure that can scale as needs grow. Support seamless integration of machine learning models into existing products.

Apr 23, 2026

Sign in to browse more jobs

Create account — see all 6,042 results

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.