Staff Software Engineer, Compute ML Scheduling and Observability
Company: Interesting Engineering, Inc.
Location: Seattle
Posted on: March 29, 2025
Job Description:
About AnthropicAnthropic's mission is to create reliable,
interpretable, and steerable AI systems. We want AI to be safe and
beneficial for our users and for society as a whole. Our team is a
quickly growing group of committed researchers, engineers, policy
experts, and business leaders working together to build beneficial
AI systems.About The RoleThe mission of the Capacity Engineering &
Efficiency team is to provide input into our company-wide cloud
infrastructure strategy and efficiency deliverables, with a
specialized focus on ML Scheduling and Observability for our
Compute infrastructure. You will develop and optimize scheduling
systems for our large-scale machine learning workloads,
particularly working with our Python-based scheduling architecture
and orchestrating workloads across jobs. Your work will contribute
to our path toward building RL-aware schedulers while supporting
and improving our model development through improved observability
and capacity efficiency. You will be expected to work with
engineering teams to ensure optimal operation and growth of our
infrastructure from both a cost and technology perspective,
collaborate with research engineering to scope and understand the
observability and capacity needs for model development, and partner
cross-functionally with finance and data science teams to analyze
and forecast growth.You May Be a Good Fit If You
- Experience instrumenting ML workloads for performance
monitoring/efficiency
- Experience with high performance, large scaled distributed
systems
- Experience with LLM inference and Reinforcement Learning
- Observability tooling and best practices (logging, metrics,
tracing)
- 10+ years experience in capacity efficiency or performance
engineering
- 10+ years experience in a technical role
- Have experience in scripting and building automation tools
- Are self-disciplined and thrive in fast-paced environments
- Have Excellent communication skills
- Pick up slack, even if it goes outside your job
description
- Have attention to detail and a passion for correctnessStrong
Candidates May Also Have Experience With
- Reinforcement Learning
- Cross-Platform accelerators
- Pytorch
- Python
- Kubernetes
- Performance optimization across multiple
platforms/environmentsRepresentative Projects
- Develop self-service tools and dashboards to enable anthropic
engineers to understand their capacity, efficiency, and costs,
leveraging observability best practices
- Investigate capacity requests and recommend right-sizing
strategies for performance optimization across multiple
platforms/environments
- Design and implement observability solutions that provide
insights into infrastructure efficiency for large-scale distributed
systems
- Collaborate with engineering teams to identify and resolve
performance bottlenecks in Kubernetes-based ML infrastructure
- Partner with research teams to quantify computational
requirements for new ML initiatives and develop appropriate
capacity plansDeadline to apply: None. Applications will be
reviewed on a rolling basis.Annual SalaryThe expected salary range
for this position is:$320,000 - $405,000 USDLogisticsEducation
requirements: We require at least a Bachelor's degree in a related
field or equivalent experience.Location-based hybrid policy:
Currently, we expect all staff to be in one of our offices at least
25% of the time. However, some roles may require more time in our
offices.Visa sponsorship: We do sponsor visas! However, we aren't
able to successfully sponsor visas for every role and every
candidate. But if we make you an offer, we will make every
reasonable effort to get you a visa, and we retain an immigration
lawyer to help with this.We encourage you to apply even if you do
not believe you meet every single qualification. Not all strong
candidates will meet every single qualification as listed. Research
shows that people who identify as being from underrepresented
groups are more prone to experiencing imposter syndrome and
doubting the strength of their candidacy, so we urge you not to
exclude yourself prematurely and to submit an application if you're
interested in this work. We think AI systems like the ones we're
building have enormous social and ethical implications. We think
this makes representation even more important, and we strive to
include a range of diverse perspectives on our team.How We're
DifferentWe believe that the highest-impact AI research will be big
science. At Anthropic we work as a single cohesive team on just a
few large-scale research efforts. And we value impact - advancing
our long-term goals of steerable, trustworthy AI - rather than work
on smaller and more specific puzzles. We view AI research as an
empirical science, which has as much in common with physics and
biology as with traditional efforts in computer science. We're an
extremely collaborative group, and we host frequent research
discussions to ensure that we are pursuing the highest-impact work
at any given time. As such, we greatly value communication
skills.The easiest way to understand our research directions is to
read our recent research. This research continues many of the
directions our team worked on prior to Anthropic, including: GPT-3,
Circuit-Based Interpretability, Multimodal Neurons, Scaling Laws,
AI & Compute, Concrete Problems in AI Safety, and Learning from
Human Preferences.Come work with us!Anthropic is a public benefit
corporation headquartered in San Francisco. We offer competitive
compensation and benefits, optional equity donation matching,
generous vacation and parental leave, flexible working hours, and a
lovely office space in which to collaborate with colleagues.
#J-18808-Ljbffr
Keywords: Interesting Engineering, Inc., Seattle , Staff Software Engineer, Compute ML Scheduling and Observability, IT / Software / Systems , Seattle, Washington
Didn't find what you're looking for? Search again!
Loading more jobs...