Sr. Machine Learning Engineer, Amazon General Intelligence (AGI)
Our Machine Learning training infrastructure (ML Infra) team is responsible for designing, implementing, and optimizing large-scale computing infrastructure that powers our cutting-edge AI and machine learning initiatives. We leverage advanced hardware, innovative software architectures, and distributed computing techniques to enable breakthrough research and product development across the company.We are seeking a Senior Machine Learning Engineer to join our team and lead the development of our next-generation ML training infrastructure. This is a high impact, high visibility role that will shape the future of our machine learning capabilities and contribute to the advancement of AI technology across the industry.Key job responsibilitiesLead the definition, design, architecture quality, implementation, and delivery of the most advanced, most difficult, most cross-cutting, and/or most ambiguous challenges spanning across our ML infrastructure.- Align the teams in ML Infrastructure and related organizations to a coherent technical vision and deliver systems that fit well together.- Exert influence over multiple teams, increasing their productivity and effectiveness. You hold peers and teams to a high bar for performance and efficiency, and aid teams through your expert guidance and example.- Considered to be an authority on technical issues by both the technical and research community, you are responsible for guiding difficult trade-off decisions and drive awareness about the impact and consequences of technical decisions on AI research and product development.- Demonstrate significant innovation, creativity, and judgement when solving challenging AI/ML infrastructure problems. Identify future skills needed across your organization and advocate for the development and/or acquisition of those skills to senior leaders. You scout top talent and recruit them to the company.- Actively mentor senior and Principal engineers, scale yourself by developing and institutionalizing best practices in AI/ML infrastructure and distributed computing across the organization.A day in the life8+ years of professional software development experience in distributed systems with emphasis on ML infrastructure- 8+ years of current programming experience building ML infrastructure using languages such as Python, C++ or Rust- Hands-on experience with parallel computing platforms such as CUDA, OpenMP, etc- Deep understanding of AI frameworks such as PyTorch, TensorFlow, and JAX, and their demands on underlying compute infrastructure, memory bandwidth, network interconnect, and storage as scale goes up- Knowledge of emerging AI hardware accelerators and architectures- Experience with containerization and orchestration technologies (Docker, Kubernetes)- Experience with cloud computing platforms (AWS, Azure, GCP) and their offeringsAbout the teamJoin our AGI team and work at the forefront of AI. Collaborate with top minds pushing boundaries in deep learning, reinforcement learning, and more. Gain valuable experience and accelerate your career growth. This is a unique opportunity to create history and shape the future of artificial intelligence.Mission of the team: We leverage our hyper-scalable, general-purpose large model training and inference systems to develop and deploy cutting-edge sensory AI foundational models that revolutionize machine perception, interpretation and interaction, with humans and with the physical world.BASIC QUALIFICATIONS- 5+ years of non-internship professional software development experience- 5+ years of programming with at least one software programming language experience- 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience- Experience as a mentor, tech lead or leading an engineering team ...