弹性AI构建容错AI系统

3.0 2025-05-10 38 0 2224 KB 35 页 PDF
侵权投诉
弹性AI构建容错AI系统
弹性AI构建容错AI系统
弹性AI构建容错AI系统
弹性AI构建容错AI系统
弹性AI构建容错AI系统
摘要:

Dan RabinovitsjVP, Engineering MetaResilient AIBuilding Fault-Tolerant AI SystemsArtificial intelligence (AI) is having quite a momentAI-enabled creation toolsText-to-image generationA hedgehog playing chessLarge Language Models(LLMs)Llama 3.1Source: Meta for Business. 'Culture Rising: 2023 Trends Report.' 2023. pushed our model training tonew heights, leveraging a significantly optimized full training stack16K H100 GPUsused to train Llama 3.1 405B>15T tokensTRAINED AT UNPRECEDENTED SCALEThe Challenge of Scale: Llama 3’s Infrastructure6K clustersJob size: 128-512 GPUs202216-24K clustersJob size: 16K GPUs2023AI jobs at scale: massive change in 2023AI jobs at scale: TodaySoftware InfraPhysical InfrastructureLlama2024Training @ Scale is Not Linear! Not scaling linearlyThroughput# of GPUsOne Small StepInterruptions302010025k50k75k100k250k500k750k1 millionInterruptions per HourNumber of GPUsMore GPUsequalsMore FailuresRoadmap to Resilient AI: Metrics Driven OutcomesEffective Training TimeE2

展开>> 收起<<
弹性AI构建容错AI系统

共 35 页,预览3页

还剩32页未读, 继续阅读

声明:企商查报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
作者: 分类: 属性:35 页 大小:2224 KB 格式:PDF 时间:2025-05-10

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 3
客服
关注