Skip to content

pacoxu/AI-Infra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

status maintainer last_updated tags
Active
pacoxu
2025-10-29
ai-infrastructure, kubernetes, learning-path, landscape
image

AI-Infra Landscape & Learning Path πŸš€

δΈ­ζ–‡η‰ˆ | English

Welcome to the AI-Infra repository! This project provides a curated landscape and structured learning path for engineers building and operating modern AI infrastructure, especially in the Kubernetes and cloud-native ecosystem.

🌐 Overview

This landscape visualizes key components across the AI Infrastructure stack, mapped by:

  • Horizontal Axis (X):

    • Left: Prototype / Early-stage projects
    • Right: Kernel & Runtime maturity
  • Vertical Axis (Y):

    • Bottom: Infrastructure Layer (Kernel/Runtime)
    • Top: Application Layer (AI/Inference)

The goal is to demystify the evolving AI Infra stack and guide engineers on where to focus their learning.

πŸ“‘ Table of Contents

πŸ“‚ Documentation Files

Kubernetes

Inference

Training

Observability

πŸ“Š AI-Infra Landscape (2025 June, needs an update)

Legend:

  • Dashed outlines = Early stage or under exploration
  • Labels on right = Functional categories

AI-Infra Landscape

🧭 Learning Path for AI Infra Engineers

πŸ“¦ 0. Kernel & Runtime (底层内核)

Core Kubernetes components and container runtime fundamentals. Skip this section if using managed Kubernetes services.

  • Key Components:

    • Core: Kubernetes, CRI, containerd, KubeVirt
    • Networking: CNI (focus: RDMA, specialized devices)
    • Storage: CSI (focus: checkpointing, model caching, data management)
    • Tools: KWOK (GPU node mocking), Helm (package management)
  • Learning Topics:

    • Container lifecycle & runtime internals
    • Kubernetes scheduler architecture
    • Resource allocation & GPU management
    • For detailed guides, see Kubernetes Guide

πŸ“ 1. Scheduling & Workloads (θ°ƒεΊ¦δΈŽε·₯作负载)

Advanced scheduling, workload orchestration, and device management for AI workloads in Kubernetes clusters.

  • Key Areas:

    • Batch Scheduling: Kueue, Volcano, koordinator, Godel, YuniKorn (Kubernetes WG Batch)
    • GPU Scheduling: HAMI, NVIDIA Kai Scheduler, NVIDIA Grove
    • GPU Management: NVIDIA GPU Operator, NVIDIA DRA Driver, Device Plugins
    • Workload Management: LWS (LeaderWorkset), Pod Groups, Gang Scheduling
    • Device Management: DRA, NRI (Kubernetes WG Device Management)
    • Checkpoint/Restore: GPU checkpoint/restore for fault tolerance and migration (NVIDIA cuda-checkpoint, AMD AMDGPU plugin via CRIU)
  • Learning Topics:

    • Job vs. pod scheduling strategies (binpack, spread, DRF)
    • Queue management & SLOs
    • Multi-model & multi-tenant scheduling

See Kubernetes Guide for comprehensive coverage of pod lifecycle, scheduling optimization, workload isolation, and resource management. Detailed guides: Kubernetes Learning Plan | Pod Lifecycle | Pod Startup Speed | Scheduling Optimization | Isolation | DRA | DRA Performance Testing | NVIDIA GPU Operator | NRI

  • RoadMap:
    • Gang Scheduling in Kubernetes #4671
    • LWS Gang Scheduling KEP-407

🧠 2. Model Inference & Runtime Optimization (ζŽ¨η†δΌ˜εŒ–)

LLM inference engines, platforms, and optimization techniques for efficient model serving at scale.

  • Key Topics:

    • Model architectures (Llama 3/4, Qwen 3, DeepSeek-V3, Flux)
    • Efficient transformer inference (KV Cache, FlashAttention, CUDA Graphs)
    • LLM serving and orchestration platforms
    • Multi-accelerator optimization
    • MoE (Mixture of Experts) architectures
    • Model lifecycle management (cold-start, sleep mode, offloading)
    • AI agent memory and context management
    • Performance testing and benchmarking
  • RoadMap:

See Inference Guide for comprehensive coverage of engines (vLLM, SGLang, Triton, TGI), platforms (Dynamo, AIBrix, OME, llmaz, Kthena, KServe), and deep-dive topics: Model Architectures | AIBrix | P/D Disaggregation | Caching | Memory/Context DB | MoE Models | Model Lifecycle | Performance Testing


🧩 3. AI Gateway & Agentic Workflow


🎯 4. Training on Kubernetes

Distributed training of large AI models on Kubernetes with fault tolerance, gang scheduling, and efficient resource management.

  • Key Topics:
    • Transformers: Standardizing model definitions across the PyTorch ecosystem
    • PyTorch ecosystem and accelerator integration (DeepSpeed, vLLM, NPU/HPU/XPU)
    • Distributed training strategies (data/model/pipeline parallelism)
    • Gang scheduling and job queueing
    • Fault tolerance and checkpointing
    • GPU error detection and recovery
    • Training efficiency metrics (ETTR, MFU)
    • GitOps workflows for training management
    • Storage optimization for checkpoints
    • Pre-training large language models (MoE, DeepseekV3, Llama4)
    • Scaling experiments and cluster setup (AMD MI325)

See Training Guide for comprehensive coverage of training operators (Kubeflow, Volcano, Kueue), ML platforms (Kubeflow Pipelines, Argo Workflows), GitOps (ArgoCD), fault tolerance strategies, ByteDance's training optimization framework, and industry best practices. Detailed guides: Transformers | PyTorch Ecosystem | Pre-Training | Parallelism Strategies | Kubeflow | ArgoCD


πŸ” 5. Observability of AI Workloads

Comprehensive monitoring, metrics, and observability across the AI infrastructure stack for production operations.

  • Key Topics:
    • Infrastructure monitoring: GPU utilization, memory, temperature, power
    • Inference metrics: TTFT, TPOT, ITL, throughput, request latency
    • Scheduler observability: Queue depth, scheduling latency, resource allocation
    • LLM application tracing: Request traces, prompt performance, model quality
    • Cost optimization: Resource utilization analysis and right-sizing
    • Multi-tenant monitoring: Per-tenant metrics and fair-share enforcement

See Observability Guide for comprehensive coverage of GPU monitoring (DCGM, Prometheus), inference metrics (OpenLLMetry, Langfuse, OpenLit), scheduler observability (Kueue, Volcano), distributed tracing (DeepFlow), and LLM evaluation platforms (TruLens, Deepchecks).


6. Ecosystem Initiatives


πŸ—ΊοΈ RoadMap

For planned features, upcoming topics, and discussion on what may or may not be included in this repository, please see the RoadMap.

🀝 Contributing

We welcome contributions to improve this landscape and path! Whether it's a new project, learning material, or diagram update β€” please open a PR or issue.

πŸ“š References

If you have some resources about AI Infra, please share them in #8.

Here are some key conferences in the AI Infra space:

πŸ“œ License

Apache License 2.0.


This repo is inspired by the rapidly evolving AI Infra stack and aims to help engineers navigate and master it.

About

init to record my learning path of AI Infra, especially on inference.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •