技能 airunway-aks-setup

📦

airunway-aks-setup

Name: airunway-aks-setup
Author: microsoft

低风险

从裸集群到模型运行：AKS 上的 AI Runway 完整部署指南

在 Azure Kubernetes Service 上部署 LLM 需要协调多个组件，包括 GPU 节点、推理服务提供程序和模型制品。本技能将引导您完成从集群验证到首个模型部署的完整设置流程。

支持: Claude Codex Code(CC)

🥉 72 青铜

下载技能 ZIP

在 Claude 中上传

前往设置 → 功能 → 技能 → 上传技能

开启并开始使用

测试它

正在使用“airunway-aks-setup”。 Set up AI Runway on my AKS cluster

预期结果:

Step 1 - 集群验证:
✓ kubectl 已找到
✓ make 已找到
✓ curl 已找到
✓ 集群上下文: my-cluster
✓ 节点: 共 3 个，已检测到 1 个 GPU 节点
- GPU 节点: Standard_A100_80GB
- VRAM: 80 GB
- bfloat16: 支持

准备好继续第 2 步了吗？

正在使用“airunway-aks-setup”。 Deploy a model with 8B parameters on my cluster

预期结果:

推荐: meta-llama/Llama-3.1-8B-Instruct
服务提供程序: KAITO (vLLM)
原因: 您的 A100-80GB 有能力运行 8B 模型，支持张量并行选项。

这是一个需要 HuggingFace 访问令牌才能使用的受控模型。是否继续收集令牌？

安全审计

低风险

v1 • 4/24/2026

This is a legitimate Microsoft-published documentation skill for AI Runway AKS setup. Static scanner flagged documentation files containing bash/PowerShell code examples as potential security issues. After evaluation, all findings are false positives: the skill provides markdown documentation with command examples for human execution, not executable code. No actual command injection, path traversal vulnerabilities, or malicious patterns exist. The skill is safe for publication with low risk level.

已扫描文件

619

分析行数

发现项

审计总数

低风险问题 (1)

SKILL.md:13 SKILL.md:19 references/steps/step-1-verify.md:9-52 references/steps/step-5-deploy.md:3-96

Documentation Code Examples Misidentified as Shell Execution

Static scanner flagged markdown files with bash/PowerShell code blocks as 'Ruby/shell backtick execution'. These are documentation files providing command examples for users to execute manually. No actual shell execution occurs. Pattern matchers cannot distinguish between executable code and human-readable documentation.

审计者: claude

质量评分

架构

100

可维护性

内容

社区

安全

规范符合性

你能构建什么

首次 AI Runway 部署

AI Runway on AKS 新手指南。从集群验证到使用 GPU 加速的首个模型部署，全程详解。

GPU 能力评估

了解可用的 GPU 硬件、检查 dtype 支持（bfloat16、float16），并根据集群 VRAM 容量获取模型推荐。

排查部署失败问题

从特定步骤恢复以继续部分完成的设置，或按照回滚步骤撤销失败的部署并重新开始。

试试这些提示

基础 AI Runway 设置

Set up AI Runway on my AKS cluster. I have an existing cluster with GPU nodes.

从特定步骤恢复

Skip to step 4 and set up the KAITO inference provider on my AKS cluster.

仅进行 GPU 评估

Check what GPUs are available in my AKS cluster and tell me which models I can run.

部署特定模型

Deploy the Llama-3.1-8B model to my AKS cluster using AI Runway. I have an A100-80GB node.

最佳实践

在选择模型大小之前，始终确认 GPU 节点可用性和 VRAM 容量
首先使用 Phi-3 或 Gemma 等非受控模型验证设置，然后再使用受控模型
使用跳过步骤参数在中断后从特定步骤恢复

避免

在首先确认您了解 Azure 上的 GPU 计算成本之前，请勿运行此技能
请勿跳过集群验证——了解您的 GPU 硬件是选择模型的必要前提
在用非受控模型验证设置之前，请勿尝试受控模型（如 Llama 等）

常见问题

什么是 AI Runway？

AI Runway 是一个用于在 Azure Kubernetes Service 上部署和管理 LLM 推理的 Kubernetes 原生框架。它为模型部署提供自定义资源定义，并集成了 KAITO、Dynamo 和 KubeRay 等推理服务提供程序。

我需要一个现有的 AKS 集群吗？

是的。本技能假设您已有一个现有的 AKS 集群。如果需要创建一个集群，请首先使用 azure-kubernetes 技能来配置带有 GPU 节点的集群，然后返回本技能。

支持哪些 GPU？

AI Runway 支持 NVIDIA GPU，包括 T4、V100、A10、A10G、L4、L40S、A100、H100 和 H200。每种 GPU 有不同的 dtype 支持——像 T4 和 V100 这样的旧款 GPU 不支持 bfloat16。

为什么我的部署因 bfloat16 错误而失败？

T4 和 V100 GPU 不支持 bfloat16 精度。请在服务参数中添加 --dtype float16，或切换到 xformers 注意力后端。请参阅 gpu-profiles.md 参考文档了解您的特定 GPU 约束。

模型部署需要多长时间？

小型模型（1B-8B）通常在 5-10 分钟内部署完成。大型模型（70B+）可能需要 20-40 分钟或更长时间，因为模型权重必须首先从 HuggingFace 下载。请检查 Pod 日志以查看下载进度。

如何回滚失败的部署？

使用 kubectl delete 删除模型部署和密钥。对于服务提供程序和控制器，请使用 AI Runway 代码仓库根目录中的 make 命令。完整的回滚步骤序列请参阅 troubleshooting.md。

开发者详情

作者

microsoft

许可证

MIT

仓库

https://github.com/microsoft/azure-skills/tree/main/.github/plugins/azure-skills/skills/airunway-aks-setup/

引用

main

文件结构

📁 references/

📁 steps/

📄 step-1-verify.md

📄 step-2-controller.md

📄 step-3-gpu.md

📄 step-4-provider.md

📄 step-5-deploy.md

📄 step-6-summary.md

📄 gpu-profiles.md

📄 model-sizing.md

📄 powershell-notes.md

📄 troubleshooting.md

📄 SKILL.md