GPU异常检测与自动隔离 - 容器服务 Kubernetes 版 ACK

ACK通过ack-node-problem-detector（NPD）组件监测GPU资源的健康状态。当GPU节点出现XID/SXID错误等异常时，NPD组件自动检测并隔离异常GPU卡，此时其他正常的GPU可以继续提供服务，以尽可能减少业务上的损失，提高集群的可靠性和运维效率。

前提条件

已安装ack-node-problem-detector（NPD），且组件版本为1.2.24及以上。
0.17.0及以上版本的ack-nvidia-device-plugin与1.2.24及以上版本的NPD搭配使用时，当NPD检测到GPU异常时，将自动隔离GPU卡，当NPD检测到GPU恢复正常后，将自动解除GPU卡的隔离。
关于如何查看ack-nvidia-device-plugin组件版本以及升级组件，请参见查看NVIDIA Device Plugin版本。

ack-node-problem-detector（NPD）是ACK基于社区开源项目node-problem-detector改造和增强的集群节点异常事件监控组件，提供丰富的GPU异常检测项以增强GPU场景的异常发现能力。发现对应的异常时，组件会根据异常类型产生相应的Kubernetes Event或Kubernetes Node Condition。

注意事项

当检测到 GPU 异常时，ack-node-problem-detector 组件将按照默认隔离策略生成 NVIDIA GPU 隔离文件，ack-nvidia-device-plugin 组件将根据该文件内容自动隔离异常 GPU 卡，以避免新的负载调度到异常 GPU 后无法正常运行，其他正常的 GPU 可以继续提供服务。隔离异常 GPU 卡后，若节点上剩余 GPU 不足以满足任务需求（如 8 卡任务在仅有 7 卡可用时），任务将无法调度，可能导致 GPU 资源闲置。自动隔离并不等于自动修复，发生 GPU 卡自动隔离的节点实例将持续计费，您仍需修复节点，建议配置 GPU 异常告警以便及时处理。
您也可以根据业务需求，关闭异常GPU自动隔离功能。详细操作步骤，请参考如何关闭NPD的异常GPU卡自动隔离能力。NVIDIA Device Plugin组件在特定版本支持异常GPU卡自动隔离，但关闭隔离能力操作方式不同。详细操作，请参见如何关闭NVIDIA Device Plugin原生GPU隔离能力。
NVIDIA的XID和SXID是GPU驱动通过NVRM事件机制写入/var/log/messages或/var/log/syslog中。NPD会记录每个XID和SXID是否已被处理，如果在发现XID或SXID后，只要对节点进行重启操作，不管这条XID或SXID所对应的问题是否已被解决（例如XID 79指明需要更换GPU设备才能解决问题），NPD将不会对这条XID或SXID产生Event或Node Condition，即NPD认为这条XID已被解决。
NPD检测NVIDIA XID或者NVIDIA SXID是通过检测节点/var/log/messages文件或/var/log/syslog文件完成的。如果dmesg日志被重定向到其他文件，NPD将无法检测NVIDIA XID和SXID。
从NPD 1.2.29版本开始，NPD中GPU异常检测插件将单独以DaemonSet方式部署，DaemonSet名称为ack-accel-health-monitor。
某些情况下，当节点出现GPU异常后，可能会导致节点上无法创建GPU容器，GPU异常检测容器可能受到影响，导致该容器无法创建，继而检测工作无法正常执行。

由于NPD GPU检测插件Pod需要检测GPU设备和GPU组件状态，需要开启privileged=true等高权限，具体参考下表。

集群RBAC权限

容器权限

Node: get

Node/Status: update

Events: create

privileged: true

只读挂载宿主机/dev/kmsg

只读挂载宿主机/usr/lib

只读挂载宿主机/etc

只读挂载宿主机/usr/lib64

只读挂载宿主机/proc

检测项及修复建议

发现GPU异常后，请参照Nvidia Xid Errors进行修复。也可根据节点实例类型（如ECS、灵骏）在对应云产品控制台查询是否存在节点实例的运维事件，或者通过自主诊断工具对节点硬件异常进行排查。

修复建议为None表示无需对硬件采取任何操作，建议自行检查应用配置是否正常。

检测项名称	是否产生Node Condition	是否产生Event	描述	是否默认隔离GPU卡	修复建议
NvidiaXID13Error	否	是 `Type: Warning` `Reason: NvidiaXID13Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 13 error has occurred.`	`Graphics Engine Exception.` 通常是数组越界、指令错误，小概率是硬件问题。	否	None
NvidiaXID31Error	否	是 `Type: Warning` `Reason: NvidiaXID31Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 31 error has occurred.`	`GPU memory page fault.` 通常是应用程序的非法地址访问，小概率是驱动或者硬件问题。	否	None
NvidiaXID43Error	否	是 `Type: Warning` `Reason: NvidiaXID43Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 43 error has occurred.`	`GPU stopped processing.` 当您的应用程序遇到软件诱发的异常并必须终止时，会记录此事件。GPU仍然处于健康状态。在大多数情况下，这并不表示驱动程序存在问题，而是您的应用程序出错。	否	None
NvidiaXID44Error	是 `Type: NvidiaXID44Error` `Reason: NodeHasNvidiaXID44Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 44 error has occurred.`	是 `Type: Warning` `Reason: NvidiaXID44Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 44 error has occurred.`	`Graphics Engine fault during context switch.` 上下文切换期间发生图形引擎故障。	是（NPD <= 1.2.28）否（NPD >= 1.2.30）	重启节点。
NvidiaXID45Error	否	是 `Type: Warning` `Reason: NvidiaXID45Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 45 error has occurred.`	`Preemptive cleanup, due to previous errors - Most likely to see when running multiple cuda applications and hitting a DBE.` 当您的应用程序中止并且内核驱动程序终止在GPU上运行的GPU应用程序时，会记录此事件。 Control-C、GPU重置和sigkill都是应用程序被中止并创建此事件的示例。在许多情况下，这并不表示存在错误，而是您或系统的操作导致。	否	None
NvidiaXID48Error	是 `Type: NvidiaXID48Error` `Reason: NodeHasNvidiaXID48Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 48 error has occurred.`	是 `Type: Warning` `Reason: NvidiaXID48Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 48 error has occurred.`	`Double Bit ECC Error(DBE).` 当GPU检测到不可纠正的错误发生时，会记录此事件。这一情况也会反馈给应用程序。需要GPU重置或重启节点才能清除此错误。	是	重启节点。
NvidiaXID61Error	是 `Type: NvidiaXID61Error` `Reason: NodeHasNvidiaXID61Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 61 error has occurred.`	是 `Type: Warning` `Reason: NvidiaXID61Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 61 error has occurred.`	`Internal micro-controller breakpoint/warning (newer drivers).` 内部微控制器断点/警告（较新驱动程序）。	是（NPD <= 1.2.28）否（NPD >= 1.2.30）	重启节点。
NvidiaXID62Error	是 `Type: NvidiaXID62Error` `Reason: NodeHasNvidiaXID62Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 62 error has occurred.`	是 `Type: Warning` `Reason: NvidiaXID62Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 62 error has occurred.`	`Internal micro-controller halt (newer drivers).` 内部微控制器停机（较新驱动程序）。	是	重启节点。
NvidiaXID63Error	否	是 `Type: Warning` `Reason: NvidiaXID63Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 63 error has occurred.`	`ECC page retirement or row remapping recording event.` 当应用程序遭遇到GPU显存硬件错误时，NVIDIA自纠错机制会将错误的内存区域retire或者remap，retirement和remapped信息需要记录到infoROM中才能永久生效。 Volta架构：记录ECC page retirement事件到infoROM成功。 Ampere架构：记录row remapping事件到infoROM成功。	否	None
NvidiaXID64Error	否	是 `Type: Warning` `Reason: NvidiaXID64Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 64 error has occurred.`	`ECC page retirement or row remapper recording failure.` 与Xid 63的触发场景类似，只是Xid 63代表retirement和remapped信息成功记录到infoROM，Xid 64代表该记录操作失败。	否	None
NvidiaXID69Error	是 `Type: NvidiaXID69Error` `Reason: NodeHasNvidiaXID69Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 69 error has occurred.`	是 `Type: Warning` `Reason: NvidiaXID69Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 69 error has occurred.`	`Graphics Engine class error.` 图形引擎类错误。	是（NPD <= 1.2.28）否（NPD >= 1.2.30）	重启节点。
NvidiaXID74Error	是 `Type: NvidiaXID74Error` `Reason: NodeHasNvidiaXID74Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 74 error has occurred.`	是 `Type: Warning` `Reason: NvidiaXID74Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 74 error has occurred.`	`Fatal NVLINK Error.` NVLink硬件错误产生的Xid。	是	硬件维修。
NvidiaXID79Error	是 `Type: NvidiaXID79Error` `Reason: NodeHasNvidiaXID79Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 79 error has occurred.`	是 `Type: Warning` `Reason: NvidiaXID79Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 79 error has occurred.`	`GPU has fallen off the bus.` GPU硬件检测到掉卡，无法从总线上检测到。	是	硬件维修。
NvidiaXID94Error	否	是 `Type: Warning` `Reason: NvidiaXID94Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 94 error has occurred.`	`Contained ECC error.` 当应用程序遭遇到GPU不可纠正的显存ECC错误时，NVIDIA错误抑制（contained）机制会尝试将错误抑制在当前已出现问题的应用程序中，而不会让错误影响GPU上的所有应用程序。当抑制机制成功抑制错误时，会产生Xid 94事件，仅影响遭遇了不可纠正ECC错误的应用程序。	否	None
NvidiaXID95Error	是 `Type: NvidiaXID95Error` `Reason: NodeHasNvidiaXID95Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 95 error has occurred.`	是 `Type: Warning` `Reason: NvidiaXID95Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 95 error has occurred.`	`Uncontained ECC error.` Xid95代表抑制失败，此时表明运行在该GPU上的所有应用程序都已受到影响，受影响的GPU必须重置后，应用程序才能重新启动。	是	重启节点。
NvidiaXID109Error	是 `Type: NvidiaXID109Error` `Reason: NodeHasNvidiaXID109Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 109 error has occurred.`	是 `Type: Warning` `Reason: NvidiaXID109Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 109 error has occurred.`	`Context Switch Timeout Error.` 上下文切换超时错误。	是（NPD <= 1.2.28）否（NPD >= 1.2.30）	None
NvidiaXID119Error	是 `Type: NvidiaXID119Error` `Reason: NodeHasNvidiaXID119Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 119 error has occurred.`	是 `Type: Warning` `Reason: NvidiaXID119Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 119 error has occurred.`	`GSP RPC Timeout.` 在等待GSP核心响应RPC消息时发生超时。	是	重启节点。
NvidiaXID120Error	是 `Type: NvidiaXID120Error` `Reason: NodeHasNvidiaXID120Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 120 error has occurred.`	是 `Type: Warning` `Reason: NvidiaXID120Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 120 error has occurred.`	`GSP Error.` 在GPU的GSP核心上运行的代码出错。	是	重启节点。
NvidiaXID140Error	是 `Type: NvidiaXID140Error` `Reason: NodeHasNvidiaXID140Error` `Message: TS=xxx;GpuIds=xxx;MSG=An NVIDIA XID 140 error has occurred.`	是 `Type: Warning` `Reason: NvidiaXID140Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid 140 error has occurred.`	`Unrecovered ECC Error.` 当GPU驱动程序在GPU内存中检测到不可纠正的错误，这些错误影响了驱动程序标记页面以进行动态页面下线或行重新映射的能力时，可能会发生此事件。需要重置GPU。	是	重启节点。
NvidiaXID[code]Error	否	是（仅产生三次事件） `Type: Warning` `Reason: NvidiaXID[code]Error` `Message: GpuIds=xxx;TS=xxx;Xid=xxx;MSG=An nvidia xid [code] error has occurred.`	未出现在该表中的其他XID。	否	提交工单。
NvidiaSXID[code]Error	否	是（仅产生三次事件） `Type: Warning` `Reason: NvidiaSXID[code]Error` `Message: TS=xxx;NVSwitchIds=xxx;MSG=An nvidia sxid [code] error has occurred.`	SXID错误可以分为三类，分别是： Correctable：错误已纠正。系统行为不受此类错误的影响。无需额外恢复。 Fatal：错误对设备来说是致命的，系统行为受到影响，从此错误中恢复的唯一方法是重置设备或重新启动系统。 Non-fatal：错误对设备来说不是致命的，系统行为受到影响，可能不需要重置设备或重新启动系统。	否	None
NvidiaEccModeNotEnabled	是 `Type: NvidiaEccModeNotEnabled` `Reason: EccModeNotEnabled` `Message: GpuIds=xxx;EccModeCurrent=xxx;EccModePending=xxx;MSG=The ECC mode of the GPU is not enabled.`	是（持续产生事件，直到问题修复） `Type: Warning` `Reason: NvidiaEccModeNotEnabled` `Message: GpuIds=xxx;EccModeCurrent=xxx;EccModePending=xxx;MSG=The ECC mode of the GPU is not enabled.`	节点ECC Mode未开启。	否	开启ECC Mode并重启节点。
NvidiaPendingRetiredPages	是 `Type: NvidiaPendingRetiredPages` `Reason: NodeHasNvidiaPendingRetiredPages` `Message: GpuIds=xxx;VolatileTotalUncorrected=xxx;AggregateTotalUncorrected=xxx;MSG=There are retired pages in a pending state on the GPU.`	是（持续产生事件，直到问题修复） `Type: Warning` `Reason: NvidiaPendingRetiredPages` `Message: GpuIds=xxx;VolatileTotalUncorrected=xxx;AggregateTotalUncorrected=xxx;MSG=There are retired pages in a pending state on the GPU.`	GPU存在处于pending状态的Retired Pages。需要重置GPU才能使这些Retired Pages生效。	是	重启节点。
NvidiaRemappingRowsFailed	是 `Type: NvidiaRemappedRowsFailed` `Reason: GPUMemoryRemappingRowsFailed` `Message: GpuIds=xxx;RemappedDueToUncorrectableErrors=xxx;MSG=The GPU has encountered an error with row mapping.`	是（持续产生事件，直到问题修复） `Type: Warning` `Reason: NvidiaRemappedRowsFailed` `Message: GpuIds=xxx;RemappedDueToUncorrectableErrors=xxx;MSG=The GPU has encountered an error with row mapping.`	GPU存在行重映射失败。	是	硬件维修。
NvidiaRemappingRowsRequireReset	是 `Type: NvidiaRemappingRowsRequireReset` `Reason: UncontainedEccError` `Message: GpuIds=xxx;MSG=Remapping rows requires GPU reset.`	是（持续产生事件，直到问题修复） `Type: Warning` `Reason: NvidiaRemappingRowsRequireReset` `Message: GpuIds=xxx;MSG=Remapping rows requires GPU reset.`	GPU遇到了无法纠正的、未包含的错误，需要通过重置GPU进行恢复。为了恢复操作，应该尽快重置GPU。	是（NPD <= 1.2.28）否（NPD >= 1.2.30）	重启节点。
NvidiaDeviceLost	是 `Type: NvidiaDeviceLost` `Reason: NodeHasNvidiaDeviceLost` `Message: GpuIds=xxx;MSG=The GPU has fallen off the bus or has otherwise become inaccessible`	是（持续产生事件，直到问题修复） `Type: Warning` `Reason: NvidiaDeviceLost` `Message: GpuIds=xxx;MSG=The GPU has fallen off the bus or has otherwise become inaccessible.`	`The GPU has fallen off the bus or has otherwise become inaccessible.` GPU已从总线上脱落或变得不可访问。	是	硬件维修。
NvidiaInfoRomCorrupted	是 `Type: NvidiaInfoRomCorrupted` `Reason: NodeHasNvidiaInfoRomCorrupted` `Message: GpuIds=xxx;MSG=GPU infoROM is corrupted`	是（持续产生事件，直到问题修复） `Type: Warning` `Reason: NvidiaInfoRomCorrupted` `Message: GpuIds=xxx;MSG=GPU infoROM is corrupted.`	`infoROM is corrupted.` infoROM已损坏。	是	硬件维修。
NvidiaPowerCableErr	是 `Type: NvidiaPowerCableErr` `Reason: NodeHasNvidiaPowerCableErr` `Message: GpuIds=xxx;MSG=A device's external power cables are not properly attached`	是（持续产生事件，直到问题修复） `Type: Warning` `Reason: NvidiaPowerCableErr` `Message: GpuIds=xxx;MSG=A device's external power cables are not properly attached.`	`A device's external power cables are not properly attached.` 设备的外部电源线连接不当。	是	硬件维修。
NvidiaPersistencedOffline	是 `Type: NvidiaPersistencedOffline` `Reason: NodeHasNvidiaPersistencedOffline` `Message: TS=xxx;GpuIds=xxx;Nvidia Persistenced service is not running.`	是 `Type: Warning` `Reason: NvidiaPersistencedOffline` `Message: TS=xxx;GpuIds=xxx;Nvidia Persistenced service is not running.`	Nvidia Persistenced服务未运行。	否	重启nvidia-persistenced服务。
NvidiaFabricManagerOffline	是 `Type: NvidiaFabricManagerOffline` `Reason: NodeHasNvidiaFabricManagerOffline` `Message: TS=xxx;GpuIds=xxx;Nvidia Fabric Manager service is not running.`	是 `Type: Warning` `Reason: NvidiaFabricManagerOffline` `Message: TS=xxx;GpuIds=xxx;Nvidia Fabric Manager service is not running.`	Nvidia Fabric Manager服务未运行。	否	重启Fabric Manager服务。
NvidiaTemperatureHigh	是 `Type: NvidiaTemperatureHigh` `Reason: NodeHasNvidiaTemperatureHigh` `Message: TS=xxx;GpuIds=xxx;Nvidia gpu temperature exceeds threshold`	是 `Type: Warning` `Reason: NvidiaTemperatureHigh` `Message: TS=xxx;GpuIds=xxx;Nvidia gpu temperature exceeds threshold`	GPU温度过高超过100摄氏度。	否	None
NvidiaNVLinkStateErr	是 `Type: NvidiaNVLinkStateErr` `Reason: NodeHasNvlinkStateErr` `Message: TS=xxx;GpuIds=xxx;Nvidia nvlink state is down`	是 `Type: Warning` `Reason: NvidiaNvlinkStateErr` `Message: TS=xxx;GpuIds=xxx;Nvidia nvlink state is down`	Nvidia NVLink状态变成down。	否	重启机器。

其他相关Event

独占GPU场景下，NPD默认会根据异常检测项自动进行GPU卡的隔离。隔离后，新的GPU应用Pod不会被分配至该GPU卡。您可查看Kubernetes Node上报的Resource中的nvidia.com/gpu数量以查看隔离效果。等待GPU卡恢复后，ACK会自动解除隔离。

触发原因

Event内容

描述

GPU卡隔离

是

Type: Warning
Reason: NvidiaDeviceIsolated
Message: GpuIds=xxx;MSG=nvidia device has been isolated due to detected issues.

GPU卡因检测出的异常被隔离。

GPU卡解除隔离

是

Type: Normal
Reason: NvidiaDeviceRecovered
Message: GpuIds=xxx;MSG=nvidia device has recovered from the fault.

GPU卡异常恢复，解除卡隔离。

常见问题

如何关闭NPD的异常GPU卡自动隔离能力？

问题背景

当节点 GPU 出现异常时，ACK 会通过NPD自动隔离异常GPU，防止任务被调度到异常GPU上。而自动隔离并不会执行自动修复，发生 GPU 卡自动隔离的节点实例将持续计费，您仍需手动重启或维修节点，并建议配置 GPU 异常告警以便及时处理。

隔离后，若节点剩余 GPU 不足以满足任务需求（如 8 卡任务在仅有 7 卡可用时），任务将无法调度，可能导致GPU资源闲置。
GPU状态恢复正常后，对该GPU设备的隔离会自动解除。
如需关闭自动隔离（出现异常GPU仍然上报资源，不做异常GPU隔离），请参考后续解决方案。

解决方案

说明

当 ack-node-problem-detector组件版本为 v1.2.30 及以上时，支持通过组件管理中的配置项 generateNvidiaGpuIsolationFile 控制是否自动隔离异常 GPU。

关闭NPD的GPU自动隔离能力。
- （推荐）方式一：通过组件管理修改组件配置。
  1. 在集群列表页面，单击目标集群名称，然后在左侧导航栏，单击组件管理。
  2. 在日志与监控页签，查找ack-node-problem-detector组件，然后根据当前组件版本执行对应操作。
    - 1.2.24 至 1.2.29 版本：查看可升级到的版本，若可升级到1.2.30及以上版本，请单击升级。
      1.2.30版本正在灰度中，若您看不到 v1.2.30 或更高版本，请提交工单申请。
    - 1.2.30 及以上版本：请单击配置。
  3. 在组件升级或组件配置界面中，将 generateNvidiaGpuIsolationFile（是否产生NVIDIA GPU 隔离文件）设置为 false，然后单击确认。
    说明
    若您此前通过方式二临时关闭了 GPU 自动隔离功能，在升级NPD组件版本时该关闭配置会自动保留。如果您在关闭后希望重新开启GPU卡自动隔离功能，可将generateNvidiaGpuIsolationFile 设置为 true。
- 方式二：通过YAML手动修改配置。
  说明
  以下关闭NPD的GPU自动隔离能力为临时方案，NPD升级到1.2.30以下版本后配置会丢失，您需在升级后按照以下步骤重新配置。建议您升级到1.2.30及以上版本以持久化该组件配置。
  1. 编辑NPD组件YAML。
```
kubectl edit ds -n kube-system ack-node-problem-detector-daemonset
```
  2. 修改EnabledIsolateGPU配置为false。
    修改前：
    --EnabledIsolateGPU=true
    修改后：
    --EnabledIsolateGPU=false
解除已经产生的GPU卡自动隔离。
针对已经产生的GPU卡自动隔离，可通过登录到发生XID错误的节点，删除/etc/nvidia-device-plugin/unhealthyDevices.json文件，来解除该节点上的GPU隔离。为了避免再次被隔离，可以参考上一步操作关闭自动隔离功能。