All Products
Search
Document Center

Platform For AI:FAQ about EAS

Last Updated:Dec 27, 2024

This topic provides answers to some frequently asked questions about online prediction.

What do I do if a service remains in the Waiting state for a long period of time?

After you deploy a service in Elastic Algorithm Service (EAS), the status of the service changes to Waiting and the service is waiting for resource scheduling and the startup of the service instances. After all service instances start up, the status of the service changes to Running. If the service remains in the Waiting state for a long period of time, you can go to the Service Details page, click the Instances tab, and then identify the cause based on the status of the service instances. In most cases, the issue occurs in the following situations:

  • Insufficient resources: On the Instances tab, all service instances or some service instances are in the Pending state.

Resource scheduling fails because the dedicated resource group does not have sufficient idle resources. The following figure shows an example. image

In this scenario, check whether the nodes in the dedicated resource group can provide sufficient idle resources, including CPU, memory, and GPU resources. If a service instance requires 3 vCPUs and 4 GB of memory, the resource group must have at least one node to provide the required idle resources.

Important

A node must reserve at least 1 vCPU for system components to avoid system failures during peak hours. Therefore, you need to decrease 1 vCPU from the total number of schedulable vCores.

The following figure shows the nodes in a dedicated resource group. For more information about how to view the details of a resource group, see Manage dedicated resource groups.image

  • Service instances fail to complete health check: The status of the service instances is Running but the status of the containers is [0/1] or [1/2].

The number to the left of the forward slash (/) indicates the number of containers that have started up. The number to the right of the forward slash (/) indicates the total number of containers. When you deploy a service from a custom image, a sidecar container is automatically injected into each service instance to control and monitor service traffic. You do not need to pay attention to the sidecar container. The total number of containers displayed in the console is 2, which indicates the container created from the custom image and the sidecar container. A service instance starts up only if both the containers are ready, and then begins to receive traffic.image

What do I do if the status of a service is Failed?

A service remains in the Failed state in the following situations:

  • Service deployment phase: If the resources, such as the model path, that you specified when you deploy the service do not exist, an error is displayed in the status information of the service. In most cases, you can identify the cause of failure based on the error.

  • Service startup phase: After the service is deployed and resources are scheduled to the service, the service fails to start up. In this scenario, the following status information is displayed:

Instance <network-test-5ff76448fd-h9dsn> not healthy: Instance crashed, please inspect instance log.

The information indicates that a service instance fails to start up. In this scenario, you need to check the instance status on the Instances tab of the service details page to identify the cause of failure. A service instance fails to start up in the following situations:

  • The service instance is terminated by the system during the startup process due to an out-of-memory (OOM) error. In most cases, you can increase the memory allocated to the service to resolve this issue. The following figure shows the status of the instance. image

  • The service instance encounters a code crash during the startup phase and the Last Status column of the service instance displays Error(error code). In this scenario, you need to click Logs in the Actions column of the service instance on the Instances tab to identify the cause of failure. The following figure shows the status of the instance. image

  • The service instance fails to pull the image and the Instance Status is Pending. The Reason for Last Exit column of the service instance displays ImagePullBackOff. image

This issue usually occurs when you use a custom image to deploy a service and the custom image fails to be pulled. To resolve this issue, check the following items:

  • Check whether the image address is valid. By default, EAS is not connected to the Internet. Therefore, EAS cannot pull images over the Internet. If the image is stored on a Container Registry Personal Edition instance, pull the image over the virtual private cloud (VPC) of the Container Registry Personal Edition instance. If you want to pull an image over the Internet, refer to How do services deployed in EAS access the Internet?

  • If your image is stored on a Container Registry Enterprise Edition instance, make sure that the VPC configuration of the Container Registry Enterprise Edition instance is the same as the network configuration of EAS. To configure networking for EAS and Container Registry, see the following topics:

  • VPC configuration for Container Registry Enterprise Edition instances: Configure access over VPCs

  • Network configuration for EAS dedicated resource groups: Configure networking for a resource group

  • Network configuration for the EAS public resource group: Configure networking for a resource group

  • Check whether the authentication information is valid. After you grant permissions to EAS in the console, EAS can pull images from Container Registry instances without passwords. To pull images from a Container Registry Personal Edition instance, enter the address of the image. To pull images from a Container Registry Enterprise Edition instance, specify the "cloud.docker_registry.instance_id": "cr_xxx" field in the service configuration. This field specifies the ID of the Container Registry Enterprise Edition instance from which you want to pull images. If you want to pull images from a third-party image repository, specify the dockerAuth field to set the username and password of the image repository.

How do services deployed in EAS access the Internet?

By default, services deployed in EAS cannot access the Internet. If you want to access the Internet, you need to connect the VPC in which your service is deployed to the Internet. Before you create a direct connection, make sure that a NAT gateway is deployed for the VPC. If no NAT gateway exists, create a NAT gateway for the VPC. To create a direct connection to connect EAS to the Internet, see the following topics:

What is the difference between calling a service over a VPC and over a VPC direct connection?

  • Call service over VPC: To call a service over a VPC, an internal-facing Server Load Balancer (SLB) instance and a NAT gateway are used. The SLB instance is used to forward requests at Layer 4 and the NAT gateway is used to forward requests at Layer 7. During peak hours, the performance of the service may be degraded due to traffic forwarding and the limited bandwidth of the NAT gateway.

  • VPC direct connection: To resolve the preceding performance and bandwidth limit issue, EAS allows you to use the VPC direct connection feature without incurring additional fees. After you activate the VPC direct connection feature, a direct connection is established between the VPC of your client and the VPC in which the EAS service is deployed. Then, you can use the service discovery feature provided by EAS to obtain the service endpoint and use a software load balancer on your client to initiate a request to the service. This solution requires you to use the EAS SDK to access the service and set endpoint_type to DIRECT.

    For example, in the scenario described in SDK for Python, you can add the following code block to the client code to change the connection mode from NAT gateway to VPC direct connection:

    client = PredictClient('http://pai-eas-vpc.cn-hangzhou.aliyuncs.com', 'mnist_saved_model_example')
    client.set_token('M2FhNjJlZDBmMzBmMzE4NjFiNzZhMmUxY2IxZjkyMDczNzAzYjFi****')
    client.set_endpoint_type(ENDPOINT_TYPE_DIRECT) # Direct link
    client.init()

What do I do if the log of the service includes the "[WARN] connection is closed: End of file" or "Write an Invalid stream: End of file" error?

The connection between the client and server is interrupted. When the server answers the request from the client, the server identifies that the connection to the client is interrupted. Then, the server generates a warning event. A connection is interrupted in the following situations:

  • The server times out: If the service is deployed by using a processor, the default timeout period of the server is 5 seconds. You can set the metadata.rpc.keepalive parameter of the service to modify the server timeout period. The server automatically closes the connection when the timeout period ends. In this scenario, you can find a 408 status code in the monitoring data of the server.

  • The client times out: The client timeout period is specified in the code of your client. If the server does not return a response before the timeout period ends, the HTTP client automatically closes the connection. In this scenario, you can find a 499 status code in the monitoring data of the server.

For more information about request status codes, see Appendix: Status codes.

What do I do if I fail to call or debug a service that is deployed by using a TensorFlow or PyTorch processor?

To ensure service performance, the body of requests sent by the TensorFlow or PyTorch processor is in the protobuf format. This means that the requests are cipher requests. However, online service debugging supports only plaintext requests. You cannot perform online service debugging in the console. To perform online service debugging, you need to use the EAS SDK to send requests. For more information about the EAS SDK for different programming languages, see SDKs.

Why is the service-linked role of EAS not automatically created or deleted for RAM users?

The AliyunServiceRoleForPaiEas role can be automatically created or deleted only if you have the required permissions. Therefore, the AliyunServiceRoleForPaiEas role cannot be automatically created or deleted for Resource Access Management (RAM) users. If you want the system to automatically create and delete the role, attach the following custom policy to the RAM user.

  1. Create a custom policy based on the following content in script mode. For more information, see Create a custom policy.

    {
      "Statement": [
        {
          "Action": "ram:CreateServiceLinkedRole",
          "Resource": "*",
          "Effect": "Allow",
          "Condition": {
            "StringEquals": {
              "ram:ServiceName": "eas.pai.aliyuncs.com"
            }
          }
        }
      ],
      "Version": "1"
    }
  2. Attach the custom policy to the RAM user. For more information, see Grant permissions to the RAM user.

How do I delete the nodes in a dedicated resource group that uses the subscription billing method?

Go to the Unsubscribe page to unscribe nodes in a dedicated resource group that uses the subscription billing method.

  • Type: Select Partial Refund.

  • Name: Select EAS pre-payment for dedicated machine.

Cilck Search and find the node to be unscribed. Then, click Unsubscribe Resource in the Actions column. Follow the console instructions to complete.

How do I use cURL commands to call EAS online services?

After you deploy an online service in EAS, you can run cURL commands to call the service through the public endpoint or the VPC endpoint. Perform the following steps:

  1. Obtain the service endpoint and token.

    1. On the EAS-Online Model Services page, click the service name to go to the Service Details page.

    2. In the Basic Information section, click View Endpoint Information.

    3. On the Public Endpoint or VPC Endpoint tab of the Invocation Method dialog box, obtain the endpoint and token of the service.

  2. Run the cURL command to call the service.

    Sample command:

    $ curl <service_url> -H 'Authorization: <service_token>' -d '[{"sex":0,"cp":0,"fbs":0,"restecg":0,"exang":0,"slop":0,"thal":0,"age":0,"trestbps":0,"chol":0,"thalach":0,"oldpeak":0,"ca":0}]'

    Parameters:

    • Replace <service_url> with the endpoint that you obtained.

    • Replace <service_token> with the token that you obtained.

    • -d: the service request data.

What are the service statuses in EAS?

Currently, an EAS service can be in the following statuses. You can view the status of a service in the Service Status column on the Elastic Algorithm Service (EAS) page.

  • Creating

  • Waiting

  • Stopped

  • Failed

  • Updating

  • Stopping

  • HotUpdate

  • Starting

  • DeleteFailed

  • Running

  • Scaling

  • Pending

  • Deleting

  • Completed

  • Preparing

After I unsubscribe an instance in a resource group, is the service instance data retained?

No, the service instance data is not retained.

Why am I unable to select my OSS bucket when deploying an EAS service?

When deploying an EAS service, you can configure model and code by mounting. Make sure that the OSS bucket or the NAS file system reside in the same region as the EAS service. Otherwise, you cannot select the OSS bucket or the NAS file system.

Why am I unable to choose the instance type with 1 core and 2 GB memory when deploying an EAS service?

To avoid potential issues during your usage, the instance type with 1 core and 2 GB memory has been removed from sale. This is because EAS deploys certain system components on each node, which occupies some of the resources. If the specification of the node is too small, the resource usage ratio of the system components will be too high, resulting in a lower proportion of available resources for you.

How many services can I deploy in EAS at most?

The number of EAS services you can deploy is determined based on the remaining resources. You can view the remaining resources in the Nodes list on the Resource Group tab. For more information, see Work with dedicated resource groups.

If tasks are allocated based on the number of CPU cores, the upper limit for the number of deployed instances is (CPU cores - 1) / number of cores used per instance.

Frequent errors

upstream connect error or disconnect/reset before headers. reset reason: connection termination

This error is typically caused by problems such as long connection timeouts leading to request failures or uneven instance load. When the server processing time exceeds the HTTP timeout set by the client, the client will abandon the request and actively close the connection. At this point, the server monitoring will show status code 499. You can check the monitoring metrics for further confirmation. For cases where inference takes a long time, we recommend that you deploy an asynchronous inference service.