By JV Roig
There are three general ways you can deploy and use generative AI services in Alibaba Cloud:
• IaaS (Infrastructure as a Service)
• PaaS (Platform as a Service)
• MaaS (Model as a Service)
Here in part 1, we'll deal with the first approach: IaaS
In an IaaS deployment, the main advantage is that it gives you 100% control of the entire stack. This includes:
• Operating System
• Instance Type and Size
• LLM server software (e.g., Hugging Face, vLLM, llama.cpp, etc)
• Model choice (literally any model)
You also get to easily control whatever other application or software that you want to exist in the same VM as your genAI stack.
This may sound amazing (and it is), but as the saying goes: "With great power comes great responsibility!"
There is huge management overhead involved in an IaaS deployment. You are now responsible for a massive amount of things, including:
• OS security, patching and maintenance
• LLM server software security, patching and maintenance
• Performance scaling and load balancing infrastructure
If you are already invested in a particular approach to generative AI (for example, if you and your team are already experts in deploying generative AI features using llama.cpp), then an IaaS genAI deployment is perfect for your use case. You’ll get 100% of the flexibility that you need, at the expense of having to manage the entire stack.
Now, let's get started with deploying generative AI through an IaaS approach.
The key service here is of course the flagship IaaS offering of Alibaba Cloud – Elastic Compute Service (ECS).
Specifically, we're going to use a GPU-powered instance to host our preferred OS and generative AI platform.
In this tutorial, the key technology choices are:
• Ubuntu 22.04 LTS for the operating system (for easy NVIDIA CUDA compatibility)
• llama.cpp as the LLM server software (see community github page: https://github.com/ggerganov/llama.cpp)
• An ECS instance type with a T4 GPU (P100, V100 or A10 will also work; T4 is just cheap and available, while being reasonably performant for our needs in this tutorial)
Alright, let’s get started!
In the ECS console overview, create a new instance by clicking Create Instance. This is where we will create our GPU-powered instance with Ubuntu 22.04 LTS as the operating system.
Here are the general steps:
• Select Pay-as-you-go as the Billing Method
• Select your desired Region. In this example, I used Singapore.
• For network and Zone, choose the default VPC, any available Zone, and the default vSwitch in that zone.
• Under Instances & Images, click All Instance Types.
• For Architecture, click the GPU/FPGA/ASIC box, to filter the instances to only show those with the desired accelerators
• You'll see a list similar to the screenshot above. Choose a cheap instance that has a single T4 (with 16GB of GPU RAM), or whatever is available in your chosen zone. To make sure our demo works, get a GPU that has at least 16GB of GPU RAM.
• CPU cores can be 8 or higher, and system RAM can be 16GB or higher.
• For Image, choose Public Images, and then Ubuntu. Choose Ubuntu 22.04 64-bit in the version list, and keep Free Security Hardening checked.
• Under Storage, you might want a System Disk that has 1TB or more – after all, LLMs are huge. Set the size appropriately. If you are unsure, just put 1024 GiB (1TB).
• For Bandwidth and Security Groups, make sure to check Assign Public IPv4 Address.
• Choose Pay-by-traffic, and choose 100 Mbps as the max bandwidth. We will be downloading some large files (LLMs are huge), so we'll make use of that 100 mbps bandwidth.
• So that we can SSH into our instance later, we need to setup a Key Pair. Choose ecs-user as the logon username (to avoid logging in as root).
• If you don't have an available key pair yet, click Create Key Pair, and then refresh the list once you’ve created your key pair.
• Finally, accept the ECS terms of service on the right, and click Create Order.
Now that we've got an instance up and running, it's time to connect to it so we can setup the necessary software.
Here, we’ll be using the Workbench remote connection, a default feature in ECS that allows us to easily SSH into our instance using the Key Pair we supplied. Then, we’ll update the OS, verify Python is installed, and then install the Nvidia drivers and CUDA.
• To connect to our instance, go to the list of instances (ECS Console -> Instances & Images -> Instances).
• You should see your recently launched instance in the list. Under the Actions column, click the Connect option, and choose Sign in now.
• Choose Public IP for connection, SSH Key Authentication, and use ecs-user as the username.
• For private key, click the box to choose your saved private key. (You would have saved this when you created your key pair)
Congrats! You’ll find yourself in a nice SSH window in the browser, and you can beging configuring the software.
First, let’s update OS with the following commands:
• sudo apt update
• sudo apt upgrade
Let’s verify python is installed. Check the python version with the following command:
• python3 --version
You should see Python 3.10.12 or similar
The most challenging part is up next: Installing Nvidia drivers and CUDA.
1. First, we need linux headers so we can build the drivers:
sudo apt-get install linux-headers-$(uname -r)
2. Then follow steps here: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local
Start with Base Installer commands (one by one), then the Driver Installer command (use legacy kernel module flavor)
If you did not choose the max bandwidth option during ECS instance launch config, this will take a lot longer than it should (GBs of NV download)
3. After Driver Installation, run:
echo export PATH=/usr/local/cuda-12.5/bin${PATH:+:${PATH}} >> .bashrc
4. Reboot your instance (cmd: sudo reboot) and reconnect again.
5. Check CUDA presence:
nvcc –version
You should see something like this:
6. Check NV driver presence:
nvidia-smi
You should see something like this:
If you’ve made it this far, congratulations! You’ve successfully installed the Nvidia drivers and CUDA – you’re now ready to use the full power of your instance for generative AI!
For this tutorial, let's use a very popular community project – llama.cpp.
1. Install llama.cpp
Installation commands:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j 8 GGML_CUDA=1
(make -j 8 means use 8 threads for the compilation. If your instance has more than 8 vcores, increase this number to use as much vcores as your instance has)
(This process takes a while, just wait)
2. Download an LLM to test our install:
cd models
wget https://huggingface.co/Qwen/Qwen1.5-7B-Chat-GGUF/resolve/main/qwen1_5-7b-chat-q4_k_m.gguf
3. Test inference using our downloaded model:
cd ..
./main -m models/qwen1_5-7b-chat-q4_k_m.gguf -p "Question: Please explain Alibaba Cloud to me. I am a non-technical business executive. Answer:" -ngl 999 -c 1000 -n 2000
Output should be something like this
Here, we downloaded a quantized (i.e., compressed) version of a small LLM, Qwen1.5 7B-chat, in GGUF format.
If you are already using specific GGUF models, this is the time to download those into your instance in order to use them.
Of course, the real goal of a genAI deployment is to have an LLM server that provides API access so that our own applications can use generative AI features.
In Step 3, we installed llama.cpp, downloaded one or more models, and then did some local inferencing.
Now, we'll use llama.cpp in server mode:
./llama-server -m [model_path/model_name.gguf] -ngl 999 --host 0.0.0.0 --port 8080 --api-key [your-desired-api-key]
Note: --port 8080 specifies that port 8080 of the VM will be used by the server. To make sure your applications can access the llama.cpp server, you need to either add 8080 access in the security group of your VM, or specify a different port number that you’ve already whitelisted in your security group.
Other options:
-cb
= enable continuous batching (recommended)-fa
= enable flash attention (needed for Qwen2, for example)-np N
= enable N slots in parallel, splitting max context size by N as a result (not recommended for low-context-size models); will improve user experience for multiple requests.
Reminders:
For llama.cpp server, apply the correct chat template, else the LLM will "see" the unknown chat delimiters and think it is part of the content, resulting in it having a conversation by itself, wasting GPU cycles. Examples:
• For llama2, it should be --chat-template llama2
appended to the server invocation.
• For llama3 it is llama3
, openai-like is chatml
• See: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
Here's an example of a proper server invocation using Llama 3 8B instruct:
llama.cpp/llama-server -m LLMs/gguf/Meta-Llama-3-8B-Instruct-v2.Q5_K_M.gguf -ngl 999 -cb -np 3 --host 0.0.0.0 --port 8080 --api-key myapikey --chat-template llama3
In the example above, we specified the correct chat template, and used -ngl 999 to tell the server to offload all layers to the GPU, for maximum performance.
Here’s an example using Qwen2:
llama.cpp/llama-server -m LLMs/gguf/qwen2-7b-instruct-q5_k_m.gguf -ngl 999 -fa -cb -np 3 --host 0.0.0.0 --port 8080 --api-key myapikey
In the example above using qwen2 on llama.cpp, we used Flash Attention through the -fa flag (required for qwen2). You don’t need to specify a chat template, because it uses the default one supported by llama.cpp.
Once your llama.cpp server is up and running (it can take a while to start up on the first time it loads a model, so be patient), the last thing to do is to access the API server through code!
We’ll need the OpenAI client Python library, so we install it first using pip:
pip install openai
Here’s sample python code to connect to your instance:
The beauty of the llama.cpp server is that it provides an OpenAI-compatible API server. This means if you or your team already have lots of code that use the OpenAI API style, using your llama.cpp-powered Alibaba Cloud instance is going to be easy. Other than configuring the correct host, port and API key, any existing code will largely be unchanged and just work.
Above, you can see the sample output when I run the code. It followed the system prompt, so it’s answering like a pirate, giving suggestions to what I could do while I’m vacationing and sailing.
Part 1 has been quite a journey! We created an ECS instance, we installed the Nvidia driver and CUDA so we can make use of the GPU in our instance, and then we installed and configured our very own llama.cpp server.
At the end, we had a dedicated, private instance that provides an OpenAI-compatible API server allowing us to easily integrate generative AI capabilities to our applications.
Up next in Part 2: An easier, faster way to deploy generative AI, with less steps and less management overhead!
ABOUT THE AUTHOR: JV is a Senior Solutions Architect in Alibaba Cloud PH, and leads the team's data and generative AI strategy. If you think anything in this article is relevant to some of your current business problems, please reach out to JV at jv.roig@alibaba-inc.com.
3 posts | 0 followers
FollowAlibaba Cloud Philippines - August 9, 2024
Alibaba Cloud Philippines - August 19, 2024
Alibaba Cloud Community - August 9, 2024
Alibaba Cloud Indonesia - November 22, 2023
Alibaba Cloud Community - September 6, 2024
Regional Content Hub - September 18, 2024
3 posts | 0 followers
FollowPowerful parallel computing capabilities based on GPU technology.
Learn MoreElastic and secure virtual cloud servers to cater all your cloud hosting needs.
Learn MoreAccelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technology
Learn MoreTop-performance foundation models from Alibaba Cloud
Learn MoreMore Posts by Alibaba Cloud Philippines
Santhakumar Munuswamy August 12, 2024 at 1:24 am
Thank you for sharing