This topic describes the terms of ApsaraVideo MediaBox SDK.
Overview
ApsaraVideo MediaBox SDK
ApsaraVideo MediaBox SDK integrates services such as Push SDK, ApsaraVideo Player SDK, Short video SDK, and Queen SDK to provide audio and video capabilities for clients in the low-code application solution based on AUI Kits. The capabilities include stream ingest, co-streaming, playback, and interactive messaging. You can obtain comprehensive audio and video capabilities with ApsaraVideo MediaBox SDK to achieve agile business innovation. For more information, see What is ApsaraVideo MediaBox SDK?
AUI Kits
AUI Kits is an Application Platform as a Service (aPaaS) service provided by Alibaba Cloud based on extensive audio and video practices. It modularizes ApsaraVideo MediaBox SDK and provides standardized open source UI components. You can use AUI Kits to easily integrate ApsaraVideo MediaBox SDK based on your business requirements. This reduces R&D costs, accelerates development, and improves business performance.
AppServer
AppServer provides a background service that can be quickly deployed and flexibly customized for AUI Kits based on services such as Function Compute. AppServer provides features such as room management, co-streaming management, user authentication, and signaling management for AUI Kits in interactive live streaming scenarios. It takes only 5 to 10 minutes to deploy AppServer. You can also deploy AppServer by using a container image or source code.
Differences between RTS 1.0 and RTS 2.0
Item | RTS 2.0 | RTS 1.0 |
Definition | Streams are ingested at the edge without passing through the live center. If you want to record and transcode data, you must configure a stream relay task to transfer the data to a Real-Time Messaging Protocol (RTMP) domain name, and then record data on the RTMP server. | Stream ingest is implemented in the liver center. You can directly record and transcode data. |
Streaming protocol | Streams can be played by using the Alibaba Real-Time Communication (ARTC) protocol based on Web Real-Time Communication (WebRTC). | |
End-to-end latency | 200 to 400 milliseconds | 500 to 1,000 milliseconds |
Limits | RTS SDK must be integrated on the stream ingest and playback sides. | RTS SDK must be integrated on the playback side. |
Resistance to poor network conditions | Streams can be smoothly played under an end-to-end packet loss rate of 30%. | Streams can be smoothly played under a pack loss rate of 30% on the playback side. |
Compatibility |
| |
Coverage | Global | |
Best practices |
Streaming media
Differences among VOD, live streaming, and stream ingest
Stream ingest: A streamer pushes local audio and video data to an ApsaraVideo server.
Live streaming: The audience can directly play the audio and video data that is pushed from a streamer client or live center in real time. In most cases, the latency is low.
Video-on-demand (VOD): The videos are stored in the ApsaraVideo media library in advance, and the audience can play the videos at any time.
Common VOD formats
Three VOD formats are commonly used: MP4, HTTP-Live-Streaming (HLS), and FLV.
MP4: a classic file format that supports a wide variety of mobile terminals and browsers, including system browsers on iOS devices and most Android devices and the Flash control on PCs. However, MP4 is relatively complex and the costs of processing MP4 videos are high. In addition, long MP4 videos such as those exceeding half an hour, are slow to load when they are played online due to the complex structure of the index table. MP4 is more suitable for on-demand short video scenarios.
HLS: a standard file format that is introduced by Apple and well supported by browsers on mobile devices. However, the support on Internet Explorer depends on custom development for the Flash control. We recommend that you use ApsaraVideo Player SDK for Web. HLS uses a simple M3U8 index structure to prevent slow MP4 indexing. HLS is suitable for medium and long video scenarios.
FLV: a standard file format that is introduced by Adobe, which is the most commonly used container format on live broadcast platforms. FLV is supported by the Flash control on PCs. However, FLV can be supported only by the apps that implement the player on mobile terminals. It is not supported by most mobile browsers.
Common live streaming protocols
Three live streaming protocols are commonly used: RTMP, FLV, and HLS.
RTMP: a powerful protocol that can be used for stream ingest and live streaming. RTMP splits video and audio frames into small packets and transmits the packets over the Internet. In addition, RTMP supports encryption. This helps ensure privacy protection. However, due to the complexity of splitting and assembling, unpredictable stability issues may occur in scenarios with a large number of concurrent requests.
FLV: a standard that is introduced by Adobe. The FLV format is simple. FLV adds header information to video frames and video headers. Based on its simple design, the FLV protocol provides low latency and can be used to process a large number of concurrent requests. The only shortcoming of FLV is that it is supported by limited mobile browsers. However, as a live streaming protocol, the FLV protocol is suitable for mobile apps as a live streaming protocol.
HLS: a standard that is introduced by Apple. HLS splits a video into small segments of 5 to 10 seconds and manages the segments by using the M3U8 index table. The videos that are downloaded by the client are 5 to 10 seconds in length. Therefore, the videos can be smoothly played. However, this method also causes a high latency of 10 to 30 seconds, which is in the common latency range of HLS. Compared with FLV, HLS supports browsers on iPhone and most Android phones.
Common stream ingest protocols
RTMP is commonly used as a stream ingest protocol. ApsaraVideo also supports RTS that provides an ultra-low latency.
RTMP: In most cases, the RTMP protocol is used to push streams from the streamer to the live center server.
RTS: RTS is an important value-added feature of Alibaba Cloud ApsaraVideo Live. It provides easy-to-use ApsaraVideo Live services that support ultra-low latency, high concurrency, and high definition.
SDK integration and use
SDK license
ApsaraVideo MediaBox SDK is a terminal SDK that is released by ApsaraVideo. It provides scenario-based audio and video capabilities for terminals. You can apply for a free license, purchase a license, or obtain a license by spending the specified amount of money on Alibaba Cloud.
A license for ApsaraVideo MediaBox SDK is bound to an application. This way, the application is authorized to use ApsaraVideo MediaBox SDK. For example, if a license of ApsaraVideo Player SDK is bound to Application A, Application A can use the features of ApsaraVideo Player SDK. Each license can be bound to at most one Android application and one iOS application. You can also add and renew licenses in the ApsaraVideo Live or ApsaraVideo VOD console. For more information about billing, see Billable items.
After you create an application and bind a license to the application in the console, a license file and a license key are generated. When you integrate ApsaraVideo MediaBox SDK, you must configure the license file and license key in the corresponding application. ApsaraVideo MediaBox SDK uses the license file and license key to verify the authorization status of the current application. By default, a unique license key is generated for each Alibaba Cloud account, and a license file is generated for each application. The license file and license key are unique and do not change regardless of the content and type of the authorization.
Duplicate symbol
A process cannot contain functions that have the same name. The compiler compiles functions into symbols. If functions that have the same name exist in a process, the connector has difficulty in selecting a function. As a result, a compilation error may occur when you integrate ApsaraVideo MediaBox SDK.
ApsaraVideo terminal SDKs may conflict with each other due to different designs of media component architectures. If you want to use two business features, use the integrated feature package. For example, if you want to use short video and player services, use the AliVCSDK_UGC package, which provides the same features and has a smaller size.
Push SDK
Bitrate control
Bitrate control uses an optimized coding algorithm to control the bitrate of video streams. In the same video coding format, video streams at a higher bitrate contain more information and provide clearer images.
Frame drop
When you send video frames, if video frames are accumulated due to the poor network conditions, you can drop video frames to reduce the latency of stream ingest.
In-ear monitoring
In-ear monitoring allows streamers to hear their voice in real time from the headset that they are wearing. For example, streamers can enable in-ear monitoring to help tune their voice when they are singing with a headset. This is because the effects of audio transmitted into ear canals over the network are quite different from the effects of audio transmitted through the air. Streamers need to monitor the effects of audio that viewers can hear on their clients.
Audio mixing
Audio mixing combines multiple audio sources into a stereo or mono audio track. Push SDK supports mixing of music and vocals.
Stream merging
Stream merging overlays video frames with video image data from multiple sources based on the timeline. This feature is supported only by Push SDK for Android.
Dynamic library
A dynamic library is also known as a dynamic link library (DLL). Dynamic libraries are different from common static libraries in that dynamic libraries are not copied to an application during compilation. Only references to the dynamic libraries are stored in the application. Dynamic libraries are loaded only when the application is running.
When you load dynamic libraries in Xcode, you must add the dynamic libraries to the Embedded Binaries section instead of the Linked Frameworks and Libraries section.
Short video SDK
Video resolution and bitrate
The resolution of a video indicates the effective horizontal and vertical pixels of the video. A higher resolution indicates a clearer video. However, a higher resolution increases the file size and the amount of time consumed to process a video. The performance of mobile devices varies. We recommend that you do not directly use the screen resolution in pixels as the video resolution. We recommend that you set the resolution to 720p or lower.
The bitrate of a video specifies the number of bits that are transmitted per second. Unit: bit per second (bit/s). When you compress a video, specify a bitrate for the compressed video. This enables the video encoder to compress the video to the expected size. In a specific range, a higher bitrate indicates a clearer video but a larger file size.
The following table describes the common video resolutions and recommended bitrates.
Definition | 1∶1 | 3:4 | 9:16 | Recommended bitrate (unit: bit/s) |
480P | 480×480 | 480×640 | 480×853 | 1000000 to 2000000 |
540P | 540×540 | 540×720 | 540×960 | 2000000 to 3000000 |
720P | 720×720 | 720×960 | 720×1280 | 2000000 to 4000000 |
1080P | 1080×1080 | 1080×1440 | 1080×1920 | 2000000 to 6000000 |
Frame rate
The frame rate of a video specifies the number of image frames that are displayed per second. Unit: frame per second (FPS). A higher frame rate indicates a smoother video but a larger file size. The recommended video frame rate is 25 to 30 fps.
Keyframe
Frame is the basic unit of video images. A video is composed of multiple consecutive frames. A keyframe, which is also called an I-frame, is important for interframe compression and encoding. During decoding, a complete image can be reconstructed by using only a keyframe. Keyframes can be generated without the need to reference other images. A keyframe can be used as an image and a reference point for seeking.
GOP
A group of pictures (GOP) is a collection of successive frames. A GOP starts with a keyframe, followed by a group of B-frames and P-frames. If the GOP size of a video is small, the number of keyframes increases and the compression ratio decreases. If the GOP size of a video is large, seeking takes more time, and reverse playback stutters because a GOP must be decoded for the frames to be played in reverse. The default GOP size in Short video SDK is 5. We recommend that you set the GOP size to a value in the range of 5 to 30.
If the GOP size of the imported video is large, you must transcode the video before you can use the video editing module.
Padding mode
If the aspect ratio of an input image or video is different from that of the output video, you must select a padding mode. The following table describes the two padding modes that are supported by Short video SDK.
Padding mode | Description |
Cropping mode | Maintains the original aspect ratio and crops the image to display only the content in the middle area. |
Scaling mode | Maintains the original aspect ratio and displays the complete image by filling the blank area with a specific color. |
Encoding mode
The following table describes the two encoding modes that are supported by Short video SDK.
Encoding mode | Description |
Software encoding | Uses the CPU to encode a video. In software encoding mode, you can configure more parameters, and the video generated at the same bitrate is clearer. However, the encoding speed is relatively slow, the CPU load is high, and the mobile phone heats up more easily. |
Hardware encoding | Use hardware other than CPU to encode a video. In hardware encoding mode, the encoding speed is higher and CPU load is low. However, the video definition is slightly lower than that generated in software encoding mode. Specific Android devices may have adaptability issues. |
Resource description
The resources of Short video SDK include facial recognition models, common filters, and animated filters. SDK resources can be stored on the network or directly in the installation package. In most cases, the size of the SDK package is large. Therefore, we recommend that you store the resources on the network and configure your application to download the resources when the application starts.
Short video SDK cannot read resources from the assets folder of the Android platforms. If the resources are packaged into an APK, the resources must be copied to the SD card after the application is started. You can obtain the resource files and instructions from the downloaded SDK package.
Supported file formats
The following table describes the supported file formats.
Resource type | Format |
Video | MP4, MOV, and FLV |
Audio | MP3, AAC, and PCM |
Image | JPG, PNG, and GIF |
Duet recording
The duet recording feature allows you to synthesize two videos based on the specified layout, such as left-right split-screen, up-down split-screen, or picture-in-picture. One video is selected from the sample videos and the other one is captured by the camera. Each frame of the synthesized video contains the images of the two videos at the same time, while the audio of the synthesized video uses the audio of the sample video. The following figure shows sample layouts. Short video SDK allows you to customize the layout. For more information, see the "Track layout" section of this topic.
Multi-source recording
The multi-source recording feature allows you to combine videos that are collected from multiple sources, such as videos recorded by screen recording and videos recorded by the camera. The feature synthesizes videos from multiple sources based on the specified layout, such as left-right split-screen, up-down split-screen, or picture-in-picture. The videos can be a video that is collected by screen recording and a video that is captured by the camera. Each frame of the synthesized video contains the images of the videos that are collected from the preceding sources. The following figure shows sample layouts. Short video SDK allows you to customize the layout. For more information, see the "Track layout" section of this topic.
Track
The two videos described in the duet recording feature are abstracted into two tracks in Short video SDK: Track A and Track B. The video collected by the camera is played in Track A, and the sample video is played in Track B. This helps you understand track layout.
The multiple video sources described in the multi-source recording feature are abstracted into multiple tracks in Short video SDK. For example, the video collected by the camera is played in Track A, and the video collected by screen recording is played in Track B. This helps you understand track layout.
Track layout
Track layout is an attribute of a track, which specifies the position of the track in the produced video. This attribute uses a normalized coordinate system to describe the center point of the track and the size of the track. The size of a track indicates the width and height of the track.
The following figure shows the track layout for duet recording.
In the figure, Track A occupies the left half of the screen and Track B occupies the right half of the screen. Therefore, the width of both tracks is 0.5, and the height of both tracks is 1.0. The center point of Track A is (0.25, 0.5), and the center point of Track B is (0.75, 0.5).
The following figure shows the track layout for multi-source recording.
In the figure, Track A occupies the left half of the screen and Track B occupies the right half of the screen. Therefore, the width of both tracks is 0.5, and the height of both tracks is 1.0. The center point of Track A is (0.25, 0.5), and the center point of Track B is (0.75, 0.5).