This topic describes how to use the MADlib plug-in. MADlib is an open source library that runs machine learning and graph computing models in AliPG databases. In terms of machine learning, MADlib provides functions and stored procedures for mathematical operations. MADlib also provides a set of typical supervised and unsupervised algorithm libraries for machine learning.
Prerequisites
- Your ApsraDB RDS for PostgreSQL instance runs one of the following database engine versions:
- The major engine version of your RDS instance is PostgreSQL 11 or PostgreSQL 12.
- The minor engine version of the RDS instance is 20230330 or later. Important This extension is supported in minor engine versions that are earlier than 20230330. However, the extensions that are supported for ApsaraDB RDS for PostgreSQL instances are changed. Starting April 17, 2023, some extensions can no longer be created for RDS instances that run minor engine versions earlier than 20230330. For more information, see [Notice] Starting April 17, 2023, some extensions can no longer be created for ApsaraDB RDS for PostgreSQL instances that run earlier minor engine versions.
- If you have created this extension for your RDS instance that runs a minor engine version earlier than 20230330, the extension is not affected.
- If this is the first time you create this extension for your RDS instance or re-create the extension, you must update the minor engine version of the RDS instance to 20230330 or later. For more information, see Update the minor engine version of an ApsaraDB RDS for PostgreSQL instance.
- A privileged account is used to connect to your RDS instance. You can check the type of the account that you use on the Accounts page in the ApsaraDB RDS console. If the account is a standard account, you must create a privileged account and use the privileged account to connect to your RDS instance. For more information, see Create an account on an ApsaraDB RDS for PostgreSQL instance.
Background information
The machine learning module of MADlib solves the following issues:
- Classification and regression issues: MADlib provides a set of algorithms such as K-Nearest Neighbor (KKN), multilayer perceptron neural network, support vector machine (SVM), and decision tree to solve binary classification and regression issues. MADlib also provides a set of models such as least-squares regression, generalized linear model (GLM), logistic regression, and multinomial logistic regression to solve regression issues.
- Clustering issues: MADlib provides the K-means algorithm for clustering analysis.
- Correlation analysis: MADlib provides the Apriori algorithm for correlation analysis. The feature can help find unexpected correlations between products such as the correlation between diapers and beer.
- Analysis of time series data: MADlib provides autoregressive integrated moving average (ARIMA) models to predict future trends of time series data.
- Others: MADlib provides principal component analysis (PCA) to extract the main factors for data dimension reduction. MADlib provides a Latent Dirichlet Allocation (LDA) model for document classification and topic modeling.
MADlib also integrates a graph computing model to solve issues such as the shortest path, PageRank ranking, and social media issues on queries for the contacts of a specific user. The following table describes the algorithms related to graph computing models.
Type | Model or feature | Description |
---|---|---|
Shortest path | Shortest path among all vertices | Calculates the shortest path among all vertices and saves the result to a specific result table. This model queries the shortest path from a start vertex to an end vertex based on the result table. |
Shortest path between a specific vertex and all other vertices | Calculates the shortest path between a specific vertex and all other vertices and saves the result to a specific result table. This model queries the shortest path from a specific vertex to any other vertex based on the result table. | |
Breadth-first search (BFS) | BFS | Uses the BFS method to query vertices that are reachable from a specific source vertex. |
HITS | HITS score | Queries the HITS scores of all vertices in a directed graph. The HITS scores include hub scores and authority scores. |
Web page ranking | PageRank | Queries the PageRank values of all vertices in a directed graph. |
Weakly connected component | Weakly connected component | Queries all weakly connected components in a directed graph. |
Measure | Average path length | Calculates the average shortest path length of graphs. |
Proximity | Calculates the closeness centrality of all nodes in a graph. | |
Graph diameter | Calculates the graph diameter. | |
In-degree or out-degree | Calculates the in-degree and out-degree of all vertices. |
Enable or disable the MADlib plug-in
- Execute the following statement to enable the MADlib plug-in:Note Before you execute the following statement, you must execute the
CREATE EXTENSION plpythonu;
statement to create the plpythonu plug-in.CREATE EXTENSION madlib;
- Execute the following statement to disable the MADlib plug-in:
DROP EXTENSION madlib;
References
For more information about the MADlib plug-in, see MADlib documentation.