This topic describes the components supported by Machine Learning Designer of Platform for AI (PAI).
Component type | Component | Description |
Custom components | Support for creating custom components in AI computing asset management. Once the custom component is successfully created, you can use it in conjunction with official components in the Designer for model training. | |
Data source and destination | This component is used to read objects or directories from Object Storage Service (OSS) buckets. | |
This component allows you to read CSV files from OSS, HTTP, and Hadoop Distributed File System (HDFS) data sources. | ||
This component reads data from MaxCompute tables. By default, the component reads the table data of the current project. | ||
This component allows you to write upstream data to MaxCompute. | ||
Data preprocessing | This component implements random data sampling based on specific proportions or numbers. The samples are independent of each other. | |
This component generates sampling data based on the values of weighted columns. | ||
This component uses the expressions of filter conditions to filter data and allows you to modify the names of the columns that you want to filter. | ||
This component stratifies the input data based on the values of a stratification column and implements random data sampling for each stratum. | ||
This component merges two tables by associating the columns in the tables and determines the output columns. This component works like the JOIN statement of SQL. | ||
This component merges two tables by column. The two tables must have the same number of rows. If one of the two tables has partitions, the partitioned table must connect to the second input port. | ||
This component merges two tables by row. If this component is used, the numbers and data types of the output fields selected from the left and right tables must be the same. This component integrates the features of UNION and UNION ALL. | ||
This component converts features of any data type to features of the STRING, DOUBLE, or INT data type. This component also allows you to replace missing values if exceptions occur during data type conversion. | ||
This component allows you to append an ID column to the first column of a data table. | ||
This component randomly splits data to generate datasets for training and testing. | ||
This component can be configured in a visualized manner or by running PAI commands. | ||
This component allows you to normalize dense or sparse data. | ||
This component allows you to generate standardized instances in a visualized manner or by running PAI commands. | ||
This component allows you to convert a table in the key-value format into a common table. | ||
This component allows you to convert a common table into a table in the key-value format in a visualized manner or by running PAI commands. | ||
Feature engineering | This component provides the filtering feature for components including Linear Model Feature Importance, GBDT Feature Importance, and Random Forest Feature Importance. This component can be used to filter the top N features. | |
This component uses a multivariate statistical method to explore the internal structures of multiple variables and how they correlate to each other based on a few principal components. | ||
This component allows you to scale numeric data in the dense or sparse format by using common scaling functions. | ||
This component discretizes continuous features based on a specific rule. | ||
This component can smooth anomalous features in input data to a specific interval. Both the sparse data and dense data are supported. | ||
This component is used to decompose matrices in linear algebra. It is a generalization of the diagonalization of normal matrices in matrix analysis. | ||
This component is used to detect data with continuous or enumeration features. It helps you detect exceptions in data. | ||
This component calculates the feature importance for a linear model, such as linear regression and logistic regression for binary classification. Both the sparse and dense data are supported. | ||
This component is used to collect statistics on the distribution of discrete features. | ||
This component is used to calculate feature importance. | ||
This component selects the top N features from all feature data in the sparse or dense format by using a filter based on the feature selection method that you specify. | ||
This component can encode nonlinear features to linear features based on gradient boosting decision tree (GBDT) algorithms. | ||
This component converts data to key-value pairs in the sparse format. | ||
Statistical analytics | This component allows you to view the distributions of feature values, feature columns, and label columns. This facilitates follow-up data analysis. | |
This component is used to measure the joint variability of two variables. | ||
This component uses empirical distribution and kernel density estimation functions. | ||
This component collects statistics about data in a table or only selected columns. | ||
This component is used in scenarios in which categorical variables are used. This component is used to determine the difference between the observed frequency and expected frequency for each classification of a single multiclass categorical variable. The null hypothesis assumes that the observed frequency and expected frequency are the same. | ||
A box plot chart shows the distribution of a set of data. It shows the distribution features of raw data. It can also be used to compare the distribution features between multiple sets of data. | ||
In regression analysis, a scatter chart shows the distribution of data points in a Cartesian coordinate system. | ||
A correlation coefficient indicates the correlation between columns in a matrix. The coefficient is in the range of [-1,1]. The count parameter is measured when the value is the number of non-zero elements in two consecutive columns. | ||
This component checks whether the means of two samples are significantly different based on statistical principles. | ||
This component is used to determine whether a significant difference exists between the overall mean of a variable and a specific value. The sample on which you want to perform a T test must follow normal distribution. | ||
This component is a normality test that determines whether the population follows normal distribution by using observations. A normality test is a special goodness-of-fit hypothesis test in statistical determination. | ||
This component can be used to show the income distribution of a country or region. | ||
This component is used to calculate the percentile of data in the columns of a data table. | ||
This component is a linear correlation coefficient that measures the linear correlation between two variables. | ||
This component is a histogram, also known as a mass distribution profile. A histogram is a statistical report chart that consists of a series of vertical stripes or line segments with different heights to show data distribution. | ||
Machine learning | This component uses the training model and prediction data as input and generates prediction results as output. | |
XGBoost is an extension of the gradient boosting algorithm. XGBoost provides better usability and robustness and has been widely used in machine learning production systems and machine learning competitions. XGBoost can be used for classification and regression. | ||
XGBoost is an extension of the gradient boosting algorithm. XGBoost provides better usability and robustness and has been widely used in machine learning production systems and machine learning competitions. XGBoost can be used for classification and regression. | ||
This component is a machine learning model based on statistical learning theory. It minimizes risks and improves the generalization capability of learning machines. This way, empirical risks and confidence intervals are minimized. | ||
This component is a binary classification algorithm and supports sparse and dense data. | ||
This component is used to set a threshold. If a feature value is greater than the threshold, the feature is a positive example. Otherwise, the feature is a negative example. | ||
A parameter server (PS) is used to process a large number of offline and online training tasks. SMART stands for scalable multiple additive regression tree. PS-SMART is an iteration algorithm that is implemented by using a PS-based gradient boosting decision tree (GBDT). | ||
This component is a classic binary classification algorithm and is widely used in advertising and search scenarios. | ||
A parameter server (PS) is used to process a large number of offline and online training tasks. SMART stands for scalable multiple additive regression tree. PS-SMART is an iteration algorithm that is implemented by using a PS-based gradient boosting decision tree (GBDT). | ||
This component selects the K-nearest records from a row in the prediction table for classification. The most common class of the K-nearest records is used as the class of the row. | ||
This component is used for multiclass classification, and supports both the sparse and dense data formats. | ||
This component is a classifier that consists of multiple decision trees. The classification result is determined by the mode of output classes of individual trees. | ||
This component is a probabilistic classification algorithm based on Bayesian theorem with independent assumptions. | ||
This component randomly selects K objects as the initial centroids of each cluster, computes the distance between the remaining objects and the centroids, distributes the remaining objects to the nearest clusters, and then recalculates the centroids of each cluster. | ||
This component is used to create clustering models. | ||
This component is used to classify models. | ||
This component is used to predict the clusters to which new points may belong based on DBSCAN models. | ||
This component is used to perform clustering prediction based on trained Gaussian mixture models. | ||
This component is an iterative decision tree algorithm that is suitable for linear and nonlinear regression scenarios. | ||
This component is used to analyze the linear relationship between a dependent variable and multiple independent variables. | ||
This component is used to process a large number of offline and online training jobs. SMART is short for scalable multiple additive regression tree. PS-SMART is an iteration algorithm that is implemented by using a PS-based gradient boosting decision tree (GBDT). | ||
This component is used to analyze the linear relationship between a dependent variable and multiple independent variables. A parameter server is used to process a large number of offline and online training jobs. | ||
This component is used to calculate AUC, KS, and F1 score metrics to generate Kolmogorov–Smirnov (KS) curves, precision-recall (P-R) curves, ROC curves, lift charts, and gain charts. | ||
This component is used to evaluate the advantages and disadvantages of different models of regression algorithms based on prediction results and original results. Then, evaluation metrics and histograms of residuals are generated. | ||
This component is used to evaluate clustering models and generate evaluation metrics based on raw data and clustering results. | ||
This component is suitable for supervised learning and corresponds to the matching matrix in unsupervised learning. | ||
This component is used to evaluate the advantages and disadvantages of models of multiclass classification algorithms based on the prediction results and original results of classification models. Then, evaluation metrics such as accuracy, kappa, and F1 score are generated. | ||
Deep learning | PAI supports deep learning frameworks and provides GPU-accelerated clusters. You can use deep learning algorithms based on these frameworks and hardware resources. | |
Time series | This component is a seasonal autoregressive integrated moving average (ARIMA) algorithm based on the open source X-13ARIMA-SEATS algorithm. | |
This component uses the automatic selection program of the ARIMA model. The component is developed based on the revised programs (Gomez and Maravall 1998), which are edited in and after TRAMO (1996). | ||
This component forecasts time series data for each row of MTable data by using the Prophet algorithm and provides the prediction result of the next time period. | ||
This component aggregates columns in a table to create an MTable based on the value specified by groupCols. | ||
This component expands a MTable into a table. | ||
Recommendation | The Factorization Machine (FM) algorithm-based components are nonlinear models that incorporate interactions among features. This algorithm is suitable for scenarios in which E-commerce, advertisements, and live video streaming are used to promote commodities. | |
This component is a model-based recommendation algorithm. It factorizes models by using sparse matrix factorization and predicts the values of missing entries to obtain a basic training model. | ||
This component is an item recall algorithm. You can use this component to measure the similarity of items based on user-item-user principles. | ||
This component is used to predict upstream batch data. You can use this component to perform offline prediction based on the model and prediction data generated by the Swing Train component. | ||
This component is a collaborative filtering algorithm based on items. It uses two input columns and provides the top N items with the highest similarity as the output. | ||
This component calculates the hit rate of recalls. A higher value indicates a higher precision of recalls that are performed by using the vectors generated during model training. | ||
Anomaly detection | This component identifies samples as outliers based on the Local Outlier Factor (LOF) algorithm. | |
This component uses the sub-sampling algorithm to detect anomalies. The sub-sampling algorithm is less complex and can be used to identify anomalous points in datasets. sub-sampling is widely used in anomaly detection scenarios. | ||
This component is an unsupervised machine learning algorithm that is different from traditional SVM algorithms. You can use this component to detect outliers by learning a decision boundary. | ||
Natural Language Processing | This component is used to extract key information from lengthy and repetitive texts. For example, headlines are the results of text summarization. You can use this component to call a specified pre-trained model to generate headlines for news. | |
This component allows you to make batch predictions by using the models trained by the machine reading comprehension training component. | ||
This component is used to extract key information from lengthy and repetitive text. For example, headlines are the results of text summarization. You can use this component to train models that generate headlines, which summarize the main points of news. | ||
This component allows you to train machine reading comprehension (MRC) models to read and comprehend given text passages and answer relevant questions. | ||
This component splits words in specific columns based on Alibaba Word Segmenter (AliWS). The words obtained after splitting are separated by spaces. | ||
This component converts a triple table (row,col,value) to a key-value table (row,[col_id:value]). | ||
This component performs a basic machine learning operation. It is typically used in information retrieval, natural language processing, and bioinformatics. | ||
This component calculates the string similarity and obtains the top N data records that best match the mapping table. | ||
This component is a preprocessing method in text analysis. This component is used to filter noise, such as "of", "is", or "oops", in word tokenization results. | ||
This component is a step in language model training. N-grams are generated based on words. The number of N-grams in all corpora is counted. | ||
This component can automatically generate abstracts. An abstract is a simple and coherent short text that accurately reflects the main ideas of a document. This component allows computers to extract an abstract from a document. | ||
This component uses one of the important technologies in natural language processing to extract keywords from a document. | ||
This component is used to split text in a document by punctuation. This component processes text before text summarization. It splits the text into rows. Each row contains only one sentence. | ||
You can calculate the extension words or sentences of the specified words or sentences based on the calculated semantic vectors, such as the word vectors calculated by the Word2Vec component. The extension words or sentences are a set of vectors that are closest to a certain vector. For example, you can generate a list of words that are most similar to a given word. This is based on the semantic vectors that are returned by the Word2Vec component. | ||
You can use the Doc2Vec component to map articles to vectors. The input is a vocabulary. The output is a document vector table, a word vector table, or a vocabulary. | ||
A conditional random field (CRF) is a conditional probability distribution model of a group of output random variables based on a group of input random variables. This model presumes that the output random variables constitute a Markov random field (MRF). | ||
This component calculates the similarity between articles or between sentences based on string similarity. | ||
This component counts the co-occurrence of all words in several documents and calculates the pointwise mutual information (PMI). | ||
This component is an algorithm component provided by Machine Learning Designer based on the online prediction model Linear Conditional Random Field (LinearCRF). This component processes sequence labeling tasks. | ||
This component is developed based on Alibaba Word Segmenter (AliWS). The component generates a word segmentation model based on parameters and custom dictionaries. | ||
During word frequency calculation, a program is used to calculate the total number of words in strings and the number of times that each word appears in the strings. The strings can be manually entered or read from a specified file. | ||
Term Frequency-Inverse Document Frequency (TF-IDF) is a commonly used weighting technique for information retrieval and text mining. TF-IDF is used by search engines as a tool in scoring and ranking the relevance of a document for a given search query. | ||
In PAI you can set the Topics parameter for the PLDA component to abstract different topics for each document. | ||
The Word2Vec component uses a neural network to map words to vectors in the K-dimensional space based on extensive training. The component supports operations on the vectors to show the semantics of the vectors. The input is a word column or a vocabulary, and the output is a vector table and a vocabulary. | ||
Network analysis | This component generates the depth of each node in a tree and the tree ID. | |
This component identifies the subgraph with the specified coreness. The largest coreness is considered to be the coreness of a graph. | ||
This component uses the Dijkstra algorithm to generate the shortest paths between a given node and all other nodes. | ||
This component is an algorithm that calculates and sorts the rankings of web pages based on their link sources. | ||
This component is a semi-supervised machine learning algorithm. The labels of a node (community) depend on those of the neighboring nodes. The degree of dependence is determined by the similarity between nodes. Data becomes stable through iterative propagation updates. | ||
This component is a semi-supervised classification algorithm. It uses the label information of labeled nodes to predict the label information for unlabeled nodes. | ||
This component is a metric that is used to evaluate the structure of communities in a network. It is designed to measure the strength of a network divided into communities. Values greater than 0.3 indicate a strong community structure. | ||
This component generates maximum connected subgraphs. In Undirected Graph G, Vertex A is connected to Vertex B if a path exists between the two vertices. Undirected Graph G contains several subgraphs. Each vertex is connected to other vertices in the same subgraph. Vertices in different subgraphs are not connected. The subgraphs in Undirected Graph G are called maximum connected subgraphs. | ||
This component calculates the peripheral density of a vertex in Undirected Graph G. The density of a star network is 0, and the density of a fully meshed network is 1. | ||
This component calculates the edge density in Undirected Graph G. | ||
This component generates all triangles in Undirected Graph G. | ||
Finance | This component performs normalization, discretization, indexation, or weight of evidence (WOE) conversion on data. | |
This component is a common modeling tool that is used in the field of credit risk assessment. This component performs binning to implement the discretization of variables and uses linear models, such as linear and logistic regression models, to train a model. The model training process includes feature selection and score transformation. | ||
This component uses the model that is generated by the Scorecard Training component to predict scores. | ||
This component is used for feature discretization. Feature discretization is a process of converting continuous data into multiple discrete intervals. This component supports equal frequency binning, equal width binning, and automated binning. | ||
This component is used to identify a shift in two samples of a population. You can use this component to measure the stability of samples. | ||
Visual algorithms | If your business involves image classification, you can use the image classification (torch) component to build image classification models for inference. | |
This component can be used to train raw video data and obtain a video classification model for inference. | ||
This component can be used to train object detection models that detect entities that have risks in images. | ||
This component can be used to train unlabeled images and obtain a model that extracts image features. | ||
This component can be used to build metric learning models for inference. | ||
This component can be used to build pose models for inference. This component is ideal for scenarios that involve human body detection. | ||
This component provides mainstream model quantization algorithms for you to compress and accelerate models. This way, high-performance inference can be implemented. | ||
This component provides the model prune component that is based on Taylor First order pruning (TaylorFO). TaylorFO is a mainstream adaptive growing and pruning algorithm (AGP). You can use a model prune component to compress models for high training and inference performance. | ||
Tools | OfflineModel is a data format used in MaxCompute. Models that are generated by traditional machine learning algorithms based on the PAICommand framework are stored in OfflineModel format in MaxCompute projects. These components can be used to obtain offline models and use the offline models to run offline prediction jobs. | |
This component can be used to export a model that is trained in MaxCompute to a specified OSS path. | ||
Custom scripts | This component allows you to write custom SQL statements in the SQL script editor. You can submit the statements to MaxCompute for execution. | |
This component allows you to install custom dependencies and run custom Python functions. | ||
This component allows you to call all algorithms of Alink, such as classification algorithms, regression algorithms, or recommendation algorithms. You can also use this component together with other algorithm components of Machine Learning Designer to create pipelines and verify their effects. | ||
This component allows you to use the multi-date loop execution feature to execute multiple day-level SQL tasks within a certain period. | ||
Beta components | This component provides a compression estimation algorithm. | |
This component supports both sparse and dense data. You can use this component to estimate values of numeric variables, such as loan limits and temperatures. | ||
This component can be used to estimate values of numeric variables, such as housing prices, sales volumes, and temperatures. | ||
This component provides the most common regularization method used to deal with ill-posed problems. |