Component reference: Overview of all components - Platform For AI

This topic describes the components supported by Machine Learning Designer of Platform for AI (PAI).

Component type	Component	Description
Custom components	Create a custom component	Support for creating custom components in AI computing asset management. Once the custom component is successfully created, you can use it in conjunction with official components in the Designer for model training.
Data source and destination	Read File Data	This component is used to read objects or directories from Object Storage Service (OSS) buckets.
	Read CSV File	This component allows you to read CSV files from OSS, HTTP, and Hadoop Distributed File System (HDFS) data sources.
	Read MaxCompute Table	This component reads data from MaxCompute tables. By default, the component reads the table data of the current project.
	Write Table	This component allows you to write upstream data to MaxCompute.
Data preprocessing	Random Sampling	This component implements random data sampling based on specific proportions or numbers. The samples are independent of each other.
	Weighted Sampling	This component generates sampling data based on the values of weighted columns.
	Filtering and Mapping	This component uses the expressions of filter conditions to filter data and allows you to modify the names of the columns that you want to filter.
	Stratified Sampling	This component stratifies the input data based on the values of a stratification column and implements random data sampling for each stratum.
	JOIN	This component merges two tables by associating the columns in the tables and determines the output columns. This component works like the JOIN statement of SQL.
	Merge Columns	This component merges two tables by column. The two tables must have the same number of rows. If one of the two tables has partitions, the partitioned table must connect to the second input port.
	Merge Rows (UNION)	This component merges two tables by row. If this component is used, the numbers and data types of the output fields selected from the left and right tables must be the same. This component integrates the features of UNION and UNION ALL.
	Data Type Conversion	This component converts features of any data type to features of the STRING, DOUBLE, or INT data type. This component also allows you to replace missing values if exceptions occur during data type conversion.
	Add ID Column	This component allows you to append an ID column to the first column of a data table.
	Split	This component randomly splits data to generate datasets for training and testing.
	Missing Data Imputation	This component can be configured in a visualized manner or by running PAI commands.
	Normalization	This component allows you to normalize dense or sparse data.
	Standardization	This component allows you to generate standardized instances in a visualized manner or by running PAI commands.
	KV2Table	This component allows you to convert a table in the key-value format into a common table.
	Table2KV	This component allows you to convert a common table into a table in the key-value format in a visualized manner or by running PAI commands.
Feature engineering	Feature Importance Filtering	This component provides the filtering feature for components including Linear Model Feature Importance, GBDT Feature Importance, and Random Forest Feature Importance. This component can be used to filter the top N features.
	Principal Component Analysis (PCA)	This component uses a multivariate statistical method to explore the internal structures of multiple variables and how they correlate to each other based on a few principal components.
	Feature Scaling	This component allows you to scale numeric data in the dense or sparse format by using common scaling functions.
	Feature Discretization	This component discretizes continuous features based on a specific rule.
	Feature Anomaly Smoothing	This component can smooth anomalous features in input data to a specific interval. Both the sparse data and dense data are supported.
	Singular-value Decomposition (SVD)	This component is used to decompose matrices in linear algebra. It is a generalization of the diagonalization of normal matrices in matrix analysis.
	Anomaly Detection	This component is used to detect data with continuous or enumeration features. It helps you detect exceptions in data.
	Linear Model Feature Importance	This component calculates the feature importance for a linear model, such as linear regression and logistic regression for binary classification. Both the sparse and dense data are supported.
	Discrete Feature Analysis	This component is used to collect statistics on the distribution of discrete features.
	Random Forest Feature Importance Evaluation	This component is used to calculate feature importance.
	Feature Selection (Filter Method)	This component selects the top N features from all feature data in the sparse or dense format by using a filter based on the feature selection method that you specify.
	Feature Encoding	This component can encode nonlinear features to linear features based on gradient boosting decision tree (GBDT) algorithms.
	One Hot Encoding	This component converts data to key-value pairs in the sparse format.
Statistical analytics	Data Pivoting	This component allows you to view the distributions of feature values, feature columns, and label columns. This facilitates follow-up data analysis.
	Covariance	This component is used to measure the joint variability of two variables.
	Empirical Probability Density Chart	This component uses empirical distribution and kernel density estimation functions.
	Whole Table Statistics	This component collects statistics about data in a table or only selected columns.
	Chi-square Goodness of Fit Test	This component is used in scenarios in which categorical variables are used. This component is used to determine the difference between the observed frequency and expected frequency for each classification of a single multiclass categorical variable. The null hypothesis assumes that the observed frequency and expected frequency are the same.
	Box Plot	A box plot chart shows the distribution of a set of data. It shows the distribution features of raw data. It can also be used to compare the distribution features between multiple sets of data.
	Scatter Plot	In regression analysis, a scatter chart shows the distribution of data points in a Cartesian coordinate system.
	Correlation Coefficient Matrix	A correlation coefficient indicates the correlation between columns in a matrix. The coefficient is in the range of [-1,1]. The count parameter is measured when the value is the number of non-zero elements in two consecutive columns.
	Two Sample T Test	This component checks whether the means of two samples are significantly different based on statistical principles.
	One Sample T Test	This component is used to determine whether a significant difference exists between the overall mean of a variable and a specific value. The sample on which you want to perform a T test must follow normal distribution.
	Normality Test	This component is a normality test that determines whether the population follows normal distribution by using observations. A normality test is a special goodness-of-fit hypothesis test in statistical determination.
	Lorenz Curve	This component can be used to show the income distribution of a country or region.
	Percentile	This component is used to calculate the percentile of data in the columns of a data table.
	Pearson Coefficient	This component is a linear correlation coefficient that measures the linear correlation between two variables.
	Histogram	This component is a histogram, also known as a mass distribution profile. A histogram is a statistical report chart that consists of a series of vertical stripes or line segments with different heights to show data distribution.
Machine learning	Prediction	This component uses the training model and prediction data as input and generates prediction results as output.
	XGboost Train	XGBoost is an extension of the gradient boosting algorithm. XGBoost provides better usability and robustness and has been widely used in machine learning production systems and machine learning competitions. XGBoost can be used for classification and regression.
	XGboost Predict	XGBoost is an extension of the gradient boosting algorithm. XGBoost provides better usability and robustness and has been widely used in machine learning production systems and machine learning competitions. XGBoost can be used for classification and regression.
	Linear SVM	This component is a machine learning model based on statistical learning theory. It minimizes risks and improves the generalization capability of learning machines. This way, empirical risks and confidence intervals are minimized.
	Binary Logistic Regression	This component is a binary classification algorithm and supports sparse and dense data.
	GBDT Binary Classification	This component is used to set a threshold. If a feature value is greater than the threshold, the feature is a positive example. Otherwise, the feature is a negative example.
	Experiment of PS-SMART Binary Classification Training	A parameter server (PS) is used to process a large number of offline and online training tasks. SMART stands for scalable multiple additive regression tree. PS-SMART is an iteration algorithm that is implemented by using a PS-based gradient boosting decision tree (GBDT).
	PS Logistic Regression for Binary Classification	This component is a classic binary classification algorithm and is widely used in advertising and search scenarios.
	PS-SMART Multiclass Classification	A parameter server (PS) is used to process a large number of offline and online training tasks. SMART stands for scalable multiple additive regression tree. PS-SMART is an iteration algorithm that is implemented by using a PS-based gradient boosting decision tree (GBDT).
	KNN	This component selects the K-nearest records from a row in the prediction table for classification. The most common class of the K-nearest records is used as the class of the row.
	Logistic Regression for Multiclass Classification	This component is used for multiclass classification, and supports both the sparse and dense data formats.
	Random Forest	This component is a classifier that consists of multiple decision trees. The classification result is determined by the mode of output classes of individual trees.
	Naive Bayesian	This component is a probabilistic classification algorithm based on Bayesian theorem with independent assumptions.
	K-means Clustering	This component randomly selects K objects as the initial centroids of each cluster, computes the distance between the remaining objects and the centroids, distributes the remaining objects to the nearest clusters, and then recalculates the centroids of each cluster.
	DBSCAN	This component is used to create clustering models.
	GMM Training	This component is used to classify models.
	DBSCAN Prediction	This component is used to predict the clusters to which new points may belong based on DBSCAN models.
	GMM Prediction	This component is used to perform clustering prediction based on trained Gaussian mixture models.
	GBDT Regression	This component is an iterative decision tree algorithm that is suitable for linear and nonlinear regression scenarios.
	Linear Regression	This component is used to analyze the linear relationship between a dependent variable and multiple independent variables.
	PS-SMART Regression	This component is used to process a large number of offline and online training jobs. SMART is short for scalable multiple additive regression tree. PS-SMART is an iteration algorithm that is implemented by using a PS-based gradient boosting decision tree (GBDT).
	PS Linear Regression	This component is used to analyze the linear relationship between a dependent variable and multiple independent variables. A parameter server is used to process a large number of offline and online training jobs.
	Binary Classification Evaluation	This component is used to calculate AUC, KS, and F1 score metrics to generate Kolmogorov–Smirnov (KS) curves, precision-recall (P-R) curves, ROC curves, lift charts, and gain charts.
	Regression Model Evaluation	This component is used to evaluate the advantages and disadvantages of different models of regression algorithms based on prediction results and original results. Then, evaluation metrics and histograms of residuals are generated.
	Clustering Model Evaluation	This component is used to evaluate clustering models and generate evaluation metrics based on raw data and clustering results.
	Confusion Matrix	This component is suitable for supervised learning and corresponds to the matching matrix in unsupervised learning.
	Multiclass Classification Evaluation	This component is used to evaluate the advantages and disadvantages of models of multiclass classification algorithms based on the prediction results and original results of classification models. Then, evaluation metrics such as accuracy, kappa, and F1 score are generated.
Deep learning	Enable deep learning	PAI supports deep learning frameworks and provides GPU-accelerated clusters. You can use deep learning algorithms based on these frameworks and hardware resources.
Time series	x13_arima	This component is a seasonal autoregressive integrated moving average (ARIMA) algorithm based on the open source X-13ARIMA-SEATS algorithm.
	x13_auto_arima	This component uses the automatic selection program of the ARIMA model. The component is developed based on the revised programs (Gomez and Maravall 1998), which are edited in and after TRAMO (1996).
	Prophet	This component forecasts time series data for each row of MTable data by using the Prophet algorithm and provides the prediction result of the next time period.
	MTable Assembler	This component aggregates columns in a table to create an MTable based on the value specified by groupCols.
	MTable Expander	This component expands a MTable into a table.
Recommendation	FM algorithms	The Factorization Machine (FM) algorithm-based components are nonlinear models that incorporate interactions among features. This algorithm is suitable for scenarios in which E-commerce, advertisements, and live video streaming are used to promote commodities.
	ALS Training	This component is a model-based recommendation algorithm. It factorizes models by using sparse matrix factorization and predicts the values of missing entries to obtain a basic training model.
	Swing Train	This component is an item recall algorithm. You can use this component to measure the similarity of items based on user-item-user principles.
	Swing Recommendation	This component is used to predict upstream batch data. You can use this component to perform offline prediction based on the model and prediction data generated by the Swing Train component.
	Collaborative Filtering (etrec)	This component is a collaborative filtering algorithm based on items. It uses two input columns and provides the top N items with the highest similarity as the output.
	Vector Recall Evaluation	This component calculates the hit rate of recalls. A higher value indicates a higher precision of recalls that are performed by using the vectors generated during model training.
Anomaly detection	Local Outlier Factor (LOF) Outlier Detection	This component identifies samples as outliers based on the Local Outlier Factor (LOF) algorithm.
	iForest Outlier	This component uses the sub-sampling algorithm to detect anomalies. The sub-sampling algorithm is less complex and can be used to identify anomalous points in datasets. sub-sampling is widely used in anomaly detection scenarios.
	One-Class SVM Outlier Detection	This component is an unsupervised machine learning algorithm that is different from traditional SVM algorithms. You can use this component to detect outliers by learning a decision boundary.
Natural Language Processing	Use the Text Summarization Prediction component	This component is used to extract key information from lengthy and repetitive texts. For example, headlines are the results of text summarization. You can use this component to call a specified pre-trained model to generate headlines for news.
	Machine Reading Comprehension Predict	This component allows you to make batch predictions by using the models trained by the machine reading comprehension training component.
	Text Summarization	This component is used to extract key information from lengthy and repetitive text. For example, headlines are the results of text summarization. You can use this component to train models that generate headlines, which summarize the main points of news.
	Machine Reading Comprehension Training	This component allows you to train machine reading comprehension (MRC) models to read and comprehend given text passages and answer relevant questions.
	Split Word	This component splits words in specific columns based on Alibaba Word Segmenter (AliWS). The words obtained after splitting are separated by spaces.
	Convert Row, Column, and Value to KV Pair	This component converts a triple table (row,col,value) to a key-value table (row,[col_id:value]).
	String Similarity	This component performs a basic machine learning operation. It is typically used in information retrieval, natural language processing, and bioinformatics.
	String Similarity - top N	This component calculates the string similarity and obtains the top N data records that best match the mapping table.
	Deprecated Word Filter	This component is a preprocessing method in text analysis. This component is used to filter noise, such as "of", "is", or "oops", in word tokenization results.
	ngram-count	This component is a step in language model training. N-grams are generated based on words. The number of N-grams in all corpora is counted.
	Text Summarization	This component can automatically generate abstracts. An abstract is a simple and coherent short text that accurately reflects the main ideas of a document. This component allows computers to extract an abstract from a document.
	Keyword Extraction	This component uses one of the important technologies in natural language processing to extract keywords from a document.
	Sentence Splitting	This component is used to split text in a document by punctuation. This component processes text before text summarization. It splits the text into rows. Each row contains only one sentence.
	Semantic Vector Distance	You can calculate the extension words or sentences of the specified words or sentences based on the calculated semantic vectors, such as the word vectors calculated by the Word2Vec component. The extension words or sentences are a set of vectors that are closest to a certain vector. For example, you can generate a list of words that are most similar to a given word. This is based on the semantic vectors that are returned by the Word2Vec component.
	Doc2Vec	You can use the Doc2Vec component to map articles to vectors. The input is a vocabulary. The output is a document vector table, a word vector table, or a vocabulary.
	Conditional Random Field	A conditional random field (CRF) is a conditional probability distribution model of a group of output random variables based on a group of input random variables. This model presumes that the output random variables constitute a Markov random field (MRF).
	Document Similarity	This component calculates the similarity between articles or between sentences based on string similarity.
	PMI	This component counts the co-occurrence of all words in several documents and calculates the pointwise mutual information (PMI).
	Conditional Random Field Prediction	This component is an algorithm component provided by Machine Learning Designer based on the online prediction model Linear Conditional Random Field (LinearCRF). This component processes sequence labeling tasks.
	Word Splitting (Generate Models)	This component is developed based on Alibaba Word Segmenter (AliWS). The component generates a word segmentation model based on parameters and custom dictionaries.
	Word Frequency Statistics	During word frequency calculation, a program is used to calculate the total number of words in strings and the number of times that each word appears in the strings. The strings can be manually entered or read from a specified file.
	TF-IDF	Term Frequency-Inverse Document Frequency (TF-IDF) is a commonly used weighting technique for information retrieval and text mining. TF-IDF is used by search engines as a tool in scoring and ranking the relevance of a document for a given search query.
	PLDA	In PAI you can set the Topics parameter for the PLDA component to abstract different topics for each document.
	Word2Vec	The Word2Vec component uses a neural network to map words to vectors in the K-dimensional space based on extensive training. The component supports operations on the vectors to show the semantics of the vectors. The input is a word column or a vocabulary, and the output is a vector table and a vocabulary.
Network analysis	Tree Depth	This component generates the depth of each node in a tree and the tree ID.
	k-Core	This component identifies the subgraph with the specified coreness. The largest coreness is considered to be the coreness of a graph.
	Single-source Shortest Path	This component uses the Dijkstra algorithm to generate the shortest paths between a given node and all other nodes.
	PageRank	This component is an algorithm that calculates and sorts the rankings of web pages based on their link sources.
	Label Propagation Clustering	This component is a semi-supervised machine learning algorithm. The labels of a node (community) depend on those of the neighboring nodes. The degree of dependence is determined by the similarity between nodes. Data becomes stable through iterative propagation updates.
	Label Propagation Classification	This component is a semi-supervised classification algorithm. It uses the label information of labeled nodes to predict the label information for unlabeled nodes.
	Modularity	This component is a metric that is used to evaluate the structure of communities in a network. It is designed to measure the strength of a network divided into communities. Values greater than 0.3 indicate a strong community structure.
	Maximum Connected Subgraph	This component generates maximum connected subgraphs. In Undirected Graph G, Vertex A is connected to Vertex B if a path exists between the two vertices. Undirected Graph G contains several subgraphs. Each vertex is connected to other vertices in the same subgraph. Vertices in different subgraphs are not connected. The subgraphs in Undirected Graph G are called maximum connected subgraphs.
	Vertex Clustering Coefficient	This component calculates the peripheral density of a vertex in Undirected Graph G. The density of a star network is 0, and the density of a fully meshed network is 1.
	Edge Clustering Coefficient	This component calculates the edge density in Undirected Graph G.
	Counting Triangle	This component generates all triangles in Undirected Graph G.
Finance	Data Conversion Module	This component performs normalization, discretization, indexation, or weight of evidence (WOE) conversion on data.
	Scorecard Training	This component is a common modeling tool that is used in the field of credit risk assessment. This component performs binning to implement the discretization of variables and uses linear models, such as linear and logistic regression models, to train a model. The model training process includes feature selection and score transformation.
	Scorecard Prediction	This component uses the model that is generated by the Scorecard Training component to predict scores.
	Binning	This component is used for feature discretization. Feature discretization is a process of converting continuous data into multiple discrete intervals. This component supports equal frequency binning, equal width binning, and automated binning.
	Population Stability Index	This component is used to identify a shift in two samples of a population. You can use this component to measure the stability of samples.
Visual algorithms	image classification (torch)	If your business involves image classification, you can use the image classification (torch) component to build image classification models for inference.
	video classification	This component can be used to train raw video data and obtain a video classification model for inference.
	object detection (easycv)	This component can be used to train object detection models that detect entities that have risks in images.
	image self-supervise learning	This component can be used to train unlabeled images and obtain a model that extracts image features.
	image metric learning (raw)	This component can be used to build metric learning models for inference.
	pose detection	This component can be used to build pose models for inference. This component is ideal for scenarios that involve human body detection.
	model quantize	This component provides mainstream model quantization algorithms for you to compress and accelerate models. This way, high-performance inference can be implemented.
	model prune	This component provides the model prune component that is based on Taylor First order pruning (TaylorFO). TaylorFO is a mainstream adaptive growing and pruning algorithm (AGP). You can use a model prune component to compress models for high training and inference performance.
Tools	OfflineModel components	OfflineModel is a data format used in MaxCompute. Models that are generated by traditional machine learning algorithms based on the PAICommand framework are stored in OfflineModel format in MaxCompute projects. These components can be used to obtain offline models and use the offline models to run offline prediction jobs.
Tools	Model Export	This component can be used to export a model that is trained in MaxCompute to a specified OSS path.
Custom scripts	SQL Script	This component allows you to write custom SQL statements in the SQL script editor. You can submit the statements to MaxCompute for execution.
	Python script	This component allows you to install custom dependencies and run custom Python functions.
	PyAlink Script	This component allows you to call all algorithms of Alink, such as classification algorithms, regression algorithms, or recommendation algorithms. You can also use this component together with other algorithm components of Machine Learning Designer to create pipelines and verify their effects.
	Time Window SQL	This component allows you to use the multi-date loop execution feature to execute multiple day-level SQL tasks within a certain period.
Beta components	Lasso Regression Training	This component provides a compression estimation algorithm.
	Lasso Regression Prediction	This component supports both sparse and dense data. You can use this component to estimate values of numeric variables, such as loan limits and temperatures.
	Ridge Regression Prediction	This component can be used to estimate values of numeric variables, such as housing prices, sales volumes, and temperatures.
	Ridge Regression Training	This component provides the most common regularization method used to deal with ill-posed problems.