Word Frequency Statistics - Platform For AI - Alibaba Cloud Documentation Center

Word Frequency Statistics is a fundamental text analysis technique that quantifies text data by tallying the occurrences of each word within the text. These results are crucial for the feature extraction phase, laying the groundwork for further Natural Language Processing tasks, such as text classification, clustering, and information retrieval.

Algorithm description

Word frequency indicates how often a word appears in a given corpus, reflecting its significance in the text. To determine word frequency, the text (docContent) must first be segmented into individual words. Then, for each text, output its unique document ID (docId) along with the associated word data in the order they were input. Finally, calculate the frequency of each word in the specified text. This method not only uncovers the lexical structure of the text but also provides essential data support for further text analysis tasks, such as text classification, topic modeling, and information retrieval.

Input and output

Input port

Split Word

Output port

Configure the component

Method 1: Visualized method

Add an Word Frequency Statistics component on the pipeline page and configure the following parameters:

Category	Parameter	Description
Fields Setting	Document ID Column	The column that contains the IDs of the specified documents (docId).
Fields Setting	Document Content Column	The column that contains the content of the specified documents (docContent). The text in this column are used for word frequency statistical analysis, which includes segmentation and frequency calculation for each word.
Tuning	Cores	The number of cores to use.
Tuning	Memory Size per Core	The memory size of each core. Unit: MB.

Method 2: PAI command method

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

pai -name doc_word_stat
    -project algo_public
    -DinputTableName=tdl_doc_test_split_word
    -DdocId=docid
    -DdocContent=content
    -DoutputTableNameMulti=doc_test_stat_multi
    -DoutputTableNameTriple=doc_test_stat_triple
    -DinputTablePartitions="region=cctv_news"
    -Dlifecycle=7

Parameter	Required	Default value	Description
inputTableName	Yes	None	The name of the input table.
docId	Yes	None	The name of the document ID column. You can specify only one column.
docContent	Yes	None	The name of the document content column. You can specify only one column.
outputTableNameMulti	Yes	None	The name of the output table that lists the words in their original order after word segmentation, including the document ID column (docId) and the corresponding document content (docContent).
outputTableNameTriple	No	None	The name of the output table that lists the number of times that each word appears in the documents, including the document ID column (docId) and the corresponding document content (docContent).
inputTablePartitions	No	All partitions	The partitions selected from the input table for training. The following formats are supported: Partition_name=value name1=value1/name2=value2: multi-level partitions. Note If you specify multiple partitions, separate them with commas (,). For example, name1=value1,value2.
lifecycle	No	-1	The lifecycle of the output table. The value must be a positive integer.