This topic describes the Deprecated Word Filter component provided by Machine Learning Designer (formerly known as Machine Learning Studio).
The Deprecated Word Filter component is a preprocessing method in text analysis. This component is used to filter noise, such as "of", "is", or "oops", in word tokenization results.
The input of the component includes an input table and a deprecated word table. The input table contains deprecated words that you want to filter out. The deprecated word table has only one column. Each row has a deprecated word.
You can configure the component by using the Machine Learning Platform for AI (PAI) console or a PAI command.
Configure the component
You can use one of the following methods to configure the Deprecated Word Filter component.
Method 1: Configure the component on the pipeline page
Tab | Parameter | Description |
---|---|---|
Fields Setting | Columns to Filter | The columns to be filtered. Separate multiple columns with commas (,). |
Tuning | Cores | The number of cores. By default, the system determines the value. |
Memory Size | The memory size of each core. By default, the system determines the value. |
Method 2: Use PAI commands
PAI -name FilterNoise -project algo_public \
-DinputTableName="test_input" -DnoiseTableName="noise_input" \
-DoutputTableName="test_output" \
-DselectedColNames="words_seg1,words_seg2" \
-Dlifecycle=30
Parameter | Required | Description | Default value |
---|---|---|---|
inputTableName | Yes | The name of the input table. | No default value |
inputTablePartitions | No | The names of the partitions in the input table. | All partitions |
noiseTableName | Yes | The name of the deprecated word table. | No default value |
noiseTablePartitions | No | The names of the partitions in the deprecated word table. | All partitions |
outputTableName | Yes | The name of the output table. | No default value |
selectedColNames | Yes | The columns to be filtered. Separate multiple columns with commas (,). | No default value |
lifecycle | No | The lifecycle of the output table. The value must be a positive integer. | No default value |
coreNum | No | The number of cores that are used in computing. | Determined by the system |
memSizePerCore | No | The memory size of each core. | Determined by the system |