Node Description

Ungroup Words Node

The Ungroup Words node is designed to take a user selected column and Ungroup the Words found in each input String into separate rows.

Chinese Words are identified by referring to a Word Dictionary. The Word Dictionary is comprised of a large internal Chinese Matching Words list built into the Ungroup Words node. The user may optionally add a Supplemental Dictionary containing their own set of Matching Words. These supplemental words are typically things like Brand Names that would not normally be included in a standard language dictionary.

Words are parsed by any end-of-word marker (such as a space) and any punctuation. Numbers are treated as if they were part of the English alphabet. The period full-stop is treated as punctuation and not as a decimal point, which means the Ungroup Words node will split decimal numbers into two separate number Words.

If a match is found then the Matched Word is added to the output collection. If two Chinese Words are matched from the same starting point in the string then the longer Word is kept and the shorter Word is discarded. An English-equivalent example would be matching ‘cat’ and ‘catch’ – in this case the Word ‘catch’ would be retained and ‘cat’ would be discarded.

This Premium Node is not available as part of the free Community Edition. Premium Nodes help clean and connect real-world data to Market Simulations, and provide advanced Market Science analysis. Note that these descriptions are often deliberately vague.

Downloads

Ungroup Words

The Ungroup Words node is designed to take a user selected column and Ungroup the Words found in each input String into separate rows. The results can be used to identify a Product Name, SKU Number, or Brand from a general Description of the Product. Both Chinese and English is currently supported.

Inputs

Input Product Array

The Input Product Array or other table containing the column of input Strings that will be Ungrouped into Words.

Supplemental Dictionary

The optional user-defined set of Matching Words to supplement the internal Chinese Matching Words list

Node

Configuration

The user selects which column from the Input Product Array contains the input Strings to be Ungrouped into Words. The Ungroup Words node can also Regroup a number adjacent Matched Words and add them to the output collection. The user can select to Regroup Couplets (two adjoining Words), Triplets (three adjoining Words), Quadruplets (four adjoining Words), Quintuplets (five adjoining Words), or Singlets (don’t Regroup). 

Outputs

Output Product Array

The Output Product Array corresponds to the Input Product Array but extended with the Ungrouped Words. Each Matched Word found in the Input Product Array will be output into separate rows.