Publication date: 07/08/2024

Text Processing Steps

In the Text Explorer platform, text is processed in three stages: tokenizing, phrasing, and terming.

Tokenizing Stage

The Tokenizing stage performs the following operations:

1. Convert text to lowercase.

2. Apply Tokenizing method (either Basic Words or Regex) to group characters into tokens.

3. Recode tokens based on specified recode definitions. Note that recoding occurs before stemming.

Note: Recode operations are processed internally in one pass regardless of the order that they are specified in the report window.

Tokenizing Stage for Asian Languages

When Japanese, Chinese (Simplified), Chinese (Traditional), or Korean is specified as the Language option, JMP uses a language-specific dictionary of words to parse the text. The dictionary is downloaded from a public source and stored in a JMP data table the first time you specify any of the above languages. This JMP data table is stored in a language-specific subdirectory of a TextExplorer directory. The location of the TextExplorer directory is based on your computer’s operating system:

Windows: C:\Users\<username>\AppData\Roaming\JMP\JMP\TextExplorer\

macOS: /Users/<username>/Library/Application Support/JMP/TextExplorer/

You can also add or remove words from the language-specific dictionary by editing the dictionary-User.jmp data table that is located in a language-specific subdirectory of the TextExplorer directory. The dictionary-User.jmp data table contains two columns: Data and action. To add a word to the language-specific dictionary, add a row to the dictionary-User.jmp data table with the word in the first column and the word add in the second column. To remove a word from the language-specific dictionary, add a row to the dictionary-User.jmp data table with the word in the first column and the word delete in the second column.

Phrasing Stage

The Phrasing stage collects phrases that occur in the corpus (collection of documents) and enables you to specify that individual phrases be treated as terms. Phrases cannot start or end with a stop word, but they can contain a stop word.

Terming Stage

The Terming stage creates the term list from the tokens and phrases that result from the previous stages.

For each token, the Terming stage performs the following operations:

1. Check that the minimum and maximum length requirements specified in the launch window are met. Tokens that contain only numbers are excluded from this operation.

2. Check that the token is qualified to become a term; tokens parsed by the Basic Words tokenization method must contain at least one alphabetical or Unicode character. Tokens that contain only numbers are excluded from this operation. The Regex tokenization method uses regular expressions to determine what characters are part of a token.

3. Check that the token is not a stop word.

4. Apply stemming and stem exceptions.

For each phrase that you add, the Terming stage performs the following operations:

1. Add the phrase to the term list. Phrases should apply stemming to each word in the phrase that is stemmed in the term list. Phrases that have different raw tokens but the same stems are combined in the term list.

2. Remove token term occurrences that appear in the phrase.

Want more information? Have questions? Get answers in the JMP User Community (community.jmp.com).