Overview of the Text Explorer Platform

Unstructured text data are common. For example, unstructured text data could result from a free response field in a survey, product review comments, or incident reports. The Text Explorer platform enables you to explore unstructured text in order to better understand its meaning. Text analysis is often an iterative process, so you might alternate between curating and analyzing the list of terms.

Curating the List of Terms

Text analysis uses some unique terminology. A term or token is the smallest piece of text, similar to a word in a sentence. However, you can define terms in many ways, including through the use of regular expressions; the process of breaking the text into terms is called tokenization.

•

A phrase is a short collection of terms; the platform has options to manage phrases that are specified as terms in and of themselves.

•

A document refers to a collection of words; in a JMP data table, the unstructured text in each row of the text column corresponds to a document.

•

A corpus refers to a collection of documents.

It is often desirable to exclude some common words from the analysis. These excluded words are called stop words. The platform has a default list of stop words, but you can also add specific words as stop words. Although stop words are not eligible to be terms, they can be used in phrases.

You can also recode terms; this is useful for combining synonyms into one common term.

Stemming is the process of combining words with identical beginnings (stems) by removing the endings that differ. This results in “jump”, “jumped”, and “jumping” all being treated as the term “jump·”. The stemming procedure is similar to the procedure used in the Snowball string processing language. When a phrase is stemmed, each word in the phrase is stemmed as it would be stemmed as a stand-alone term.

Analyzing the List of Terms

Text analysis in the Text Explorer platform uses a bag of words approach. Other than in the formation of phrases, the order of terms is ignored. The analysis is based on the term counts.

After you curate the list of terms through the use of regular expressions, stop words, recoding, and stemming, you can perform analyses on the curated list of terms. The analysis options in the platform are based on the document term matrix (DTM). Each row in the DTM corresponds to a document (a cell in a text column of a JMP data table). Each column in the DTM corresponds to a term from the curated term list. This approach implements the bag of words approach since it ignores word ordering. In its simplest form, each cell of the DTM contains the frequency (number of occurrences) of the column’s term in the row’s document. There are various other weighting schemes; these are described in Save Options.

The analysis options that are available in the platform first perform a singular value decomposition (SVD) on the document term matrix. This can greatly reduce the number of columns needed to represent the term information in the data. For more details about singular value decomposition, see Statistical Detailsin the Multivariate Methods book. Hierarchical clustering options are available for clustering the terms and for clustering the documents. These options enable you to group similar terms or documents together.

Platform Workflow

The expected steps for using the Text Explorer platform are as follows:

Specify the method for tokenizing (either built-in or customized regular expression).

Use the report to specify additional stop words, add phrases to the term list, perform recodes of terms, and specify exceptions to stemming rules.

Specify the preference for stemming.

Use word and phrase counts, SVD, and clustering approaches to identify important terms and phrases.

Note: The SVD and clustering options are available only in JMP Pro.

Save results for use in further analysis: the term table, the DTM, the singular vectors, or other results.

Note: The option to save the singular vectors is available only in JMP Pro.

Save Phrase, Recode, and Stop Words properties for use in future analyses of similar text data.