Text analysis uses some unique terminology. A term or token is the smallest piece of text, similar to a word in a sentence. However, you can define terms in many ways, including through the use of regular expressions; the process of breaking the text into terms is called tokenization.
•
|
A phrase is a short collection of terms; the platform has options to manage phrases that are specified as terms in and of themselves.
|
•
|
A document refers to a collection of words; in a JMP data table, the unstructured text in each row of the text column corresponds to a document.
|
•
|
A corpus refers to a collection of documents.
|
It is often desirable to exclude some common words from the analysis. These excluded words are called stop words. The platform has a default list of stop words, but you can also add specific words as stop words. Although stop words are not eligible to be terms, they can be used in phrases.
Stemming is the process of combining words with identical beginnings (stems) by removing the endings that differ. This results in “jump”, “jumped”, and “jumping” all being treated as the term “jump·”. The stemming procedure is similar to the procedure used in the Snowball string processing language. When a phrase is stemmed, each word in the phrase is stemmed as it would be stemmed as a stand-alone term.
Text analysis in the Text Explorer platform uses a bag of words approach. Other than in the formation of phrases, the order of terms is ignored. The analysis is based on the term counts.
After you curate the list of terms through the use of regular expressions, stop words, recoding, and stemming, you can perform analyses on the curated list of terms. The analysis options in the platform are based on the document term matrix (DTM). Each row in the DTM corresponds to a document (a cell in a text column of a JMP data table). Each column in the DTM corresponds to a term from the curated term list. This approach implements the bag of words approach since it ignores word ordering. In its simplest form, each cell of the DTM contains the frequency (number of occurrences) of the column’s term in the row’s document. There are various other weighting schemes for the DTM; these are described in Save Options.
The analysis options that are available in the platform first perform a singular value decomposition (SVD) on the document term matrix. This can greatly reduce the number of columns needed to represent the term information in the data. For more details about singular value decomposition, see Statistical Details in the Multivariate Methods book. Hierarchical clustering options are available for clustering the terms and for clustering the documents. These options enable you to group similar terms or documents together.
4.
|