Term selection identifies which terms best explain different responses. The analysis uses the Generalized Regression platform to perform variable selection on the document term matrix (DTM) and to identify terms that most impact the response. Term selection can be used with binary responses, similar to sentiment analysis, as well as other types of responses. The fitted model uses an appropriate response distribution for the specified response column.
Tip: For an example of Term Selection, select Help > Sample Data Library, open Chips.jmp, and run the Text Explorer - Term Selection table script.
The Settings report enables you to select a response column, specify the target level of the response, and adjust the settings for the model. When you have specified the model settings, click the Run button to run the model. The fitted model then appears in the Summary report. See Term Selection Summary Report.
After you choose a response column, the Target Level outline appears.
• For nominal responses, choose one level of the response to be the target level in a logistic regression model; the response in the logistic regression model is the target level versus all of the other levels combined.
• For ordinal responses, all response levels are initially included in the model. Using the local data filter, you can select levels of the response to be excluded from the model; the underlying numeric values of the included levels are modeled with a normal response distribution.
Note: For ordinal responses, the term selection model can be fit only when the data type of the response column is numeric.
• For continuous responses, use the local data filter histogram to select values of the response to be excluded from the model; the included values are modeled with a normal response distribution.
• For response columns with the Multiple Response modeling type, choose one or more levels of the response to be the target level in a binary logistic regression model. If you choose more than one level, a document belongs to the target level if any of the levels are present in the response column for that document. Select the Combine with AND option to require that all selected levels are present in a document’s response column for that document to be included in the target level.
By default, the Generalized Regression model uses the Elastic Net estimation method with early stopping and the AICc validation method. You can change these settings in the Model Settings outline. See Generalized Regression Models in Fitting Linear Models.
Note: If a Validation column is specified in the Text Explorer launch window, the Generalized Regression platform in the Term Selection report uses the Validation column as the Validation method.
The Term Settings define the document term matrix (DTM) that is used in the regression model. You can change the weighting technique as well as the maximum number of terms included in the DTM; each term corresponds to a column of the DTM. Note that terms that have fewer than 10 occurrences in the corpus are not included in the DTM used by the model. For more information about the DTM options, see Document Term Matrix Specifications Window.
After you run an analysis, the Term Selection report consists of three sections. The Settings report contains controls for specifying an analysis. See Term Selection Settings. Below the Settings report, there are initially closed Generalized Regression reports for each analysis that you have run. See Generalized Regression Models in Fitting Linear Models. The last section of the report is the Summary report.
Figure 12.12 Term Selection Report
The Summary report contains a Model Comparison table, a Summary table and histogram, a Document Scores table, a Term Scores table, and a text box.
The Model Comparison table contains a row for each fitted model. The rest of the Summary report shows results from the currently selected model in this table.
The Summary table shows counts and mean scores for the documents, overall and by the predicted value of the response from the model. The Mean Contribution is the average of the contribution values in the Document Scores table. The Summary histogram shows the distribution of the overall contribution values of the documents. The histogram is interactive, so you can click on a bar to highlight the corresponding documents in the Document Scores table.
The Document Scores table shows the positive and negative contribution values for each document, as well as predicted and actual values for each document. For binomial response models, the predicted values are probabilities of the document being in the target level; for normal response models, the predicted values are the predictions from the fitted model for each document. If you select a row of the table, the text of the corresponding document appears in the text box below the table.
The Term Scores table lists each term that was selected by the fitted model, its coefficient from the model, its LogWorth, and the count of occurrences of the term in the corpus. If you select a row of the table, the text of the corresponding document appears in the text box below the table.
The text box shows the text of documents that are selected in the Document Scores table or the context of terms that are selected in the Term Scores table.
The Term Selection red triangle menu contains the following options:
Save Document Scores
(Available only when an analysis is selected in the Summary table.) Saves the columns from the Document Scores table to new columns in the data table. The new columns contain the positive and negative contributions, as well as the predicted value for each document.
Save Term Score DTM
(Available only when an analysis is selected in the Summary table.) Saves columns to the data table for each relevant term in the currently selected analysis. The columns contain the term scores for each document, using the Weighting specified in the Term Selection Term Settings.
Save Prediction Formulas
(Available only when an analysis is selected in the Summary table.) Saves columns to the data table that contain the prediction formulas for the currently selected analysis.
Show Term Cloud
Shows or hides a word cloud in the Summary report. The word cloud shows the coefficient terms in the currently selected analysis. The words are sized by the absolute value of their coefficients and colored by the sign of their coefficients.
Remove
Removes the Term Selection report from the Text Explorer report window.