The K Nearest Neighbors platform predicts a response value based on the responses of the k nearest rows. The k nearest rows to a given row are determined by identifying the k smallest Euclidean distances between the predictor values for that row and the predictor values for each of the other rows. For a continuous response, the predicted value is the average of the responses for the k nearest rows. For a categorical response, the predicted value is the most frequent response level for the k nearest neighbors. If two or more levels are tied as the most frequent levels, the predicted response is assigned by selecting one of these levels at random.
Note: Because ties for most frequent levels in the case of a categorical response are broken at random, results from independent runs of the platform might differ. In a script, add the JSL keyword Nonrandom to the function for a K Nearest Neighbor model to obtain reproducible results.
Note the following potential drawbacks of the k nearest neighbors method:
For more information about the k nearest neighbors method, see Hastie et al. (2009), Hand et al. (2001), and Shmueli et al. (2017).