WHY: Logistic model overpredicts probability in specific probability region

by Emma Wiik   Last Updated January 16, 2018 13:19 PM

I have tried to find answers to my question by googling but haven't found anything relevant - if anyone could pass on some resources or relevant key words that would be brilliant, in absence of actually getting an answer to my question.

My model predicts deforestation at pixel resolution using both pixel- and higher-resolution predictors. The formula is as follows:

gam(formula = loss ~ ecozone + s(dist_river) + s(dist_prevdf) + s(allarea) + s(dist_road) + s(altitude) + s(slope) + s(aspect, bs = "cc") + s(pcARA, k = 3) + s(popdens) + s(pcpriorloss) + s(pcforestGFC), family = binomial, data = gfcpix, weights = weight/mean(weight), method = "REML", select = TRUE, knots = list(aspect = c(0, 360)))

Here, aspect is an angle and therefore given a cyclic spline, and the weights argument refers to loss having been oversampled by a factor of 5 in the data as it was very rare to begin with. However, as the maximum pixels sampled at the unit of the lower-resolution predictors was constrained, no-loss pixels were variably undersampled. Weights therefore differ between loss and no loss pixels across communities. Instead of including communities as random effects, as our main question relates to a few predictors at this resolution, I therefore added these predictors as-is. The total number of data rows in the model is 123288.

I used Faraway's logistic model instructions for evaluating the model and grouped predicted probabilities into intervals, comparing these groups into the corresponding proportion of 1 in the data. Predicted vs observed probabilities I multiplied probabilities by 5 to correspond to the sample rather than population level of 1s.

I am unsure why the model overpredicts at high probabilities. Plotting binned residuals against individual predictors did not suggest any clear issues, but I assume this could have been one way of chasing the issue? I attempted adding some plausible interaction structures into the model but they were not significant nor did they impact on the predictions. Of course, I may be missing some important predictors that would help in constraining the predictions. However, I am at a loss in terms of how to troubleshoot this issue. What might be the most likely cause of the problem, and how could I test this plus other possible causes? Simply saying "The model overpredicted loss at high probabilities" seems inadequate.

Any help would be much appreciated.

I have not provided data or code in this case as my question is more generic than my example, however can try to generate a similarly 'flawed' model with invented data if necessary.



Related Questions


mgcv - gamm model evaluation

Updated September 14, 2017 08:19 AM


Why is LogLoss preferred over RMSE?

Updated April 17, 2017 05:19 AM