The number of violent crime in each of the 58 most populated cities in the United States of America is recorded in 2015. It is suspected that the number of crimes is associated with the temperature.

In this study, we will investigate the factors affecting the number of violent crimes. We have data on population, climate, unemployment rate, and median income. The climate is presented by the yearly temperature (average of the daily averages) for each city in 2015.

# read data
crimes2015 =read.csv("https://pages.uwf.edu/acohen/teaching/datasets/crimes2015.csv")
summary(crimes2015)
##        X          Report.Year       City             Population     
##  Min.   : 1.00   Min.   :2015   Length:58          Min.   : 191992  
##  1st Qu.:15.25   1st Qu.:2015   Class :character   1st Qu.: 424103  
##  Median :29.50   Median :2015   Mode  :character   Median : 625936  
##  Mean   :29.50   Mean   :2015                      Mean   : 917607  
##  3rd Qu.:43.75   3rd Qu.:2015                      3rd Qu.: 866389  
##  Max.   :58.00   Max.   :2015                      Max.   :8550861  
##      Crimes       Temperature     Unemployment        Income      
##  Min.   :  626   Min.   :47.00   Min.   : 2.900   Min.   : 28099  
##  1st Qu.: 2875   1st Qu.:55.25   1st Qu.: 4.165   1st Qu.: 44756  
##  Median : 4536   Median :60.00   Median : 5.010   Median : 51175  
##  Mean   : 6981   Mean   :60.97   Mean   : 5.308   Mean   : 53548  
##  3rd Qu.: 7874   3rd Qu.:66.00   3rd Qu.: 5.965   3rd Qu.: 59314  
##  Max.   :50088   Max.   :78.00   Max.   :10.770   Max.   :103801

Lets first look at the relationship between Unemployment and Income as I suspect they must be dependent on each other.

ggplot(crimes2015, aes(y=Unemployment, x=Income/10000))+geom_point()+geom_smooth(method = 'loess', formula = "y~x", se=FALSE)

From the above graph we can infer that as income increases for the population, the unemployment decreases.

Lets first save the Crime Rate (per 1000). I’ll use this as our response variable.

crimes2015$CrimeRate <- crimes2015$Crime / crimes2015$Population * 1000 #per 1000

For the purpose of our analysis, I will be converting Unemployment into factor and three buckets will be created for low, medium and high unemployment. Also, I will convert temperature into factor and put them into five buckets.

df1 <- transform(crimes2015, 
                 TemperatureGroup=cut(Temperature,
                                                  breaks=c(-Inf, 50, 60, 70, Inf),
                                                  labels=c('<50', '50', '60', '70')), 
                 IncomeGroup=cut(Income, breaks=c(-Inf,44756, 59314, Inf), 
                                 labels=c('Low', 'Medium', 'High')), 
                 UnemploymentGroup=cut(Unemployment, 
                                       breaks=c(-Inf,4.165, 5.965, Inf), 
                                       labels=c('Low', 'Medium', 'High')))

First, we can view the response variable CrimeRate data continuity by creating a histogram:

hist(df1$CrimeRate, main = paste("Histogram of CrimeRate"), xlab="CrimeRate")

Clearly, the data is almost in the form of a bell curve but slightly left skewed and not a perfect normal distribution.

Let’s check out the mean() and var() of the dependent variable:

mean(df1$CrimeRate)
## [1] 8.328426
var(df1$CrimeRate)
## [1] 15.10534

The variance is much greater than the mean, which suggests that we will have over-dispersion in the model.

Now, lets take a look at the interaction between Temperature and Unemployment

(cleaned_data <- df1 %>% 
    group_by(TemperatureGroup, UnemploymentGroup) %>% 
    summarise(
      mean_CrimeRate = mean(CrimeRate, na.rm = TRUE)
    ) %>% 
    drop_na())

(labels <- cleaned_data %>% 
    filter(TemperatureGroup == 70) %>% 
    mutate(
      label = case_when(
        UnemploymentGroup == "Low" ~ "Low",
        UnemploymentGroup == "Medium" ~ "Medium",
        UnemploymentGroup == "High" ~ "High"
      )
    ))

cleaned_data %>% ggplot(aes(TemperatureGroup, mean_CrimeRate)) +  geom_line(size = 1.2, aes(group = UnemploymentGroup, color = UnemploymentGroup)) +
  geom_point(size = 2.6, aes(color = UnemploymentGroup), shape = 15)  +
  labs(
    title = "Interaction between Temperature and Unemployment",
    subtitle = paste0("Mean crime rate decreases with increase in temperature.\n",
                      "This is true for all levels of unemployment except the medium.\n",
                      "Population with medium unemployment has a sudden increase in crime rate at temp 70. "),
    caption = "",
    x = "Temprature",
    y = "mean Crime Rate",
    colour="Unemployment"
  ) 

Poisson model

Lets consider the model

pois.m1=glm(Crimes ~ Temperature+Unemployment + offset(log(Population)),
           data=crimes2015,
           family = poisson(link = "log") )

And an alternative model

pois.m2=glm(Crimes ~ Temperature*Unemployment + offset(log(Population)),
            data=crimes2015,
            family = poisson(link = "log") )

Now lets check the effect of independent variables added sequentially. This is very useful and help us remove unnecessary variables in the model.

anova(pois.m1, test="Chisq")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: Crimes
## 
## Terms added sequentially (first to last)
## 
## 
##              Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                            57      70096              
## Temperature   1   5603.4        56      64492 < 2.2e-16 ***
## Unemployment  1  10095.3        55      54397 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(pois.m2, test="Chisq")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: Crimes
## 
## Terms added sequentially (first to last)
## 
## 
##                          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                        57      70096              
## Temperature               1   5603.4        56      64492 < 2.2e-16 ***
## Unemployment              1  10095.3        55      54397 < 2.2e-16 ***
## Temperature:Unemployment  1    878.8        54      53518 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result shows that both variables are significant in both the models so we will keep them.

Now, lets compare these two models to see which one is preferred

anova(pois.m1, pois.m2, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: Crimes ~ Temperature + Unemployment + offset(log(Population))
## Model 2: Crimes ~ Temperature * Unemployment + offset(log(Population))
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        55      54397                          
## 2        54      53518  1   878.81 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova shows that the pois.m2 model is an improvement over pois.m1

Since the models are not nested, we compare the two models using the AIC:

c( "Model1"=AIC(pois.m1),"Model2"=AIC(pois.m2) )
##   Model1   Model2 
## 55003.01 54126.20

The AIC suggests the model pois.m2 produces a better predictions so we will use pois.m2 for our analysis

Diagnostics:

par(mfrow=c(2,2))
plot(pois.m2)

(1) Checking the Systematic Component Let’s first check the systematic component of our model:

Look at the top-left plot of the residuals against the predicted values.

Looking for two features of the plots which are important:

Also, if the link function is appropriate, then the plot should be roughly linear. If a noticeable curvature is apparent in the plot, then another choice of link function should be considered.

In this case, there is no specific trend observed in the residual chart and residual appears to be almost evenly distributed with almost constant variance except few exceptions towards the end.

(2) Checking the Random Component and Outliers We will check if residuals are normally distributed.

We see that the residuals are almost normally distributed with very few outliers.

Now, lets look at the spread of the deviance.

boxplot(residuals(pois.m2, typr="deviance"), horizontal = TRUE)

The median is slightly shifted towards left so the residuals are slightly right (or “positively”) skewed. Deviance residuals are approximately normally distributed if the model is specified correctly. In our example, it shows a little bit of skeweness since median is not quite zero.

Deviance goodness-of-fit test:

We can use the residual deviance to perform a goodness of fit test for the overall model. The residual deviance is the difference between the deviance of the current model and the maximum deviance of the ideal model where the predicted values are identical to the observed. Therefore, if the residual difference is small enough, the goodness of fit test will not be significant, indicating that the model fits the data.

Using goodness-of-fit statistic we will test the following hypothesis:

H0: the model fits Ha: the model does not fir (or, some other model fits)

pchisq(pois.m2$deviance, pois.m2$df.residual,lower.tail = F)
## [1] 0

The above p-value suggests that a residual deviance as large or larger than what we observed under the model in pois.m2 is highly likely! So we reject the null hypothesis that the model fits. This suggests that the model is not adequate or lack of fit. We reject the null hypothesis and conclude that the model doesn’t fits reasonably well because the goodness-of-fit chi-squared test is statistically significant.

Now, lets do a dispersion test to check for overdispersion

The standard Poisson GLM models the (conditional) mean E[y]=μ which is assumed to be equal to the variance VAR[y]=μ.

VAR[y]=μ+α⋅trafo(μ)

Overdispersion corresponds to α > 0 and underdispersion to α < 0

dispersiontest(pois.m2,trafo = 1,alternative = "greater")
## 
##  Overdispersion test
## 
## data:  pois.m2
## z = 4.8434, p-value = 6.382e-07
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##    alpha 
## 971.6663

Since, α (971) is greater than 0, there is an overdispersion.

Since, we have established that there is an overdispersion, there are two ways to model the overdispersion - Quasi-Poisson (not an EDM where Var[y] = φµ) - Negative Binomial (EDM where Var[y] = µ + ψµ^2)

Quasi-Poisson Model The Crime count model from above needs to be fit using the quasi-Poisson family, which will account for the greater variance in the data.

qpois = glm(Crimes ~ Temperature*Unemployment + offset(log(Population)),
            data=crimes2015,
            family = quasipoisson(link = "log"))

Diagnostics:

Lets first look at the summary to see how the coefficients and std. error compares with the previous model.

summary(qpois)
## 
## Call:
## glm(formula = Crimes ~ Temperature * Unemployment + offset(log(Population)), 
##     family = quasipoisson(link = "log"), data = crimes2015)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -83.586  -15.891    0.171   19.008   68.771  
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -6.019609   1.431166  -4.206 9.85e-05 ***
## Temperature               0.009448   0.023916   0.395    0.694    
## Unemployment              0.345641   0.254693   1.357    0.180    
## Temperature:Unemployment -0.004011   0.004350  -0.922    0.361    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 1044.818)
## 
##     Null deviance: 70096  on 57  degrees of freedom
## Residual deviance: 53518  on 54  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4
s1<- summary(pois.m2)
coef1 <- s1$coefficients[ , 1]
se.coef1 <- s1$coefficients[ , 2]

s2<- summary(qpois)
coef2 <- s2$coefficients[ , 1]
se.coef2 <- s2$coefficients[ , 2]

cbind(coef1, se.coef1, coef2, se.coef2)
##                                 coef1     se.coef1        coef2   se.coef2
## (Intercept)              -6.019608644 0.0442761142 -6.019608644 1.43116557
## Temperature               0.009448342 0.0007398788  0.009448342 0.02391558
## Unemployment              0.345640994 0.0078794544  0.345640994 0.25469272
## Temperature:Unemployment -0.004010685 0.0001345752 -0.004010685 0.00434996

As we can see, the coefficients are almost unchanged. However, the std. error increased significantly. Also, no terms are significant any more.

The dispersion parameter, which was forced to be 1 in our last model (pois.m2), is allowed to be estimated here. In fact, it is estimated at 1044.818.

This parameter tells us how many times larger the variance is than the mean. Since our dispersion is greater than one, it turns out the conditional variance is actually larger than the conditional mean. We have over-dispersion.

Before determining that the quasi-Poisson model “qpois” is appropriate, we will check to see if the variance of the residuals is proportional to the mean. We begin this check by creating a new dataframe which includes the residuals and fitted values.

p1Diag <- data.frame(crimes2015,
                     link = predict(pois.m2, type = "link"),
                     fit = predict(pois.m2, type = "response"),
                     pearson = residuals(pois.m2, type = "pearson"),
                     resid = residuals(pois.m2, type = "response"),
                     residSqr = residuals(pois.m2, type = "response")^2
                     )

We will plot the square of the residual against the predicted mean. We will add to this scatter plot a black line for the Poisson-assumed variance (the mean), a dashed green line for the quasi-Poisson assumed variance, and a blue curve for the smoothed mean of the square of the residual.

ggplot(p1Diag, aes(x = fit, y = residSqr)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, size = 1) +
  geom_abline(intercept = 0, slope = summary(qpois)$dispersion,
              color = "darkgreen", linetype = 2, size = 1) +
  geom_smooth(method = 'loess', formula='y~x',se = F, size = 1) +
  theme_bw() 

Ideally the blue curve would be straight and it would be collinear with the green line for the quasi-Poisson variance. The greater the deviation from the green line the greater the concern is about the proportionality of the variance to the mean. Here we have some indication that the variance may not be proportional to the mean.

Now lets, take a look at the autoplot (residual chart)

par(mfrow=c(2,2))
plot(qpois)

Nothing apparently changed here as compared to the previous model. Its not giving us any additional information about our model.

Quasi-poisson model has no AIC and no likelihood as it has no probability distribution. I think we can’t do goodness of fit here.

Negative Binomial Model

We will now look to see if a negative binomial model might be a better fit.

nb = glm.nb(Crimes ~ Temperature*Unemployment + offset(log(Population)), data=crimes2015)

Diagnostics:

Lets first look at the summary to see how the coefficients and std. error compares with the previous model.

summary(nb)
## 
## Call:
## glm.nb(formula = Crimes ~ Temperature * Unemployment + offset(log(Population)), 
##     data = crimes2015, init.theta = 6.160278675, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2327  -0.7348  -0.1445   0.5340   2.3455  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)   
## (Intercept)              -4.3822910  1.4518065  -3.019  0.00254 **
## Temperature              -0.0172933  0.0244708  -0.707  0.47976   
## Unemployment              0.1344794  0.2698312   0.498  0.61821   
## Temperature:Unemployment -0.0002946  0.0045820  -0.064  0.94874   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(6.1603) family taken to be 1)
## 
##     Null deviance: 80.832  on 57  degrees of freedom
## Residual deviance: 59.581  on 54  degrees of freedom
## AIC: 1058.9
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  6.16 
##           Std. Err.:  1.12 
## 
##  2 x log-likelihood:  -1048.893
s1<- summary(pois.m2)
coef1 <- s1$coefficients[ , 1]
se.coef1 <- s1$coefficients[ , 2]

s2<- summary(qpois)
coef2 <- s2$coefficients[ , 1]
se.coef2 <- s2$coefficients[ , 2]

s3<- summary(nb)
coef3 <- s3$coefficients[ , 1]
se.coef3 <- s3$coefficients[ , 2]

cbind(coef1, se.coef1, coef2, se.coef2, coef3, se.coef3)
##                                 coef1     se.coef1        coef2   se.coef2
## (Intercept)              -6.019608644 0.0442761142 -6.019608644 1.43116557
## Temperature               0.009448342 0.0007398788  0.009448342 0.02391558
## Unemployment              0.345640994 0.0078794544  0.345640994 0.25469272
## Temperature:Unemployment -0.004010685 0.0001345752 -0.004010685 0.00434996
##                                  coef3    se.coef3
## (Intercept)              -4.3822910483 1.451806504
## Temperature              -0.0172932501 0.024470804
## Unemployment              0.1344794118 0.269831212
## Temperature:Unemployment -0.0002945832 0.004582017

We can see the coefficients have changed and std. errors increased as compared to poisson model.

Also, the residual chart below does not improve much and is almost similar to previous models.

par(mfrow=c(2,2))
plot(nb)

But now we can do a goodness of fit test to check if this model is adequate.

Deviance goodness-of-fit test:

pchisq(nb$deviance, nb$df.residual,lower.tail = F)
## [1] 0.2798634
dist_chisq(chi2 = nb$deviance, deg.f = nb$df.residual)

The p-value is greater than 5% so we cannot reject the null hypothesis and conclude that the model fits reasonably well and is adequate.

We have chosen our final best model as the negative binomial model (nb). We can interpret the regression coefficient as follows: for a one unit change in the predictor variable, the difference in the logs of expected counts is expected to change by the respective regression coefficient, given the other predictor variables in the model are held constant.

tab_model(nb)
  Crimes
Predictors Incidence Rate Ratios CI p
(Intercept) 0.01 0.00 – 0.23 0.003
Temperature 0.98 0.94 – 1.03 0.480
Unemployment 1.14 0.67 – 1.99 0.618
Temperature *
Unemployment
1.00 0.99 – 1.01 0.949
Observations 58
R2 Nagelkerke 0.408
#change in crimeRate for 1 unit increase in temp
(1-exp(nb$coefficients[2]))*100
## Temperature 
##    1.714458
#change in crimeRate for 1 unit increase in unemployment
(1-exp(nb$coefficients[3]))*100
## Unemployment 
##    -14.39411