The number of violent crime in each of the 58 most populated cities in the United States of America is recorded in 2015. It is suspected that the number of crimes is associated with the temperature.
In this study, we will investigate the factors affecting the number of violent crimes. We have data on population, climate, unemployment rate, and median income. The climate is presented by the yearly temperature (average of the daily averages) for each city in 2015.
# read data
crimes2015 =read.csv("https://pages.uwf.edu/acohen/teaching/datasets/crimes2015.csv")
summary(crimes2015)
## X Report.Year City Population
## Min. : 1.00 Min. :2015 Length:58 Min. : 191992
## 1st Qu.:15.25 1st Qu.:2015 Class :character 1st Qu.: 424103
## Median :29.50 Median :2015 Mode :character Median : 625936
## Mean :29.50 Mean :2015 Mean : 917607
## 3rd Qu.:43.75 3rd Qu.:2015 3rd Qu.: 866389
## Max. :58.00 Max. :2015 Max. :8550861
## Crimes Temperature Unemployment Income
## Min. : 626 Min. :47.00 Min. : 2.900 Min. : 28099
## 1st Qu.: 2875 1st Qu.:55.25 1st Qu.: 4.165 1st Qu.: 44756
## Median : 4536 Median :60.00 Median : 5.010 Median : 51175
## Mean : 6981 Mean :60.97 Mean : 5.308 Mean : 53548
## 3rd Qu.: 7874 3rd Qu.:66.00 3rd Qu.: 5.965 3rd Qu.: 59314
## Max. :50088 Max. :78.00 Max. :10.770 Max. :103801
Lets first look at the relationship between Unemployment and Income as I suspect they must be dependent on each other.
ggplot(crimes2015, aes(y=Unemployment, x=Income/10000))+geom_point()+geom_smooth(method = 'loess', formula = "y~x", se=FALSE)
From the above graph we can infer that as income increases for the population, the unemployment decreases.
Lets first save the Crime Rate (per 1000). I’ll use this as our response variable.
crimes2015$CrimeRate <- crimes2015$Crime / crimes2015$Population * 1000 #per 1000
For the purpose of our analysis, I will be converting Unemployment into factor and three buckets will be created for low, medium and high unemployment. Also, I will convert temperature into factor and put them into five buckets.
df1 <- transform(crimes2015,
TemperatureGroup=cut(Temperature,
breaks=c(-Inf, 50, 60, 70, Inf),
labels=c('<50', '50', '60', '70')),
IncomeGroup=cut(Income, breaks=c(-Inf,44756, 59314, Inf),
labels=c('Low', 'Medium', 'High')),
UnemploymentGroup=cut(Unemployment,
breaks=c(-Inf,4.165, 5.965, Inf),
labels=c('Low', 'Medium', 'High')))
First, we can view the response variable CrimeRate data continuity by creating a histogram:
hist(df1$CrimeRate, main = paste("Histogram of CrimeRate"), xlab="CrimeRate")
Clearly, the data is almost in the form of a bell curve but slightly left skewed and not a perfect normal distribution.
Let’s check out the mean() and var() of the dependent variable:
mean(df1$CrimeRate)
## [1] 8.328426
var(df1$CrimeRate)
## [1] 15.10534
The variance is much greater than the mean, which suggests that we will have over-dispersion in the model.
Now, lets take a look at the interaction between Temperature and Unemployment
(cleaned_data <- df1 %>%
group_by(TemperatureGroup, UnemploymentGroup) %>%
summarise(
mean_CrimeRate = mean(CrimeRate, na.rm = TRUE)
) %>%
drop_na())
(labels <- cleaned_data %>%
filter(TemperatureGroup == 70) %>%
mutate(
label = case_when(
UnemploymentGroup == "Low" ~ "Low",
UnemploymentGroup == "Medium" ~ "Medium",
UnemploymentGroup == "High" ~ "High"
)
))
cleaned_data %>% ggplot(aes(TemperatureGroup, mean_CrimeRate)) + geom_line(size = 1.2, aes(group = UnemploymentGroup, color = UnemploymentGroup)) +
geom_point(size = 2.6, aes(color = UnemploymentGroup), shape = 15) +
labs(
title = "Interaction between Temperature and Unemployment",
subtitle = paste0("Mean crime rate decreases with increase in temperature.\n",
"This is true for all levels of unemployment except the medium.\n",
"Population with medium unemployment has a sudden increase in crime rate at temp 70. "),
caption = "",
x = "Temprature",
y = "mean Crime Rate",
colour="Unemployment"
)
Poisson model
Lets consider the model
pois.m1=glm(Crimes ~ Temperature+Unemployment + offset(log(Population)),
data=crimes2015,
family = poisson(link = "log") )
And an alternative model
pois.m2=glm(Crimes ~ Temperature*Unemployment + offset(log(Population)),
data=crimes2015,
family = poisson(link = "log") )
Now lets check the effect of independent variables added sequentially. This is very useful and help us remove unnecessary variables in the model.
anova(pois.m1, test="Chisq")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: Crimes
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 57 70096
## Temperature 1 5603.4 56 64492 < 2.2e-16 ***
## Unemployment 1 10095.3 55 54397 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(pois.m2, test="Chisq")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: Crimes
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 57 70096
## Temperature 1 5603.4 56 64492 < 2.2e-16 ***
## Unemployment 1 10095.3 55 54397 < 2.2e-16 ***
## Temperature:Unemployment 1 878.8 54 53518 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result shows that both variables are significant in both the models so we will keep them.
Now, lets compare these two models to see which one is preferred
anova(pois.m1, pois.m2, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: Crimes ~ Temperature + Unemployment + offset(log(Population))
## Model 2: Crimes ~ Temperature * Unemployment + offset(log(Population))
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 55 54397
## 2 54 53518 1 878.81 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova shows that the pois.m2 model is an improvement over pois.m1
Since the models are not nested, we compare the two models using the AIC:
c( "Model1"=AIC(pois.m1),"Model2"=AIC(pois.m2) )
## Model1 Model2
## 55003.01 54126.20
The AIC suggests the model pois.m2 produces a better predictions so we will use pois.m2 for our analysis
Diagnostics:
par(mfrow=c(2,2))
plot(pois.m2)
(1) Checking the Systematic Component Let’s first check the systematic component of our model:
Look at the top-left plot of the residuals against the predicted values.
Looking for two features of the plots which are important:
Trends: Any trends appearing in these plots indicate that the systematic component can be improved. This could mean changing the link function, adding extra explanatory variables, or transforming the explanatory variables.
Constant variation: If the random component is correct (that is, the correct edm is used), the variance of the points is approximately constant.
Also, if the link function is appropriate, then the plot should be roughly linear. If a noticeable curvature is apparent in the plot, then another choice of link function should be considered.
In this case, there is no specific trend observed in the residual chart and residual appears to be almost evenly distributed with almost constant variance except few exceptions towards the end.
(2) Checking the Random Component and Outliers We will check if residuals are normally distributed.
We see that the residuals are almost normally distributed with very few outliers.
Now, lets look at the spread of the deviance.
boxplot(residuals(pois.m2, typr="deviance"), horizontal = TRUE)
The median is slightly shifted towards left so the residuals are slightly right (or “positively”) skewed. Deviance residuals are approximately normally distributed if the model is specified correctly. In our example, it shows a little bit of skeweness since median is not quite zero.
Deviance goodness-of-fit test:
We can use the residual deviance to perform a goodness of fit test for the overall model. The residual deviance is the difference between the deviance of the current model and the maximum deviance of the ideal model where the predicted values are identical to the observed. Therefore, if the residual difference is small enough, the goodness of fit test will not be significant, indicating that the model fits the data.
Using goodness-of-fit statistic we will test the following hypothesis:
H0: the model fits Ha: the model does not fir (or, some other model fits)
pchisq(pois.m2$deviance, pois.m2$df.residual,lower.tail = F)
## [1] 0
The above p-value suggests that a residual deviance as large or larger than what we observed under the model in pois.m2 is highly likely! So we reject the null hypothesis that the model fits. This suggests that the model is not adequate or lack of fit. We reject the null hypothesis and conclude that the model doesn’t fits reasonably well because the goodness-of-fit chi-squared test is statistically significant.
Now, lets do a dispersion test to check for overdispersion
The standard Poisson GLM models the (conditional) mean E[y]=μ which is assumed to be equal to the variance VAR[y]=μ.
VAR[y]=μ+α⋅trafo(μ)
Overdispersion corresponds to α > 0 and underdispersion to α < 0
dispersiontest(pois.m2,trafo = 1,alternative = "greater")
##
## Overdispersion test
##
## data: pois.m2
## z = 4.8434, p-value = 6.382e-07
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
## alpha
## 971.6663
Since, α (971) is greater than 0, there is an overdispersion.
Since, we have established that there is an overdispersion, there are two ways to model the overdispersion - Quasi-Poisson (not an EDM where Var[y] = φµ) - Negative Binomial (EDM where Var[y] = µ + ψµ^2)
Quasi-Poisson Model The Crime count model from above needs to be fit using the quasi-Poisson family, which will account for the greater variance in the data.
qpois = glm(Crimes ~ Temperature*Unemployment + offset(log(Population)),
data=crimes2015,
family = quasipoisson(link = "log"))
Diagnostics:
Lets first look at the summary to see how the coefficients and std. error compares with the previous model.
summary(qpois)
##
## Call:
## glm(formula = Crimes ~ Temperature * Unemployment + offset(log(Population)),
## family = quasipoisson(link = "log"), data = crimes2015)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -83.586 -15.891 0.171 19.008 68.771
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.019609 1.431166 -4.206 9.85e-05 ***
## Temperature 0.009448 0.023916 0.395 0.694
## Unemployment 0.345641 0.254693 1.357 0.180
## Temperature:Unemployment -0.004011 0.004350 -0.922 0.361
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 1044.818)
##
## Null deviance: 70096 on 57 degrees of freedom
## Residual deviance: 53518 on 54 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
s1<- summary(pois.m2)
coef1 <- s1$coefficients[ , 1]
se.coef1 <- s1$coefficients[ , 2]
s2<- summary(qpois)
coef2 <- s2$coefficients[ , 1]
se.coef2 <- s2$coefficients[ , 2]
cbind(coef1, se.coef1, coef2, se.coef2)
## coef1 se.coef1 coef2 se.coef2
## (Intercept) -6.019608644 0.0442761142 -6.019608644 1.43116557
## Temperature 0.009448342 0.0007398788 0.009448342 0.02391558
## Unemployment 0.345640994 0.0078794544 0.345640994 0.25469272
## Temperature:Unemployment -0.004010685 0.0001345752 -0.004010685 0.00434996
As we can see, the coefficients are almost unchanged. However, the std. error increased significantly. Also, no terms are significant any more.
The dispersion parameter, which was forced to be 1 in our last model (pois.m2), is allowed to be estimated here. In fact, it is estimated at 1044.818.
This parameter tells us how many times larger the variance is than the mean. Since our dispersion is greater than one, it turns out the conditional variance is actually larger than the conditional mean. We have over-dispersion.
Before determining that the quasi-Poisson model “qpois” is appropriate, we will check to see if the variance of the residuals is proportional to the mean. We begin this check by creating a new dataframe which includes the residuals and fitted values.
p1Diag <- data.frame(crimes2015,
link = predict(pois.m2, type = "link"),
fit = predict(pois.m2, type = "response"),
pearson = residuals(pois.m2, type = "pearson"),
resid = residuals(pois.m2, type = "response"),
residSqr = residuals(pois.m2, type = "response")^2
)
We will plot the square of the residual against the predicted mean. We will add to this scatter plot a black line for the Poisson-assumed variance (the mean), a dashed green line for the quasi-Poisson assumed variance, and a blue curve for the smoothed mean of the square of the residual.
ggplot(p1Diag, aes(x = fit, y = residSqr)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, size = 1) +
geom_abline(intercept = 0, slope = summary(qpois)$dispersion,
color = "darkgreen", linetype = 2, size = 1) +
geom_smooth(method = 'loess', formula='y~x',se = F, size = 1) +
theme_bw()
Ideally the blue curve would be straight and it would be collinear with the green line for the quasi-Poisson variance. The greater the deviation from the green line the greater the concern is about the proportionality of the variance to the mean. Here we have some indication that the variance may not be proportional to the mean.
Now lets, take a look at the autoplot (residual chart)
par(mfrow=c(2,2))
plot(qpois)
Nothing apparently changed here as compared to the previous model. Its not giving us any additional information about our model.
Quasi-poisson model has no AIC and no likelihood as it has no probability distribution. I think we can’t do goodness of fit here.
Negative Binomial Model
We will now look to see if a negative binomial model might be a better fit.
nb = glm.nb(Crimes ~ Temperature*Unemployment + offset(log(Population)), data=crimes2015)
Diagnostics:
Lets first look at the summary to see how the coefficients and std. error compares with the previous model.
summary(nb)
##
## Call:
## glm.nb(formula = Crimes ~ Temperature * Unemployment + offset(log(Population)),
## data = crimes2015, init.theta = 6.160278675, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2327 -0.7348 -0.1445 0.5340 2.3455
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.3822910 1.4518065 -3.019 0.00254 **
## Temperature -0.0172933 0.0244708 -0.707 0.47976
## Unemployment 0.1344794 0.2698312 0.498 0.61821
## Temperature:Unemployment -0.0002946 0.0045820 -0.064 0.94874
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(6.1603) family taken to be 1)
##
## Null deviance: 80.832 on 57 degrees of freedom
## Residual deviance: 59.581 on 54 degrees of freedom
## AIC: 1058.9
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 6.16
## Std. Err.: 1.12
##
## 2 x log-likelihood: -1048.893
s1<- summary(pois.m2)
coef1 <- s1$coefficients[ , 1]
se.coef1 <- s1$coefficients[ , 2]
s2<- summary(qpois)
coef2 <- s2$coefficients[ , 1]
se.coef2 <- s2$coefficients[ , 2]
s3<- summary(nb)
coef3 <- s3$coefficients[ , 1]
se.coef3 <- s3$coefficients[ , 2]
cbind(coef1, se.coef1, coef2, se.coef2, coef3, se.coef3)
## coef1 se.coef1 coef2 se.coef2
## (Intercept) -6.019608644 0.0442761142 -6.019608644 1.43116557
## Temperature 0.009448342 0.0007398788 0.009448342 0.02391558
## Unemployment 0.345640994 0.0078794544 0.345640994 0.25469272
## Temperature:Unemployment -0.004010685 0.0001345752 -0.004010685 0.00434996
## coef3 se.coef3
## (Intercept) -4.3822910483 1.451806504
## Temperature -0.0172932501 0.024470804
## Unemployment 0.1344794118 0.269831212
## Temperature:Unemployment -0.0002945832 0.004582017
We can see the coefficients have changed and std. errors increased as compared to poisson model.
Also, the residual chart below does not improve much and is almost similar to previous models.
par(mfrow=c(2,2))
plot(nb)
But now we can do a goodness of fit test to check if this model is adequate.
Deviance goodness-of-fit test:
pchisq(nb$deviance, nb$df.residual,lower.tail = F)
## [1] 0.2798634
dist_chisq(chi2 = nb$deviance, deg.f = nb$df.residual)
The p-value is greater than 5% so we cannot reject the null hypothesis and conclude that the model fits reasonably well and is adequate.
We have chosen our final best model as the negative binomial model (nb). We can interpret the regression coefficient as follows: for a one unit change in the predictor variable, the difference in the logs of expected counts is expected to change by the respective regression coefficient, given the other predictor variables in the model are held constant.
tab_model(nb)
Crimes | |||
---|---|---|---|
Predictors | Incidence Rate Ratios | CI | p |
(Intercept) | 0.01 | 0.00 – 0.23 | 0.003 |
Temperature | 0.98 | 0.94 – 1.03 | 0.480 |
Unemployment | 1.14 | 0.67 – 1.99 | 0.618 |
Temperature * Unemployment |
1.00 | 0.99 – 1.01 | 0.949 |
Observations | 58 | ||
R2 Nagelkerke | 0.408 |
#change in crimeRate for 1 unit increase in temp
(1-exp(nb$coefficients[2]))*100
## Temperature
## 1.714458
#change in crimeRate for 1 unit increase in unemployment
(1-exp(nb$coefficients[3]))*100
## Unemployment
## -14.39411
For every 1 unit increase in the yearly temperature, the crime rate decreases by 1.71% considering the unemployment is unchanged.
For every 1 unit increase in the unemployment, the crime rate increases by 14.4% considering the temperature remains unchanged.