The number of violent crime in each of the 58 most populated cities in the United States of America is recorded in 2015. It is suspected that the number of crimes is associated with the temperature.

In this study, we will investigate the factors affecting the number of violent crimes. We have data on population, climate, unemployment rate, and median income. The climate is presented by the yearly temperature (average of the daily averages) for each city in 2015.

```
# read data
crimes2015 =read.csv("https://pages.uwf.edu/acohen/teaching/datasets/crimes2015.csv")
summary(crimes2015)
```

```
## X Report.Year City Population
## Min. : 1.00 Min. :2015 Length:58 Min. : 191992
## 1st Qu.:15.25 1st Qu.:2015 Class :character 1st Qu.: 424103
## Median :29.50 Median :2015 Mode :character Median : 625936
## Mean :29.50 Mean :2015 Mean : 917607
## 3rd Qu.:43.75 3rd Qu.:2015 3rd Qu.: 866389
## Max. :58.00 Max. :2015 Max. :8550861
## Crimes Temperature Unemployment Income
## Min. : 626 Min. :47.00 Min. : 2.900 Min. : 28099
## 1st Qu.: 2875 1st Qu.:55.25 1st Qu.: 4.165 1st Qu.: 44756
## Median : 4536 Median :60.00 Median : 5.010 Median : 51175
## Mean : 6981 Mean :60.97 Mean : 5.308 Mean : 53548
## 3rd Qu.: 7874 3rd Qu.:66.00 3rd Qu.: 5.965 3rd Qu.: 59314
## Max. :50088 Max. :78.00 Max. :10.770 Max. :103801
```

Lets first look at the relationship between Unemployment and Income as I suspect they must be dependent on each other.

`ggplot(crimes2015, aes(y=Unemployment, x=Income/10000))+geom_point()+geom_smooth(method = 'loess', formula = "y~x", se=FALSE)`

From the above graph we can infer that as income increases for the population, the unemployment decreases.

Lets first save the Crime Rate (per 1000). Iâ€™ll use this as our response variable.

`crimes2015$CrimeRate <- crimes2015$Crime / crimes2015$Population * 1000 #per 1000`

For the purpose of our analysis, I will be converting Unemployment into factor and three buckets will be created for low, medium and high unemployment. Also, I will convert temperature into factor and put them into five buckets.

```
df1 <- transform(crimes2015,
TemperatureGroup=cut(Temperature,
breaks=c(-Inf, 50, 60, 70, Inf),
labels=c('<50', '50', '60', '70')),
IncomeGroup=cut(Income, breaks=c(-Inf,44756, 59314, Inf),
labels=c('Low', 'Medium', 'High')),
UnemploymentGroup=cut(Unemployment,
breaks=c(-Inf,4.165, 5.965, Inf),
labels=c('Low', 'Medium', 'High')))
```

First, we can view the response variable CrimeRate data continuity by creating a histogram:

`hist(df1$CrimeRate, main = paste("Histogram of CrimeRate"), xlab="CrimeRate")`

Clearly, the data is almost in the form of a bell curve but slightly left skewed and not a perfect normal distribution.

Letâ€™s check out the mean() and var() of the dependent variable:

`mean(df1$CrimeRate)`

`## [1] 8.328426`

`var(df1$CrimeRate)`

`## [1] 15.10534`

The variance is much greater than the mean, which suggests that we will have over-dispersion in the model.

Now, lets take a look at the interaction between Temperature and Unemployment

```
(cleaned_data <- df1 %>%
group_by(TemperatureGroup, UnemploymentGroup) %>%
summarise(
mean_CrimeRate = mean(CrimeRate, na.rm = TRUE)
) %>%
drop_na())
(labels <- cleaned_data %>%
filter(TemperatureGroup == 70) %>%
mutate(
label = case_when(
UnemploymentGroup == "Low" ~ "Low",
UnemploymentGroup == "Medium" ~ "Medium",
UnemploymentGroup == "High" ~ "High"
)
))
cleaned_data %>% ggplot(aes(TemperatureGroup, mean_CrimeRate)) + geom_line(size = 1.2, aes(group = UnemploymentGroup, color = UnemploymentGroup)) +
geom_point(size = 2.6, aes(color = UnemploymentGroup), shape = 15) +
labs(
title = "Interaction between Temperature and Unemployment",
subtitle = paste0("Mean crime rate decreases with increase in temperature.\n",
"This is true for all levels of unemployment except the medium.\n",
"Population with medium unemployment has a sudden increase in crime rate at temp 70. "),
caption = "",
x = "Temprature",
y = "mean Crime Rate",
colour="Unemployment"
)
```

**Poisson
model**

Lets consider the model

```
pois.m1=glm(Crimes ~ Temperature+Unemployment + offset(log(Population)),
data=crimes2015,
family = poisson(link = "log") )
```

And an alternative model

```
pois.m2=glm(Crimes ~ Temperature*Unemployment + offset(log(Population)),
data=crimes2015,
family = poisson(link = "log") )
```

Now lets check the effect of independent variables added sequentially. This is very useful and help us remove unnecessary variables in the model.

`anova(pois.m1, test="Chisq")`

```
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: Crimes
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 57 70096
## Temperature 1 5603.4 56 64492 < 2.2e-16 ***
## Unemployment 1 10095.3 55 54397 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(pois.m2, test="Chisq")`

```
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: Crimes
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 57 70096
## Temperature 1 5603.4 56 64492 < 2.2e-16 ***
## Unemployment 1 10095.3 55 54397 < 2.2e-16 ***
## Temperature:Unemployment 1 878.8 54 53518 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The result shows that both variables are significant in both the models so we will keep them.

Now, lets compare these two models to see which one is preferred

`anova(pois.m1, pois.m2, test="Chisq")`

```
## Analysis of Deviance Table
##
## Model 1: Crimes ~ Temperature + Unemployment + offset(log(Population))
## Model 2: Crimes ~ Temperature * Unemployment + offset(log(Population))
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 55 54397
## 2 54 53518 1 878.81 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Anova shows that the pois.m2 model is an improvement over pois.m1

Since the models are not nested, we compare the two models using the AIC:

`c( "Model1"=AIC(pois.m1),"Model2"=AIC(pois.m2) )`

```
## Model1 Model2
## 55003.01 54126.20
```

The AIC suggests the model pois.m2 produces a better predictions so we will use pois.m2 for our analysis

**Diagnostics:**

```
par(mfrow=c(2,2))
plot(pois.m2)
```