The number of violent crime in each of the 58 most populated cities in the United States of America is recorded in 2015. It is suspected that the number of crimes is associated with the temperature.

In this study, we will investigate the factors affecting the number of violent crimes. We have data on population, climate, unemployment rate, and median income. The climate is presented by the yearly temperature (average of the daily averages) for each city in 2015.

# read data
crimes2015 =read.csv("https://pages.uwf.edu/acohen/teaching/datasets/crimes2015.csv")
summary(crimes2015)
##        X          Report.Year       City             Population     
##  Min.   : 1.00   Min.   :2015   Length:58          Min.   : 191992  
##  1st Qu.:15.25   1st Qu.:2015   Class :character   1st Qu.: 424103  
##  Median :29.50   Median :2015   Mode  :character   Median : 625936  
##  Mean   :29.50   Mean   :2015                      Mean   : 917607  
##  3rd Qu.:43.75   3rd Qu.:2015                      3rd Qu.: 866389  
##  Max.   :58.00   Max.   :2015                      Max.   :8550861  
##      Crimes       Temperature     Unemployment        Income      
##  Min.   :  626   Min.   :47.00   Min.   : 2.900   Min.   : 28099  
##  1st Qu.: 2875   1st Qu.:55.25   1st Qu.: 4.165   1st Qu.: 44756  
##  Median : 4536   Median :60.00   Median : 5.010   Median : 51175  
##  Mean   : 6981   Mean   :60.97   Mean   : 5.308   Mean   : 53548  
##  3rd Qu.: 7874   3rd Qu.:66.00   3rd Qu.: 5.965   3rd Qu.: 59314  
##  Max.   :50088   Max.   :78.00   Max.   :10.770   Max.   :103801

Lets first look at the relationship between Unemployment and Income as I suspect they must be dependent on each other.

ggplot(crimes2015, aes(y=Unemployment, x=Income/10000))+geom_point()+geom_smooth(method = 'loess', formula = "y~x", se=FALSE)

From the above graph we can infer that as income increases for the population, the unemployment decreases.

Lets first save the Crime Rate (per 1000). I’ll use this as our response variable.

crimes2015$CrimeRate <- crimes2015$Crime / crimes2015$Population * 1000 #per 1000

For the purpose of our analysis, I will be converting Unemployment into factor and three buckets will be created for low, medium and high unemployment. Also, I will convert temperature into factor and put them into five buckets.

df1 <- transform(crimes2015, 
                 TemperatureGroup=cut(Temperature,
                                                  breaks=c(-Inf, 50, 60, 70, Inf),
                                                  labels=c('<50', '50', '60', '70')), 
                 IncomeGroup=cut(Income, breaks=c(-Inf,44756, 59314, Inf), 
                                 labels=c('Low', 'Medium', 'High')), 
                 UnemploymentGroup=cut(Unemployment, 
                                       breaks=c(-Inf,4.165, 5.965, Inf), 
                                       labels=c('Low', 'Medium', 'High')))

First, we can view the response variable CrimeRate data continuity by creating a histogram:

hist(df1$CrimeRate, main = paste("Histogram of CrimeRate"), xlab="CrimeRate")

Clearly, the data is almost in the form of a bell curve but slightly left skewed and not a perfect normal distribution.

Let’s check out the mean() and var() of the dependent variable:

mean(df1$CrimeRate)
## [1] 8.328426
var(df1$CrimeRate)
## [1] 15.10534

The variance is much greater than the mean, which suggests that we will have over-dispersion in the model.

Now, lets take a look at the interaction between Temperature and Unemployment

(cleaned_data <- df1 %>% 
    group_by(TemperatureGroup, UnemploymentGroup) %>% 
    summarise(
      mean_CrimeRate = mean(CrimeRate, na.rm = TRUE)
    ) %>% 
    drop_na())

(labels <- cleaned_data %>% 
    filter(TemperatureGroup == 70) %>% 
    mutate(
      label = case_when(
        UnemploymentGroup == "Low" ~ "Low",
        UnemploymentGroup == "Medium" ~ "Medium",
        UnemploymentGroup == "High" ~ "High"
      )
    ))

cleaned_data %>% ggplot(aes(TemperatureGroup, mean_CrimeRate)) +  geom_line(size = 1.2, aes(group = UnemploymentGroup, color = UnemploymentGroup)) +
  geom_point(size = 2.6, aes(color = UnemploymentGroup), shape = 15)  +
  labs(
    title = "Interaction between Temperature and Unemployment",
    subtitle = paste0("Mean crime rate decreases with increase in temperature.\n",
                      "This is true for all levels of unemployment except the medium.\n",
                      "Population with medium unemployment has a sudden increase in crime rate at temp 70. "),
    caption = "",
    x = "Temprature",
    y = "mean Crime Rate",
    colour="Unemployment"
  ) 

Poisson model

Lets consider the model

pois.m1=glm(Crimes ~ Temperature+Unemployment + offset(log(Population)),
           data=crimes2015,
           family = poisson(link = "log") )

And an alternative model

pois.m2=glm(Crimes ~ Temperature*Unemployment + offset(log(Population)),
            data=crimes2015,
            family = poisson(link = "log") )

Now lets check the effect of independent variables added sequentially. This is very useful and help us remove unnecessary variables in the model.

anova(pois.m1, test="Chisq")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: Crimes
## 
## Terms added sequentially (first to last)
## 
## 
##              Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                            57      70096              
## Temperature   1   5603.4        56      64492 < 2.2e-16 ***
## Unemployment  1  10095.3        55      54397 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(pois.m2, test="Chisq")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: Crimes
## 
## Terms added sequentially (first to last)
## 
## 
##                          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                        57      70096              
## Temperature               1   5603.4        56      64492 < 2.2e-16 ***
## Unemployment              1  10095.3        55      54397 < 2.2e-16 ***
## Temperature:Unemployment  1    878.8        54      53518 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result shows that both variables are significant in both the models so we will keep them.

Now, lets compare these two models to see which one is preferred

anova(pois.m1, pois.m2, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: Crimes ~ Temperature + Unemployment + offset(log(Population))
## Model 2: Crimes ~ Temperature * Unemployment + offset(log(Population))
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        55      54397                          
## 2        54      53518  1   878.81 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova shows that the pois.m2 model is an improvement over pois.m1

Since the models are not nested, we compare the two models using the AIC:

c( "Model1"=AIC(pois.m1),"Model2"=AIC(pois.m2) )
##   Model1   Model2 
## 55003.01 54126.20

The AIC suggests the model pois.m2 produces a better predictions so we will use pois.m2 for our analysis

Diagnostics:

par(mfrow=c(2,2))
plot(pois.m2)