3  Results

Code
options(warn = -1)
library(ggplot2)
library(dplyr)
library(lubridate)
library(nycgeo)
library(ggridges)
library(vcd)
library(sf)
library(tidyverse,warn.conflicts = FALSE)
Code
service_data = read_csv("C:\\Users\\varun\\Downloads\\311_Service_Requests_from_20230901_to_20231129.csv",show_col_types = FALSE)

#service_data = read_csv('I:/My Drive/MSDS/Study/Fall23/EDAV/Homework/final_project/311_Service_Requests_from_20230901_to_20231129.csv', show_col_types = FALSE)

3.1 Clean data

Since “Facility Type”, “Vehicle Type”, “Due Date” columns are all having >90% missing values (missing value analysis), we will remove them.

From the dataset description, they refer to…

Facility Type: If available, this field describes the type of city facility associated to the SR

Vehicle Type: If the incident is a taxi, this field describes the type of TLC vehicle.

Due Date: Date when responding agency is expected to update the SR. This is based on the Complaint Type and internal Service Level Agreements (SLAs).

“Landmark” has >40% missing values. But since other address columns (eg. Longitude, Latitude, Incident Address) are proxies of it, we can remove “Landmark”.

Landmark: If the incident location is identified as a Landmark the name of the landmark will display here

Community Boards (each taking charge of 1 Community District, or CD), together with the City Council Districts, handle the “day-to-day processing of citizen complaints and requests for municipal services”. At the sub-borough level, we will look at how well each CD manages their municipal issues. “link”

Since we’re only looking at the New York City dataset, the City column is redundant, and can be removed too.

There are some rows with Borough and Status values being “Unspecified”. But they take up such a small % of the total number of rows (~0% from missing value analysis), so we will remove these rows too.

Code
service_data <- service_data[, !(colnames(service_data) %in% c("Vehicle Type", "Due Date", "Facility Type", "Landmark", "City"))]

# see count of number of values. in Borough, Status column
table(service_data[["Borough"]])

        BRONX      BROOKLYN     MANHATTAN        QUEENS STATEN ISLAND 
       157918        266927        183142        195532         30344 
  Unspecified 
         3362 
Code
table(service_data[["Status"]])

   Assigned      Closed In Progress        Open     Pending     Started 
       4203      734515       49275       47374         534         885 
Unspecified 
        439 
Code
# remove rows with Borough / Status = "Unspecified"
service_data <- service_data[!grepl("Unspecified", service_data$Borough),]
service_data <- service_data[!grepl("Unspecified", service_data$Status),]

# convert these columns to lower case, to commonalize similar texts that have different casing
service_data <- service_data %>%
  mutate_at(vars(c(`Street Name`, `Incident Address`, `Complaint Type`)), tolower)

3.2 (Q1) Which borough has the most complaints, and the least?

Code
# library(ggplot2)
library(forcats)
# library(dplyr)

# specify colors for boroughs
custom_colors <- c("QUEENS" = "yellow", "MANHATTAN" = "blue", "BRONX" = "brown", "STATEN ISLAND" = "green", "BROOKLYN" = "red")

# plot bar chart
ggplot(service_data %>%
         group_by(Borough) %>% # to include another dimension (but it tends to be too cluttered), add eg: , `Complaint Type`
         summarise(Total_Complaints = n()), 
       aes(x = reorder(Borough, -Total_Complaints), y = Total_Complaints/1000, fill = Borough)) +
  geom_bar(stat = "identity") +
  labs(title = "Total Complaints by NYC Borough",
       x = "Borough",
       y = "Number of Complaints (000s)") +
  scale_fill_manual(values = custom_colors) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme_minimal()

Comments:

Brooklyn appears to receive the most 311 complaints, followed by Queens, Manhattan, the Bronx. Staten Island seems to receive the least complaints.

3.3 (Q2) Which complaint types are the most numerous, and the least?

Code
ggplot(service_data %>%
         # mutate(`Complaint Type` = tolower(`Complaint Type`)) %>%
         group_by(`Complaint Type`) %>%
         summarise(Total_Complaints = n()), 
       aes(x = Total_Complaints/1000, y = reorder(`Complaint Type`, Total_Complaints), fill = `Complaint Type`)) +
  geom_col(position = position_dodge(width = 20)) +
  labs(title = "Total Complaints by Complaint Type",
       x = "Number of Complaints (000s)",
       y = "Complaint Type") +
  scale_fill_manual(values = custom_colors) +
  theme(axis.text.y = element_text(vjust = 0.5)) + # Adjust hjust and vjust as needed
  theme_minimal()

Comments:

On an aggregate level across all boroughs, the most common complaints range from illegal parking, noise in residential spaces, (presumably, the lack of) hot water, to also blocked driveways. In general, noise/cleanliness/water issues are the most frequent complaints, as seen from the top 10-20 bars.

Towards the bottom of the bar plot, we see that certain issues amount to fewer cases of complaints. These include highway/snow/oil spill issues. This is understandable, given that they are less “day-to-day” issues.

To drill in further, we can facet by boroughs. However, there are clearly too many complaint types. Focusing on just the top 5, for each borough…

Code
library(tidytext)

service_data_top <- service_data %>%
  # mutate(`Complaint Type` = tolower(`Complaint Type`)) %>%
  group_by(Borough, `Complaint Type`) %>%
  summarise(Total_Complaints = n()) %>%
  group_by(Borough) %>%
  top_n(5, wt = Total_Complaints) %>%
  arrange(Borough, desc(Total_Complaints)) %>%
  mutate(`Complaint Type` = factor(`Complaint Type`, levels = unique(`Complaint Type`)))

# Get a vector of unique Boroughs ordered by the total complaints
ordered_boroughs <- service_data_top %>%
  arrange(Borough) %>%
  group_by(Borough) %>%
  summarise(total_complaints = sum(Total_Complaints)) %>%
  arrange(desc(total_complaints)) %>%
  pull(Borough)

# Use reorder_within to sort bars within each facet
ggplot(service_data_top, aes(x = Total_Complaints, y = reorder_within(`Complaint Type`, Total_Complaints, Borough), fill = `Complaint Type`)) +
  geom_col() +
  labs(title = "Top 5 Complaint Types by Borough",
       x = "Number of Complaints",
       y = "Complaint Type") +
  scale_fill_manual(values = custom_colors) +
  theme_minimal() +
  facet_wrap(~Borough, scales = "free_y", ncol = 1, strip.position = "bottom") +
  theme(axis.text.y = element_text(vjust = 0.5)) +
  theme_minimal()

Comments:

We see that again, illegal parking, lack of hot water, and residential noise rank amongst the top 5 complaints across all 5 boroughs. This would certainly be something to look out for, for anyone who’s looking to stay in these boroughs. But can we go 1 level lower, to see which streets (in these boroughs) contain many of these complaints?

3.4 (Q3) What are the locations that most complaints are associated with?

We explore locations by their Street Names (eg. Seventh Avenue), not Incident Address (eg. 31 Seventh Avenue, 139 Seventh Avenue), as the latter would have too many unique categories. For the initial plots here, we do not use zip code or longitude/latitude as they are less interpretable (we will plot that on a chloropleth later).

Since the number of missing rows in Street Names is low, as seen from the missing vales analysis (~3.7% rows missing), we shall omit these missing rows from this section’s analysis.

Code
# remove rows with missing Street Name
service_data_temp <- service_data[complete.cases(service_data$`Street Name`), ]
Code
service_data_top <- service_data_temp %>%
  # mutate(`Street Name` = tolower(`Street Name`)) %>%
  group_by(Borough, `Street Name`) %>%
  summarise(Total_Complaints = n()) %>%
  group_by(Borough) %>%
  top_n(5, wt = Total_Complaints) %>%
  arrange(Borough, desc(Total_Complaints)) %>%
  mutate(`Complaint Type` = factor(`Street Name`, levels = unique(`Street Name`)))

# Get a vector of unique Boroughs ordered by the total complaints
ordered_boroughs <- service_data_top %>%
  arrange(Borough) %>%
  group_by(Borough) %>%
  summarise(total_complaints = sum(Total_Complaints)) %>%
  arrange(desc(total_complaints)) %>%
  pull(Borough)

# Use reorder_within to sort bars within each facet
ggplot(service_data_top, aes(x = Total_Complaints, y = reorder_within(`Street Name`, Total_Complaints, Borough), fill = `Street Name`)) +
  geom_col() +
  labs(title = "Top 5 Streets with the most complaints",
       x = "Number of Complaints",
       y = "Street Name") +
  scale_fill_manual(values = custom_colors) +
  theme_minimal() +
  facet_wrap(~Borough, scales = "free_y", ncol = 1, strip.position = "bottom") +
  theme(axis.text.y = element_text(vjust = 0.5)) +
  theme_minimal()

Comments:

We see from here some of the most complained streets in each borough, eg. Ocean Avenue in Brooklyn. But this can also be high, due to the fact that some of them are long, main roads in each Borough (eg. Broadway, Amsterdam Avenue in Manhattan). We can plot a chloropleth to get a better, geographical view of problem areas:

Code
service_data_top_districts <- service_data %>%
  group_by(`City Council Districts`) %>%
  summarise(Total_Complaints = n()) %>%
  arrange(desc(Total_Complaints)) %>%
  rename(value = Total_Complaints) %>%
  drop_na(value) %>%
  mutate(`City Council Districts` = as.character(`City Council Districts`))

  nyc_boundaries(geography = "council") %>%
  left_join(service_data_top_districts,
            by = c("council_dist_id" = "City Council Districts")) %>%
  ggplot() +
  geom_sf(aes(fill = value)) +
  scale_fill_viridis_c(name = "Total Complaints",
                       option = "brewer blues",
                       direction = -1) +
  theme_void() +
  labs(title = "Total Complaints for each District")

Comment:

From this, we can see that indeed the Amsterdam and Broadway Ave areas in Manhattan UWS have the most complaints. So is the Prospect Park area in Brooklyn, which also coincides with the Ocean Ave street. This narrows down the areas of the streets, which are more problematic.

3.5 (Q4) What is the proportion of complaints that were handled by different departments?

Code
ggplot(service_data %>%
         group_by(Agency) %>%
         summarise(Total_Complaints = n()), 
       aes(x = reorder(Agency, -Total_Complaints), y = Total_Complaints/1000, fill = Agency)) +
  geom_bar(stat = "identity") +
  labs(title = "Total Complaints by NYC Agencies",
       x = "Agency",
       y = "Number of Complaints (000s)") +
  scale_fill_manual(values = custom_colors) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme_minimal()

Code
library(knitr)

# Remove duplicates
service_data_agencies <- distinct(select(service_data, Agency, `Agency Name`))

# Display the table
kable(service_data_agencies, caption = "Agencies", format = "html")
Agencies
Agency Agency Name
NYPD New York City Police Department
DOHMH Department of Health and Mental Hygiene
DCWP Department of Consumer and Worker Protection
DPR Department of Parks and Recreation
HPD Department of Housing Preservation and Development
DEP Department of Environmental Protection
DOT Department of Transportation
TLC Taxi and Limousine Commission
DOB Department of Buildings
DSNY Department of Sanitation
EDC Economic Development Corporation
DHS Department of Homeless Services
DOE Department of Education

We wanted to also explore how efficient/overworked each Agency is, in managing these complaints. These could provide an indication as to the level of service a resident might expect, when living in the different parts of NYC. We chose to only factor up to 2 levels, because any more levels, and the mosaic plot would become too cluttered to read, given the huge number of categories that we have.

The final plot can be seen below:

Code
library(RColorBrewer)
status_order <- c("Open", "Assigned", "Pending", "Started", "In Progress", "Closed")

service_data_temp$Status <- factor(service_data_temp$Status, , levels = status_order)

contingency_table <- table(service_data_temp$Agency, service_data_temp$Status)

# Create a mosaic plot using base R
mosaicplot(contingency_table, main = "Distribution of Complaint Status by NYC Agencies", las=2)

Comments:

Note that:

  • The data dictionary provided did not explicitly state what each status means, but a sensible inference would be that they are ordinal categories, going from Open, to being Assigned to some agency. It would presumably stay in a Pending status, before work on it gets Started. It’ll remain In Progress, until it gets Closed by the responsible agency.

  • Lines that split into 3 bands, indicate 0 values for that Status tier.

We shall focus on the 2 most overworked agencies: NYPD and HPD, with the widest bars.

We can thus observe that the NYPD, even though flooded with the greatest volume of complaints, has been able to close almost all of them. This should be reassuring to residents who are concerned about safety and noise related issues, as the NYPD is the main body responsible for them.

The HPD (2nd highest workload) however, has many Open issues remaining. This is cold comfort to residents-to-be, who are seeking responsive solutions to their housing infrastructural problems. 1 consolation is that the HPD has also managed to close a large proportion of its complaints.

What about the distribution of complaints handled by Agencies, per Borough?

Code
contingency_table <- table(service_data_temp$Borough, service_data_temp$Agency)

# Create a mosaic plot using base R
mosaicplot(contingency_table, main = "Distribution of Agencies handling complaints, by NYC Boroughs", las=2,highlighting = 2)

Comments:

Putting this together with the previous plot, we see that once again, NYPD, then HPD, took up the lion’s share of the workload for complaints, across all boroughs. Residents-to-be can feel comforted that issues in their home area that are managed by the NYPD, would mostly be closed.

Queens however, ought to be slightly concerned, about the (relatively) higher proportion of police complaints that they receive.

Across all Boroughs (especially those living in the Bronx), one should also rightfully be concerned at the large amount of housing complaints that they receive in their borough. Queens’ residents-to-be can put their minds at ease, knowing that they get fewer housing complaints.

3.6 (Q5) Types of complaints that get closed the fastest/slowest, by using the mean resolution time to close a request.

Closing complaints is great. But what about the speed at which they get closed?

Code
service_data$`Closed Date` <- date(mdy_hms(service_data$`Closed Date`,tz=Sys.timezone()))
service_data$`Created Date` <- date(mdy_hms(service_data$`Created Date`,tz=Sys.timezone()))
service_data$Closing_Duration <- service_data$`Closed Date` - service_data$`Created Date`

Complaints that get resolved the fastest

Code
complaint_duration <- service_data %>%
  group_by(`Complaint Type`) %>%
  summarise(across(Closing_Duration,mean, na.rm=TRUE))

complaint_duration[order(complaint_duration$Closing_Duration, decreasing = FALSE),] %>%
  slice(1:40) %>%
  ggplot(aes(x = Closing_Duration, y = fct_rev(
    fct_reorder(`Complaint Type`, -Closing_Duration, .desc = TRUE)
  ))) +
  geom_point() +
  xlab("Closing Duration (Days)") +
  ylab("Complaint Type") +
  ggtitle("Mean Time for Complaint Resolution (Days)")+
  theme_bw()

Comments: We see that non-structural issues such as vehicular/sidewalk noise, drug activity, illegal fireworks/parking, generally get resolved the fastest (<1 day). This is to be expected, since the perpetrators can be fairly easily directed away.

Thankfully, many of these are day-to-day issues that can directly affect the wellbeing of residents, and the fact that they get resolved fast brings comfort to anyone who plans to stay in NYC.

Top 40 Complaints that get resolved the slowest

Code
complaint_duration[order(complaint_duration$Closing_Duration, decreasing = TRUE),] %>%
  slice(1:40) %>%
  ggplot(aes(x = Closing_Duration, y = fct_rev(
    fct_reorder(`Complaint Type`, Closing_Duration, .desc = TRUE)
  ))) +
  geom_point() +
  xlab("Closing Duration") +
  ylab("Complaint Type") +
  ggtitle("Mean Time for Complaint Resolution (Days)")+
  theme_bw()

Comments: Structural issues such as sidewalk/sewer/infrastructure condition, generally get resolved the slowest (>10 days). This is to be expected, since these require more involved efforts.

However, one might be more concerned about sewage/building maintenance/taxi complaints, when planning to stay in NYC. These can impact the living conditions and transportation satisfaction of said city-dweller, and one might want to be aware of the less-easily resolved issues plaguing certain areas in NYC. We will see in the D3 graph, how all the graphs that answer these questions here, come together, to help someone decide where might be problematic areas to live in.

Resolution time for complaint tickets across Boroughs

Code
library(ggridges)

service_data %>% na.omit(service_data) %>% 
ggplot(aes(x = Borough ,y = Closing_Duration)) +
  geom_boxplot() +
  ggtitle("Boxplot for Ticket Closing Duration for each Borough")+
  labs(
    x = "Borough",
    y = "Ticket Resolution Duration (Days)"
  ) +
  theme_bw() +
  coord_flip()

Code
# ridgeline?
service_data %>%
  ggplot(aes(x = Closing_Duration, y = as.factor(Borough), fill = as.factor(Borough))) +
  geom_density_ridges() +
  ggtitle("Ridgeline Plot for Ticket Closing Duration for each Borough") +
  labs(
    x = "Ticket Resolution Duration (Days)",
    y = "Borough"
  ) +
  scale_fill_manual(values = custom_colors) +  # Set specific colors
  theme_bw()

Comments:

From the boxplots we can see that the median resolution time across different boroughs is nearly the same (Between 1-2 days). The outliers are mainly due to issues visualized in the previous Cleveland plot.

Ridgeline plots were considered too, instead of the boxplots, and we similarly see a long-tail effect, where most complaints get resolved in <5days, and very few take >10 days to close. But since most of the resolution times are close to 1-2 days, there was not a lot of additional useful information that could be derived from such a distribution.

Resolution time for tickets across Districts

Code
service_district <- service_data %>%
  group_by(`City Council Districts`) %>%
  summarise(mn_cd = mean(Closing_Duration, na.rm = TRUE)) %>%
  arrange(desc(mn_cd)) %>%
  rename(value = mn_cd) %>%
  drop_na(value) %>%
  mutate(`City Council Districts` = as.character(`City Council Districts`))

nyc_boundaries(geography = "council") %>%
  left_join(service_district,
            by = c("council_dist_id" = "City Council Districts")) %>%
  ggplot() +
  geom_sf(aes(fill = value)) +
  scale_fill_viridis_c(name = "Mean Resolution Time (Days)",
                       option = "inferno",
                       direction = -1) +
  theme_void() +
  labs(title = "How fast is the ticket resolution for each District?")

Comments:

The above analysis for q5, culminates in this chloropleth map, that shows how long the various districts of NYC’s boroughs take to resolve 311 service requests.

We see that the mean complaint resolution time corroborates with the above boxplot/ridgeline plots, with the mean duration hovering below 5 days. However, the darkest colored regions, eg. north Staten Island, west Brooklyn, southwest Queens, southern Manhattan and Bronx, resolve 311 issues the slowest on average. Residents-to-be may want to keep this (and knowledge from Q3, boroughs and their types of high complaint counts, and their respective streets with high complaint counts) in mind, when deciding where to stay.