which route would three-amigos choose ?
Background and Motivation:
Our project motivation was to analyze data that was easy to relate to.
Hence starts the journey of three amigos, with the accommodation set at St.Regis in downtown SFO, each of them had their own favourite destination in mind for a quick jog but would only do so if the other two were also interested in the same spot. Hmm ... conflict of interest and also how would one convince the other or both to agree with one's choice. They started flipping coins and rolling dice but none would give them the answer they sought. Here comes data analytics to the rescue.
With SFO crime data readily available, we wanted to see if we could identify some relationships with likely variables and come up with practical recommendations as to which route would be the safest.
Project Objectives:
Analyze the San Francisco Crime data set provided by Kaggle.com to predict the likelihood of certain crime occurrences and suggest a better path among the 3 chosen routes, which start with a common starting point at St.Regis and end in either Aquarium of the Bay, Painted Ladies, or Cupid's Span respectively.
Approach:
Examine SFO Crime dataset, wrangle data to a level where it's only relevant to our end objective i.e., to tidy and collect observations which affect our mode of transportation which is WALKING/JOGGING. Wrangle it, Analyze it, and finally Visualize it to determine relationship between variables (if any) and conclude our objective by defining 1-3 walking paths by overlaying statistical likelihood of crime occurrences and suggesting better path. We intend to use the ‘training’ data set, as it is the only full-featured data set.
Analysis:
- Data Wrangling :
We started working with the SFO Crime training dataset readily available. The initial challenge was to identify the categories which are related to our end objective. The provided dataset had multiple categories labeled as "Category" and each category was further subdivided in to multiple sub-categories here labeled as "Descript" based on the act committed. There were few "Category"s and their "Descript"s which were outright not relevant to our cause and hence were filtered out to mention few "BAD CHECKS", "GAMBLING", etc.
Our second phase of wrangling was to closely examine "Descript"s and to filter out any non-relevant descriptions from this Category, for example "ASSAULT" category though is relevant to our case, acts such as "inflict injury on cohabitee", "threatening phone calls" and similar others are not relevant to a person whose mode of transportation is WALKING/JOGGING from point A to point B.
Our third and final phase of wrangling was to filter out unnecessary observations based on "Descript"s identified in second stage of filtering and make sure we arrive to final dataset which is relevant to our end objective of WALKING/JOGGING.
We have initially started around 875K observations and have systematically wrangled data to arrive at around 195K crime observations which have an impact on person WALKING/JOGGING in the areas discussed.
- Exploratory Analysis :
Using the finalized wrangled data below shows all categories of crime, which clearly shows concentration near the northeastern section of the city. This map provided us with the guidance on which areas we needed to concentrate for further analysis.
For reference, we’ve provided zip code boundaries as overlay.
We have further created 3 categories based on the severity and threat level to life:
Property : Theft, Robbery, etc.,
Life : Kidnapping, Sex Offenses, etc.,
Nuisance : Loitering, Drunkeness, etc.,
Life related crimes such as assault, kidnapping, missing persons, and sex offenses have a much wider footprint mainly due to the number and prevalence of assault related crimes. Whereas nuisance related crimes such as disorderly conduct, drug/narcotic, drunkenness, and non-criminal offenses are very much concentrated in one region. The below map provides crime density views of the same.
Per the histogram below, larceny and theft crime top the chart at above 70,000 occurrences, followed by assault and other suspicious activities.
Below we have tried to point out a few interesting trends of crime by class, district and day. Not only does the southern district have the largest number of crimes, it has significantly higher property related crimes. Since the southern district has some interesting trends we decided to delve into this district a bit further.
On Fridays, the daily average crime by class for the Southern District is a little under 6000 crimes. Approximately 60% are related to property, 10% to nuisance and 30% to life related crimes. The second highest daily crime rate falls on Saturday followed by, surprisingly, Wednesday!
The chart below shows the frequencies of crime within a given hour for the Southern District. We found it interesting that the highest occurrences fall in the afternoon between noon and 5 pm with the peak between 3-4 pm. The lowest occurrence is in the early morning hours between 4-5 am.
- Proximity Analysis :
As seen above in the Data Wrangling and Exploratory Analysis, we have arrived at the crime data set of offenses that negatively impact pedestrians in San Francisco. The data set also provides coordinates for where the crime was reported, so we wanted to start with a ‘bird's eye view’.
Even though we can see all this crime, the Three Amigos decided that we still want to venture from our hotel at the St. Regis in the Southern district for a two mile walk. We have three options: 1) walk 2 miles to the Aquarium of the Bay, 2) walk 2 miles to the Painted Ladies, or 3) walk a 2 mile round trip to Cupid’s Span and back.
On the first image on the left, you’ll see that we have outlined the three paths heading North, South-West, or East. Then to the right, the same three paths are now have their respective neighborhood crime. The neighborhood crime is close enough in proximity to pose a threat to us as we walk.
So the goal is to pick the route with the least risk as we attempt to improve our health with a brisk walk.
In order to figure out which path posed the least risk, we pulled out our calculators. We decided to consider the three classifications of crime in order of severity (life, property, and nuisance) by factoring each in a descending scale (6, 3, & 1 respectively). To this end, we were able to add the three respective path’s neighborhood crime into a weighted score based on this adjusted scale. The walk to the Aquarium of the Bay proved to be the safest route.
To show the relative crime density points, we overlaid a heat map of the respective neighborhood crime. Based upon this, we’ll be walking a little bit faster in a few sections of our walk to the Aquarium. Even though we’ve determined the lower risk path, it still has risk.