Background

San Francisco is a beautiful city in the state of California. It is the most densely populated city in the state of California and the second-most densely populated major city in the United States after New York City (source). With its dense population, it is also the hub of various types of crimes, and the crime rate has been a concern for the local government. For this project, we chose San Francisco crime dataset since we wanted to apply data cleaning, visualization, and analysis on a real-world issue that matters. Overall, we created several visualizations that help general public understand the distribution of crime instances; also, we considered historical weather data and applied machine learning and regression to build a model that predicts the number of major (i.e. life threatening) and minor (i.e. materialistic loss) crimes that may take place in a district at a given date and time. It is our hope that our website can help residents in San Francisco better understand the security of their surroundings.

Historical Crime Instances in San Francisco
yellow/orange stands for a minor/major crime instance

Datasets Used

San Francisco crime data: This is an open data released by San Francisco Police Department under SF OpenData Initiative. We originally came across a cleaned version of the dataset on a Kaggle contest on crime type categorization, but we decided to use the raw data from SF OpenData instead, as we believe the latter is more reliable. The dataset includes all crime instances beginning on 1/1/2013 up to two weeks before data download (April 2016), totaling 1,902,850 rows. Each row includes the date, time, category, description, address, and geo-coordinate of an instance.

San Francisco historical weather data: This dataset includes ASOS (Automated Surface Observing System) data at San Francisco International Airport, downloaded from Iowa Environmental Mesonet. It includes hourly measurement data from 1/1/2013 up to the previous day of data download (April 2016). Each measurement contains the temperature, dew point, wind speed, along with a raw METAR string that encodes more weather information.

San Francisco weather forcast: This data is accessed in real-time from a weather API by forecast.io. It receives and sanitizes weather data from various sources, and returns the weather forcast in San Francisco for the following two days.

Data Quality

Completeness: Our main datasets for visualization and analysis are crime data and historical weather data. While the crime data is complete, the weather dataset contains around 100 datapoints with missing temperatures. To fix the incompleteness, we manually cross-referenced other data sources (e.g. Weather Underground) and filled in the missing data.

Correctness & Coherency: A manual inspection of the weather dataset shows that it matches with other sources (e.g. weather.com), and the temperatures also fluctuate across seasons as expected. As for the crime dataset, it's impossible to verify the correctness of each individual instance, but we verified the count of most serious crimes (e.g. homicide) against news reports, and the numbers match up. We also noticed that theft and assult are the most common types of crimes, which matches with our expectation.

Accountability: The weather data comes from ASOS network, which is a highly reliable source as it also provides data to commercial pilots. As for the crime dataset, it is published by San Francisco Police Department, which is also considered reliable and accountable.

Click here for more details on data processing