The data for this project was gathered from multiple web sources and compiled into three final spreadsheets using the Python programming language. The scripts are located at the bottom; immediately below is a list of all of the raw datasets used for this project:
- NYC Open Data DOHMH New York City Restaurant Inspection Results
- Zillow Rent Index Time Series Data: Multifamily, SFR, Condo/Co-op
- 2010 US Population Density by ZIP Code
- NYC Dept. of Health ZIP Code Definitions of New York City Neighborhoods
There arose a need to normalize the restaurant density metric by population size and geographic area after realizing that there would likely be more restaurants in areas where there are more people and/or land. To compare restaurant densities across different neighborhoods two separate statistics were generated using the 2010 US Population Density dataset, they were restaurant density per 100 people, and restaurant density per square mile. The same was not done for restaurant diversity as it was unclear if a larger population or landmass would increase the number of different categories of restaurants, as it is unclear what can attribute to an area having more categories of restaurants.
All of this data was paired with rent data provided by the Zillow Rent Index which contained information on the monthly median rent for ZIP codes across the US, spanning from November of 2010 to March of 2017. Median rent data from the most recent month (March 2017) was used in this study when talking about the current rent of a specific area. The ZIP codes were grouped by neighborhood in the same manner as the restaurant ZIP codes and the average of rent of all the ZIP codes in a specific neighborhood was taken as the average rent for that neighborhood. A replication of the Zillow Rent Index time series was created with just the ZIP codes from New York grouped by neighborhood in a similar manner as described above.
Following the presentation on the data from rent and diversity/density, Yelp data on prices and ratings was gathered for two case studies. This was collected by scraping Yelp.com using a Python Beautiful Soup script which can be found under downloads. The web scraper searched for restaurants by address and returned their price ranking and ratings when found. The decision to search by addresses was made because of the large amount of variability in how restaurants were named in DOHMH dataset and on the Yelp website; addresses were thought to be more concrete and uniform when translating from the DOHMH dataset to Yelp. However, the web scraper still encountered discrepancies when searching by addresses and therefore was only 60% successful in returning data for the addresses included in the case study. In the end this still left over 800 points of data to be used in the case study analysis.