Data Collection – Projects In Data

The data for this project was gathered from multiple web sources and compiled into three final spreadsheets using the Python programming language. The scripts are located at the bottom; immediately below is a list of all of the raw datasets used for this project:

The DOHMH New York City Restaurant Inspection dataset provided a close to comprehensive list of all restaurants operating in New York City along with their corresponding addresses and cuisine categories. Two primary statistics were derived using this data, restaurant density and restaurant diversity. Restaurant density is a measure of how many restaurants are located within in a given area, while restaurant diversity is a measure of how many different categories of restaurants are located in a given area. At first restaurants were grouped by ZIP codes to calculate these metrics, however to make the data more digestible the decision was made to first map these different ZIP codes to New York City neighborhoods using a dataset from the NYC Dept. of Health and then group the restaurants by neighborhood to generate the necessary statistics.

There arose a need to normalize the restaurant density metric by population size and geographic area after realizing that there would likely be more restaurants in areas where there are more people and/or land. To compare restaurant densities across different neighborhoods two separate statistics were generated using the 2010 US Population Density dataset, they were restaurant density per 100 people, and restaurant density per square mile. The same was not done for restaurant diversity as it was unclear if a larger population or landmass would increase the number of different categories of restaurants, as it is unclear what can attribute to an area having more categories of restaurants.

All of this data was paired with rent data provided by the Zillow Rent Index which contained information on the monthly median rent for ZIP codes across the US, spanning from November of 2010 to March of 2017. Median rent data from the most recent month (March 2017) was used in this study when talking about the current rent of a specific area. The ZIP codes were grouped by neighborhood in the same manner as the restaurant ZIP codes and the average of rent of all the ZIP codes in a specific neighborhood was taken as the average rent for that neighborhood. A replication of the Zillow Rent Index time series was created with just the ZIP codes from New York grouped by neighborhood in a similar manner as described above.

Following the presentation on the data from rent and diversity/density, Yelp data on prices and ratings was gathered for two case studies. This was collected by scraping Yelp.com using a Python Beautiful Soup script which can be found under downloads. The web scraper searched for restaurants by address and returned their price ranking and ratings when found. The decision to search by addresses was made because of the large amount of variability in how restaurants were named in DOHMH dataset and on the Yelp website; addresses were thought to be more concrete and uniform when translating from the DOHMH dataset to Yelp. However, the web scraper still encountered discrepancies when searching by addresses and therefore was only 60% successful in returning data for the addresses included in the case study. In the end this still left over 800 points of data to be used in the case study analysis.

Hide