Answer the following questions:
- What was the price of a cup of coffee in the state of New York between 1900 and 1909?
- How much does data cleaning on the source dataset change the answer to the first question?
What’s on the Menu? is a project to transcribe The New York Public Library’s restaurant menu collection. The collection contains menus from around the world, stretching from the 1850s to the 2000s.
The dataset contains four tables with the following relationships.
Credit: @monsieur-le-git
The first target question is tuned to the data available in the dataset.
| Table | Attribute | Change |
|---|---|---|
| Menu | date | Trim leading and trailing whitespace |
| Menu | call_number | Trim leading and trailing whitespace |
| Menu | place | Trim leading and trailing whitespace |
| Menu | currency | Trim leading and trailing whitespace |
| MenuItem | price | Trim leading and trailing whitespace |
| Dish | name | Trim leading and trailing whitespace |
| Menu | date | Set empty dates to year from call_number |
| Menu | date | Repair typos observed in manual data exploration |
| Menu | place | Repair misspellings of "New York" |
| Menu | currency | Repair misspellings of "Dollars" |
| Menu | currency | Change instances of "Cents" to "Dollars" (communize US currency) |
| MenuItem | price | Divide by 100 (Menu currency changed from "Cents" to "Dollars") |
| Dish | name | Repair misspellings of "Coffee" |
The goal of data cleaning is to increase the number of records from the menu and dish tables that meet the target value for every applicable attribute.
conda env create -f environment.yml
conda activate coffeepython src/explore.py /path/to/datasetpython src/main.py /path/to/datasetmypy src/*.py
python -m unittest discover -s src





