Tutorials on Data Frames and Relational Databases

This repository contains tutorials on manipulating data frames and querying relational databases. Tools covered include:

Traditional SQL tools such as PostgreSQL and MySQL;
Python Pandas data frames;
R data.frame and data.table; and
Apache Spark, which ships with Spark DataFrame and Spark SQL.

In order to provide meaningful context and motivation, tutorials are structured as dataset-centric cases, with similar examples illustrated in multiple ways using different tools.

1. Outline of Topics

1.1. `SQL`

Note: the main SQL dialect we use here is PostgreSQL, which usually provides a very good starting point for learning SQL. Other SQL dialects such as MySQL, HiveQL, etc. have some syntactic differences compared with PostgreSQL, especially regarding advanced functionalities.

Basic:

SELECT all columns / certain columns of table
SELECT ... INTO TEMP TABLE ... for creating temporary / intermediate table from query results
- DROP TABLE IF EXISTS ... for deleting temp table
LIMIT number of rows returned
AS for aliasing / renaming
NULL values for empty cells
DISTINCT values of a column, or tuples of values across columns
WHERE & logical comparisons:
- = for equality
- !=, >, >=, <, <= for inequalities
- IN for checking membership in values list
- LIKE & ILIKE for string matching
ORDER BY for sorting
Aggregating functions:
- COUNT, SUM & AVG
- MAX & MIN
GROUP BY for grouped aggregation

Intermediate:

COALESCE(...) for replacing NULL values
CAST(... AS ...) for conversion among common data types:
- INT for integers
- DOUBLE PRECISION for double-precision floating-point numbers
- STR for strings
- DATE for dates
- TIMESTAMP for date-times
CASE WHEN ... THEN ... WHEN ... THEN ... ELSE ... END conditional logics
JOIN & LEFT JOIN among 2 or more tables
- Equality Join & Non-Equality Join
CROSS JOIN to create Cartesian product
VALUES clause to create table on the fly

Advanced:

String-manipulating functions
- CONCAT for concatenating strings
- SUBSTR for getting part of string
Date/Time-manipulating functions
- EXTRACT(... FROM ...) for getting date/time component
- TO_CHAR for converting date/time component to string
Sub-queries & Common Table Expressions (CTEs)
Windowing (PARTITION BY & ORDER BY) & Windowed Analytics Functions:
- ROW_NUMBER(), RANK() & DENSE_RANK()
- LAG & LEAD

1.2. `Python Pandas DataFrame`

Basic:

len(<Pandas DataFrame>): count number of rows
.columns: list of columns
.info(...): summary, including column data types
.loc[..., ...]: slicing of by string-label indices and column names, plus boolean logical conditions
.iloc[..., ...]: slicing by integer indices and integer column numbers
.ix[..., ...]: versatile slicing by a mixture of integer and string indices and column names, plus boolean logical conditions

Intermediate:

.unique(): list unique / distinct values from a Pandas series / DataFrame column
.drop_duplicates(...): return unique / distinct rows of a Pandas DataFrame
.isnull(): detect None & numpy.nan values
.rename(...): rename columns

Advanced:

.groupby(...): similar to GROUP BY in SQL

1.3. `R data.frame`

Basic:

nrow(<data.frame>): count number of rows
colnames(<data.frame>) / names(<data.frame>): vector of column names
summary(<data.frame>): summarize data.frame
<data.frame>[..., ]: select rows of data.frame by row numbers or by logical conditions
<data.frame>[..., c(<selected column names>)]: select specific columns of data.frame, and select certain rows only by either row numbers or logical conditions

Intermediate:

unique(<data.frame>): obtain unique / distinct rows of data.frame

Advanced:

aggregate(...): aggregrate by group, similar to GROUP BY in SQL

1.4. `R data.table`

Basic:

nrow(<data.table>): count number of rows
colnames(<data.table>) / names(<data.table>): vector of column names
summary(<data.table>): summarize data.table
<data.table>[...]: select rows of data.table by row numbers or by logical conditions
<data.table>[..., .(<selected column names>)]: select specific columns of data.table, and select certain rows only by either row numbers or logical conditions

Intermediate:

use of get(...) to get variables by name inside the data.table namespace
use of := for assignment within the data.table namespace
use of with=FALSE to force literal interpretation of inputs passed into [..., ...]
setnames(...) for renaming data.table column names
unique(<data.table>): obtain unique / distinct rows of data.table

Advanced:

<data.table>[..., ..., by=...]: aggregrate by group, similar to GROUP BY in SQL

1.5a. `PySpark SQL DataFrame`

Basic:

.count(): count number of rows
.columns: list of columns
.printSchema(): summarize column data types
.show(...): show certain number of first rows
.select(...): select certain columns
.toPandas: convert to Pandas DataFrame

Intermediate:

.distinct(): select distinct rows

1.5b. `SparkR SQL DataFrame`

Basic:

count(<SparkR SQL DataFrame>): count number of rows
columns(<SparkR SQL DataFrame>): vector of column names
printSchema(<SparkR SQL DataFrame>): summarize column data types
showDF(<SparkR SQL DataFrame>, <numRows>): show certain number of first rows
select(...): select certain columns
as.data.frame: convert to R data.frame

Intermediate:

distinct(<SparkR SQL DataFrame>): select distinct rows

2. Tips for Exploratory Analyses of Relational Data

Below is a simple checklist of things you may want to examine to effectively explore a set of relational data:

For each table, examine which columns are numeric and which are categorical in nature, and which represent a body of text or an array of values;
For each categorical column, list out the distinct categories, count their frequencies, and, if there are too many, consider how you may group them into fewer super-categories;
Define a number of metrics you believe are relevant; for each metric, examine its:
- Growth trends over time;
- Cyclical changes across hours of a day, days of a week, months of a year, etc.;
- Cross-sectional variations & rankings across categories / segments / tiers of data.

3. Software Installation Requirements & Guides

There are 3 possible setups for running these tutorials:

Running on an Amazon Web Services Elastic MapReduce (AWS EMR) server cluster (recommended)
Running on a single Mac or Windows machine that is sufficiently high-spec:
- RAM >= 12GB
- No. of Processor Cores >= 8

3.1. AWS EMR

If you wish to use the recommended AWS EMR setup:

Follow the instructions on the Chicago Booth Analytics wiki to set up the necessary software for working with AWS EMR;
Test your set up:
- Launch a shell command-line terminal window:
  - Mac: the default terminal;
  - Windows: please use the Git Bash terminal that ships with Git – don't use the Windows terminal;
- Navigate to <path to your cloned Chicago Booth Analytics Software folder>/AWS/EMR in the command-line terminal;
- Bid for a basic EMR cluster of 1 Master + 2 Workers of M3.xLarge server type, trying a price range around $0.050/server/hour, by running a command like: sh create.sh -b <my-s3-bucket-in-cali> -m m3.xlarge -p 0.050 -n 2 -t m3.xlarge -q 0.050 -r "a normal cluster with M3.xLarge servers"
- Check the Northern California AWS EMR management console to verify that:
  - the cluster enters the "Bootstrapping" stage after about 15 minutes
    - if it does not, that means your price is too low; terminate the cluster and try again
  - the cluster enters the "Running" stage after about 30-45 minutes
- Once the cluster is in the "Running" stage, connect to the cluster by running command: sh connect -d <Your-Cluster-Master-Public-DNS> and typing "yes" to accept any questions in the command-line terminal
  - the command should open a new tab on your internet browser, with address localhost:8133; if that does not happen, manually go to localhost:8133 in a browswer windows
  - check that you see Jupyter environment in the browser window
- Terminate your cluster through the AWS EMR management console.

3.2. Single Mac or Windows machine

If your computer is sufficiently high-spec (RAM >= 12GB, No. of Processor Cores >= 8), you may try running the tutorials locally.

You will need the following software setup:

3.2.1. `Git` & Related Version-Control Software

Git is the go-to software solution for version control / change-tracking of programming code and related materials.

Follow instructions on the Chicago Booth Analytics wiki to download and install:

Git; and
SourceTree.

3.2.2. Clone Chicago Booth Analytics's `Software` and `RelationalData` Repos onto Your Computer

One you have installed Git and SourceTree, use SourceTree to clone the following GitHub repos onto folders on your computer:

Software, which contains scripts for install some difficult software; and
RelationalData, i.e. this tutorial repo.

3.2.3. JetBrains `DataGrip` `SQL` IDE

JetBrains, a developer of some of the best integrated development environments (IDEs), has a nice IDE named DataGrip for working with relational databases.

Follow instructions on the Chicago Booth Analytics wiki to download and install DataGrip.

3.2.4. Anaconda `Python` v2.7 & `Python` Packages

For Python, we highly recommend Continuum Analytics's Anaconda distribution, which helpfully pre-packages hundreds of useful packages for scientific computing and saves you the frustration of installing those on your own.

Follow instructions on the Chicago Booth Analytics wiki to download and install Anaconda Python v2.7.

Then, enter a shell command-line terminal – the default terminal on Mac / Git Bash terminal on Windows (see the Git installation instructions) – and:

navigate to folder <your local Chicago Booth Analytics Software repo folder>/Python and run the following commands:
- sh Install-SQL-Related-Packages.sh;
  - note that for Windows, in order to get the PsycoPG2 package (essential for interacting with PostgreSQL databases_:
    - go here;
    - download a .whl file appropriate for your Windows machine's processor (32-bit / 64-bit);
    - enter a command-line terminal, navigate to the download folder; and
    - run command: pip install <the-downloaded-file-name.whl>;
- sh Install-ApacheSpark-Related-Packages.sh; and
- sh Install-Visualization-Packages.sh.

3.2.5. `R` & `R` packages

Follow instructions on the Chicago Booth Analytics wiki to download and install R of version at least 3.2.3.

Install certain R-related software:

Launch a shell command-line terminal window:
- Mac: the default terminal;
- Windows: please use the Git Bash terminal that ships with Git – don't use the Windows terminal;
Navigate to <path to your cloned Chicago Booth Analytics Software folder>/R in the command-line terminal;
Install the iR kernel for Jupyter, to allow us to run R in the Jupyter Notebook environment: run command Rscript Install-JupyterIRKernel.R
Install basic R packages: run command Rscript Install-Basic-Packages.R
Install SQL and Data Frame-related packages: run command Rscript Install-SQL-and-DataFrame-Packages.R
Install Visualization packages: run command Rscript Install-Visualization-Packages.R

3.2.6. Test Your Local Mac / Windows Setup

Launch a shell command-line terminal window:
- Mac: the default terminal;
- Windows: please use the Git Bash terminal that ships with Git – don't use the Windows terminal;
Navigate to <path to your cloned Chicago Booth Analytics RelationData folder> _(this repo!) in the command-line terminal;
Launch the Jupyter environment: run command jupyter notebook
- this would launch a browser tab with address localhost:8888
In the Jupyter environment in the browswer tab:
- Enter the Test-Jupyter-Workbooks-for-Software-Setup folder;
- Test the setup for Python: open the Test-Python-Software-SetUps.ipynb workbook, run all cells, and verify that there are no errors;
- Test the setup for R: open the Test-R-Software-SetUps.ipynb workbook, run all cells, and verify that there are no errors.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
Cases/AirBnB Kaggle		Cases/AirBnB Kaggle
Test Software Setups		Test Software Setups
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tutorials on Data Frames and Relational Databases

1. Outline of Topics

1.1. `SQL`

1.2. `Python Pandas DataFrame`

1.3. `R data.frame`

1.4. `R data.table`

1.5a. `PySpark SQL DataFrame`

1.5b. `SparkR SQL DataFrame`

2. Tips for Exploratory Analyses of Relational Data

3. Software Installation Requirements & Guides

3.1. AWS EMR

3.2. Single Mac or Windows machine

3.2.1. `Git` & Related Version-Control Software

3.2.2. Clone Chicago Booth Analytics's `Software` and `RelationalData` Repos onto Your Computer

3.2.3. JetBrains `DataGrip` `SQL` IDE

3.2.4. Anaconda `Python` v2.7 & `Python` Packages

3.2.5. `R` & `R` packages

3.2.6. Test Your Local Mac / Windows Setup

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

ChicagoBoothAnalytics/RelationalData

Folders and files

Latest commit

History

Repository files navigation

Tutorials on Data Frames and Relational Databases

1. Outline of Topics

1.1. SQL

1.2. Python Pandas DataFrame

1.3. R data.frame

1.4. R data.table

1.5a. PySpark SQL DataFrame

1.5b. SparkR SQL DataFrame

2. Tips for Exploratory Analyses of Relational Data

3. Software Installation Requirements & Guides

3.1. AWS EMR

3.2. Single Mac or Windows machine

3.2.1. Git & Related Version-Control Software

3.2.2. Clone Chicago Booth Analytics's Software and RelationalData Repos onto Your Computer

3.2.3. JetBrains DataGrip SQL IDE

3.2.4. Anaconda Python v2.7 & Python Packages

3.2.5. R & R packages

3.2.6. Test Your Local Mac / Windows Setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

1.1. `SQL`

1.2. `Python Pandas DataFrame`

1.3. `R data.frame`

1.4. `R data.table`

1.5a. `PySpark SQL DataFrame`

1.5b. `SparkR SQL DataFrame`

3.2.1. `Git` & Related Version-Control Software

3.2.2. Clone Chicago Booth Analytics's `Software` and `RelationalData` Repos onto Your Computer

3.2.3. JetBrains `DataGrip` `SQL` IDE

3.2.4. Anaconda `Python` v2.7 & `Python` Packages

3.2.5. `R` & `R` packages

Packages