Skip to content

ChicagoBoothAnalytics/RelationalData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

117 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Tutorials on Data Frames and Relational Databases

This repository contains tutorials on manipulating data frames and querying relational databases. Tools covered include:

  • Traditional SQL tools such as PostgreSQL and MySQL;

  • Python Pandas data frames;

  • R data.frame and data.table; and

  • Apache Spark, which ships with Spark DataFrame and Spark SQL.

In order to provide meaningful context and motivation, tutorials are structured as dataset-centric cases, with similar examples illustrated in multiple ways using different tools.

1. Outline of Topics

1.1. SQL

Note: the main SQL dialect we use here is PostgreSQL, which usually provides a very good starting point for learning SQL. Other SQL dialects such as MySQL, HiveQL, etc. have some syntactic differences compared with PostgreSQL, especially regarding advanced functionalities.

Basic:

  • SELECT all columns / certain columns of table
  • SELECT ... INTO TEMP TABLE ... for creating temporary / intermediate table from query results
    • DROP TABLE IF EXISTS ... for deleting temp table
  • LIMIT number of rows returned
  • AS for aliasing / renaming
  • NULL values for empty cells
  • DISTINCT values of a column, or tuples of values across columns
  • WHERE & logical comparisons:
    • = for equality
    • !=, >, >=, <, <= for inequalities
    • IN for checking membership in values list
    • LIKE & ILIKE for string matching
  • ORDER BY for sorting
  • Aggregating functions:
    • COUNT, SUM & AVG
    • MAX & MIN
  • GROUP BY for grouped aggregation

Intermediate:

  • COALESCE(...) for replacing NULL values
  • CAST(... AS ...) for conversion among common data types:
    • INT for integers
    • DOUBLE PRECISION for double-precision floating-point numbers
    • STR for strings
    • DATE for dates
    • TIMESTAMP for date-times
  • CASE WHEN ... THEN ... WHEN ... THEN ... ELSE ... END conditional logics
  • JOIN & LEFT JOIN among 2 or more tables
    • Equality Join & Non-Equality Join
  • CROSS JOIN to create Cartesian product
  • VALUES clause to create table on the fly

Advanced:

  • String-manipulating functions
    • CONCAT for concatenating strings
    • SUBSTR for getting part of string
  • Date/Time-manipulating functions
    • EXTRACT(... FROM ...) for getting date/time component
    • TO_CHAR for converting date/time component to string
  • Sub-queries & Common Table Expressions (CTEs)
  • Windowing (PARTITION BY & ORDER BY) & Windowed Analytics Functions:
    • ROW_NUMBER(), RANK() & DENSE_RANK()
    • LAG & LEAD

1.2. Python Pandas DataFrame

Basic:

  • len(<Pandas DataFrame>): count number of rows
  • .columns: list of columns
  • .info(...): summary, including column data types
  • .loc[..., ...]: slicing of by string-label indices and column names, plus boolean logical conditions
  • .iloc[..., ...]: slicing by integer indices and integer column numbers
  • .ix[..., ...]: versatile slicing by a mixture of integer and string indices and column names, plus boolean logical conditions

Intermediate:

  • .unique(): list unique / distinct values from a Pandas series / DataFrame column
  • .drop_duplicates(...): return unique / distinct rows of a Pandas DataFrame
  • .isnull(): detect None & numpy.nan values
  • .rename(...): rename columns

Advanced:

  • .groupby(...): similar to GROUP BY in SQL

1.3. R data.frame

Basic:

  • nrow(<data.frame>): count number of rows
  • colnames(<data.frame>) / names(<data.frame>): vector of column names
  • summary(<data.frame>): summarize data.frame
  • <data.frame>[..., ]: select rows of data.frame by row numbers or by logical conditions
  • <data.frame>[..., c(<selected column names>)]: select specific columns of data.frame, and select certain rows only by either row numbers or logical conditions

Intermediate:

  • unique(<data.frame>): obtain unique / distinct rows of data.frame

Advanced:

  • aggregate(...): aggregrate by group, similar to GROUP BY in SQL

1.4. R data.table

Basic:

  • nrow(<data.table>): count number of rows
  • colnames(<data.table>) / names(<data.table>): vector of column names
  • summary(<data.table>): summarize data.table
  • <data.table>[...]: select rows of data.table by row numbers or by logical conditions
  • <data.table>[..., .(<selected column names>)]: select specific columns of data.table, and select certain rows only by either row numbers or logical conditions

Intermediate:

  • use of get(...) to get variables by name inside the data.table namespace
  • use of := for assignment within the data.table namespace
  • use of with=FALSE to force literal interpretation of inputs passed into [..., ...]
  • setnames(...) for renaming data.table column names
  • unique(<data.table>): obtain unique / distinct rows of data.table

Advanced:

  • <data.table>[..., ..., by=...]: aggregrate by group, similar to GROUP BY in SQL

1.5a. PySpark SQL DataFrame

Basic:

  • .count(): count number of rows
  • .columns: list of columns
  • .printSchema(): summarize column data types
  • .show(...): show certain number of first rows
  • .select(...): select certain columns
  • .toPandas: convert to Pandas DataFrame

Intermediate:

  • .distinct(): select distinct rows

1.5b. SparkR SQL DataFrame

Basic:

  • count(<SparkR SQL DataFrame>): count number of rows
  • columns(<SparkR SQL DataFrame>): vector of column names
  • printSchema(<SparkR SQL DataFrame>): summarize column data types
  • showDF(<SparkR SQL DataFrame>, <numRows>): show certain number of first rows
  • select(...): select certain columns
  • as.data.frame: convert to R data.frame

Intermediate:

  • distinct(<SparkR SQL DataFrame>): select distinct rows

2. Tips for Exploratory Analyses of Relational Data

Below is a simple checklist of things you may want to examine to effectively explore a set of relational data:

  • For each table, examine which columns are numeric and which are categorical in nature, and which represent a body of text or an array of values;

  • For each categorical column, list out the distinct categories, count their frequencies, and, if there are too many, consider how you may group them into fewer super-categories;

  • Define a number of metrics you believe are relevant; for each metric, examine its:

    • Growth trends over time;
    • Cyclical changes across hours of a day, days of a week, months of a year, etc.;
    • Cross-sectional variations & rankings across categories / segments / tiers of data.

3. Software Installation Requirements & Guides

There are 3 possible setups for running these tutorials:

  1. Running on an Amazon Web Services Elastic MapReduce (AWS EMR) server cluster (recommended)
  2. Running on a single Mac or Windows machine that is sufficiently high-spec:
    • RAM >= 12GB
    • No. of Processor Cores >= 8

3.1. AWS EMR

If you wish to use the recommended AWS EMR setup:

  • Follow the instructions on the Chicago Booth Analytics wiki to set up the necessary software for working with AWS EMR;

  • Test your set up:

    • Launch a shell command-line terminal window:

      • Mac: the default terminal;
      • Windows: please use the Git Bash terminal that ships with Git – don't use the Windows terminal;
    • Navigate to <path to your cloned Chicago Booth Analytics Software folder>/AWS/EMR in the command-line terminal;

    • Bid for a basic EMR cluster of 1 Master + 2 Workers of M3.xLarge server type, trying a price range around $0.050/server/hour, by running a command like: sh create.sh -b <my-s3-bucket-in-cali> -m m3.xlarge -p 0.050 -n 2 -t m3.xlarge -q 0.050 -r "a normal cluster with M3.xLarge servers"

    • Check the Northern California AWS EMR management console to verify that:

      • the cluster enters the "Bootstrapping" stage after about 15 minutes
        • if it does not, that means your price is too low; terminate the cluster and try again
      • the cluster enters the "Running" stage after about 30-45 minutes
    • Once the cluster is in the "Running" stage, connect to the cluster by running command: sh connect -d <Your-Cluster-Master-Public-DNS> and typing "yes" to accept any questions in the command-line terminal

      • the command should open a new tab on your internet browser, with address localhost:8133; if that does not happen, manually go to localhost:8133 in a browswer windows
      • check that you see Jupyter environment in the browser window
    • Terminate your cluster through the AWS EMR management console.

3.2. Single Mac or Windows machine

If your computer is sufficiently high-spec (RAM >= 12GB, No. of Processor Cores >= 8), you may try running the tutorials locally.

You will need the following software setup:

3.2.1. Git & Related Version-Control Software

Git is the go-to software solution for version control / change-tracking of programming code and related materials.

Follow instructions on the Chicago Booth Analytics wiki to download and install:

  • Git; and
  • SourceTree.

3.2.2. Clone Chicago Booth Analytics's Software and RelationalData Repos onto Your Computer

One you have installed Git and SourceTree, use SourceTree to clone the following GitHub repos onto folders on your computer:

  • Software, which contains scripts for install some difficult software; and
  • RelationalData, i.e. this tutorial repo.

3.2.3. JetBrains DataGrip SQL IDE

JetBrains, a developer of some of the best integrated development environments (IDEs), has a nice IDE named DataGrip for working with relational databases.

Follow instructions on the Chicago Booth Analytics wiki to download and install DataGrip.

3.2.4. Anaconda Python v2.7 & Python Packages

For Python, we highly recommend Continuum Analytics's Anaconda distribution, which helpfully pre-packages hundreds of useful packages for scientific computing and saves you the frustration of installing those on your own.

Follow instructions on the Chicago Booth Analytics wiki to download and install Anaconda Python v2.7.

Then, enter a shell command-line terminal – the default terminal on Mac / Git Bash terminal on Windows (see the Git installation instructions) – and:

  • navigate to folder <your local Chicago Booth Analytics Software repo folder>/Python and run the following commands:
    • sh Install-SQL-Related-Packages.sh;
      • note that for Windows, in order to get the PsycoPG2 package (essential for interacting with PostgreSQL databases_:
        • go here;
        • download a .whl file appropriate for your Windows machine's processor (32-bit / 64-bit);
        • enter a command-line terminal, navigate to the download folder; and
        • run command: pip install <the-downloaded-file-name.whl>;
    • sh Install-ApacheSpark-Related-Packages.sh; and
    • sh Install-Visualization-Packages.sh.

3.2.5. R & R packages

Follow instructions on the Chicago Booth Analytics wiki to download and install R of version at least 3.2.3.

Install certain R-related software:

  • Launch a shell command-line terminal window:

    • Mac: the default terminal;
    • Windows: please use the Git Bash terminal that ships with Git – don't use the Windows terminal;
  • Navigate to <path to your cloned Chicago Booth Analytics Software folder>/R in the command-line terminal;

  • Install the iR kernel for Jupyter, to allow us to run R in the Jupyter Notebook environment: run command Rscript Install-JupyterIRKernel.R

  • Install basic R packages: run command Rscript Install-Basic-Packages.R

  • Install SQL and Data Frame-related packages: run command Rscript Install-SQL-and-DataFrame-Packages.R

  • Install Visualization packages: run command Rscript Install-Visualization-Packages.R

3.2.6. Test Your Local Mac / Windows Setup

  • Launch a shell command-line terminal window:

    • Mac: the default terminal;
    • Windows: please use the Git Bash terminal that ships with Git – don't use the Windows terminal;
  • Navigate to <path to your cloned Chicago Booth Analytics RelationData folder> _(this repo!) in the command-line terminal;

  • Launch the Jupyter environment: run command jupyter notebook

    • this would launch a browser tab with address localhost:8888
  • In the Jupyter environment in the browswer tab:

    • Enter the Test-Jupyter-Workbooks-for-Software-Setup folder;
    • Test the setup for Python: open the Test-Python-Software-SetUps.ipynb workbook, run all cells, and verify that there are no errors;
    • Test the setup for R: open the Test-R-Software-SetUps.ipynb workbook, run all cells, and verify that there are no errors.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •