This repository contains tutorials on manipulating data frames and querying relational databases. Tools covered include:
-
Traditional
SQLtools such asPostgreSQLandMySQL; -
PythonPandasdata frames; -
Rdata.frameanddata.table; and -
Apache Spark, which ships withSpark DataFrameandSpark SQL.
In order to provide meaningful context and motivation, tutorials are structured as dataset-centric cases, with similar examples illustrated in multiple ways using different tools.
Note: the main SQL dialect we use here is PostgreSQL, which usually provides a very good starting point for learning SQL. Other SQL dialects such as MySQL, HiveQL, etc. have some syntactic differences compared with PostgreSQL, especially regarding advanced functionalities.
Basic:
SELECTall columns / certain columns of tableSELECT ... INTO TEMP TABLE ...for creating temporary / intermediate table from query resultsDROP TABLE IF EXISTS ...for deleting temp table
LIMITnumber of rows returnedASfor aliasing / renamingNULLvalues for empty cellsDISTINCTvalues of a column, or tuples of values across columnsWHERE& logical comparisons:=for equality!=,>,>=,<,<=for inequalitiesINfor checking membership in values listLIKE&ILIKEfor string matching
ORDER BYfor sorting- Aggregating functions:
COUNT,SUM&AVGMAX&MIN
GROUP BYfor grouped aggregation
Intermediate:
COALESCE(...)for replacingNULLvaluesCAST(... AS ...)for conversion among common data types:INTfor integersDOUBLE PRECISIONfor double-precision floating-point numbersSTRfor stringsDATEfor datesTIMESTAMPfor date-times
CASE WHEN ... THEN ... WHEN ... THEN ... ELSE ... ENDconditional logicsJOIN&LEFT JOINamong 2 or more tables- Equality Join & Non-Equality Join
CROSS JOINto create Cartesian productVALUESclause to create table on the fly
Advanced:
- String-manipulating functions
CONCATfor concatenating stringsSUBSTRfor getting part of string
- Date/Time-manipulating functions
EXTRACT(... FROM ...)for getting date/time componentTO_CHARfor converting date/time component to string
- Sub-queries & Common Table Expressions (CTEs)
- Windowing (
PARTITION BY&ORDER BY) & Windowed Analytics Functions:ROW_NUMBER(),RANK()&DENSE_RANK()LAG&LEAD
Basic:
len(<Pandas DataFrame>): count number of rows.columns: list of columns.info(...): summary, including column data types.loc[..., ...]: slicing of by string-label indices and column names, plus boolean logical conditions.iloc[..., ...]: slicing by integer indices and integer column numbers.ix[..., ...]: versatile slicing by a mixture of integer and string indices and column names, plus boolean logical conditions
Intermediate:
.unique(): list unique / distinct values from aPandasseries /DataFramecolumn.drop_duplicates(...): return unique / distinct rows of aPandas DataFrame.isnull(): detectNone&numpy.nanvalues.rename(...): rename columns
Advanced:
.groupby(...): similar toGROUP BYinSQL
Basic:
nrow(<data.frame>): count number of rowscolnames(<data.frame>)/names(<data.frame>): vector of column namessummary(<data.frame>): summarizedata.frame<data.frame>[..., ]: select rows ofdata.frameby row numbers or by logical conditions<data.frame>[..., c(<selected column names>)]: select specific columns ofdata.frame, and select certain rows only by either row numbers or logical conditions
Intermediate:
unique(<data.frame>): obtain unique / distinct rows ofdata.frame
Advanced:
aggregate(...): aggregrate by group, similar toGROUP BYinSQL
Basic:
nrow(<data.table>): count number of rowscolnames(<data.table>)/names(<data.table>): vector of column namessummary(<data.table>): summarizedata.table<data.table>[...]: select rows ofdata.tableby row numbers or by logical conditions<data.table>[..., .(<selected column names>)]: select specific columns ofdata.table, and select certain rows only by either row numbers or logical conditions
Intermediate:
- use of
get(...)to get variables by name inside thedata.tablenamespace - use of
:=for assignment within thedata.tablenamespace - use of
with=FALSEto force literal interpretation of inputs passed into[..., ...] setnames(...)for renamingdata.tablecolumn namesunique(<data.table>): obtain unique / distinct rows ofdata.table
Advanced:
<data.table>[..., ..., by=...]: aggregrate by group, similar toGROUP BYinSQL
Basic:
.count(): count number of rows.columns: list of columns.printSchema(): summarize column data types.show(...): show certain number of first rows.select(...): select certain columns.toPandas: convert toPandas DataFrame
Intermediate:
.distinct(): select distinct rows
Basic:
count(<SparkR SQL DataFrame>): count number of rowscolumns(<SparkR SQL DataFrame>): vector of column namesprintSchema(<SparkR SQL DataFrame>): summarize column data typesshowDF(<SparkR SQL DataFrame>, <numRows>): show certain number of first rowsselect(...): select certain columnsas.data.frame: convert toRdata.frame
Intermediate:
distinct(<SparkR SQL DataFrame>): select distinct rows
Below is a simple checklist of things you may want to examine to effectively explore a set of relational data:
-
For each table, examine which columns are numeric and which are categorical in nature, and which represent a body of text or an array of values;
-
For each categorical column, list out the distinct categories, count their frequencies, and, if there are too many, consider how you may group them into fewer super-categories;
-
Define a number of metrics you believe are relevant; for each metric, examine its:
- Growth trends over time;
- Cyclical changes across hours of a day, days of a week, months of a year, etc.;
- Cross-sectional variations & rankings across categories / segments / tiers of data.
There are 3 possible setups for running these tutorials:
- Running on an Amazon Web Services Elastic MapReduce (AWS EMR) server cluster (recommended)
- Running on a single Mac or Windows machine that is sufficiently high-spec:
- RAM >= 12GB
- No. of Processor Cores >= 8
If you wish to use the recommended AWS EMR setup:
-
Follow the instructions on the Chicago Booth Analytics wiki to set up the necessary software for working with AWS EMR;
-
Test your set up:
-
Launch a shell command-line terminal window:
- Mac: the default terminal;
- Windows: please use the
Git Bashterminal that ships withGit– don't use the Windows terminal;
-
Navigate to
<path to your cloned Chicago Booth Analytics Software folder>/AWS/EMRin the command-line terminal; -
Bid for a basic EMR cluster of 1 Master + 2 Workers of M3.xLarge server type, trying a price range around $0.050/server/hour, by running a command like:
sh create.sh-b<my-s3-bucket-in-cali>-mm3.xlarge-p0.050-n2-tm3.xlarge-q0.050-r"a normal cluster with M3.xLarge servers" -
Check the Northern California AWS EMR management console to verify that:
- the cluster enters the "Bootstrapping" stage after about 15 minutes
- if it does not, that means your price is too low; terminate the cluster and try again
- the cluster enters the "Running" stage after about 30-45 minutes
- the cluster enters the "Bootstrapping" stage after about 15 minutes
-
Once the cluster is in the "Running" stage, connect to the cluster by running command:
sh connect -d <Your-Cluster-Master-Public-DNS>and typing "yes" to accept any questions in the command-line terminal- the command should open a new tab on your internet browser, with address
localhost:8133; if that does not happen, manually go tolocalhost:8133in a browswer windows - check that you see
Jupyterenvironment in the browser window
- the command should open a new tab on your internet browser, with address
-
Terminate your cluster through the AWS EMR management console.
-
If your computer is sufficiently high-spec (RAM >= 12GB, No. of Processor Cores >= 8), you may try running the tutorials locally.
You will need the following software setup:
Git is the go-to software solution for version control / change-tracking of programming code and related materials.
Follow instructions on the Chicago Booth Analytics wiki to download and install:
Git; andSourceTree.
One you have installed Git and SourceTree, use SourceTree to clone the following GitHub repos onto folders on your computer:
Software, which contains scripts for install some difficult software; andRelationalData, i.e. this tutorial repo.
JetBrains, a developer of some of the best integrated development environments (IDEs), has a nice IDE named DataGrip for working with relational databases.
Follow instructions on the Chicago Booth Analytics wiki to download and install DataGrip.
For Python, we highly recommend Continuum Analytics's Anaconda distribution, which helpfully pre-packages hundreds of useful packages for scientific computing and saves you the frustration of installing those on your own.
Follow instructions on the Chicago Booth Analytics wiki to download and install Anaconda Python v2.7.
Then, enter a shell command-line terminal – the default terminal on Mac / Git Bash terminal on Windows (see the Git installation instructions) – and:
- navigate to folder
<your local Chicago Booth Analytics Software repo folder>/Pythonand run the following commands:sh Install-SQL-Related-Packages.sh;- note that for Windows, in order to get the
PsycoPG2package (essential for interacting withPostgreSQLdatabases_:- go here;
- download a
.whlfile appropriate for your Windows machine's processor (32-bit / 64-bit); - enter a command-line terminal, navigate to the download folder; and
- run command:
pip install <the-downloaded-file-name.whl>;
- note that for Windows, in order to get the
sh Install-ApacheSpark-Related-Packages.sh; andsh Install-Visualization-Packages.sh.
Follow instructions on the Chicago Booth Analytics wiki to download and install R of version at least 3.2.3.
Install certain R-related software:
-
Launch a shell command-line terminal window:
- Mac: the default terminal;
- Windows: please use the
Git Bashterminal that ships withGit– don't use the Windows terminal;
-
Navigate to
<path to your cloned Chicago Booth Analytics Software folder>/Rin the command-line terminal; -
Install the
iRkernel forJupyter, to allow us to runRin theJupyter Notebookenvironment: run commandRscript Install-JupyterIRKernel.R -
Install basic
Rpackages: run commandRscript Install-Basic-Packages.R -
Install SQL and Data Frame-related packages: run command
Rscript Install-SQL-and-DataFrame-Packages.R -
Install Visualization packages: run command
Rscript Install-Visualization-Packages.R
-
Launch a shell command-line terminal window:
- Mac: the default terminal;
- Windows: please use the
Git Bashterminal that ships withGit– don't use the Windows terminal;
-
Navigate to
<path to your cloned Chicago Booth Analytics RelationData folder>_(this repo!) in the command-line terminal; -
Launch the
Jupyterenvironment: run commandjupyter notebook- this would launch a browser tab with address
localhost:8888
- this would launch a browser tab with address
-
In the
Jupyterenvironment in the browswer tab:- Enter the
Test-Jupyter-Workbooks-for-Software-Setupfolder; - Test the setup for
Python: open theTest-Python-Software-SetUps.ipynbworkbook, run all cells, and verify that there are no errors; - Test the setup for
R: open theTest-R-Software-SetUps.ipynbworkbook, run all cells, and verify that there are no errors.
- Enter the