Exploring the TCGA data in BigQuery

The ISB-CGC (isb-cgc.org) project has aggregated and curated all of the TCGA open-access clinical, biospecimen, and Level-3 molecular data and uploaded it into BigQuery tables that are open to the public. Here we will show you how you can begin to work with these tables from the familiar R environment.

Helpful BigQuery links

For this example, we'll also be working with Google BigQuery. It's often helpful to have a link to the docs handy, and especially the query reference.

Run a query from R

We will start by loading four R packages:

the bigrquery package written by Hadley Wickham implements an R interface to Google BigQuery,
the dplyr package provides a set of tools for efficiently manipulating datasets in R, and
the ggplot2 package for elegant graphics, and
the scales package for visualization-oriented scale functions.

require(dplyr) || install.packages("dplyr")

## [1] TRUE

require(bigrquery) || install.packages("bigrquery")

## [1] TRUE

require(scales) || install.packages("scales")

## [1] TRUE

require(ggplot2) || install.packages("ggplot2")

## [1] TRUE

library(ISBCGCExamples)

######################[ TIP ]########################################
## Set the Google Cloud Platform project id under which these queries will run.
##
## If you are using the Google Bioconductor workshop docker image, this is already
## set for you in your .Rprofile and you can skip this step.

# project <- "YOUR-PROJECT-ID"
#####################################################################

Let's start by working with one of the simplest tables, the Clinical_data table. The format of a table name in BigQuery is <project_name>:<dataset_name>.<table_name>

theTable <- "isb-cgc:tcga_201510_alpha.Clinical_data"

Note that when you send the first query, you will need to go through the authentication flow with BigQuery. You will be provided with a url to cut and paste into your browser, and then you will get an authorization code to cut and paste back here.

bigrquery uses the package httr to perform OAuth.

Let's start by just counting the number of records in the table. First we'll just create the query string and echo it:

querySql <- paste("SELECT COUNT(1) FROM [", theTable, "]", sep="")
querySql

## [1] "SELECT COUNT(1) FROM [isb-cgc:tcga_201510_alpha.Clinical_data]"

And then we'll send the query to the cloud for execution:

result <- query_exec(querySql, project=project)
result

##     f0_
## 1 11152

And we see that the table has 1 row - this is the number of unique patients or participants across all of the various TCGA studies.

######################[ TIP ]########################################
## If you have any trouble with OAuth and need to redo/reset OAuth,
## run the following code.

#if (FALSE != getOption("httr_oauth_cache")) {
#  file.remove(getOption("httr_oauth_cache"))
#}

## or maybe it should look like this?

#if (!is.null(getOption("httr_oauth_cache"))) {
#  file.remove(getOption("httr_oauth_cache"))
#}

#message("Restart R to redo/reset OAuth.")
#####################################################################

Run a query using the BigQuery Web User Interface

So what is actually in this table? Click on this link to view the schema in the BigQuery web user interface.

We can also run the exact same query using the BigQuery web user interface. In the BigQuery web user interface:

(1) click on the "Compose Query" button (2) paste in the SQL for the query we just ran via R (3) click on "Run Query".

Run a query stored in a file from R

Instead of typing SQL directly into our R code, we can use a convenience function to read SQL from a file.

library(ISBCGCExamples)
DisplayAndDispatchQuery

## function (queryUri, project, replacements = list()) 
## {
##     if (missing(queryUri)) {
##         stop("Pass the file path or url to the file containing the query.")
##     }
##     if (missing(project)) {
##         stop("Pass the project id of your Google Cloud Platform project.")
##     }
##     if (grepl("^https.*", queryUri)) {
##         querySql <- RCurl::getURL(queryUri, ssl.verifypeer = FALSE)
##     }
##     else {
##         querySql <- readChar(queryUri, nchars = 1e+06)
##     }
##     for (replacement in names(replacements)) {
##         querySql <- gsub(replacement, replacements[[replacement]], 
##             querySql, fixed = TRUE)
##     }
##     cat(querySql)
##     bigrquery::query_exec(querySql, project)
## }
## <environment: namespace:ISBCGCExamples>

This allows queries to be more easily shared among analyses and also reused for different tables. For example, in the following file we have a query that will count the number of patients, grouped by disease type in any of our TCGA data tables.

file.show(file.path(system.file(package = "ISBCGCExamples"),
                    "sql",
                    "count-patients-by-study.sql"))

Now let's run the query to see what these counts are in the Clinical_data table:

result <- DisplayAndDispatchQuery(file.path(system.file(package = "ISBCGCExamples"),
                                            "sql",
                                            "count-patients-by-study.sql"),
                                  project=project,
                                  replacements=list("_THE_TABLE_"=theTable))

# all of the TCGA molecular data tables contain the fields ParticipantBarcode and Study

SELECT Study, COUNT(*) AS n
FROM (
    SELECT ParticipantBarcode, Study
    FROM [isb-cgc:tcga_201510_alpha.Clinical_data]
    GROUP BY ParticipantBarcode, Study
) GROUP BY Study
ORDER BY n DESC

cat("Number of rows returned by this query: ", nrow(result), "\n")

Number of rows returned by this query:  33

Results from bigrquery are returned as R dataframes, meaning that we can make use of all of the regular dataframe functions as well as all sorts of other great R packages to do our downstream work.

mode(result)

## [1] "list"

class(result)

## [1] "data.frame"

summary(result)

##     Study                 n         
##  Length:33          Min.   :  36.0  
##  Class :character   1st Qu.: 134.0  
##  Mode  :character   Median : 308.0  
##                     Mean   : 337.9  
##                     3rd Qu.: 507.0  
##                     Max.   :1097.0

head(result)

##   Study    n
## 1  BRCA 1097
## 2   GBM  596
## 3    OV  587
## 4  UCEC  548
## 5  KIRC  537
## 6  HNSC  528

Visualize Query Results

Since there are over 30 distinct tumor types within the TCGA project, we may want to filter our results before visualizing. For example let's look only at tumor types with at least 500 patients in the study:

subsetResults <- filter(result, n>=500)
subsetResults <- arrange(subsetResults,desc(n))

and then create a barchart of the patient counts:

ggplot(subsetResults, aes(x=Study, y=n, fill=Study)) +
  geom_bar(stat="identity") +
  ylab("Number of Patients") +
  ggtitle("Study Size")

Provenance

sessionInfo()

R version 3.2.4 (2016-03-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ISBCGCExamples_0.1.1 ggplot2_2.1.0        scales_0.4.0        
[4] bigrquery_0.2.0      dplyr_0.4.3          knitr_1.12.3        

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4      magrittr_1.5     munsell_0.4.3    colorspace_1.2-6
 [5] R6_2.1.2         stringr_1.0.0    httr_1.1.0       plyr_1.8.3      
 [9] tools_3.2.4      parallel_3.2.4   grid_3.2.4       gtable_0.2.0    
[13] DBI_0.3.1        openssl_0.9.2    lazyeval_0.1.10  assertthat_0.1  
[17] digest_0.6.9     formatR_1.3      curl_0.9.6       evaluate_0.8.3  
[21] labeling_0.3     stringi_1.0-1    jsonlite_0.9.19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploring the TCGA data in BigQuery

Helpful BigQuery links

Run a query from R

Run a query using the BigQuery Web User Interface

Run a query stored in a file from R

Visualize Query Results

Provenance

FilesExpand file tree

BigQueryIntroduction.md

Latest commit

History

BigQueryIntroduction.md

File metadata and controls

Exploring the TCGA data in BigQuery

Helpful BigQuery links

Run a query from R

Run a query using the BigQuery Web User Interface

Run a query stored in a file from R

Visualize Query Results

Provenance