The ISB-CGC (isb-cgc.org) project has aggregated and curated all of the TCGA open-access clinical, biospecimen, and Level-3 molecular data and uploaded it into BigQuery tables that are open to the public. Here we will show you how you can begin to work with these tables from the familiar R environment.
For this example, we'll also be working with Google BigQuery. It's often helpful to have a link to the docs handy, and especially the query reference.
We will start by loading four R packages:
- the bigrquery package written by Hadley Wickham implements an R interface to Google BigQuery,
- the dplyr package provides a set of tools for efficiently manipulating datasets in R, and
- the ggplot2 package for elegant graphics, and
- the scales package for visualization-oriented scale functions.
require(dplyr) || install.packages("dplyr")## [1] TRUE
require(bigrquery) || install.packages("bigrquery")## [1] TRUE
require(scales) || install.packages("scales")## [1] TRUE
require(ggplot2) || install.packages("ggplot2")## [1] TRUE
library(ISBCGCExamples)######################[ TIP ]########################################
## Set the Google Cloud Platform project id under which these queries will run.
##
## If you are using the Google Bioconductor workshop docker image, this is already
## set for you in your .Rprofile and you can skip this step.
# project <- "YOUR-PROJECT-ID"
#####################################################################Let's start by working with one of the simplest tables, the Clinical_data table. The format of a table name in BigQuery is <project_name>:<dataset_name>.<table_name>
theTable <- "isb-cgc:tcga_201510_alpha.Clinical_data"Note that when you send the first query, you will need to go through the authentication flow with BigQuery. You will be provided with a url to cut and paste into your browser, and then you will get an authorization code to cut and paste back here.
bigrquery uses the package httr to perform OAuth.
Let's start by just counting the number of records in the table. First we'll just create the query string and echo it:
querySql <- paste("SELECT COUNT(1) FROM [", theTable, "]", sep="")
querySql## [1] "SELECT COUNT(1) FROM [isb-cgc:tcga_201510_alpha.Clinical_data]"
And then we'll send the query to the cloud for execution:
result <- query_exec(querySql, project=project)
result## f0_
## 1 11152
And we see that the table has 1 row - this is the number of unique patients or participants across all of the various TCGA studies.
######################[ TIP ]########################################
## If you have any trouble with OAuth and need to redo/reset OAuth,
## run the following code.
#if (FALSE != getOption("httr_oauth_cache")) {
# file.remove(getOption("httr_oauth_cache"))
#}
## or maybe it should look like this?
#if (!is.null(getOption("httr_oauth_cache"))) {
# file.remove(getOption("httr_oauth_cache"))
#}
#message("Restart R to redo/reset OAuth.")
#####################################################################So what is actually in this table? Click on this link to view the schema in the BigQuery web user interface.
We can also run the exact same query using the BigQuery web user interface. In the BigQuery web user interface:
(1) click on the "Compose Query" button (2) paste in the SQL for the query we just ran via R (3) click on "Run Query".
Instead of typing SQL directly into our R code, we can use a convenience function to read SQL from a file.
library(ISBCGCExamples)
DisplayAndDispatchQuery## function (queryUri, project, replacements = list())
## {
## if (missing(queryUri)) {
## stop("Pass the file path or url to the file containing the query.")
## }
## if (missing(project)) {
## stop("Pass the project id of your Google Cloud Platform project.")
## }
## if (grepl("^https.*", queryUri)) {
## querySql <- RCurl::getURL(queryUri, ssl.verifypeer = FALSE)
## }
## else {
## querySql <- readChar(queryUri, nchars = 1e+06)
## }
## for (replacement in names(replacements)) {
## querySql <- gsub(replacement, replacements[[replacement]],
## querySql, fixed = TRUE)
## }
## cat(querySql)
## bigrquery::query_exec(querySql, project)
## }
## <environment: namespace:ISBCGCExamples>
This allows queries to be more easily shared among analyses and also reused for different tables. For example, in the following file we have a query that will count the number of patients, grouped by disease type in any of our TCGA data tables.
file.show(file.path(system.file(package = "ISBCGCExamples"),
"sql",
"count-patients-by-study.sql"))Now let's run the query to see what these counts are in the Clinical_data table:
result <- DisplayAndDispatchQuery(file.path(system.file(package = "ISBCGCExamples"),
"sql",
"count-patients-by-study.sql"),
project=project,
replacements=list("_THE_TABLE_"=theTable))# all of the TCGA molecular data tables contain the fields ParticipantBarcode and Study
SELECT Study, COUNT(*) AS n
FROM (
SELECT ParticipantBarcode, Study
FROM [isb-cgc:tcga_201510_alpha.Clinical_data]
GROUP BY ParticipantBarcode, Study
) GROUP BY Study
ORDER BY n DESC
cat("Number of rows returned by this query: ", nrow(result), "\n")Number of rows returned by this query: 33
Results from bigrquery are returned as R dataframes, meaning that we can make use of all of the regular dataframe functions as well as all sorts of other great R packages to do our downstream work.
mode(result)## [1] "list"
class(result)## [1] "data.frame"
summary(result)## Study n
## Length:33 Min. : 36.0
## Class :character 1st Qu.: 134.0
## Mode :character Median : 308.0
## Mean : 337.9
## 3rd Qu.: 507.0
## Max. :1097.0
head(result)## Study n
## 1 BRCA 1097
## 2 GBM 596
## 3 OV 587
## 4 UCEC 548
## 5 KIRC 537
## 6 HNSC 528
Since there are over 30 distinct tumor types within the TCGA project, we may want to filter our results before visualizing. For example let's look only at tumor types with at least 500 patients in the study:
subsetResults <- filter(result, n>=500)
subsetResults <- arrange(subsetResults,desc(n))and then create a barchart of the patient counts:
ggplot(subsetResults, aes(x=Study, y=n, fill=Study)) +
geom_bar(stat="identity") +
ylab("Number of Patients") +
ggtitle("Study Size")sessionInfo()R version 3.2.4 (2016-03-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ISBCGCExamples_0.1.1 ggplot2_2.1.0 scales_0.4.0
[4] bigrquery_0.2.0 dplyr_0.4.3 knitr_1.12.3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.4 magrittr_1.5 munsell_0.4.3 colorspace_1.2-6
[5] R6_2.1.2 stringr_1.0.0 httr_1.1.0 plyr_1.8.3
[9] tools_3.2.4 parallel_3.2.4 grid_3.2.4 gtable_0.2.0
[13] DBI_0.3.1 openssl_0.9.2 lazyeval_0.1.10 assertthat_0.1
[17] digest_0.6.9 formatR_1.3 curl_0.9.6 evaluate_0.8.3
[21] labeling_0.3 stringi_1.0-1 jsonlite_0.9.19
