Skip to content

NBLGraduateSpecialistProgram/WebscrapeTechniques

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebscrapeTechniques

These materials form an introductory workshop covering some techniques for collecting content from public websites, otherwise known as webscraping.

This workshop was originally offered during the Spring 2019 semester and again in Spring 2020 at Rutgers University-New Brunswick through the New Brunswick Libraries Graduate Specialists program and the Rutgers DH Initiative.

Two appendices were added in the interim: one on scraping password-protected sites and another on webscraping ethics.

File Guide

WebscrapeTechniques.Rmd Master .Rmd file for user participation in the workshop, to run or edit code as desired; used to generate .pdf file.

WebscrapeTechniques.pdf This .pdf file is best for viewing the workshop or following along outside of an R or RStudio environment; it contains all code as well as sample outputs and figures.

/images Individual files for the screenshots used in the .pdf and .Rmd files.

Setup for Beginners

If you’re new to R, you’ll need to set up access in one of two ways first. The standard method is to download R itself for whichever operating system you're using and then download RStudio, an Integrated Development Environment (IDE) that makes working in R clearer by adding a text editor for writing or loading R code and a workspace for viewing data in memory.

If you want a fast, easy way to give R a try with a stable internet connection, you can use RStudio Cloud. This browser version of RStudio looks and functions just like the desktop version, and it saves data between sessions too. All it takes is a Google account to log on. You can even clone this workshop repository directly into your RStudio Cloud workspace by clicking the arrow next to the “New Project” button, selecting “New Project from Git Repo,” and then pasting the url to this page.

About

Techniques for collecting content from public websites

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published