Facebook Comments ETL Pipeline

An ETL pipeline designed to automate the extraction, transformation, and loading (ETL) of Facebook comments into a MySQL data warehouse. This project uses Airflow for pipeline orchestration, Selenium for web scraping, and custom Python scripts for data preprocessing and loading. The project is containerized with Docker for easy deployment and scaling.

Note: If you just want to crawl Facebook comments, use the crawler module in the src directory.

🚀 Features

Data Extraction: Crawls comments from Facebook posts using Selenium and Facebook cookies for authentication.
Data Transformation: Cleans raw data to remove noise and standardize text.
Data Loading: Stores structured data in a MySQL table for analysis.
Pipeline Orchestration: Automated scheduling and monitoring using Airflow.
Containerized Deployment: Fully containerized with Docker for portability and scalability.

🏗️ Architecture

Workflow Overview:

Extraction: rawl comments from a specified Facebook post using Selenium.
Transformation: Clean and preprocess comments using a custom data cleaner.
Loading: Store cleaned data in a MySQL table for downstream analysis.

🛠️ Installation and Setup

1. Clone the repository:

git clone https://github.com/johnPa02/crawler_facebook_comment.git
cd crawler_facebook_comment

2. Configure environment variables

cp .env.example .env

3. Build and run the Docker containers:

docker-compose up --build

✅ TODO

Integrate Google BigQuery for handling larger datasets.
Implement PySpark for distributed data processing.
Add dashboards for data visualization using Tableau or Google Data Studio.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
dags		dags
data		data
img		img
logs/scheduler		logs/scheduler
notebooks		notebooks
plugins		plugins
src		src
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
posts.txt		posts.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Facebook Comments ETL Pipeline

📚 Table of Contents

🚀 Features

🏗️ Architecture

🛠️ Installation and Setup

1. Clone the repository:

2. Configure environment variables

3. Build and run the Docker containers:

✅ TODO

About

Uh oh!

Releases

Packages

Uh oh!

Languages

johnPa02/crawler_facebook_comment

Folders and files

Latest commit

History

Repository files navigation

Facebook Comments ETL Pipeline

📚 Table of Contents

🚀 Features

🏗️ Architecture

🛠️ Installation and Setup

1. Clone the repository:

2. Configure environment variables

3. Build and run the Docker containers:

✅ TODO

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages