🚀 Multi-Source Content Scraper

A powerful, extensible content scraping system for collecting authentic content from public figures across multiple platforms.

Features • Quick Start • Installation • Usage • Documentation • Contributing

🌟 Overview

Scrape, validate, and analyze content from your favorite thought leaders across Twitter, YouTube, Blogs, Podcasts, and Books. Built with authenticity validation, AI-powered processing, and vector embeddings for semantic search.

Currently supports:

🎯 Balaji Srinivasan (@balajis)
📚 Tim Ferriss (@tferriss)

Easily extensible to any public figure!

✨ Features

🔍 Multi-Platform Scraping

Twitter/X: Full tweet history + automatic thread reconstruction
YouTube: Video metadata + automatic transcript extraction
Blogs: Full article text from personal blogs (tim.blog, balajis.com)
Podcasts: RSS feed parsing + episode metadata
Books: Online books & blog excerpts

✅ Authenticity Validation

Domain Verification: Ensures content is from official sources
Platform-Specific Checks: Twitter handles, YouTube channels, etc.
Authenticity Scoring: 0-100 score for each piece of content
Configurable Filters: Only save high-quality, authentic content

🧠 AI-Powered Processing

Text Cleaning: Automatic normalization and cleaning
Keyword Extraction: Identify main topics and themes
Content Chunking: Smart chunking with configurable overlap
OpenAI Embeddings: Generate vector embeddings for semantic search
Structured Data Extraction: Extract goals, strategies, principles

💾 Flexible Storage

SQL Database: SQLAlchemy with SQLite/PostgreSQL support
Vector Stores: Pinecone, ChromaDB, or Weaviate integration
JSON Export: Export data in standard formats
Incremental Updates: Only scrape new content

🛡️ Production-Ready

Rate Limiting: Respects API limits with token bucket algorithm
Robots.txt Compliance: Ethical web scraping
Retry Logic: Exponential backoff for failed requests
Comprehensive Logging: Debug and monitor with loguru
Error Handling: Graceful degradation and error recovery
Progress Tracking: Real-time progress bars with tqdm

🚀 Quick Start

1️⃣ Installation

# Clone the repository
git clone https://github.com/REDFOX1899/content-scraper.git
cd content-scraper

# Run automated setup
./setup.sh

Or manually:

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Copy environment template
cp .env.example .env

2️⃣ Configuration

Edit .env and add your API keys:

TWITTER_BEARER_TOKEN=your_token_here
YOUTUBE_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here  # Optional, for embeddings

3️⃣ Start Scraping!

# Scrape Tim Ferriss blog posts
python main.py scrape --author tim_ferriss --platform blog --max-items 20

# Scrape Balaji's tweets
python main.py scrape --author balaji_srinivasan --platform twitter --max-items 50

# Scrape with embeddings for AI applications
python main.py scrape --author tim_ferriss --platform blog --embed --max-items 100

📋 Usage

Basic Commands

# Scrape specific platform
python main.py scrape --author tim_ferriss --platform blog --max-items 50

# Scrape multiple platforms
python main.py scrape --author balaji_srinivasan \
  --platform twitter \
  --platform youtube \
  --max-items 100

# Scrape with date filter
python main.py scrape --author tim_ferriss \
  --date-from 2023-01-01 \
  --date-to 2024-01-01

# Only save authentic content
python main.py scrape --author balaji_srinivasan --authentic-only

# Process existing data
python main.py process --limit 100 --embed

# View statistics
python main.py stats

# Export to JSON
python main.py export --author tim_ferriss --output data.json

Python API

from scrapers.blog_scraper import BlogScraper
from validators.authenticity_validator import AuthenticityValidator
from storage.database import ContentDatabase

# Initialize scraper
scraper = BlogScraper('tim_ferriss', author_config)

# Scrape content
content = scraper.scrape(max_pages=10)

# Validate authenticity
validator = AuthenticityValidator()
validated = validator.validate_batch(content)

# Store in database
db = ContentDatabase()
db.save_batch(validated)

See example_usage.py for more examples.

🏗️ Architecture

┌─────────────────┐
│   User Input    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  CLI Interface  │
└────────┬────────┘
         │
         ▼
┌─────────────────────────────────┐
│      Orchestrator               │
│  ┌─────────────────────────┐   │
│  │  Platform Scrapers      │   │
│  │  • Blog                 │   │
│  │  • Twitter              │   │
│  │  • YouTube              │   │
│  │  • Podcast              │   │
│  │  • Book                 │   │
│  └─────────────────────────┘   │
└───────────┬─────────────────────┘
            │
            ▼
    ┌───────────────┐
    │  Validator    │
    │  (Score 0-100)│
    └───────┬───────┘
            │
            ▼
    ┌───────────────┐
    │   Processor   │
    │  • Clean      │
    │  • Extract    │
    │  • Chunk      │
    └───────┬───────┘
            │
            ▼
    ┌───────────────┐
    │   Embeddings  │
    │   (OpenAI)    │
    └───────┬───────┘
            │
            ▼
┌───────────────────────┐
│      Storage          │
│  ┌────────────────┐   │
│  │  SQL Database  │   │
│  │  Vector Store  │   │
│  │  JSON Export   │   │
│  └────────────────┘   │
└───────────────────────┘

📁 Project Structure

content-scraper/
├── config/                     # Configuration
│   ├── settings.py            # Main settings
│   └── authors.json           # Author profiles
├── scrapers/                   # Platform scrapers
│   ├── base_scraper.py        # Base class
│   ├── blog_scraper.py
│   ├── twitter_scraper.py
│   ├── youtube_scraper.py
│   ├── podcast_scraper.py
│   └── book_scraper.py
├── validators/                 # Content validation
│   └── authenticity_validator.py
├── storage/                    # Data storage
│   ├── database.py            # SQL database
│   └── vector_store.py        # Vector stores
├── processing/                 # Content processing
│   ├── text_processor.py
│   └── content_extractor.py
├── utils/                      # Utilities
│   └── rate_limiter.py
├── main.py                     # CLI interface
├── example_usage.py            # Examples
└── README.md                   # This file

🎯 Use Cases

1. AI-Powered Knowledge Base

Build a semantic search engine over your favorite thought leader's content:

# Scrape with embeddings
python main.py scrape --author tim_ferriss --embed

# Use vector store for semantic search
from storage.vector_store import create_vector_store
store = create_vector_store("chroma")
results = store.query(question_embedding, top_k=5)

2. Research & Analysis

Analyze trends, topics, and insights:

# Export data
python main.py export --output data.json

# Analyze with pandas
import pandas as pd
df = pd.read_json('data.json')
df['keywords'].value_counts()

3. Content Curation

Curate the best content automatically:

# Get only high-quality, authentic content
python main.py scrape --author balaji_srinivasan \
  --authentic-only \
  --date-from 2024-01-01

4. Chatbot Training

Train AI chatbots on authentic content:

Scrape content with embeddings
Store in vector database
Build RAG (Retrieval-Augmented Generation) system

🔧 Configuration

Adding New Authors

Edit config/authors.json:

{
  "new_author": {
    "name": "Author Name",
    "twitter": {"handle": "username"},
    "youtube_channels": [{
      "name": "Channel Name",
      "channel_id": "UCxxxxx"
    }],
    "blogs": [{
      "name": "Blog Name",
      "url": "https://blog.com"
    }],
    "official_domains": ["blog.com", "website.com"]
  }
}

Customizing Settings

Edit config/settings.py:

# Rate limiting
RATE_LIMIT_CALLS = 10
RATE_LIMIT_PERIOD = 60  # seconds

# Content filtering
MIN_AUTHENTICITY_SCORE = 75
MIN_CONTENT_LENGTH = 100

# Text processing
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200

# Embeddings
EMBEDDING_MODEL = "text-embedding-ada-002"

📊 Database Schema

CREATE TABLE content (
    id VARCHAR(64) PRIMARY KEY,
    author VARCHAR(100) NOT NULL,
    platform VARCHAR(50) NOT NULL,
    content_type VARCHAR(50),
    title TEXT NOT NULL,
    content TEXT NOT NULL,
    url TEXT NOT NULL,
    date_published DATETIME,
    date_scraped DATETIME NOT NULL,
    authenticity_score INTEGER,
    processed BOOLEAN DEFAULT FALSE,
    embedded BOOLEAN DEFAULT FALSE,
    metadata JSON,
    word_count INTEGER
);

🔑 API Keys

Twitter API

Go to Twitter Developer Portal
Create a new app
Copy the Bearer Token

YouTube Data API

Go to Google Cloud Console
Create project → Enable YouTube Data API v3
Create credentials → Copy API Key

OpenAI API (Optional)

Go to OpenAI Platform
Create API key
Used for embeddings and content analysis

🚦 Rate Limits & Best Practices

Twitter: ~300 requests per 15 minutes (managed automatically)
YouTube: 10,000 quota units per day
Blogs: Respectful 2-second delays between requests
Robots.txt: Always respected

Best Practices:

Start with --max-items 10 to test
Use --date-from for incremental updates
Use --authentic-only for quality data
Monitor logs/scraper.log
Export data regularly

🤝 Contributing

We welcome contributions! Here's how you can help:

Adding New Platforms

Create a new scraper inheriting from BaseScraper
Implement the scrape() method
Add platform validation
Submit a PR!

from scrapers.base_scraper import BaseScraper

class NewPlatformScraper(BaseScraper):
    def scrape(self, **kwargs):
        # Your scraping logic
        return content_list

Adding New Authors

Add configuration to config/authors.json
Add official domains for validation
Test thoroughly
Submit a PR!

See CONTRIBUTING.md for detailed guidelines.

📖 Documentation

Quick Start Guide - Get started in 5 minutes
Example Usage - Code examples
API Documentation - Detailed API docs (coming soon)

🐛 Troubleshooting

Common Issues

"Twitter API key not found"

# Add to .env
TWITTER_BEARER_TOKEN=your_token_here

"Rate limit exceeded"

# Wait and retry, or reduce max-items
python main.py scrape --author tim_ferriss --max-items 10

"No module named 'tweepy'"

pip install -r requirements.txt

Database locked

# Only one process can write at a time
# Wait for current operation to complete

See QUICKSTART.md for more troubleshooting tips.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

⚖️ Legal & Ethics

✅ Only scrapes publicly available content
✅ Respects robots.txt files
✅ Implements rate limiting
✅ Does NOT scrape private or paywalled content
✅ For personal use, research, and education

Important: Always respect the terms of service of the platforms you're scraping. This tool is designed for ethical, legal use only.

🙏 Acknowledgments

Built for learning from:

Balaji Srinivasan (@balajis) - Entrepreneur, investor, thought leader
Tim Ferriss (@tferriss) - Author, podcaster, entrepreneur

This tool helps fans and researchers analyze and learn from their public content.

⭐ Star History

If you find this project useful, please consider giving it a star! ⭐

🗺️ Roadmap

💬 Community

Issues: GitHub Issues
Discussions: GitHub Discussions
Pull Requests: Contributing Guide

Built with ❤️ for the learning community

⬆ back to top

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
config		config
data		data
processing		processing
scrapers		scrapers
storage		storage
utils		utils
validators		validators
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
example_usage.py		example_usage.py
main.py		main.py
requirements.txt		requirements.txt
setup.sh		setup.sh

License

REDFOX1899/content-scraper

Folders and files

Latest commit

History

Repository files navigation

🚀 Multi-Source Content Scraper

🌟 Overview

✨ Features

🔍 Multi-Platform Scraping

✅ Authenticity Validation

🧠 AI-Powered Processing

💾 Flexible Storage

🛡️ Production-Ready

🚀 Quick Start

1️⃣ Installation

2️⃣ Configuration

3️⃣ Start Scraping!

📋 Usage

Basic Commands

Python API

🏗️ Architecture

📁 Project Structure

🎯 Use Cases

1. AI-Powered Knowledge Base

2. Research & Analysis

3. Content Curation

4. Chatbot Training

🔧 Configuration

Adding New Authors

Customizing Settings

📊 Database Schema

🔑 API Keys

Twitter API

YouTube Data API

OpenAI API (Optional)

🚦 Rate Limits & Best Practices

🤝 Contributing

Adding New Platforms

Adding New Authors

📖 Documentation

🐛 Troubleshooting

Common Issues

📜 License

⚖️ Legal & Ethics

🙏 Acknowledgments

⭐ Star History

🗺️ Roadmap

💬 Community

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages