A powerful, extensible content scraping system for collecting authentic content from public figures across multiple platforms.
Features • Quick Start • Installation • Usage • Documentation • Contributing
Scrape, validate, and analyze content from your favorite thought leaders across Twitter, YouTube, Blogs, Podcasts, and Books. Built with authenticity validation, AI-powered processing, and vector embeddings for semantic search.
Currently supports:
- 🎯 Balaji Srinivasan (@balajis)
- 📚 Tim Ferriss (@tferriss)
Easily extensible to any public figure!
- Twitter/X: Full tweet history + automatic thread reconstruction
- YouTube: Video metadata + automatic transcript extraction
- Blogs: Full article text from personal blogs (tim.blog, balajis.com)
- Podcasts: RSS feed parsing + episode metadata
- Books: Online books & blog excerpts
- Domain Verification: Ensures content is from official sources
- Platform-Specific Checks: Twitter handles, YouTube channels, etc.
- Authenticity Scoring: 0-100 score for each piece of content
- Configurable Filters: Only save high-quality, authentic content
- Text Cleaning: Automatic normalization and cleaning
- Keyword Extraction: Identify main topics and themes
- Content Chunking: Smart chunking with configurable overlap
- OpenAI Embeddings: Generate vector embeddings for semantic search
- Structured Data Extraction: Extract goals, strategies, principles
- SQL Database: SQLAlchemy with SQLite/PostgreSQL support
- Vector Stores: Pinecone, ChromaDB, or Weaviate integration
- JSON Export: Export data in standard formats
- Incremental Updates: Only scrape new content
- Rate Limiting: Respects API limits with token bucket algorithm
- Robots.txt Compliance: Ethical web scraping
- Retry Logic: Exponential backoff for failed requests
- Comprehensive Logging: Debug and monitor with loguru
- Error Handling: Graceful degradation and error recovery
- Progress Tracking: Real-time progress bars with tqdm
# Clone the repository
git clone https://github.com/REDFOX1899/content-scraper.git
cd content-scraper
# Run automated setup
./setup.shOr manually:
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Copy environment template
cp .env.example .envEdit .env and add your API keys:
TWITTER_BEARER_TOKEN=your_token_here
YOUTUBE_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here # Optional, for embeddings# Scrape Tim Ferriss blog posts
python main.py scrape --author tim_ferriss --platform blog --max-items 20
# Scrape Balaji's tweets
python main.py scrape --author balaji_srinivasan --platform twitter --max-items 50
# Scrape with embeddings for AI applications
python main.py scrape --author tim_ferriss --platform blog --embed --max-items 100# Scrape specific platform
python main.py scrape --author tim_ferriss --platform blog --max-items 50
# Scrape multiple platforms
python main.py scrape --author balaji_srinivasan \
--platform twitter \
--platform youtube \
--max-items 100
# Scrape with date filter
python main.py scrape --author tim_ferriss \
--date-from 2023-01-01 \
--date-to 2024-01-01
# Only save authentic content
python main.py scrape --author balaji_srinivasan --authentic-only
# Process existing data
python main.py process --limit 100 --embed
# View statistics
python main.py stats
# Export to JSON
python main.py export --author tim_ferriss --output data.jsonfrom scrapers.blog_scraper import BlogScraper
from validators.authenticity_validator import AuthenticityValidator
from storage.database import ContentDatabase
# Initialize scraper
scraper = BlogScraper('tim_ferriss', author_config)
# Scrape content
content = scraper.scrape(max_pages=10)
# Validate authenticity
validator = AuthenticityValidator()
validated = validator.validate_batch(content)
# Store in database
db = ContentDatabase()
db.save_batch(validated)See example_usage.py for more examples.
┌─────────────────┐
│ User Input │
└────────┬────────┘
│
▼
┌─────────────────┐
│ CLI Interface │
└────────┬────────┘
│
▼
┌─────────────────────────────────┐
│ Orchestrator │
│ ┌─────────────────────────┐ │
│ │ Platform Scrapers │ │
│ │ • Blog │ │
│ │ • Twitter │ │
│ │ • YouTube │ │
│ │ • Podcast │ │
│ │ • Book │ │
│ └─────────────────────────┘ │
└───────────┬─────────────────────┘
│
▼
┌───────────────┐
│ Validator │
│ (Score 0-100)│
└───────┬───────┘
│
▼
┌───────────────┐
│ Processor │
│ • Clean │
│ • Extract │
│ • Chunk │
└───────┬───────┘
│
▼
┌───────────────┐
│ Embeddings │
│ (OpenAI) │
└───────┬───────┘
│
▼
┌───────────────────────┐
│ Storage │
│ ┌────────────────┐ │
│ │ SQL Database │ │
│ │ Vector Store │ │
│ │ JSON Export │ │
│ └────────────────┘ │
└───────────────────────┘
content-scraper/
├── config/ # Configuration
│ ├── settings.py # Main settings
│ └── authors.json # Author profiles
├── scrapers/ # Platform scrapers
│ ├── base_scraper.py # Base class
│ ├── blog_scraper.py
│ ├── twitter_scraper.py
│ ├── youtube_scraper.py
│ ├── podcast_scraper.py
│ └── book_scraper.py
├── validators/ # Content validation
│ └── authenticity_validator.py
├── storage/ # Data storage
│ ├── database.py # SQL database
│ └── vector_store.py # Vector stores
├── processing/ # Content processing
│ ├── text_processor.py
│ └── content_extractor.py
├── utils/ # Utilities
│ └── rate_limiter.py
├── main.py # CLI interface
├── example_usage.py # Examples
└── README.md # This file
Build a semantic search engine over your favorite thought leader's content:
# Scrape with embeddings
python main.py scrape --author tim_ferriss --embed
# Use vector store for semantic search
from storage.vector_store import create_vector_store
store = create_vector_store("chroma")
results = store.query(question_embedding, top_k=5)Analyze trends, topics, and insights:
# Export data
python main.py export --output data.json
# Analyze with pandas
import pandas as pd
df = pd.read_json('data.json')
df['keywords'].value_counts()Curate the best content automatically:
# Get only high-quality, authentic content
python main.py scrape --author balaji_srinivasan \
--authentic-only \
--date-from 2024-01-01Train AI chatbots on authentic content:
- Scrape content with embeddings
- Store in vector database
- Build RAG (Retrieval-Augmented Generation) system
Edit config/authors.json:
{
"new_author": {
"name": "Author Name",
"twitter": {"handle": "username"},
"youtube_channels": [{
"name": "Channel Name",
"channel_id": "UCxxxxx"
}],
"blogs": [{
"name": "Blog Name",
"url": "https://blog.com"
}],
"official_domains": ["blog.com", "website.com"]
}
}Edit config/settings.py:
# Rate limiting
RATE_LIMIT_CALLS = 10
RATE_LIMIT_PERIOD = 60 # seconds
# Content filtering
MIN_AUTHENTICITY_SCORE = 75
MIN_CONTENT_LENGTH = 100
# Text processing
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
# Embeddings
EMBEDDING_MODEL = "text-embedding-ada-002"CREATE TABLE content (
id VARCHAR(64) PRIMARY KEY,
author VARCHAR(100) NOT NULL,
platform VARCHAR(50) NOT NULL,
content_type VARCHAR(50),
title TEXT NOT NULL,
content TEXT NOT NULL,
url TEXT NOT NULL,
date_published DATETIME,
date_scraped DATETIME NOT NULL,
authenticity_score INTEGER,
processed BOOLEAN DEFAULT FALSE,
embedded BOOLEAN DEFAULT FALSE,
metadata JSON,
word_count INTEGER
);- Go to Twitter Developer Portal
- Create a new app
- Copy the Bearer Token
- Go to Google Cloud Console
- Create project → Enable YouTube Data API v3
- Create credentials → Copy API Key
- Go to OpenAI Platform
- Create API key
- Used for embeddings and content analysis
- Twitter: ~300 requests per 15 minutes (managed automatically)
- YouTube: 10,000 quota units per day
- Blogs: Respectful 2-second delays between requests
- Robots.txt: Always respected
Best Practices:
- Start with
--max-items 10to test - Use
--date-fromfor incremental updates - Use
--authentic-onlyfor quality data - Monitor
logs/scraper.log - Export data regularly
We welcome contributions! Here's how you can help:
- Create a new scraper inheriting from
BaseScraper - Implement the
scrape()method - Add platform validation
- Submit a PR!
from scrapers.base_scraper import BaseScraper
class NewPlatformScraper(BaseScraper):
def scrape(self, **kwargs):
# Your scraping logic
return content_list- Add configuration to
config/authors.json - Add official domains for validation
- Test thoroughly
- Submit a PR!
See CONTRIBUTING.md for detailed guidelines.
- Quick Start Guide - Get started in 5 minutes
- Example Usage - Code examples
- API Documentation - Detailed API docs (coming soon)
"Twitter API key not found"
# Add to .env
TWITTER_BEARER_TOKEN=your_token_here"Rate limit exceeded"
# Wait and retry, or reduce max-items
python main.py scrape --author tim_ferriss --max-items 10"No module named 'tweepy'"
pip install -r requirements.txtDatabase locked
# Only one process can write at a time
# Wait for current operation to completeSee QUICKSTART.md for more troubleshooting tips.
This project is licensed under the MIT License - see the LICENSE file for details.
- ✅ Only scrapes publicly available content
- ✅ Respects robots.txt files
- ✅ Implements rate limiting
- ✅ Does NOT scrape private or paywalled content
- ✅ For personal use, research, and education
Important: Always respect the terms of service of the platforms you're scraping. This tool is designed for ethical, legal use only.
Built for learning from:
- Balaji Srinivasan (@balajis) - Entrepreneur, investor, thought leader
- Tim Ferriss (@tferriss) - Author, podcaster, entrepreneur
This tool helps fans and researchers analyze and learn from their public content.
If you find this project useful, please consider giving it a star! ⭐
- Add more authors (Paul Graham, Naval Ravikant, etc.)
- Web dashboard for browsing scraped content
- REST API endpoints
- Docker support
- Incremental update scheduler
- Content deduplication
- Advanced ML-based topic modeling
- Notion/Obsidian export
- Browser extension
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Pull Requests: Contributing Guide
Built with ❤️ for the learning community