A comprehensive media aggregation and analysis platform for scraping, indexing, and analyzing news articles and social media posts.
This project provides tools and documentation for:
- Fetching news articles from various sources (APIs and web scraping)
- Indexing media content in OpenSearch
- Analyzing content using AI/NLP for topics, sentiment, bias, entities, and events
See MEDIA_AGGREGATION_GUIDE.md for detailed information on:
-
News Aggregation APIs - NewsAPI, The Guardian, New York Times, and more
- Features comparison
- Sign-up and API key processes
- Sample code snippets
-
Web Scraping Alternatives - Reddit, Twitter/X, RSS feeds, and custom scrapers
- Tools and libraries (BeautifulSoup, Newspaper3k, Playwright)
- High-profile aggregator sources
-
APIs vs Web Scraping - Comparison and recommendations
-
OpenSearch Integration - Indexing and searching media content
-
Python Libraries for AI/NLP - spaCy, Transformers, NLTK, OpenAI GPT
- Entity recognition
- Sentiment analysis
- Topic classification
- Bias detection
- Text summarization
-
Example Workflows - Complete pipelines and monitoring systems
-
Clone the repository:
git clone https://github.com/medium-tech/media-aggregator.git cd media-aggregator -
Create and activate a Python virtual environment:
# Using venv (Python 3.9+) python3 -m venv venv # Activate on Linux/macOS source venv/bin/activate # Activate on Windows venv\Scripts\activate
-
Install the package:
pip install -e . -
Install system dependencies for web scraping (optional):
The scraping module requires additional system dependencies:
On Ubuntu/Debian:
# Install Tesseract OCR and Chrome/Chromium for html2image sudo apt-get update sudo apt-get install -y tesseract-ocr chromium-browserOn macOS:
# Install Tesseract OCR and Chrome brew install tesseract brew install --cask google-chromeOn Windows:
- Download and install Tesseract OCR
- Download and install Google Chrome
- Add Tesseract to your PATH environment variable
-
Set up environment variables:
cp .env.example .env # Edit .env and add your API keys
The media aggregator uses several APIs for fetching news articles and social media posts. You'll need to create accounts and obtain API keys for each service you want to use.
Sign-up Process:
- Visit https://developer.nytimes.com/accounts/create
- Create an account with your email
- Verify your email address
- Create an app in the developer portal
- Enable the Article Search API
- Copy your API key and add it to
.envasNYTIMES_API_KEY
API Documentation: https://developer.nytimes.com/docs/articlesearch-product/1/overview
Rate Limits: 4,000 requests/day, 500 requests/minute
Licensing: Free for non-commercial use. Review Terms of Service for commercial use.
Sign-up Process:
- Visit https://mediastack.com/product
- Sign up for a free account
- Receive API key immediately
- Add it to
.envasMEDIASTACK_API_KEY
API Documentation: https://mediastack.com/documentation
Rate Limits:
- Free: 500 requests/month
- Basic: 10,000 requests/month ($9.99/month)
- Professional: 100,000 requests/month ($49.99/month)
Licensing: Review Terms of Use for usage guidelines.
Sign-up Process:
- Visit https://gnews.io/
- Register with your email
- Receive API key instantly
- Add it to
.envasGNEWS_API_KEY
API Documentation: https://gnews.io/docs/v4
Rate Limits:
- Free: 100 requests/day
- Basic: 10,000 requests/month ($9/month)
- Pro: 50,000 requests/month ($29/month)
Licensing: Review Terms of Service for usage restrictions.
Sign-up Process:
- Visit https://developer.twitter.com/
- Apply for a developer account
- Create a new app in the developer portal
- Generate a Bearer Token
- Add it to
.envasTWITTER_BEARER_TOKEN
API Documentation: https://developer.twitter.com/en/docs/twitter-api
Rate Limits:
- Free tier: 1,500 tweets/month (Essential access)
- Basic: $100/month for 10,000 tweets/month
- Pro: Custom pricing
Licensing: Review Twitter Developer Agreement for usage terms.
For local development:
-
Using Docker Compose (recommended):
The project includes a
docker-compose.ymlfile that sets up both OpenSearch and OpenSearch Dashboards:# Start OpenSearch and OpenSearch Dashboards docker-compose up -d # Check if services are running docker-compose ps # View logs docker-compose logs -f # Stop services docker-compose down # Stop and remove data volumes docker-compose down -v
Once started, you can access:
- OpenSearch: http://localhost:9200
- OpenSearch Dashboards: http://localhost:5601
Default credentials:
- Username:
admin - Password:
Admin123!
-
Using Docker directly (alternative):
If you prefer to run OpenSearch only without Docker Compose:
docker run -d -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" \ -e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=Admin123!" \ opensearchproject/opensearch:latest
-
Configure in
.env:OPENSEARCH_HOST=localhost OPENSEARCH_PORT=9200 OPENSEARCH_USERNAME=admin OPENSEARCH_PASSWORD=Admin123! OPENSEARCH_USE_SSL=false
OpenSearch Documentation: https://opensearch.org/docs/latest/
The media aggregator uses a two-step workflow:
- Fetch: Download raw data from APIs and save to disk as JSON files
- Index: Read data from disk and index into OpenSearch
This workflow allows you to:
- Rebuild OpenSearch indices without re-fetching data
- Preserve raw data for future processing
- Separate data collection from indexing
Fetch articles from various sources and save them to disk:
NY Times:
# Fetch articles by query
mediaagg-articles nytimes --query "artificial intelligence"
# With date filtering (YYYYMMDD format)
mediaagg-articles nytimes --query "climate change" --begin-date 20240101 --end-date 20241231Mediastack:
# Fetch by keywords
mediaagg-articles mediastack --keywords "technology"
# With country and category filters
mediaagg-articles mediastack --keywords "election" --countries "us" --categories "politics"
# With date range (YYYY-MM-DD format)
mediaagg-articles mediastack --keywords "AI" --date-from 2024-01-01 --date-to 2024-12-31Google News (GNews):
# Fetch by query
mediaagg-articles gnews --query "machine learning"
# Fetch by category
mediaagg-articles gnews --category "technology" --max-results 50
# With language and country
mediaagg-articles gnews --query "sports" --lang "en" --country "us"Articles are saved to ./data/<source_name>/ by default (configurable via DATA_ROOT environment variable).
Once articles are fetched, index them into OpenSearch:
# Index NY Times articles
mediaagg-articles index nytimes
# Index Mediastack articles
mediaagg-articles index mediastack
# Index Google News articles
mediaagg-articles index gnewsTwitter/X:
# Fetch tweets from a user and save to disk
mediaagg-socials tweets elonmusk --max-results 50
# With date filtering (ISO 8601 format)
mediaagg-socials tweets nytimes --start-time "2024-01-01T00:00:00Z" --end-time "2024-12-31T23:59:59Z"Tweets are saved to ./data/tweets/ by default.
# Index all tweets from disk
mediaagg-socials indexThe scraping module allows you to download web pages, render them as images, and extract text via OCR:
Scrape a web page:
# Scrape a URL and save all artifacts (HTML, image, extracted text)
mediaagg-scraping https://example.com
# Specify a custom source name for organization
mediaagg-scraping https://news.ycombinator.com --source hackernewsThe scraping tool will:
- Download the raw HTML and save as
raw.html - Render the HTML as an image and save as
rendered.png - Extract text from the image using OCR and save as
extracted_text.txt - Store all artifacts in
./data/<source_name>/<article_id>/
The article ID is generated as a SHA-256 hash of the URL, ensuring each unique URL gets its own folder.
Output example:
Scraping URL: https://example.com
Downloading HTML from https://example.com...
Saved raw HTML to ./data/scraped/5d41402a.../raw.html
Rendering HTML to image...
Saved rendered image to ./data/scraped/5d41402a.../rendered.png
Extracting text from image via OCR...
Saved extracted text to ./data/scraped/5d41402a.../extracted_text.txt
Article folder: ./data/scraped/5d41402abc4fd2403c9...
You can also use the package programmatically:
from mediaagg.articles import fetch_nytimes, fetch_mediastack, fetch_gnews, index_articles
from mediaagg.socials import fetch_tweets, index_tweets
from mediaagg.storage import load_all_data
# Fetch articles (saves to disk by default)
fetch_nytimes(query="technology", begin_date="20240101", save_to_disk=True)
# Load articles from disk and index
articles = load_all_data("nytimes")
index_articles(articles, source_name="nytimes")
# Fetch tweets (saves to disk by default)
fetch_tweets(username="elonmusk", max_results=100, save_to_disk=True)
# Load tweets from disk and index
tweets = load_all_data("tweets")
index_tweets(tweets)Raw data is stored in the directory specified by the DATA_ROOT environment variable (default: ./data).
Directory structure:
data/
├── nytimes/ # NY Times articles
│ ├── abc123.json
│ └── def456.json
├── mediastack/ # Mediastack articles
│ ├── ghi789.json
│ └── jkl012.json
├── gnews/ # Google News articles
│ ├── mno345.json
│ └── pqr678.json
└── tweets/ # Twitter/X posts
├── 1234567890.json
└── 9876543210.json
Each article is stored as a separate JSON file with a unique identifier:
- Articles: Filename is the SHA-256 hash of the article URL
- Tweets: Filename is the tweet ID
Articles are automatically indexed into source-specific indices:
articles-nytimes- NY Times articlesarticles-mediastack- Mediastack articlesarticles-gnews- Google News articlestweets- Twitter/X posts
For more detailed information about news aggregation, web scraping, and AI/NLP processing:
- See MEDIA_AGGREGATION_GUIDE.md for comprehensive documentation
- Includes API comparisons, sample code, and complete pipeline examples
See LICENSE for details.