Media Aggregator

A comprehensive media aggregation and analysis platform for scraping, indexing, and analyzing news articles and social media posts.

Overview

This project provides tools and documentation for:

Fetching news articles from various sources (APIs and web scraping)
Indexing media content in OpenSearch
Analyzing content using AI/NLP for topics, sentiment, bias, entities, and events

Documentation

See MEDIA_AGGREGATION_GUIDE.md for detailed information on:

News Aggregation APIs - NewsAPI, The Guardian, New York Times, and more
- Features comparison
- Sign-up and API key processes
- Sample code snippets
Web Scraping Alternatives - Reddit, Twitter/X, RSS feeds, and custom scrapers
- Tools and libraries (BeautifulSoup, Newspaper3k, Playwright)
- High-profile aggregator sources
APIs vs Web Scraping - Comparison and recommendations
OpenSearch Integration - Indexing and searching media content
Python Libraries for AI/NLP - spaCy, Transformers, NLTK, OpenAI GPT
- Entity recognition
- Sentiment analysis
- Topic classification
- Bias detection
- Text summarization
Example Workflows - Complete pipelines and monitoring systems

Getting Started

Python Environment Setup

Clone the repository:

git clone https://github.com/medium-tech/media-aggregator.git
cd media-aggregator

Create and activate a Python virtual environment:

# Using venv (Python 3.9+)
python3 -m venv venv

# Activate on Linux/macOS
source venv/bin/activate

# Activate on Windows
venv\Scripts\activate

Install the package:
```
pip install -e .
```
Install system dependencies for web scraping (optional):

The scraping module requires additional system dependencies:

On Ubuntu/Debian:
```
# Install Tesseract OCR and Chrome/Chromium for html2image
sudo apt-get update
sudo apt-get install -y tesseract-ocr chromium-browser
```
On macOS:
```
# Install Tesseract OCR and Chrome
brew install tesseract
brew install --cask google-chrome
```
On Windows:
- Download and install Tesseract OCR
- Download and install Google Chrome
- Add Tesseract to your PATH environment variable

Set up environment variables:

cp .env.example .env
# Edit .env and add your API keys

Setting Up API Accounts and Keys

The media aggregator uses several APIs for fetching news articles and social media posts. You'll need to create accounts and obtain API keys for each service you want to use.

1. New York Times API

Sign-up Process:

Visit https://developer.nytimes.com/accounts/create
Create an account with your email
Verify your email address
Create an app in the developer portal
Enable the Article Search API
Copy your API key and add it to .env as NYTIMES_API_KEY

API Documentation: https://developer.nytimes.com/docs/articlesearch-product/1/overview

Rate Limits: 4,000 requests/day, 500 requests/minute

Licensing: Free for non-commercial use. Review Terms of Service for commercial use.

2. Mediastack API

Sign-up Process:

Visit https://mediastack.com/product
Sign up for a free account
Receive API key immediately
Add it to .env as MEDIASTACK_API_KEY

API Documentation: https://mediastack.com/documentation

Rate Limits:

Free: 500 requests/month
Basic: 10,000 requests/month ($9.99/month)
Professional: 100,000 requests/month ($49.99/month)

Licensing: Review Terms of Use for usage guidelines.

3. GNews (Google News) API

Sign-up Process:

Visit https://gnews.io/
Register with your email
Receive API key instantly
Add it to .env as GNEWS_API_KEY

API Documentation: https://gnews.io/docs/v4

Rate Limits:

Free: 100 requests/day
Basic: 10,000 requests/month ($9/month)
Pro: 50,000 requests/month ($29/month)

Licensing: Review Terms of Service for usage restrictions.

4. Twitter/X API

Sign-up Process:

Visit https://developer.twitter.com/
Apply for a developer account
Create a new app in the developer portal
Generate a Bearer Token
Add it to .env as TWITTER_BEARER_TOKEN

API Documentation: https://developer.twitter.com/en/docs/twitter-api

Rate Limits:

Free tier: 1,500 tweets/month (Essential access)
Basic: $100/month for 10,000 tweets/month
Pro: Custom pricing

Licensing: Review Twitter Developer Agreement for usage terms.

5. OpenSearch Setup

For local development:

Using Docker Compose (recommended):

The project includes a docker-compose.yml file that sets up both OpenSearch and OpenSearch Dashboards:
```
# Start OpenSearch and OpenSearch Dashboards
docker-compose up -d

# Check if services are running
docker-compose ps

# View logs
docker-compose logs -f

# Stop services
docker-compose down

# Stop and remove data volumes
docker-compose down -v
```
Once started, you can access:
- OpenSearch: http://localhost:9200
- OpenSearch Dashboards: http://localhost:5601
Default credentials:
- Username: admin
- Password: Admin123!

Using Docker directly (alternative):

If you prefer to run OpenSearch only without Docker Compose:

docker run -d -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" \
  -e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=Admin123!" \
  opensearchproject/opensearch:latest

Configure in .env:

OPENSEARCH_HOST=localhost
OPENSEARCH_PORT=9200
OPENSEARCH_USERNAME=admin
OPENSEARCH_PASSWORD=Admin123!
OPENSEARCH_USE_SSL=false

OpenSearch Documentation: https://opensearch.org/docs/latest/

Usage

The media aggregator uses a two-step workflow:

Fetch: Download raw data from APIs and save to disk as JSON files
Index: Read data from disk and index into OpenSearch

This workflow allows you to:

Rebuild OpenSearch indices without re-fetching data
Preserve raw data for future processing
Separate data collection from indexing

Step 1: Fetching News Articles

Fetch articles from various sources and save them to disk:

NY Times:

# Fetch articles by query
mediaagg-articles nytimes --query "artificial intelligence"

# With date filtering (YYYYMMDD format)
mediaagg-articles nytimes --query "climate change" --begin-date 20240101 --end-date 20241231

Mediastack:

# Fetch by keywords
mediaagg-articles mediastack --keywords "technology"

# With country and category filters
mediaagg-articles mediastack --keywords "election" --countries "us" --categories "politics"

# With date range (YYYY-MM-DD format)
mediaagg-articles mediastack --keywords "AI" --date-from 2024-01-01 --date-to 2024-12-31

Google News (GNews):

# Fetch by query
mediaagg-articles gnews --query "machine learning"

# Fetch by category
mediaagg-articles gnews --category "technology" --max-results 50

# With language and country
mediaagg-articles gnews --query "sports" --lang "en" --country "us"

Articles are saved to ./data/<source_name>/ by default (configurable via DATA_ROOT environment variable).

Step 2: Indexing Articles

Once articles are fetched, index them into OpenSearch:

# Index NY Times articles
mediaagg-articles index nytimes

# Index Mediastack articles
mediaagg-articles index mediastack

# Index Google News articles
mediaagg-articles index gnews

Fetching Social Media Posts

Twitter/X:

# Fetch tweets from a user and save to disk
mediaagg-socials tweets elonmusk --max-results 50

# With date filtering (ISO 8601 format)
mediaagg-socials tweets nytimes --start-time "2024-01-01T00:00:00Z" --end-time "2024-12-31T23:59:59Z"

Tweets are saved to ./data/tweets/ by default.

Indexing Tweets

# Index all tweets from disk
mediaagg-socials index

Web Scraping

The scraping module allows you to download web pages, render them as images, and extract text via OCR:

Scrape a web page:

# Scrape a URL and save all artifacts (HTML, image, extracted text)
mediaagg-scraping https://example.com

# Specify a custom source name for organization
mediaagg-scraping https://news.ycombinator.com --source hackernews

The scraping tool will:

Download the raw HTML and save as raw.html
Render the HTML as an image and save as rendered.png
Extract text from the image using OCR and save as extracted_text.txt
Store all artifacts in ./data/<source_name>/<article_id>/

The article ID is generated as a SHA-256 hash of the URL, ensuring each unique URL gets its own folder.

Output example:

Scraping URL: https://example.com
Downloading HTML from https://example.com...
Saved raw HTML to ./data/scraped/5d41402a.../raw.html
Rendering HTML to image...
Saved rendered image to ./data/scraped/5d41402a.../rendered.png
Extracting text from image via OCR...
Saved extracted text to ./data/scraped/5d41402a.../extracted_text.txt

Article folder: ./data/scraped/5d41402abc4fd2403c9...

Python API Usage

You can also use the package programmatically:

from mediaagg.articles import fetch_nytimes, fetch_mediastack, fetch_gnews, index_articles
from mediaagg.socials import fetch_tweets, index_tweets
from mediaagg.storage import load_all_data

# Fetch articles (saves to disk by default)
fetch_nytimes(query="technology", begin_date="20240101", save_to_disk=True)

# Load articles from disk and index
articles = load_all_data("nytimes")
index_articles(articles, source_name="nytimes")

# Fetch tweets (saves to disk by default)
fetch_tweets(username="elonmusk", max_results=100, save_to_disk=True)

# Load tweets from disk and index
tweets = load_all_data("tweets")
index_tweets(tweets)

Data Storage

Raw data is stored in the directory specified by the DATA_ROOT environment variable (default: ./data).

Directory structure:

data/
├── nytimes/          # NY Times articles
│   ├── abc123.json
│   └── def456.json
├── mediastack/       # Mediastack articles
│   ├── ghi789.json
│   └── jkl012.json
├── gnews/            # Google News articles
│   ├── mno345.json
│   └── pqr678.json
└── tweets/           # Twitter/X posts
    ├── 1234567890.json
    └── 9876543210.json

Each article is stored as a separate JSON file with a unique identifier:

Articles: Filename is the SHA-256 hash of the article URL
Tweets: Filename is the tweet ID

OpenSearch Indices

Articles are automatically indexed into source-specific indices:

articles-nytimes - NY Times articles
articles-mediastack - Mediastack articles
articles-gnews - Google News articles
tweets - Twitter/X posts

Additional Resources

For more detailed information about news aggregation, web scraping, and AI/NLP processing:

See MEDIA_AGGREGATION_GUIDE.md for comprehensive documentation
Includes API comparisons, sample code, and complete pipeline examples

License

See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
mediaagg		mediaagg
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
MEDIA_AGGREGATION_GUIDE.md		MEDIA_AGGREGATION_GUIDE.md
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Media Aggregator

Overview

Documentation

Getting Started

Python Environment Setup

Setting Up API Accounts and Keys

1. New York Times API

2. Mediastack API

3. GNews (Google News) API

4. Twitter/X API

5. OpenSearch Setup

Usage

Step 1: Fetching News Articles

Step 2: Indexing Articles

Fetching Social Media Posts

Indexing Tweets

Web Scraping

Python API Usage

Data Storage

OpenSearch Indices

Additional Resources

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

medium-tech/media-aggregator

Folders and files

Latest commit

History

Repository files navigation

Media Aggregator

Overview

Documentation

Getting Started

Python Environment Setup

Setting Up API Accounts and Keys

1. New York Times API

2. Mediastack API

3. GNews (Google News) API

4. Twitter/X API

5. OpenSearch Setup

Usage

Step 1: Fetching News Articles

Step 2: Indexing Articles

Fetching Social Media Posts

Indexing Tweets

Web Scraping

Python API Usage

Data Storage

OpenSearch Indices

Additional Resources

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages