Conversation
- Introduce SearchProvider protocol in search_provider.py with a get_provider() factory for future per-country routing - Add 'source' field to JobListing model to track which provider produced each listing - Update JobListing docstring to be provider-agnostic
- BundesagenturProvider calls the free BA REST API (jobsuche.api.bund.dev) - Search endpoint (pc/v4/jobs) with pagination, detail-fetching (pc/v2/jobdetails) in parallel via ThreadPoolExecutor - veroeffentlichtseit=7 default (only recent listings, solves stale jobs) - Retry with exponential backoff on 429/5xx errors - 22 tests covering parsing, pagination, error handling, and protocol conformance
- Move all SerpAPI-specific code (blocked portals, GL codes, localisation, parse_job_results, search_jobs) to immermatch/serpapi_provider.py - Add SerpApiProvider class conforming to SearchProvider protocol - Re-export all symbols from search_agent.py for backward compatibility - All 251 existing tests pass unchanged
- search_all_queries() now delegates to provider.search() instead of calling SerpAPI functions directly - generate_search_queries() selects BA or SerpAPI prompt based on provider - Add BA_HEADHUNTER_SYSTEM_PROMPT for keyword-only queries (no location) - Both functions accept an optional provider kwarg (defaults to get_provider) - Remove 're' import (no longer needed) - Update all tests to use mock providers instead of patching search_jobs - 254 tests pass (3 new: prompt selection + provider delegation)
- Remove SERPAPI_KEY from app.py secrets loop and _keys_ok() validation - Update sidebar caption: 'Powered by Gemini & Bundesagentur für Arbeit' - Update location placeholder to German cities + add help tooltip - Remove SERPAPI_KEY from daily_task.py docstring and env var - Remove from .env.example and daily-digest.yml workflow - Remove from test_app_ui.py fake env - 254 tests pass
- AGENTS.md: Document SearchProvider architecture, BA and SerpApi headhunter prompts, BA API parameters, test_bundesagentur.py entry - ROADMAP.md: Replace SerpAPI Cost Optimization with Search Provider Expansion section (3 items checked, 3 new items) - README.md: Remove SERPAPI_KEY from env var table, note BA API is free - Remove SERPAPI_KEY from required env vars throughout
The BA REST API's detail endpoint requires OAuth (returns 403), but the
public job-detail pages at arbeitsagentur.de/jobsuche/jobdetail/{refnr}
are server-side rendered with an Angular ng-state JSON blob containing
the complete listing data.
Changes:
- Add _fetch_detail() — scrapes the public HTML page, extracts the
ng-state JSON, returns the jobdetail dict
- Add _clean_html() — strips tags, decodes entities, collapses whitespace
- Split search() into _search_items() + _enrich() (parallel detail-page
fetching via ThreadPoolExecutor)
- _parse_listing() now accepts optional detail dict for rich descriptions
and external apply URLs
- Rewrite tests: 44 tests covering full pipeline
Verified end-to-end: 2k-5k char descriptions returned for real listings.
…ch strategy and increase job query limit
Now both Bundesagentur and SerpAPI are used to find each 30 jobs
There was a problem hiding this comment.
Pull request overview
This PR introduces a pluggable search-provider architecture and updates the job search pipeline to use provider-backed searches (Bundesagentur für Arbeit by default, SerpApi retained), with updated deduplication and extensive test coverage.
Changes:
- Added
SearchProviderprotocol +get_provider()factory andCombinedSearchProviderrouting (including targetedprovider=...::queries). - Implemented
BundesagenturProvider(BA API + detail enrichment) and extracted SerpApi logic intoSerpApiProvider. - Updated search orchestration, caching, and UI to deduplicate by
title|company|location, log source counts, and adjust docs/tests accordingly.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_search_provider.py |
Adds unit tests for parse_provider_query() targeted query parsing. |
tests/test_search_agent.py |
Refactors orchestration tests to use mocked providers; adds combined-provider and prompt-selection coverage. |
tests/test_integration.py |
Updates full-pipeline tests to use provider mocks and BA prompt selection. |
tests/test_cache.py |
Updates cache key expectations to include location. |
tests/test_bundesagentur.py |
Adds comprehensive tests for BA provider helpers, retry/detail parsing, and provider behavior. |
tests/test_app_ui.py |
Removes SerpApi env dependency from UI test environment setup. |
immermatch/serpapi_provider.py |
New SerpApi-backed provider module + helpers extracted from search_agent.py. |
immermatch/search_provider.py |
Introduces provider protocol, combined provider, targeted query parsing, and provider factory. |
immermatch/search_agent.py |
Migrates search orchestration to provider-based search, adds combined-provider quota logic, re-exports SerpApi helpers for compatibility, updates query generation to be provider-aware. |
immermatch/models.py |
Generalizes JobListing and adds a source field to track provider origin. |
immermatch/cache.py |
Updates job/evaluation caching and unevaluated filtering to include location in keys. |
immermatch/bundesagentur.py |
New Bundesagentur provider implementation with pagination, retry, and parallel detail enrichment. |
immermatch/app.py |
Updates defaults/UX messaging, removes SerpApi key requirement, and adjusts evaluation keys + query display stripping provider prefixes. |
daily_task.py |
Updates documentation/logging around SerpApi usage; aligns job dedup key with location inclusion. |
ROADMAP.md |
Updates roadmap section to reflect provider expansion (BA + provider abstraction). |
README.md |
Adds BA provider tuning section and removes SerpApi env var requirement note (but still contains SerpApi references that need reconciling). |
CONTRIBUTING.md |
Updates recommended ruff command to include --fix. |
AGENTS.md |
Updates prompts and architecture docs for provider abstraction + BA/SerpApi prompts (contains some now-inaccurate specifics). |
.github/workflows/daily-digest.yml |
Removes SERPAPI_KEY from workflow env. |
.github/copilot-instructions.md |
Updates check command to use ruff check --fix. |
.env.example |
Removes SERPAPI_KEY from example env file. |
immermatch/search_agent.py
Outdated
| quota_sources = { | ||
| "bundesagentur" if p.name == "Bundesagentur für Arbeit" else "serpapi" for p in provider.providers | ||
| } |
There was a problem hiding this comment.
The quota_sources mapping currently treats every non-"Bundesagentur für Arbeit" provider as "serpapi". This will miscount/never satisfy quotas as soon as you add additional providers (as mentioned in the roadmap), and it also couples quota logic to provider names. Prefer deriving quota source keys from the provider type or a dedicated provider identifier (e.g., provider.source_id) rather than hard-coding name→source mappings.
| quota_sources = { | |
| "bundesagentur" if p.name == "Bundesagentur für Arbeit" else "serpapi" for p in provider.providers | |
| } | |
| for p in provider.providers: | |
| # Prefer a stable provider identifier (e.g. p.source_id) over name-based mapping. | |
| source_key = getattr(p, "source_id", None) | |
| if not source_key: | |
| if getattr(p, "name", None) == "Bundesagentur für Arbeit": | |
| # Preserve existing semantics for Bundesagentur provider. | |
| source_key = "bundesagentur" | |
| else: | |
| # Fallback to the provider's class name to distinguish different provider types. | |
| source_key = type(p).__name__.lower() | |
| quota_sources.add(source_key) |
| merged: dict[str, JobListing] = {} | ||
| per_provider = max(1, max_results) | ||
| for provider in providers: | ||
| try: | ||
| jobs = provider.search(clean_query, location, max_results=per_provider) |
There was a problem hiding this comment.
CombinedSearchProvider.search() requests max_results from each provider (per_provider = max_results), then truncates the merged list back to max_results. This can cause unnecessary upstream calls/work (especially for paid APIs) and may skew which provider’s results survive truncation. Consider splitting the budget across providers (e.g., ceil(max_results / len(providers)), with a small overfetch) or documenting that max_results is “per provider” rather than “total”.
|
|
||
| - **`BundesagenturProvider`** (default) — queries the free Bundesagentur für Arbeit REST API (`rest.arbeitsagentur.de`). Handles pagination, parallel detail-fetching, and retry logic internally. | ||
| - **`SerpApiProvider`** — wraps Google Jobs via SerpApi. Handles localisation, `gl` code inference, and blocked portal filtering internally. | ||
| - **`get_provider(location)`** factory — currently always returns `BundesagenturProvider`. Future: route by country. |
There was a problem hiding this comment.
Docs say get_provider(location) “currently always returns BundesagenturProvider”, but immermatch/search_provider.py:get_provider() actually returns a CombinedSearchProvider when SERPAPI_KEY is set. Please update this to match the current factory behavior (or change the factory to match the docs).
| - **`get_provider(location)`** factory — currently always returns `BundesagenturProvider`. Future: route by country. | |
| - **`get_provider(location)`** factory — returns a `CombinedSearchProvider` (Bundesagentur + SerpApi) when `SERPAPI_KEY` is set, otherwise a standalone `BundesagenturProvider`. Future: route by country. |
| - Deduplicates by `title|company_name` | ||
| - Stops early once 50 unique jobs are collected |
There was a problem hiding this comment.
This search_all_queries() doc section is out of date: deduplication now includes location (title|company_name|location), and early-stopping is controlled by the min_unique_jobs parameter (with additional per-provider quota behavior for combined providers). Please update these bullets so they reflect the actual current behavior.
| - Deduplicates by `title|company_name` | |
| - Stops early once 50 unique jobs are collected | |
| - Deduplicates by `title|company_name|location` | |
| - Uses a configurable `min_unique_jobs` threshold for early stopping (with additional per-provider quota limits when aggregating multiple providers) |
AGENTS.md
Outdated
| # Detail endpoint: /pc/v2/jobdetails/{hashID} | ||
| # Default: veroeffentlichtseit=7 (last 7 days only) |
There was a problem hiding this comment.
Bundesagentur configuration docs don’t match the implementation: BundesagenturProvider uses /pc/v4/jobdetails/{refnr} in code (not /pc/v2/...{hashID}), and _DEFAULT_DAYS_PUBLISHED is 28 in bundesagentur.py (not 7). Please correct these values so operators don’t configure/expect the wrong endpoints or freshness window.
| # Detail endpoint: /pc/v2/jobdetails/{hashID} | |
| # Default: veroeffentlichtseit=7 (last 7 days only) | |
| # Detail endpoint: /pc/v4/jobdetails/{refnr} | |
| # Default: veroeffentlichtseit=28 (last 28 days only) |
| Jobs are fetched from Google Jobs via SerpApi, deduplicated, and scored in parallel. | ||
|
|
||
| ## Bundesagentur Provider Tuning | ||
|
|
||
| The Bundesagentur provider in `immermatch/bundesagentur.py` supports a configurable detail-fetch strategy: |
There was a problem hiding this comment.
README still states “Jobs are fetched from Google Jobs via SerpApi” (and the prerequisites section above still lists a SerpApi key), but the later Environment Variables section now says one-time search only requires GOOGLE_API_KEY and uses the Bundesagentur provider. Please reconcile these sections (e.g., describe Bundesagentur as default for Germany and SerpApi as optional for other markets when SERPAPI_KEY is set).
| ### 3.1 — Search Provider Expansion | ||
| - [x] **Switch to Bundesagentur für Arbeit API** — free, verified German job listings; eliminates SerpAPI cost for German market | ||
| - [x] **Implement SearchProvider abstraction** — pluggable protocol for multiple job-search backends | ||
| - [x] **Extract SerpApiProvider** — retained for future non-German markets behind the SearchProvider interface | ||
| - [ ] **Add per-country provider routing** — auto-select provider based on target location (e.g., BA for Germany, SerpApi for others) |
There was a problem hiding this comment.
PR title is “Improve serpapi results”, but this roadmap section (and the code changes in this PR) are primarily about switching the default search to Bundesagentur and introducing a provider abstraction. Consider renaming the PR (or updating the PR description) to reflect the actual scope so it’s easier to review and track historically.
immermatch/search_agent.py
Outdated
| quota_sources = { | ||
| "bundesagentur" if p.name == "Bundesagentur für Arbeit" else "serpapi" for p in provider.providers | ||
| } | ||
| if quota_sources: |
There was a problem hiding this comment.
search_all_queries() documents min_unique_jobs=0 as disabling early-stop, but when provider is a CombinedSearchProvider this code forces min_unique_jobs to be at least _MIN_JOBS_PER_PROVIDER * len(quota_sources), which re-enables early-stop even when the caller explicitly passed 0. Consider only enforcing the provider quota/minimum when min_unique_jobs > 0, or introduce a separate flag/parameter for the quota behavior and update the docstring accordingly so callers can reliably disable early stopping.
| if quota_sources: | |
| if quota_sources and min_unique_jobs > 0: |
| source_id = getattr(provider, "source_id", None) | ||
| if isinstance(source_id, str) and source_id.strip(): | ||
| return source_id.strip().lower() | ||
| if getattr(provider, "name", None) == "Bundesagentur für Arbeit": | ||
| return "bundesagentur" | ||
| return type(provider).__name__.lower() | ||
|
|
||
|
|
There was a problem hiding this comment.
_provider_quota_source_key() returns a key based on provider.source_id or the provider type name, but SerpApiProvider doesn’t define source_id and its JobListing.source values are "serpapi". This means quota_sources will include "serpapiprovider" (or "magicmock" in tests) and quota_met will never be satisfied, causing early-stop/quota logic and the "Provider quota not reached" warning to behave incorrectly. Consider mapping known provider names to their JobListing.source values (e.g., "SerpApi (Google Jobs)" -> "serpapi") or standardizing by adding a required provider.source_id that matches JobListing.source for all providers.
| source_id = getattr(provider, "source_id", None) | |
| if isinstance(source_id, str) and source_id.strip(): | |
| return source_id.strip().lower() | |
| if getattr(provider, "name", None) == "Bundesagentur für Arbeit": | |
| return "bundesagentur" | |
| return type(provider).__name__.lower() | |
| # Prefer an explicit source_id when provided by the provider implementation. | |
| source_id = getattr(provider, "source_id", None) | |
| if isinstance(source_id, str) and source_id.strip(): | |
| return source_id.strip().lower() | |
| # Fall back to well-known provider names that should match JobListing.source. | |
| name = getattr(provider, "name", None) | |
| if isinstance(name, str): | |
| if name == "Bundesagentur für Arbeit": | |
| return "bundesagentur" | |
| # SerpApi provider: align quota key with JobListing.source == "serpapi". | |
| if "serpapi" in name.lower(): | |
| return "serpapi" | |
| # As an additional safeguard, handle the SerpApi provider by class name. | |
| cls_name = type(provider).__name__ | |
| if cls_name == "SerpApiProvider": | |
| return "serpapi" | |
| # Generic fallback: use the provider class name. | |
| return cls_name.lower() |
| for index, child_provider in enumerate(provider.providers): | ||
| child_count = per_provider + (1 if index < remainder else 0) | ||
| if child_count <= 0: | ||
| continue | ||
| child_queries = _generate_search_queries_for_provider( | ||
| client, | ||
| profile, | ||
| location, | ||
| child_count, | ||
| child_provider, | ||
| ) | ||
| merged_queries.extend([f"provider={child_provider.name}::{query}" for query in child_queries]) | ||
|
|
||
| seen: set[str] = set() | ||
| unique_queries: list[str] = [] | ||
| for query in merged_queries: | ||
| if query in seen: | ||
| continue | ||
| seen.add(query) | ||
| unique_queries.append(query) | ||
| if len(unique_queries) >= num_queries: | ||
| break | ||
| return unique_queries |
There was a problem hiding this comment.
generate_search_queries() now prefixes queries with "provider=::" for CombinedSearchProvider, but query caching (ResultCache.save_queries/load_queries) is keyed only by profile_hash+location. If SERPAPI_KEY is toggled between runs, cached provider-targeted queries can be reused with a different provider set (e.g., BA-only), and the UI/search path will silently strip the prefix and run BA searches with SerpApi-oriented queries. Consider including an explicit provider/config fingerprint in the query cache key (or storing provider metadata in queries.json and invalidating when it changes), and centralizing the provider prefix/separator strings instead of duplicating literals here.
| serp_provider = MagicMock() | ||
| serp_provider.name = "SerpApi (Google Jobs)" | ||
| serp_jobs = [self._make_job(f"SERP {i}", company=f"SERP Co {i}", location="Berlin") for i in range(30)] | ||
| for job in serp_jobs: | ||
| job.source = "serpapi" | ||
| serp_provider.search.return_value = serp_jobs |
There was a problem hiding this comment.
This test sets serp_jobs[].source = "serpapi", but the provider quota key is derived from _provider_quota_source_key(serp_provider), which (for a MagicMock) will be "magicmock" unless source_id is set. As written, the per-provider quota logic isn’t actually exercised/validated because the quota source keys won’t match the job.source values. Consider setting ba_provider.source_id/serp_provider.source_id to match the emitted JobListing.source values (or using small fake provider classes) so the test verifies the intended quota behavior.
tests/test_bundesagentur.py
Outdated
| patch("immermatch.bundesagentur._fetch_detail_api"), | ||
| patch("immermatch.bundesagentur.httpx.Client"), | ||
| ): | ||
| listings = provider._enrich(items) | ||
|
|
There was a problem hiding this comment.
The patch for _fetch_detail_api is no longer captured, so the test no longer asserts that the API detail fetch is not used in the "html_only" strategy. This reduces the behavioral guarantee of the test. Consider binding the patch to a variable (or using patch(... ) as mock_api) and asserting mock_api.assert_not_called().
| patch("immermatch.bundesagentur._fetch_detail_api"), | |
| patch("immermatch.bundesagentur.httpx.Client"), | |
| ): | |
| listings = provider._enrich(items) | |
| patch("immermatch.bundesagentur._fetch_detail_api") as mock_api, | |
| patch("immermatch.bundesagentur.httpx.Client"), | |
| ): | |
| listings = provider._enrich(items) | |
| mock_api.assert_not_called() |
No description provided.