Conversation
- Introduce SearchProvider protocol in search_provider.py with a get_provider() factory for future per-country routing - Add 'source' field to JobListing model to track which provider produced each listing - Update JobListing docstring to be provider-agnostic
- BundesagenturProvider calls the free BA REST API (jobsuche.api.bund.dev) - Search endpoint (pc/v4/jobs) with pagination, detail-fetching (pc/v2/jobdetails) in parallel via ThreadPoolExecutor - veroeffentlichtseit=7 default (only recent listings, solves stale jobs) - Retry with exponential backoff on 429/5xx errors - 22 tests covering parsing, pagination, error handling, and protocol conformance
- Move all SerpAPI-specific code (blocked portals, GL codes, localisation, parse_job_results, search_jobs) to immermatch/serpapi_provider.py - Add SerpApiProvider class conforming to SearchProvider protocol - Re-export all symbols from search_agent.py for backward compatibility - All 251 existing tests pass unchanged
- search_all_queries() now delegates to provider.search() instead of calling SerpAPI functions directly - generate_search_queries() selects BA or SerpAPI prompt based on provider - Add BA_HEADHUNTER_SYSTEM_PROMPT for keyword-only queries (no location) - Both functions accept an optional provider kwarg (defaults to get_provider) - Remove 're' import (no longer needed) - Update all tests to use mock providers instead of patching search_jobs - 254 tests pass (3 new: prompt selection + provider delegation)
- Remove SERPAPI_KEY from app.py secrets loop and _keys_ok() validation - Update sidebar caption: 'Powered by Gemini & Bundesagentur für Arbeit' - Update location placeholder to German cities + add help tooltip - Remove SERPAPI_KEY from daily_task.py docstring and env var - Remove from .env.example and daily-digest.yml workflow - Remove from test_app_ui.py fake env - 254 tests pass
- AGENTS.md: Document SearchProvider architecture, BA and SerpApi headhunter prompts, BA API parameters, test_bundesagentur.py entry - ROADMAP.md: Replace SerpAPI Cost Optimization with Search Provider Expansion section (3 items checked, 3 new items) - README.md: Remove SERPAPI_KEY from env var table, note BA API is free - Remove SERPAPI_KEY from required env vars throughout
The BA REST API's detail endpoint requires OAuth (returns 403), but the
public job-detail pages at arbeitsagentur.de/jobsuche/jobdetail/{refnr}
are server-side rendered with an Angular ng-state JSON blob containing
the complete listing data.
Changes:
- Add _fetch_detail() — scrapes the public HTML page, extracts the
ng-state JSON, returns the jobdetail dict
- Add _clean_html() — strips tags, decodes entities, collapses whitespace
- Split search() into _search_items() + _enrich() (parallel detail-page
fetching via ThreadPoolExecutor)
- _parse_listing() now accepts optional detail dict for rich descriptions
and external apply URLs
- Rewrite tests: 44 tests covering full pipeline
Verified end-to-end: 2k-5k char descriptions returned for real listings.
…ch strategy and increase job query limit
There was a problem hiding this comment.
Pull request overview
This PR switches the default job-search backend from SerpApi (Google Jobs) to the free Bundesagentur für Arbeit (German Federal Employment Agency) REST API. It introduces a SearchProvider protocol layer so providers are pluggable, extracts SerpApi logic into SerpApiProvider (retained for future non-German markets), and adapts all tests, pipelines, and documentation accordingly.
Changes:
- Introduces
immermatch/bundesagentur.py(BundesagenturProvider) andimmermatch/search_provider.py(SearchProviderprotocol +get_provider()factory), moving search logic behind a pluggable interface. - Moves SerpApi-specific helpers to
immermatch/serpapi_provider.py, re-exporting them fromsearch_agent.pyfor backward compatibility;generate_search_queriesnow selects the correct LLM prompt based on the active provider. - Updates all tests (
test_search_agent.py,test_integration.py), adds a newtests/test_bundesagentur.py, and removesSERPAPI_KEYfrom all configuration, CI, and documentation.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
immermatch/bundesagentur.py |
New BA provider: pagination, parallel detail-fetching, retry logic |
immermatch/search_provider.py |
New SearchProvider protocol and get_provider() factory |
immermatch/serpapi_provider.py |
SerpApi logic extracted from search_agent.py; SerpApiProvider class added |
immermatch/search_agent.py |
Delegates to SearchProvider; adds BA prompt; re-exports SerpApi helpers |
immermatch/models.py |
Adds source field to JobListing for provider tracking |
immermatch/app.py |
Removes SerpApi key dependency; updates UI placeholder/caption |
daily_task.py |
Removes SERPAPI_KEY from docs/logging |
tests/test_bundesagentur.py |
New comprehensive test file for the BA provider |
tests/test_search_agent.py |
Tests updated to use provider-based API |
tests/test_integration.py |
Integration tests updated to inject mock provider |
tests/test_app_ui.py |
Removes SERPAPI_KEY from fake env |
AGENTS.md |
Architecture docs updated for new provider model |
README.md |
Adds BA provider tuning section; removes SerpApi env var entry |
ROADMAP.md |
Updates Phase 3 milestones |
CONTRIBUTING.md / .github/copilot-instructions.md |
Lint command updated to --fix |
.github/workflows/daily-digest.yml |
Removes SERPAPI_KEY secret |
.env.example |
Removes SERPAPI_KEY entry |
AGENTS.md
Outdated
| BA_BASE_URL = "https://rest.arbeitsagentur.de/jobboerse/jobsuche-service" | ||
| BA_API_KEY = "jobboerse-jobsuche" # pragma: allowlist secret # public API key, not a secret | ||
| # Search endpoint: /pc/v4/jobs (params: was, wo, veroeffentlichtseit, size, page, angebotsart) | ||
| # Detail endpoint: /pc/v2/jobdetails/{hashID} |
There was a problem hiding this comment.
The AGENTS.md documents the BA detail endpoint as /pc/v2/jobdetails/{hashID}, but the actual implementation in immermatch/bundesagentur.py uses /pc/v4/jobdetails/{refnr} (line 130). The version and the parameter name are both wrong in the documentation. This should be updated to /pc/v4/jobdetails/{refnr} to match the code.
| # Detail endpoint: /pc/v2/jobdetails/{hashID} | |
| # Detail endpoint: /pc/v4/jobdetails/{refnr} |
AGENTS.md
Outdated
| BA_API_KEY = "jobboerse-jobsuche" # pragma: allowlist secret # public API key, not a secret | ||
| # Search endpoint: /pc/v4/jobs (params: was, wo, veroeffentlichtseit, size, page, angebotsart) | ||
| # Detail endpoint: /pc/v2/jobdetails/{hashID} | ||
| # Default: veroeffentlichtseit=7 (last 7 days only) |
There was a problem hiding this comment.
The AGENTS.md documents veroeffentlichtseit=7 (last 7 days only) as the default, but _DEFAULT_DAYS_PUBLISHED in bundesagentur.py is set to 28. The same discrepancy causes the test test_veroeffentlichtseit_default to fail (it asserts == 7). Either the comment in AGENTS.md should say 28 to reflect the code, or the constant and test need to align on 7. Whichever value is correct, all three locations (AGENTS.md line 214, bundesagentur.py:_DEFAULT_DAYS_PUBLISHED, and the test assertion) must agree.
| # Default: veroeffentlichtseit=7 (last 7 days only) | |
| # Default: veroeffentlichtseit=28 (last 28 days only) |
| def test_enriches_items_with_details(self) -> None: | ||
| items = [ | ||
| _make_stellenangebot(refnr="r1", titel="Dev A"), | ||
| _make_stellenangebot(refnr="r2", titel="Dev B"), | ||
| ] | ||
| details = { | ||
| "r1": _make_detail(description="<b>Desc A</b>"), | ||
| "r2": _make_detail(description="<p>Desc B</p>"), | ||
| } | ||
|
|
||
| provider = BundesagenturProvider() | ||
| with patch("immermatch.bundesagentur._fetch_detail", side_effect=lambda _c, refnr: details.get(refnr, {})): | ||
| with patch("immermatch.bundesagentur.httpx.Client"): | ||
| listings = provider._enrich(items) | ||
|
|
||
| assert len(listings) == 2 | ||
| assert listings[0].description == "Desc A" | ||
| assert listings[1].description == "Desc B" | ||
|
|
||
| def test_enrich_falls_back_on_failed_detail(self) -> None: | ||
| items = [_make_stellenangebot(refnr="r1", titel="Dev", arbeitgeber="Corp")] | ||
|
|
||
| provider = BundesagenturProvider() | ||
| with patch("immermatch.bundesagentur._fetch_detail", return_value={}): | ||
| with patch("immermatch.bundesagentur.httpx.Client"): | ||
| listings = provider._enrich(items) | ||
|
|
||
| assert len(listings) == 1 | ||
| # Uses fallback description from search fields | ||
| assert "Arbeitgeber: Corp" in listings[0].description | ||
|
|
||
| def test_enrich_with_external_apply_url(self) -> None: | ||
| items = [_make_stellenangebot(refnr="r1")] | ||
| detail = _make_detail(partner_url="https://jobs.example.com", partner_name="Example") | ||
|
|
||
| provider = BundesagenturProvider() | ||
| with patch("immermatch.bundesagentur._fetch_detail", return_value=detail): | ||
| with patch("immermatch.bundesagentur.httpx.Client"): | ||
| listings = provider._enrich(items) | ||
|
|
||
| assert len(listings[0].apply_options) == 2 | ||
| assert listings[0].apply_options[1].source == "Example" | ||
| assert listings[0].apply_options[1].url == "https://jobs.example.com" |
There was a problem hiding this comment.
The tests test_enriches_items_with_details, test_enrich_falls_back_on_failed_detail, and test_enrich_with_external_apply_url only mock _fetch_detail (HTML path) but not _fetch_detail_api. With the default strategy being api_then_html, _get_detail will first call _fetch_detail_api, and only when it returns empty, fall back to _fetch_detail. These tests currently work only by coincidence: patching httpx.Client causes _fetch_detail_api to receive a MagicMock whose .get().status_code is neither 200 nor in the retry set, so it returns {}, triggering the fallback. The tests should explicitly mock _fetch_detail_api to return {} to make the api_then_html fallback path clear and intentional.
|
|
||
|
|
There was a problem hiding this comment.
The TestEnrich class tests api_then_html (via test_api_then_html_strategy_falls_back_to_html) and api_only (via test_api_only_strategy_uses_api_detail) strategies, but there is no test for the html_only strategy (which bypasses _fetch_detail_api entirely). Adding a test for the html_only strategy would complete the coverage for all three paths of _get_detail().
| def test_html_only_strategy_uses_html_detail(self) -> None: | |
| items = [_make_stellenangebot(refnr="r1", titel="Dev", arbeitgeber="Corp")] | |
| html_detail = {"stellenangebotsBeschreibung": "HTML only detail"} | |
| provider = BundesagenturProvider(detail_strategy="html_only") | |
| with ( | |
| patch("immermatch.bundesagentur._fetch_detail", return_value=html_detail), | |
| patch("immermatch.bundesagentur._fetch_detail_api") as mock_api, | |
| patch("immermatch.bundesagentur.httpx.Client"), | |
| ): | |
| listings = provider._enrich(items) | |
| assert len(listings) == 1 | |
| assert listings[0].description == "HTML only detail" | |
| mock_api.assert_not_called() |
| return [] | ||
| return self._enrich(items) | ||
|
|
||
| # ------------------------------------------------------------------ |
There was a problem hiding this comment.
When search_all_queries runs with up to 5 parallel workers, each calls provider.search() which, for BundesagenturProvider, can itself spin up a nested ThreadPoolExecutor(max_workers=5) in _enrich(). This means up to 25 concurrent detail-fetch threads can be alive simultaneously (in addition to the 5 search threads), all making HTTP connections to arbeitsagentur.de. At the default of 20 queries per session, this could create significant concurrent load against the BA API, potentially triggering rate-limiting (the 403/429 responses the retry logic handles). Consider documenting this or capping the outer pool to reduce peak thread count, or passing max_workers as a configurable parameter.
tests/test_bundesagentur.py
Outdated
|
|
||
| def test_veroeffentlichtseit_default(self) -> None: | ||
| provider = BundesagenturProvider() | ||
| assert provider._days_published == 7 |
There was a problem hiding this comment.
The test test_veroeffentlichtseit_default asserts that provider._days_published == 7, but the BundesagenturProvider.__init__ defaults days_published to _DEFAULT_DAYS_PUBLISHED, which is defined as 28 in bundesagentur.py line 39. This test will always fail, since BundesagenturProvider() sets _days_published to 28 and the assertion checks for 7.
The fix should make the assertion consistent with the actual default: change assert provider._days_published == 7 to assert provider._days_published == 28, or update the constant to 7 if 7 was the intended default (in which case bundesagentur.py would also need updating).
| assert provider._days_published == 7 | |
| assert provider._days_published == 28 |
AGENTS.md
Outdated
| | `test_evaluator_agent.py` (8 tests) | `evaluator_agent.py` | `evaluate_job()` (4 cases: happy path, API error fallback, parse error fallback, non-dict fallback) + `evaluate_all_jobs()` (3 cases: sorted output, progress callback, empty list) + `generate_summary()` (2 cases: score distribution in prompt, missing skills in prompt) | | ||
| | `test_search_agent.py` (32 tests) | `search_agent.py` | `_is_remote_only()` (remote tokens, non-remote) + `_infer_gl()` (known locations, unknown default, remote returns None, case insensitive) + `_localise_query()` (city names, country names, case insensitive, multiple cities) + `_parse_job_results()` (valid, blocked portals, mixed, empty, no-apply-links) + `search_all_queries()` (location auto-append with localisation, no double-append, early stopping) + `TestLlmJsonRecovery` (profile_candidate and generate_search_queries retry/recovery) | | ||
| | `test_search_agent.py` (35 tests) | `search_agent.py` | `_is_remote_only()` (remote tokens, non-remote) + `_infer_gl()` (known locations, unknown default, remote returns None, case insensitive) + `_localise_query()` (city names, country names, case insensitive, multiple cities) + `_parse_job_results()` (valid, blocked portals, mixed, empty, no-apply-links) + `search_all_queries()` (provider delegation, dedup, early stopping, callbacks, default provider) + `generate_search_queries()` prompt selection (BA vs SerpApi) + `TestLlmJsonRecovery` (profile_candidate and generate_search_queries retry/recovery) | | ||
| | `test_bundesagentur.py` (22 tests) | `bundesagentur.py` | `_build_ba_link()`, `_parse_location()`, `_parse_search_results()`, `_stub_to_listing()`, `BundesagenturProvider.search()` (basic merge, pagination, HTTP errors, empty results, detail fetch failures), `SearchProvider` protocol conformance | |
There was a problem hiding this comment.
The AGENTS.md test coverage table references a function _stub_to_listing() in the description of test_bundesagentur.py, but no such function exists in immermatch/bundesagentur.py or anywhere else in the codebase. The function that converts a raw search result dict to a JobListing is _parse_listing(). This inaccuracy in the documentation is misleading.
| | `test_bundesagentur.py` (22 tests) | `bundesagentur.py` | `_build_ba_link()`, `_parse_location()`, `_parse_search_results()`, `_stub_to_listing()`, `BundesagenturProvider.search()` (basic merge, pagination, HTTP errors, empty results, detail fetch failures), `SearchProvider` protocol conformance | | |
| | `test_bundesagentur.py` (22 tests) | `bundesagentur.py` | `_build_ba_link()`, `_parse_location()`, `_parse_search_results()`, `_parse_listing()`, `BundesagenturProvider.search()` (basic merge, pagination, HTTP errors, empty results, detail fetch failures), `SearchProvider` protocol conformance | |
No description provided.