Skip to content

Go agentur fuer arbeit#56

Merged
TheTrueAI merged 10 commits intomainfrom
go-agentur-fuer-arbeit
Mar 1, 2026
Merged

Go agentur fuer arbeit#56
TheTrueAI merged 10 commits intomainfrom
go-agentur-fuer-arbeit

Conversation

@TheTrueAI
Copy link
Owner

No description provided.

- Introduce SearchProvider protocol in search_provider.py with a
  get_provider() factory for future per-country routing
- Add 'source' field to JobListing model to track which provider
  produced each listing
- Update JobListing docstring to be provider-agnostic
- BundesagenturProvider calls the free BA REST API (jobsuche.api.bund.dev)
- Search endpoint (pc/v4/jobs) with pagination, detail-fetching
  (pc/v2/jobdetails) in parallel via ThreadPoolExecutor
- veroeffentlichtseit=7 default (only recent listings, solves stale jobs)
- Retry with exponential backoff on 429/5xx errors
- 22 tests covering parsing, pagination, error handling, and protocol
  conformance
- Move all SerpAPI-specific code (blocked portals, GL codes, localisation,
  parse_job_results, search_jobs) to immermatch/serpapi_provider.py
- Add SerpApiProvider class conforming to SearchProvider protocol
- Re-export all symbols from search_agent.py for backward compatibility
- All 251 existing tests pass unchanged
- search_all_queries() now delegates to provider.search() instead of
  calling SerpAPI functions directly
- generate_search_queries() selects BA or SerpAPI prompt based on provider
- Add BA_HEADHUNTER_SYSTEM_PROMPT for keyword-only queries (no location)
- Both functions accept an optional provider kwarg (defaults to get_provider)
- Remove 're' import (no longer needed)
- Update all tests to use mock providers instead of patching search_jobs
- 254 tests pass (3 new: prompt selection + provider delegation)
- Remove SERPAPI_KEY from app.py secrets loop and _keys_ok() validation
- Update sidebar caption: 'Powered by Gemini & Bundesagentur für Arbeit'
- Update location placeholder to German cities + add help tooltip
- Remove SERPAPI_KEY from daily_task.py docstring and env var
- Remove from .env.example and daily-digest.yml workflow
- Remove from test_app_ui.py fake env
- 254 tests pass
- AGENTS.md: Document SearchProvider architecture, BA and SerpApi
  headhunter prompts, BA API parameters, test_bundesagentur.py entry
- ROADMAP.md: Replace SerpAPI Cost Optimization with Search Provider
  Expansion section (3 items checked, 3 new items)
- README.md: Remove SERPAPI_KEY from env var table, note BA API is free
- Remove SERPAPI_KEY from required env vars throughout
The BA REST API's detail endpoint requires OAuth (returns 403), but the
public job-detail pages at arbeitsagentur.de/jobsuche/jobdetail/{refnr}
are server-side rendered with an Angular ng-state JSON blob containing
the complete listing data.

Changes:
- Add _fetch_detail() — scrapes the public HTML page, extracts the
  ng-state JSON, returns the jobdetail dict
- Add _clean_html() — strips tags, decodes entities, collapses whitespace
- Split search() into _search_items() + _enrich() (parallel detail-page
  fetching via ThreadPoolExecutor)
- _parse_listing() now accepts optional detail dict for rich descriptions
  and external apply URLs
- Rewrite tests: 44 tests covering full pipeline

Verified end-to-end: 2k-5k char descriptions returned for real listings.
Copilot AI review requested due to automatic review settings March 1, 2026 22:42
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR switches the default job-search backend from SerpApi (Google Jobs) to the free Bundesagentur für Arbeit (German Federal Employment Agency) REST API. It introduces a SearchProvider protocol layer so providers are pluggable, extracts SerpApi logic into SerpApiProvider (retained for future non-German markets), and adapts all tests, pipelines, and documentation accordingly.

Changes:

  • Introduces immermatch/bundesagentur.py (BundesagenturProvider) and immermatch/search_provider.py (SearchProvider protocol + get_provider() factory), moving search logic behind a pluggable interface.
  • Moves SerpApi-specific helpers to immermatch/serpapi_provider.py, re-exporting them from search_agent.py for backward compatibility; generate_search_queries now selects the correct LLM prompt based on the active provider.
  • Updates all tests (test_search_agent.py, test_integration.py), adds a new tests/test_bundesagentur.py, and removes SERPAPI_KEY from all configuration, CI, and documentation.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
immermatch/bundesagentur.py New BA provider: pagination, parallel detail-fetching, retry logic
immermatch/search_provider.py New SearchProvider protocol and get_provider() factory
immermatch/serpapi_provider.py SerpApi logic extracted from search_agent.py; SerpApiProvider class added
immermatch/search_agent.py Delegates to SearchProvider; adds BA prompt; re-exports SerpApi helpers
immermatch/models.py Adds source field to JobListing for provider tracking
immermatch/app.py Removes SerpApi key dependency; updates UI placeholder/caption
daily_task.py Removes SERPAPI_KEY from docs/logging
tests/test_bundesagentur.py New comprehensive test file for the BA provider
tests/test_search_agent.py Tests updated to use provider-based API
tests/test_integration.py Integration tests updated to inject mock provider
tests/test_app_ui.py Removes SERPAPI_KEY from fake env
AGENTS.md Architecture docs updated for new provider model
README.md Adds BA provider tuning section; removes SerpApi env var entry
ROADMAP.md Updates Phase 3 milestones
CONTRIBUTING.md / .github/copilot-instructions.md Lint command updated to --fix
.github/workflows/daily-digest.yml Removes SERPAPI_KEY secret
.env.example Removes SERPAPI_KEY entry

AGENTS.md Outdated
BA_BASE_URL = "https://rest.arbeitsagentur.de/jobboerse/jobsuche-service"
BA_API_KEY = "jobboerse-jobsuche" # pragma: allowlist secret # public API key, not a secret
# Search endpoint: /pc/v4/jobs (params: was, wo, veroeffentlichtseit, size, page, angebotsart)
# Detail endpoint: /pc/v2/jobdetails/{hashID}
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AGENTS.md documents the BA detail endpoint as /pc/v2/jobdetails/{hashID}, but the actual implementation in immermatch/bundesagentur.py uses /pc/v4/jobdetails/{refnr} (line 130). The version and the parameter name are both wrong in the documentation. This should be updated to /pc/v4/jobdetails/{refnr} to match the code.

Suggested change
# Detail endpoint: /pc/v2/jobdetails/{hashID}
# Detail endpoint: /pc/v4/jobdetails/{refnr}

Copilot uses AI. Check for mistakes.
AGENTS.md Outdated
BA_API_KEY = "jobboerse-jobsuche" # pragma: allowlist secret # public API key, not a secret
# Search endpoint: /pc/v4/jobs (params: was, wo, veroeffentlichtseit, size, page, angebotsart)
# Detail endpoint: /pc/v2/jobdetails/{hashID}
# Default: veroeffentlichtseit=7 (last 7 days only)
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AGENTS.md documents veroeffentlichtseit=7 (last 7 days only) as the default, but _DEFAULT_DAYS_PUBLISHED in bundesagentur.py is set to 28. The same discrepancy causes the test test_veroeffentlichtseit_default to fail (it asserts == 7). Either the comment in AGENTS.md should say 28 to reflect the code, or the constant and test need to align on 7. Whichever value is correct, all three locations (AGENTS.md line 214, bundesagentur.py:_DEFAULT_DAYS_PUBLISHED, and the test assertion) must agree.

Suggested change
# Default: veroeffentlichtseit=7 (last 7 days only)
# Default: veroeffentlichtseit=28 (last 28 days only)

Copilot uses AI. Check for mistakes.
Comment on lines 503 to 545
def test_enriches_items_with_details(self) -> None:
items = [
_make_stellenangebot(refnr="r1", titel="Dev A"),
_make_stellenangebot(refnr="r2", titel="Dev B"),
]
details = {
"r1": _make_detail(description="<b>Desc A</b>"),
"r2": _make_detail(description="<p>Desc B</p>"),
}

provider = BundesagenturProvider()
with patch("immermatch.bundesagentur._fetch_detail", side_effect=lambda _c, refnr: details.get(refnr, {})):
with patch("immermatch.bundesagentur.httpx.Client"):
listings = provider._enrich(items)

assert len(listings) == 2
assert listings[0].description == "Desc A"
assert listings[1].description == "Desc B"

def test_enrich_falls_back_on_failed_detail(self) -> None:
items = [_make_stellenangebot(refnr="r1", titel="Dev", arbeitgeber="Corp")]

provider = BundesagenturProvider()
with patch("immermatch.bundesagentur._fetch_detail", return_value={}):
with patch("immermatch.bundesagentur.httpx.Client"):
listings = provider._enrich(items)

assert len(listings) == 1
# Uses fallback description from search fields
assert "Arbeitgeber: Corp" in listings[0].description

def test_enrich_with_external_apply_url(self) -> None:
items = [_make_stellenangebot(refnr="r1")]
detail = _make_detail(partner_url="https://jobs.example.com", partner_name="Example")

provider = BundesagenturProvider()
with patch("immermatch.bundesagentur._fetch_detail", return_value=detail):
with patch("immermatch.bundesagentur.httpx.Client"):
listings = provider._enrich(items)

assert len(listings[0].apply_options) == 2
assert listings[0].apply_options[1].source == "Example"
assert listings[0].apply_options[1].url == "https://jobs.example.com"
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests test_enriches_items_with_details, test_enrich_falls_back_on_failed_detail, and test_enrich_with_external_apply_url only mock _fetch_detail (HTML path) but not _fetch_detail_api. With the default strategy being api_then_html, _get_detail will first call _fetch_detail_api, and only when it returns empty, fall back to _fetch_detail. These tests currently work only by coincidence: patching httpx.Client causes _fetch_detail_api to receive a MagicMock whose .get().status_code is neither 200 nor in the retry set, so it returns {}, triggering the fallback. The tests should explicitly mock _fetch_detail_api to return {} to make the api_then_html fallback path clear and intentional.

Copilot uses AI. Check for mistakes.
Comment on lines +576 to +577


Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TestEnrich class tests api_then_html (via test_api_then_html_strategy_falls_back_to_html) and api_only (via test_api_only_strategy_uses_api_detail) strategies, but there is no test for the html_only strategy (which bypasses _fetch_detail_api entirely). Adding a test for the html_only strategy would complete the coverage for all three paths of _get_detail().

Suggested change
def test_html_only_strategy_uses_html_detail(self) -> None:
items = [_make_stellenangebot(refnr="r1", titel="Dev", arbeitgeber="Corp")]
html_detail = {"stellenangebotsBeschreibung": "HTML only detail"}
provider = BundesagenturProvider(detail_strategy="html_only")
with (
patch("immermatch.bundesagentur._fetch_detail", return_value=html_detail),
patch("immermatch.bundesagentur._fetch_detail_api") as mock_api,
patch("immermatch.bundesagentur.httpx.Client"),
):
listings = provider._enrich(items)
assert len(listings) == 1
assert listings[0].description == "HTML only detail"
mock_api.assert_not_called()

Copilot uses AI. Check for mistakes.
Comment on lines +276 to +279
return []
return self._enrich(items)

# ------------------------------------------------------------------
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When search_all_queries runs with up to 5 parallel workers, each calls provider.search() which, for BundesagenturProvider, can itself spin up a nested ThreadPoolExecutor(max_workers=5) in _enrich(). This means up to 25 concurrent detail-fetch threads can be alive simultaneously (in addition to the 5 search threads), all making HTTP connections to arbeitsagentur.de. At the default of 20 queries per session, this could create significant concurrent load against the BA API, potentially triggering rate-limiting (the 403/429 responses the retry logic handles). Consider documenting this or capping the outer pool to reduce peak thread count, or passing max_workers as a configurable parameter.

Copilot uses AI. Check for mistakes.

def test_veroeffentlichtseit_default(self) -> None:
provider = BundesagenturProvider()
assert provider._days_published == 7
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test test_veroeffentlichtseit_default asserts that provider._days_published == 7, but the BundesagenturProvider.__init__ defaults days_published to _DEFAULT_DAYS_PUBLISHED, which is defined as 28 in bundesagentur.py line 39. This test will always fail, since BundesagenturProvider() sets _days_published to 28 and the assertion checks for 7.

The fix should make the assertion consistent with the actual default: change assert provider._days_published == 7 to assert provider._days_published == 28, or update the constant to 7 if 7 was the intended default (in which case bundesagentur.py would also need updating).

Suggested change
assert provider._days_published == 7
assert provider._days_published == 28

Copilot uses AI. Check for mistakes.
AGENTS.md Outdated
| `test_evaluator_agent.py` (8 tests) | `evaluator_agent.py` | `evaluate_job()` (4 cases: happy path, API error fallback, parse error fallback, non-dict fallback) + `evaluate_all_jobs()` (3 cases: sorted output, progress callback, empty list) + `generate_summary()` (2 cases: score distribution in prompt, missing skills in prompt) |
| `test_search_agent.py` (32 tests) | `search_agent.py` | `_is_remote_only()` (remote tokens, non-remote) + `_infer_gl()` (known locations, unknown default, remote returns None, case insensitive) + `_localise_query()` (city names, country names, case insensitive, multiple cities) + `_parse_job_results()` (valid, blocked portals, mixed, empty, no-apply-links) + `search_all_queries()` (location auto-append with localisation, no double-append, early stopping) + `TestLlmJsonRecovery` (profile_candidate and generate_search_queries retry/recovery) |
| `test_search_agent.py` (35 tests) | `search_agent.py` | `_is_remote_only()` (remote tokens, non-remote) + `_infer_gl()` (known locations, unknown default, remote returns None, case insensitive) + `_localise_query()` (city names, country names, case insensitive, multiple cities) + `_parse_job_results()` (valid, blocked portals, mixed, empty, no-apply-links) + `search_all_queries()` (provider delegation, dedup, early stopping, callbacks, default provider) + `generate_search_queries()` prompt selection (BA vs SerpApi) + `TestLlmJsonRecovery` (profile_candidate and generate_search_queries retry/recovery) |
| `test_bundesagentur.py` (22 tests) | `bundesagentur.py` | `_build_ba_link()`, `_parse_location()`, `_parse_search_results()`, `_stub_to_listing()`, `BundesagenturProvider.search()` (basic merge, pagination, HTTP errors, empty results, detail fetch failures), `SearchProvider` protocol conformance |
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AGENTS.md test coverage table references a function _stub_to_listing() in the description of test_bundesagentur.py, but no such function exists in immermatch/bundesagentur.py or anywhere else in the codebase. The function that converts a raw search result dict to a JobListing is _parse_listing(). This inaccuracy in the documentation is misleading.

Suggested change
| `test_bundesagentur.py` (22 tests) | `bundesagentur.py` | `_build_ba_link()`, `_parse_location()`, `_parse_search_results()`, `_stub_to_listing()`, `BundesagenturProvider.search()` (basic merge, pagination, HTTP errors, empty results, detail fetch failures), `SearchProvider` protocol conformance |
| `test_bundesagentur.py` (22 tests) | `bundesagentur.py` | `_build_ba_link()`, `_parse_location()`, `_parse_search_results()`, `_parse_listing()`, `BundesagenturProvider.search()` (basic merge, pagination, HTTP errors, empty results, detail fetch failures), `SearchProvider` protocol conformance |

Copilot uses AI. Check for mistakes.
@TheTrueAI TheTrueAI merged commit 13e0761 into main Mar 1, 2026
4 checks passed
@TheTrueAI TheTrueAI deleted the go-agentur-fuer-arbeit branch March 1, 2026 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants