Improve serpapi results by TheTrueAI · Pull Request #57 · TheTrueAI/immermatch

TheTrueAI · 2026-03-01T23:06:45Z

No description provided.

- Introduce SearchProvider protocol in search_provider.py with a get_provider() factory for future per-country routing - Add 'source' field to JobListing model to track which provider produced each listing - Update JobListing docstring to be provider-agnostic

- BundesagenturProvider calls the free BA REST API (jobsuche.api.bund.dev) - Search endpoint (pc/v4/jobs) with pagination, detail-fetching (pc/v2/jobdetails) in parallel via ThreadPoolExecutor - veroeffentlichtseit=7 default (only recent listings, solves stale jobs) - Retry with exponential backoff on 429/5xx errors - 22 tests covering parsing, pagination, error handling, and protocol conformance

- Move all SerpAPI-specific code (blocked portals, GL codes, localisation, parse_job_results, search_jobs) to immermatch/serpapi_provider.py - Add SerpApiProvider class conforming to SearchProvider protocol - Re-export all symbols from search_agent.py for backward compatibility - All 251 existing tests pass unchanged

- search_all_queries() now delegates to provider.search() instead of calling SerpAPI functions directly - generate_search_queries() selects BA or SerpAPI prompt based on provider - Add BA_HEADHUNTER_SYSTEM_PROMPT for keyword-only queries (no location) - Both functions accept an optional provider kwarg (defaults to get_provider) - Remove 're' import (no longer needed) - Update all tests to use mock providers instead of patching search_jobs - 254 tests pass (3 new: prompt selection + provider delegation)

- Remove SERPAPI_KEY from app.py secrets loop and _keys_ok() validation - Update sidebar caption: 'Powered by Gemini & Bundesagentur für Arbeit' - Update location placeholder to German cities + add help tooltip - Remove SERPAPI_KEY from daily_task.py docstring and env var - Remove from .env.example and daily-digest.yml workflow - Remove from test_app_ui.py fake env - 254 tests pass

- AGENTS.md: Document SearchProvider architecture, BA and SerpApi headhunter prompts, BA API parameters, test_bundesagentur.py entry - ROADMAP.md: Replace SerpAPI Cost Optimization with Search Provider Expansion section (3 items checked, 3 new items) - README.md: Remove SERPAPI_KEY from env var table, note BA API is free - Remove SERPAPI_KEY from required env vars throughout

The BA REST API's detail endpoint requires OAuth (returns 403), but the public job-detail pages at arbeitsagentur.de/jobsuche/jobdetail/{refnr} are server-side rendered with an Angular ng-state JSON blob containing the complete listing data. Changes: - Add _fetch_detail() — scrapes the public HTML page, extracts the ng-state JSON, returns the jobdetail dict - Add _clean_html() — strips tags, decodes entities, collapses whitespace - Split search() into _search_items() + _enrich() (parallel detail-page fetching via ThreadPoolExecutor) - _parse_listing() now accepts optional detail dict for rich descriptions and external apply URLs - Rewrite tests: 44 tests covering full pipeline Verified end-to-end: 2k-5k char descriptions returned for real listings.

…ch strategy and increase job query limit

Now both Bundesagentur and SerpAPI are used to find each 30 jobs

Copilot

Pull request overview

This PR introduces a pluggable search-provider architecture and updates the job search pipeline to use provider-backed searches (Bundesagentur für Arbeit by default, SerpApi retained), with updated deduplication and extensive test coverage.

Changes:

Added SearchProvider protocol + get_provider() factory and CombinedSearchProvider routing (including targeted provider=...:: queries).
Implemented BundesagenturProvider (BA API + detail enrichment) and extracted SerpApi logic into SerpApiProvider.
Updated search orchestration, caching, and UI to deduplicate by title|company|location, log source counts, and adjust docs/tests accordingly.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`tests/test_search_provider.py`	Adds unit tests for `parse_provider_query()` targeted query parsing.
`tests/test_search_agent.py`	Refactors orchestration tests to use mocked providers; adds combined-provider and prompt-selection coverage.
`tests/test_integration.py`	Updates full-pipeline tests to use provider mocks and BA prompt selection.
`tests/test_cache.py`	Updates cache key expectations to include location.
`tests/test_bundesagentur.py`	Adds comprehensive tests for BA provider helpers, retry/detail parsing, and provider behavior.
`tests/test_app_ui.py`	Removes SerpApi env dependency from UI test environment setup.
`immermatch/serpapi_provider.py`	New SerpApi-backed provider module + helpers extracted from `search_agent.py`.
`immermatch/search_provider.py`	Introduces provider protocol, combined provider, targeted query parsing, and provider factory.
`immermatch/search_agent.py`	Migrates search orchestration to provider-based search, adds combined-provider quota logic, re-exports SerpApi helpers for compatibility, updates query generation to be provider-aware.
`immermatch/models.py`	Generalizes `JobListing` and adds a `source` field to track provider origin.
`immermatch/cache.py`	Updates job/evaluation caching and unevaluated filtering to include location in keys.
`immermatch/bundesagentur.py`	New Bundesagentur provider implementation with pagination, retry, and parallel detail enrichment.
`immermatch/app.py`	Updates defaults/UX messaging, removes SerpApi key requirement, and adjusts evaluation keys + query display stripping provider prefixes.
`daily_task.py`	Updates documentation/logging around SerpApi usage; aligns job dedup key with location inclusion.
`ROADMAP.md`	Updates roadmap section to reflect provider expansion (BA + provider abstraction).
`README.md`	Adds BA provider tuning section and removes SerpApi env var requirement note (but still contains SerpApi references that need reconciling).
`CONTRIBUTING.md`	Updates recommended ruff command to include `--fix`.
`AGENTS.md`	Updates prompts and architecture docs for provider abstraction + BA/SerpApi prompts (contains some now-inaccurate specifics).
`.github/workflows/daily-digest.yml`	Removes `SERPAPI_KEY` from workflow env.
`.github/copilot-instructions.md`	Updates check command to use `ruff check --fix`.
`.env.example`	Removes `SERPAPI_KEY` from example env file.

Copilot · 2026-03-01T23:12:26Z

immermatch/search_agent.py

+        quota_sources = {
+            "bundesagentur" if p.name == "Bundesagentur für Arbeit" else "serpapi" for p in provider.providers
+        }


The quota_sources mapping currently treats every non-"Bundesagentur für Arbeit" provider as "serpapi". This will miscount/never satisfy quotas as soon as you add additional providers (as mentioned in the roadmap), and it also couples quota logic to provider names. Prefer deriving quota source keys from the provider type or a dedicated provider identifier (e.g., provider.source_id) rather than hard-coding name→source mappings.

Suggested change

quota_sources = {

"bundesagentur" if p.name == "Bundesagentur für Arbeit" else "serpapi" for p in provider.providers

}

for p in provider.providers:

# Prefer a stable provider identifier (e.g. p.source_id) over name-based mapping.

source_key = getattr(p, "source_id", None)

if not source_key:

if getattr(p, "name", None) == "Bundesagentur für Arbeit":

# Preserve existing semantics for Bundesagentur provider.

source_key = "bundesagentur"

else:

# Fallback to the provider's class name to distinguish different provider types.

source_key = type(p).__name__.lower()

quota_sources.add(source_key)

Copilot · 2026-03-01T23:12:26Z

immermatch/search_provider.py

+        merged: dict[str, JobListing] = {}
+        per_provider = max(1, max_results)
+        for provider in providers:
+            try:
+                jobs = provider.search(clean_query, location, max_results=per_provider)


CombinedSearchProvider.search() requests max_results from each provider (per_provider = max_results), then truncates the merged list back to max_results. This can cause unnecessary upstream calls/work (especially for paid APIs) and may skew which provider’s results survive truncation. Consider splitting the budget across providers (e.g., ceil(max_results / len(providers)), with a small overfetch) or documenting that max_results is “per provider” rather than “total”.

Copilot · 2026-03-01T23:12:27Z

AGENTS.md

+
+- **`BundesagenturProvider`** (default) — queries the free Bundesagentur für Arbeit REST API (`rest.arbeitsagentur.de`). Handles pagination, parallel detail-fetching, and retry logic internally.
+- **`SerpApiProvider`** — wraps Google Jobs via SerpApi. Handles localisation, `gl` code inference, and blocked portal filtering internally.
+- **`get_provider(location)`** factory — currently always returns `BundesagenturProvider`. Future: route by country.


Docs say get_provider(location) “currently always returns BundesagenturProvider”, but immermatch/search_provider.py:get_provider() actually returns a CombinedSearchProvider when SERPAPI_KEY is set. Please update this to match the current factory behavior (or change the factory to match the docs).

Suggested change

- **`get_provider(location)`** factory — currently always returns `BundesagenturProvider`. Future: route by country.

- **`get_provider(location)`** factory — returns a `CombinedSearchProvider` (Bundesagentur + SerpApi) when `SERPAPI_KEY` is set, otherwise a standalone `BundesagenturProvider`. Future: route by country.

Copilot · 2026-03-01T23:12:27Z

AGENTS.md

+- Deduplicates by `title|company_name`
+- Stops early once 50 unique jobs are collected


This search_all_queries() doc section is out of date: deduplication now includes location (title|company_name|location), and early-stopping is controlled by the min_unique_jobs parameter (with additional per-provider quota behavior for combined providers). Please update these bullets so they reflect the actual current behavior.

Suggested change

- Deduplicates by `title|company_name`

- Stops early once 50 unique jobs are collected

- Deduplicates by `title|company_name|location`

- Uses a configurable `min_unique_jobs` threshold for early stopping (with additional per-provider quota limits when aggregating multiple providers)

Copilot · 2026-03-01T23:12:27Z

AGENTS.md

+# Detail endpoint: /pc/v2/jobdetails/{hashID}
+# Default: veroeffentlichtseit=7 (last 7 days only)


Bundesagentur configuration docs don’t match the implementation: BundesagenturProvider uses /pc/v4/jobdetails/{refnr} in code (not /pc/v2/...{hashID}), and _DEFAULT_DAYS_PUBLISHED is 28 in bundesagentur.py (not 7). Please correct these values so operators don’t configure/expect the wrong endpoints or freshness window.

Suggested change

# Detail endpoint: /pc/v2/jobdetails/{hashID}

# Default: veroeffentlichtseit=7 (last 7 days only)

# Detail endpoint: /pc/v4/jobdetails/{refnr}

# Default: veroeffentlichtseit=28 (last 28 days only)

Copilot · 2026-03-01T23:12:28Z

README.md

 Jobs are fetched from Google Jobs via SerpApi, deduplicated, and scored in parallel.

+## Bundesagentur Provider Tuning
+
+The Bundesagentur provider in `immermatch/bundesagentur.py` supports a configurable detail-fetch strategy:


README still states “Jobs are fetched from Google Jobs via SerpApi” (and the prerequisites section above still lists a SerpApi key), but the later Environment Variables section now says one-time search only requires GOOGLE_API_KEY and uses the Bundesagentur provider. Please reconcile these sections (e.g., describe Bundesagentur as default for Germany and SerpApi as optional for other markets when SERPAPI_KEY is set).

Copilot · 2026-03-01T23:12:28Z

ROADMAP.md

+### 3.1 — Search Provider Expansion
+- [x] **Switch to Bundesagentur für Arbeit API** — free, verified German job listings; eliminates SerpAPI cost for German market
+- [x] **Implement SearchProvider abstraction** — pluggable protocol for multiple job-search backends
+- [x] **Extract SerpApiProvider** — retained for future non-German markets behind the SearchProvider interface
+- [ ] **Add per-country provider routing** — auto-select provider based on target location (e.g., BA for Germany, SerpApi for others)


PR title is “Improve serpapi results”, but this roadmap section (and the code changes in this PR) are primarily about switching the default search to Bundesagentur and introducing a provider abstraction. Consider renaming the PR (or updating the PR description) to reflect the actual scope so it’s easier to review and track historically.

Copilot · 2026-03-01T23:12:28Z

immermatch/search_agent.py

+        quota_sources = {
+            "bundesagentur" if p.name == "Bundesagentur für Arbeit" else "serpapi" for p in provider.providers
+        }
+        if quota_sources:


search_all_queries() documents min_unique_jobs=0 as disabling early-stop, but when provider is a CombinedSearchProvider this code forces min_unique_jobs to be at least _MIN_JOBS_PER_PROVIDER * len(quota_sources), which re-enables early-stop even when the caller explicitly passed 0. Consider only enforcing the provider quota/minimum when min_unique_jobs > 0, or introduce a separate flag/parameter for the quota behavior and update the docstring accordingly so callers can reliably disable early stopping.

Suggested change

if quota_sources:

if quota_sources and min_unique_jobs > 0:

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Copilot · 2026-03-01T23:41:41Z

immermatch/search_agent.py

+    source_id = getattr(provider, "source_id", None)
+    if isinstance(source_id, str) and source_id.strip():
+        return source_id.strip().lower()
+    if getattr(provider, "name", None) == "Bundesagentur für Arbeit":
+        return "bundesagentur"
+    return type(provider).__name__.lower()
+
+


_provider_quota_source_key() returns a key based on provider.source_id or the provider type name, but SerpApiProvider doesn’t define source_id and its JobListing.source values are "serpapi". This means quota_sources will include "serpapiprovider" (or "magicmock" in tests) and quota_met will never be satisfied, causing early-stop/quota logic and the "Provider quota not reached" warning to behave incorrectly. Consider mapping known provider names to their JobListing.source values (e.g., "SerpApi (Google Jobs)" -> "serpapi") or standardizing by adding a required provider.source_id that matches JobListing.source for all providers.

Suggested change

source_id = getattr(provider, "source_id", None)

if isinstance(source_id, str) and source_id.strip():

return source_id.strip().lower()

if getattr(provider, "name", None) == "Bundesagentur für Arbeit":

return "bundesagentur"

return type(provider).__name__.lower()

# Prefer an explicit source_id when provided by the provider implementation.

source_id = getattr(provider, "source_id", None)

if isinstance(source_id, str) and source_id.strip():

return source_id.strip().lower()

# Fall back to well-known provider names that should match JobListing.source.

name = getattr(provider, "name", None)

if isinstance(name, str):

if name == "Bundesagentur für Arbeit":

return "bundesagentur"

# SerpApi provider: align quota key with JobListing.source == "serpapi".

if "serpapi" in name.lower():

return "serpapi"

# As an additional safeguard, handle the SerpApi provider by class name.

cls_name = type(provider).__name__

if cls_name == "SerpApiProvider":

return "serpapi"

# Generic fallback: use the provider class name.

return cls_name.lower()

Copilot · 2026-03-01T23:41:42Z

immermatch/search_agent.py

+        for index, child_provider in enumerate(provider.providers):
+            child_count = per_provider + (1 if index < remainder else 0)
+            if child_count <= 0:
+                continue
+            child_queries = _generate_search_queries_for_provider(
+                client,
+                profile,
+                location,
+                child_count,
+                child_provider,
+            )
+            merged_queries.extend([f"provider={child_provider.name}::{query}" for query in child_queries])
+
+        seen: set[str] = set()
+        unique_queries: list[str] = []
+        for query in merged_queries:
+            if query in seen:
+                continue
+            seen.add(query)
+            unique_queries.append(query)
+            if len(unique_queries) >= num_queries:
+                break
+        return unique_queries


generate_search_queries() now prefixes queries with "provider=::" for CombinedSearchProvider, but query caching (ResultCache.save_queries/load_queries) is keyed only by profile_hash+location. If SERPAPI_KEY is toggled between runs, cached provider-targeted queries can be reused with a different provider set (e.g., BA-only), and the UI/search path will silently strip the prefix and run BA searches with SerpApi-oriented queries. Consider including an explicit provider/config fingerprint in the query cache key (or storing provider metadata in queries.json and invalidating when it changes), and centralizing the provider prefix/separator strings instead of duplicating literals here.

Copilot · 2026-03-01T23:41:42Z

tests/test_search_agent.py

+        serp_provider = MagicMock()
+        serp_provider.name = "SerpApi (Google Jobs)"
+        serp_jobs = [self._make_job(f"SERP {i}", company=f"SERP Co {i}", location="Berlin") for i in range(30)]
+        for job in serp_jobs:
+            job.source = "serpapi"
+        serp_provider.search.return_value = serp_jobs


This test sets serp_jobs[].source = "serpapi", but the provider quota key is derived from _provider_quota_source_key(serp_provider), which (for a MagicMock) will be "magicmock" unless source_id is set. As written, the per-provider quota logic isn’t actually exercised/validated because the quota source keys won’t match the job.source values. Consider setting ba_provider.source_id/serp_provider.source_id to match the emitted JobListing.source values (or using small fake provider classes) so the test verifies the intended quota behavior.

Copilot · 2026-03-01T23:41:42Z

tests/test_bundesagentur.py

+            patch("immermatch.bundesagentur._fetch_detail_api"),
            patch("immermatch.bundesagentur.httpx.Client"),
        ):
            listings = provider._enrich(items)



The patch for _fetch_detail_api is no longer captured, so the test no longer asserts that the API detail fetch is not used in the "html_only" strategy. This reduces the behavioral guarantee of the test. Consider binding the patch to a variable (or using patch(... ) as mock_api) and asserting mock_api.assert_not_called().

Suggested change

patch("immermatch.bundesagentur._fetch_detail_api"),

patch("immermatch.bundesagentur.httpx.Client"),

):

listings = provider._enrich(items)

patch("immermatch.bundesagentur._fetch_detail_api") as mock_api,

patch("immermatch.bundesagentur.httpx.Client"),

):

listings = provider._enrich(items)

mock_api.assert_not_called()

TheTrueAI added 9 commits March 1, 2026 04:28

feat(ba): enhance Bundesagentur provider with configurable detail-fet…

07c8fad

…ch strategy and increase job query limit

feat: combined search provider

bc4368b

Now both Bundesagentur and SerpAPI are used to find each 30 jobs

Copilot AI review requested due to automatic review settings March 1, 2026 23:06

Copilot started reviewing on behalf of TheTrueAI March 1, 2026 23:07 View session

Copilot AI reviewed Mar 1, 2026

View reviewed changes

TheTrueAI added 2 commits March 2, 2026 00:23

Merge main into improve-serpapi-results and resolve conflicts

fc5ee99

Address valid PR review comments for provider quotas and budgeting

3b46333

TheTrueAI requested a review from Copilot March 1, 2026 23:36

Copilot started reviewing on behalf of TheTrueAI March 1, 2026 23:37 View session

Copilot AI reviewed Mar 1, 2026

View reviewed changes

Address latest PR review comments and cache/provider safety

49bb462

TheTrueAI merged commit 610ca7c into main Mar 2, 2026
4 checks passed

TheTrueAI deleted the improve-serpapi-results branch March 2, 2026 00:03

-        quota_sources = {
-            "bundesagentur" if p.name == "Bundesagentur für Arbeit" else "serpapi" for p in provider.providers
-        }
+        for p in provider.providers:
+            # Prefer a stable provider identifier (e.g. p.source_id) over name-based mapping.
+            source_key = getattr(p, "source_id", None)
+            if not source_key:
+                if getattr(p, "name", None) == "Bundesagentur für Arbeit":
+                    # Preserve existing semantics for Bundesagentur provider.
+                    source_key = "bundesagentur"
+                else:
+                    # Fallback to the provider's class name to distinguish different provider types.
+                    source_key = type(p).__name__.lower()
+            quota_sources.add(source_key)

	- `get_provider(location)` factory — currently always returns `BundesagenturProvider`. Future: route by country.
	- `get_provider(location)` factory — returns a `CombinedSearchProvider` (Bundesagentur + SerpApi) when `SERPAPI_KEY` is set, otherwise a standalone `BundesagenturProvider`. Future: route by country.

		- Deduplicates by `title\|company_name`
		- Stops early once 50 unique jobs are collected

		# Detail endpoint: /pc/v2/jobdetails/{hashID}
		# Default: veroeffentlichtseit=7 (last 7 days only)

-    source_id = getattr(provider, "source_id", None)
-    if isinstance(source_id, str) and source_id.strip():
-        return source_id.strip().lower()
-    if getattr(provider, "name", None) == "Bundesagentur für Arbeit":
-        return "bundesagentur"
-    return type(provider).__name__.lower()
+    # Prefer an explicit source_id when provided by the provider implementation.
+    source_id = getattr(provider, "source_id", None)
+    if isinstance(source_id, str) and source_id.strip():
+        return source_id.strip().lower()
+    # Fall back to well-known provider names that should match JobListing.source.
+    name = getattr(provider, "name", None)
+    if isinstance(name, str):
+        if name == "Bundesagentur für Arbeit":
+            return "bundesagentur"
+        # SerpApi provider: align quota key with JobListing.source == "serpapi".
+        if "serpapi" in name.lower():
+            return "serpapi"
+    # As an additional safeguard, handle the SerpApi provider by class name.
+    cls_name = type(provider).__name__
+    if cls_name == "SerpApiProvider":
+        return "serpapi"
+    # Generic fallback: use the provider class name.
+    return cls_name.lower()

Conversation

TheTrueAI commented Mar 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants