LLMExtractionStrategy returns empty results despite successful crawl & scrape #1458
-
|
Hi, I’m testing Crawl4AI 0.7.4 with Here are two examples I tried: Example 1 – OpenAI (gpt-5-mini) --- Extracting Structured Data with openai/gpt-5-mini ---
[INIT].... → Crawl4AI 0.7.4
[FETCH]... ↓ https://openai.com/api/pricing/ | ✓ | ⏱: 1.24s
[SCRAPE].. ◆ https://openai.com/api/pricing/ | ✓ | ⏱: 0.03s
[EXTRACT]. ■ https://openai.com/api/pricing/ | ✓ | ⏱: 27.78s
[COMPLETE] ● https://openai.com/api/pricing/ | ✓ | ⏱: 29.06s
[]Example 2 – OpenAI (gpt-5-mini) with another site --- Extracting Structured Data with openai/gpt-5-mini ---
[INIT].... → Crawl4AI 0.7.4
[FETCH]... ↓ https://www.migros.com.tr/ | ✓ | ⏱: 0.96s
[SCRAPE].. ◆ https://www.migros.com.tr/ | ✓ | ⏱: 0.03s
[EXTRACT]. ■ https://www.migros.com.tr/ | ✓ | ⏱: 12.54s
[COMPLETE] ● https://www.migros.com.tr/ | ✓ | ⏱: 13.53s
[]Code Snippet import os
import json
import asyncio
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai import LLMExtractionStrategy, CacheMode, BrowserConfig
from typing import Dict, List
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(
..., description="Fee for output token for the OpenAI model."
)
async def extract_structured_data_using_llm(
provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
):
print(f"\n--- Extracting Structured Data with {provider} ---")
if api_token is None and provider != "ollama":
print(f"API token is required for {provider}. Skipping this example.")
# return
browser_config = BrowserConfig(headless=True)
extra_args = {"temperature": 1, "max_tokens": 2000}
if extra_headers:
extra_args["extra_headers"] = extra_headers
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
word_count_threshold=1,
page_timeout=80000,
extraction_strategy=LLMExtractionStrategy(
llm_config = LLMConfig(provider=provider,api_token=api_token),
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content.""",
extra_args=extra_args,
),
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://openai.com/api/pricing/", config=crawler_config
)
print(result.extracted_content)
if __name__ == "__main__":
asyncio.run(
extract_structured_data_using_llm(
provider="openai/gpt-5-mini", api_token=apitoken)
)Expected behavior Environment
Question Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
|
GPT 5 models are not officially supported (see https://docs.crawl4ai.com/core/browser-crawler-config/#3-llmconfig-essentials), so suspect that that is causing your issues. I'm personally using gpt-4o-mini without any issues, try that model and see if it works. |
Beta Was this translation helpful? Give feedback.
GPT 5 models are not officially supported (see https://docs.crawl4ai.com/core/browser-crawler-config/#3-llmconfig-essentials), so suspect that that is causing your issues. I'm personally using gpt-4o-mini without any issues, try that model and see if it works.