Skip to content

Conversation

@nyo16
Copy link
Contributor

@nyo16 nyo16 commented Jan 25, 2026

Summary

This PR adds timing metrics and opt-in OpenAI-compatible response formats to Bumblebee's text generation serving, making it easier to integrate with
existing OpenAI-compatible tooling and monitor generation performance.
This will help bumblebee to integrated easier with other libs without writing custom middlewares .

New Options

Three new options for Bumblebee.Text.generation/4:

  • :include_timing - When true, includes performance metrics in results
  • :output_format - Supports :bumblebee (default), :openai, and :openai_chat
  • :model_name - Model identifier for OpenAI format responses

Features

Timing Metrics

When include_timing: true:

  • generation_time_us - Total generation time in microseconds
  • tokens_per_second - Generation throughput
  • time_to_first_token_us - Time to first token (streaming only)

Finish Reason Tracking

Tracks why generation stopped:

  • "stop" - EOS token reached
  • "length" - Max length reached

OpenAI-Compatible Formats

Text Completions (:openai):

%{
  id: "cmpl-...",
  object: "text_completion",
  model: "gpt2",
  choices: [%{index: 0, text: "...", finish_reason: "stop"}],
  usage: %{prompt_tokens: 5, completion_tokens: 20, total_tokens: 25}
}

Chat Completions (:openai_chat):
%{
  id: "chatcmpl-...",
  object: "chat.completion",
  model: "gpt2",
  choices: [%{index: 0, message: %{role: "assistant", content: "..."}, finish_reason: "stop"}],
  usage: %{prompt_tokens: 5, completion_tokens: 20, total_tokens: 25}
}

Both formats also support streaming with proper chunk formatting.

Backward Compatibility

All changes are opt-in with sensible defaults:
- include_timing: false
- output_format: :bumblebee

Existing code works unchanged.

Test Plan

- All existing tests pass
- Added tests for timing metrics (streaming and non-streaming)
- Added tests for OpenAI format output
- Added tests for OpenAI chat format output
- Added tests for OpenAI streaming formats
- Manually tested with Qwen3-0.6B model

Add timing metrics and opt-in OpenAI-compatible response formats to
text generation serving.

New options for `Bumblebee.Text.generation/4`:

- `:include_timing` - when true, includes `generation_time_us` and
  `tokens_per_second` in results. For streaming with `:stream_done`,
  also adds `time_to_first_token_us`

- `:output_format` - supports `:bumblebee` (default), `:openai`
  (text completions), and `:openai_chat` (chat completions)

- `:model_name` - model identifier for OpenAI format responses

Also tracks finish reason (EOS token vs max length) in generation,
exposed as `finish_reason` in streaming done events and as
"stop"/"length" in OpenAI format responses.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant