feat: Add token metrics and OpenAI-compatible response format to text generation #445

nyo16 · 2026-01-25T21:29:45Z

Summary

This PR adds timing metrics and opt-in OpenAI-compatible response formats to Bumblebee's text generation serving, making it easier to integrate with
existing OpenAI-compatible tooling and monitor generation performance.
This will help bumblebee to integrated easier with other libs without writing custom middlewares .

New Options

Three new options for Bumblebee.Text.generation/4:

:include_timing - When true, includes performance metrics in results
:output_format - Supports :bumblebee (default), :openai, and :openai_chat
:model_name - Model identifier for OpenAI format responses

Features

Timing Metrics

When include_timing: true:

generation_time_us - Total generation time in microseconds
tokens_per_second - Generation throughput
time_to_first_token_us - Time to first token (streaming only)

Finish Reason Tracking

Tracks why generation stopped:

"stop" - EOS token reached
"length" - Max length reached

OpenAI-Compatible Formats

Text Completions (:openai):

%{
  id: "cmpl-...",
  object: "text_completion",
  model: "gpt2",
  choices: [%{index: 0, text: "...", finish_reason: "stop"}],
  usage: %{prompt_tokens: 5, completion_tokens: 20, total_tokens: 25}
}

Chat Completions (:openai_chat):
%{
  id: "chatcmpl-...",
  object: "chat.completion",
  model: "gpt2",
  choices: [%{index: 0, message: %{role: "assistant", content: "..."}, finish_reason: "stop"}],
  usage: %{prompt_tokens: 5, completion_tokens: 20, total_tokens: 25}
}

Both formats also support streaming with proper chunk formatting.

Backward Compatibility

All changes are opt-in with sensible defaults:
- include_timing: false
- output_format: :bumblebee

Existing code works unchanged.

Test Plan

- All existing tests pass
- Added tests for timing metrics (streaming and non-streaming)
- Added tests for OpenAI format output
- Added tests for OpenAI chat format output
- Added tests for OpenAI streaming formats
- Manually tested with Qwen3-0.6B model

Add timing metrics and opt-in OpenAI-compatible response formats to text generation serving. New options for `Bumblebee.Text.generation/4`: - `:include_timing` - when true, includes `generation_time_us` and `tokens_per_second` in results. For streaming with `:stream_done`, also adds `time_to_first_token_us` - `:output_format` - supports `:bumblebee` (default), `:openai` (text completions), and `:openai_chat` (chat completions) - `:model_name` - model identifier for OpenAI format responses Also tracks finish reason (EOS token vs max length) in generation, exposed as `finish_reason` in streaming done events and as "stop"/"length" in OpenAI format responses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add token metrics and OpenAI-compatible response format to text generation #445

feat: Add token metrics and OpenAI-compatible response format to text generation #445

nyo16 commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add token metrics and OpenAI-compatible response format to text generation #445

Are you sure you want to change the base?

feat: Add token metrics and OpenAI-compatible response format to text generation #445

Conversation

nyo16 commented Jan 25, 2026

Summary

New Options

Features

Timing Metrics

Finish Reason Tracking

OpenAI-Compatible Formats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant