Skip to content

Conversation

@cboulay
Copy link
Collaborator

@cboulay cboulay commented Jan 19, 2026

This PR adds a new transp_sync_blocking transport flag that enables synchronous, zero-copy data transfer for stream outlets. Instead of copying sample data into an internal buffer for async delivery, sync mode writes directly from the user's buffer to connected sockets, eliminating memory allocation and copy overhead.

It is intended to be a replacement for #170 , which has not been updated in some time.

Motivation

For high-channel-count, high-sample-rate applications (e.g., 1000+ channels at 30kHz), the async outlet's per-sample memory allocation and copying becomes a significant CPU bottleneck. Sync mode addresses this by:

  • Zero-copy transfer: Sample data is written directly from the user's buffer
  • No per-sample allocation: Timestamps are stored in a small metadata buffer, but sample data isn't copied
  • Blocking semantics: push_sample/push_chunk blocks until data is sent to all consumers

This makes all the difference for me on a lower power embedded system

Usage

lsl::stream_info info("MyStream", "EEG", 1000, 30000, lsl::cf_float32);
lsl::stream_outlet outlet(info, 0, 360, transp_sync_blocking);

std::vector<float> sample(1000);
outlet.push_sample(sample);  // Blocks until sent to all consumers

Limitations

  • String format not supported: Sync mode requires fixed-size samples
  • Blocking: Push operations block until all connected consumers receive data
  • Latency scales with consumers: More consumers = longer push latency

Benchmark Results

Test configuration: 1000 channels, 30kHz sample rate, macOS (Apple Silicon)

CPU Usage by Chunk Size (1 consumer)

Chunk Size Async (µs/sample) Sync (µs/sample) CPU Savings
1 31.18 15.65 50%
4 9.05 5.06 44%
8 7.76 6.37 18%
16 9.76 6.88 30%
30 9.82 8.34 15%

Scaling with Multiple Consumers (chunk=4)

Consumers Async (µs/sample) Sync (µs/sample) CPU Savings
1 10.15 4.95 51%
2 21.95 9.84 55%
3 33.82 15.91 53%

CPU savings remain significant (~50%) across consumer counts. However, push latency increases linearly with consumers in sync mode (async latency stays constant).

Implementation Details

  • New sync_write_handler class manages connected sockets grouped by endianness
  • tcp_server hands off sockets to sync handler after protocol negotiation
  • Optimized enqueue_chunk_sync() batches timestamp+data buffers for efficient gather-write
  • sync_timestamps_ uses std::deque to ensure pointer stability during buffer accumulation

@cboulay cboulay force-pushed the pugixml_fetch_only branch from a33e2a2 to 0f7bdc2 Compare January 19, 2026 17:16
@cboulay cboulay force-pushed the cboulay/sync_outlet branch from f7e32c2 to ce58eb6 Compare January 19, 2026 17:17
Base automatically changed from pugixml_fetch_only to dev January 24, 2026 17:50
@zeyus
Copy link
Contributor

zeyus commented Jan 27, 2026

This is awesome, I will have to try it out! I have relatively low sample rate, but I have the additional burden of thread communication in my app, so the pointer to the data struct can be sent directly instead of a locked, reusable buffer per stream, this should reduce the overhead considerably (e.g. 1/2 to 1/4 number of ops, which actually would scale with the number of consumers in my case despite the slight increase in latency, due to it not needing to be copied for each push_*)

For reference

This is a general performance test I run on my Dart API wrapper.
The results vary per run, because the latency is based on the calculation of the received_timestamp - sent_timestamp and the inlet polling rate is ~stream sample rate, which means the latency can be between 0-1 samples depending on the send and polling offset (not including any overhead).

Despite the numbers being slightly higher in the new version, I wouldn't take that as indication of a performance regression, but it is a useful point of reference to make sure the performance wont be worse when I test the zero-copy mode.

Liblsl.dart performance test results, liblsl v1.16.2
Streams Channels Freq(Hz) Throughput(MB/s) μ Latency(ms) 𝛔 Latency(ms) Min Latency(ms) Max Latency(ms) Packet Loss(%)
1 1 50 0.0003 13.7728 0.3049 13.3674 14.8893 0.0000
1 1 500 0.0033 1.0647 0.0413 0.9515 2.0121 0.0000
1 1 1000 0.0065 3.4329 3.2389 0.2475 9.1617 0.0000
1 1 10000 0.0654 6.7327 2.2249 0.0765 8.5703 0.0000
1 2 50 0.0003 8.9015 0.1193 8.6882 9.9025 0.0000
1 2 500 0.0033 1.7708 0.0537 0.2096 1.9008 0.0000
1 2 1000 0.0065 1.5907 0.6224 0.1819 2.7605 0.0000
1 2 10000 0.0654 2.3131 1.0372 0.0388 4.9256 0.0000
1 16 50 0.0003 9.2737 0.0962 9.1588 10.2556 0.0000
1 16 500 0.0033 1.3924 0.0400 1.2787 2.3428 0.0000
1 16 1000 0.0065 1.0722 0.4895 0.2950 2.0104 0.0000
1 16 10000 0.0654 2.2451 1.8190 0.0798 8.8329 0.0000
1 64 50 0.0003 9.4274 0.1498 8.6082 10.4592 0.0000
1 64 500 0.0033 1.6945 0.0576 0.2677 1.7633 0.0000
1 64 1000 0.0065 1.0967 0.6200 0.3300 2.9064 0.0000
1 64 10000 0.0654 3.3471 1.3914 0.0672 5.0093 0.0000
8 1 50 0.0026 1.4986 0.1033 1.3745 2.5464 0.0000
8 1 500 0.0261 1.9765 0.0658 0.4732 2.1359 0.0000
8 1 1000 0.0523 1.5774 0.6861 0.1944 3.3715 0.0000
8 1 10000 0.5226 4.0570 0.9709 0.0567 9.0689 0.0000
8 2 50 0.0026 1.5608 0.1068 1.4153 2.6143 0.0000
8 2 500 0.0262 2.1204 0.8388 0.4696 2.6880 0.0000
8 2 1000 0.0523 2.0864 1.0778 0.2072 4.8532 0.0000
8 2 10000 0.5221 6.8337 0.5595 0.0957 11.2415 0.0000
8 16 50 0.0026 1.0841 0.1028 0.8845 2.1213 0.0000
8 16 500 0.0261 1.9742 0.0921 0.2769 3.1904 0.0000
8 16 1000 0.0523 2.1878 0.5527 0.6861 3.1433 0.0000
8 16 10000 0.5226 5.9658 0.5568 0.0892 9.2876 0.0000
8 64 50 0.0026 2.0540 0.1060 1.8235 3.1125 0.0000
8 64 500 0.0262 2.6237 1.7181 0.2288 6.8779 0.0000
8 64 1000 0.0523 2.2410 0.8767 0.5427 4.3022 0.0000
8 64 10000 0.5226 3.7804 1.6545 0.0825 10.3524 0.0000
Liblsl.dart performance test results, liblsl v1.17.5
Streams Channels Freq(Hz) Throughput(MB/s) μ Latency(ms) 𝛔 Latency(ms) Min Latency(ms) Max Latency(ms) Packet Loss(%)
1 1 50 0.0003 14.3537 0.3527 14.0259 15.1913 0.0000
1 1 500 0.0033 1.6701 0.0432 1.4182 2.6015 0.0000
1 1 1000 0.0065 3.0676 2.7901 0.2872 8.2148 0.0000
1 1 10000 0.0654 6.5976 2.1562 0.0856 8.3490 0.0000
1 2 50 0.0003 8.8558 0.1096 8.7354 9.9173 0.0000
1 2 500 0.0033 1.4980 0.0455 1.1321 2.4871 0.0000
1 2 1000 0.0065 0.9279 0.4905 0.1694 1.8394 0.0000
1 2 10000 0.0654 1.7390 0.5597 0.0978 2.8445 0.0000
1 16 50 0.0003 9.4010 0.1484 9.0308 10.5920 0.0000
1 16 500 0.0033 1.3423 0.0492 1.0757 2.3382 0.0000
1 16 1000 0.0065 1.0901 0.4898 0.1112 2.1962 0.0000
1 16 10000 0.0654 2.0513 0.7104 0.0523 3.3309 0.0000
1 64 50 0.0003 9.3309 0.1013 9.2038 10.3376 0.0000
1 64 500 0.0033 1.9906 0.0675 0.5221 2.1827 0.0000
1 64 1000 0.0065 1.0845 0.4875 0.2972 1.9136 0.0000
1 64 10000 0.0654 1.8844 0.6142 0.0582 3.0086 0.0000
8 1 50 0.0026 1.5178 0.1019 1.3846 2.5342 0.0000
8 1 500 0.0261 1.7542 0.0888 1.4299 3.4051 0.0000
8 1 1000 0.0523 1.0636 0.6615 0.1173 2.2970 0.0000
8 1 10000 0.5225 4.1972 0.9736 0.0809 9.6145 0.0000
8 2 50 0.0026 2.0271 0.1031 1.8579 3.0566 0.0000
8 2 500 0.0262 1.3934 0.0479 1.2255 2.3799 0.0000
8 2 1000 0.0523 1.7672 0.6552 0.4705 2.6111 0.0000
8 2 10000 0.5225 4.9241 0.5277 0.1365 9.3788 0.0000
8 16 50 0.0026 7.3037 8.7284 1.1469 23.1762 0.0000
8 16 500 0.0262 2.0217 0.9410 0.5183 2.7986 0.0000
8 16 1000 0.0523 1.2594 0.5820 0.2685 3.1768 0.0000
8 16 10000 0.5226 4.3666 0.4936 0.0799 7.6156 0.0000
8 64 50 0.0026 3.1733 2.9382 2.0793 23.7131 0.0000
8 64 500 0.0262 1.5404 0.0625 0.9172 2.8961 0.0000
8 64 1000 0.0523 2.1069 0.6712 0.2980 4.0179 0.0000
8 64 10000 0.5224 4.4022 0.9821 0.0740 8.8697 0.0000

 1. User creates outlet with transp_sync_blocking flag:
  lsl::stream_outlet outlet(info, 0, 360, transp_sync_blocking);
  2. When a consumer connects, the socket is handed off from client_session to sync_write_handler after the feed header handshake (no transfer thread is spawned).
  3. When push_sample() is called:
    - Timestamp is encoded and stored in sync_timestamps_
    - User's data buffer pointer is wrapped in asio::const_buffer (zero copy)
    - If pushthrough=true, all buffers are written to all consumers via blocking gather-write
…per consistent with stream_info and properly throws on construction failure.
  - Fix have_consumers()/wait_for_consumers() to detect sync consumers
  - Handle DEDUCED_TIMESTAMP in sync mode for proper chunk timing
  - Change sync_timestamps_ to deque to prevent pointer invalidation
  - Add optimized enqueue_chunk_sync() for batched chunk transfers
  - Add have_sync_consumers() to tcp_server
  - Add sync outlet tests and benchmark tool
… in the namespace)

  2. Added #include <algorithm> for std::sort
…ync<std::string>, which resolves the Windows linker error
2. Replaced C++17 structured bindings with .first/.second pair access for C++11/14 compatibility
@cboulay cboulay force-pushed the cboulay/sync_outlet branch from ce58eb6 to d5ae90c Compare January 28, 2026 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants