fix/json unicode offsets by jurgenvinju · Pull Request #2659 · usethesource/rascal

jurgenvinju · 2026-02-18T11:04:40Z

This is a work in progress. Not urgent. Testing out a first streaming solution to see if it works and how fast it still is. We're keeping a shadow buffer of int[1024] offsets which maps character offsets (indices into the array) to their codepoint offset count. From that we can derive the start of buffer and positions for every character in the current buffer.

Note that unicode characters in comments are equally responsible for shifts in the offsets as unicode characters in string constants and names.

start of buffer offset compensated for presence of surrogate pairs
pos index into buffer compensated for surrogate pairs
current experiment throws NPEs; needs fix
column position (same)
line position (same)

When the above is done, the origin locations and the error positions will be
correct in the presence of surrogate pairs.

codecov · 2026-02-18T11:10:33Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 46%. Comparing base (57d8bcd) to head (505e233).

Additional details and impacted files

@@           Coverage Diff           @@
##              main   #2659   +/-   ##
=======================================
- Coverage       46%     46%   -1%     
+ Complexity    6677    6670    -7     
=======================================
  Files          795     795           
  Lines        65899   65902    +3     
  Branches      9878    9880    +2     
=======================================
- Hits         30709   30706    -3     
- Misses       32806   32818   +12     
+ Partials      2384    2378    -6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sonarqubecloud · 2026-02-19T11:34:39Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

buffer offset is now compensated for surrogate pairs

3aa527d

jurgenvinju added 4 commits February 18, 2026 12:58

Merge branch 'main' into fix/json-unicode-offsets

505e233

added test with unicode surrogate pairs

977759a

initial throw at unicode resilient positions during JSON parsing

5432aea

minor improvements

fd57b97

jurgenvinju self-assigned this Feb 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix/json unicode offsets#2659

fix/json unicode offsets#2659
jurgenvinju wants to merge 5 commits intomainfrom
fix/json-unicode-offsets

jurgenvinju commented Feb 18, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 18, 2026 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

jurgenvinju commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud bot commented Feb 19, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

jurgenvinju commented Feb 18, 2026 •

edited

Loading

codecov bot commented Feb 18, 2026 •

edited

Loading