Exclude star-pagination text when extracting citations by quevon24 · Pull Request #294 · freelawproject/eyecite

quevon24 · 2025-10-03T00:19:18Z

This PR updates the XPath used for text extraction to ignore text nodes whose parent has the star-pagination class. These markers should not be included in the cleaned content. These are used in Harvard's XML.

Example:

Input: 135 <span class="star-pagination">*355</span> Mass. 147

Before:
Extracted text included pagination markers:

Output: 355 Mass. 147

After:
Pagination markers are excluded:

Output: 135 Mass. 147

add test case

github-actions · 2025-10-03T00:25:58Z

The Eyecite Report 👁️

Gains and Losses

There were 8 gains and 9 losses.

Click here to see details.

There were 53 changes so we are only displaying the first 50. You can review the
entire list by downloading the output.csv file linked above.

id	Gain	Loss
3542084	Page 78
1771039	361 Dallas, 124
1771039		124 Tex. 1
1546016	360 App. 158
1546016		99 Conn. App. 158
2361623	538 C.F.R. §§ 404.1520
2427861		594 S.W.2d 795
1270074	992 C.F.R. § 423.440
2183357		80 L.Ed. 2d 674
901384		197 Mich.App. 349
1352961	434 P.2d at 593
2208164	384 N.W.2d at 244
2208164		251 N.W.2d at 244
1334683		236 SE2d 829
1485354		Tex. Gov't Code Ann. § 2001.033
1485354	351 S.W.3d 833
1485354		62 S.W.3d 833

Time Chart

Generated Files

Branch 1 Output
Branch 2 Output
Full Output CSV

flooie · 2025-10-03T15:04:32Z

Thanks @quevon24

grossir

These are used in Harvard's XML.

I also see them in html_lawbox for some of the examples. I am also seeing class="page-label" in Harvard data, maybe it's a good addition

The rest of this comment is some weird stuff I found while experimenting with the branch; but I think the changes are OK, I would only suggest adding the class I mention above

Documenting some weird behavior from the benchmark #212 and html_with_citations

See this one that appears as a status change in the benchmark for opinion and cluster 1485354

"351 S.W.3d 833" vs "62 S.W.3d 833"

Weirdly

on the source it has not got the same value as in the benchmark, having a "5" instead of an "S", so no citation should be found, anyway
Also, the class on the live html_with_citations is class="page-label" instead of class="start-pagination"

Checking the source it seems lawbox was actually used? (because of "S" instead of "5"

In [14]: op2.xml_harvard[18000:18300]
Out[14]: ' <em>\n   Reliant Energy, Inc. v. Public Util. Comm’n,\n  </em>\n  62\n  <span citation-index="1" class="star-pagination" label="351"> \n   *351\n   </span>\n  5.W.3d 833, 841 (Tex.App.2001);\n  <em>\n   McCarty,\n  </em>\n  919 S.W.2d at 854.\n </p>\n<p id="b371-4">\n  Appellants correctly note the factors the C'

In [20]: op2.html_lawbox[18000:18200]
Out[20]: 'ergy, Inc. v. Public Util, Comm\'n,</i> 62 <span class="star-pagination">*351</span> S.W.3d 833, 841 (Tex.App.2001); <i>McCarty,</i> 919 S.W.2d at 854.</p>\n<p>Appellants correctly note the factors the '

I was checking the benchmark output (as always, it's confusing that Loss is actually the gains; and viceversa)

In the Loss column

"99 Conn. App. 158" comes from this cluster with 1 opinion that has both html_lawbox and xml_harvard

But, if you check the opinion, we already have such citation...; it is also found on the current eyecite release installed in Courtlistener.

from eyecite.tokenizers import HyperscanTokenizer
from eyecite import get_citations
from cl.search.models import Opinion

op = Opinion.objects.get(id=1546016)
get_citations(markup_text=op.xml_harvard, clean_steps=['xml', 'html', 'inline_whitespace'], tokenizer=HyperscanTokenizer())

It makes sense that we already have it, since it's not actually affected by the star pagination

In [9]: op.xml_harvard[16500:16700]
Out[9]: 'ntirely speculative.” (Internal quotation marks omitted.)\n  <em>\n   Jezierny\n  </em>\n  v. Jezierny, 99 Conn. App. 158, 160-61, 912 A.2d 1127 (2007). Nowhere in the court’s memorandum of decision is th'

But, again, in html_lawbox it was actually being affected

In [26]: op.html_lawbox[15450:15650]
Out[26]: ' by us ... would be entirely speculative." (Internal quotation marks omitted.) <i>Jezierny v. Jezierny,</i> 99 Conn. <span class="star-pagination">*360</span> App. 158, 160-61, 912 A.2d 1127 (2007). N'

So, it's very strange that the benchmark is marking it as a status change. Is the benchmark data corrupted? Seems like xml_harvard was overloaded by html_lawbox?

flooie

Looks good to me.

fix(clean): exclude text inside elements with class "star-pagination"

b30f463

add test case

quevon24 linked an issue Oct 3, 2025 that may be closed by this pull request

Listed in authorities but not actually cited #293

Closed

fix(clean): update CHANGES.md

10c42da

grossir added this to Sprint (Case Law) Oct 3, 2025

grossir moved this to PRs to Review in Sprint (Case Law) Oct 3, 2025

grossir self-assigned this Oct 3, 2025

grossir requested review from flooie and grossir and removed request for flooie October 3, 2025 14:56

grossir reviewed Oct 3, 2025

View reviewed changes

grossir assigned quevon24 and unassigned grossir Oct 3, 2025

flooie approved these changes Oct 3, 2025

View reviewed changes

flooie merged commit 04d82c0 into main Oct 3, 2025
8 checks passed

flooie deleted the 293-listed-in-authorities-but-not-actually-cited branch October 3, 2025 17:23

github-project-automation bot moved this from PRs to Review to Done in Sprint (Case Law) Oct 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Exclude star-pagination text when extracting citations#294

Exclude star-pagination text when extracting citations#294
flooie merged 2 commits intomainfrom
293-listed-in-authorities-but-not-actually-cited

quevon24 commented Oct 3, 2025

Uh oh!

github-actions bot commented Oct 3, 2025

Uh oh!

flooie commented Oct 3, 2025

Uh oh!

grossir left a comment

Uh oh!

flooie left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

quevon24 commented Oct 3, 2025

Uh oh!

github-actions bot commented Oct 3, 2025

The Eyecite Report 👁️

Gains and Losses

Time Chart

Generated Files

Uh oh!

flooie commented Oct 3, 2025

Uh oh!

grossir left a comment

Choose a reason for hiding this comment

Uh oh!

flooie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants