Skip to content

Exclude star-pagination text when extracting citations#294

Merged
flooie merged 2 commits intomainfrom
293-listed-in-authorities-but-not-actually-cited
Oct 3, 2025
Merged

Exclude star-pagination text when extracting citations#294
flooie merged 2 commits intomainfrom
293-listed-in-authorities-but-not-actually-cited

Conversation

@quevon24
Copy link
Member

@quevon24 quevon24 commented Oct 3, 2025

This PR updates the XPath used for text extraction to ignore text nodes whose parent has the star-pagination class. These markers should not be included in the cleaned content. These are used in Harvard's XML.

Example:

Input: 135 <span class="star-pagination">*355</span> Mass. 147

Before:
Extracted text included pagination markers:

Output: 355 Mass. 147

After:
Pagination markers are excluded:

Output: 135 Mass. 147

@quevon24 quevon24 linked an issue Oct 3, 2025 that may be closed by this pull request
@github-actions
Copy link
Contributor

github-actions bot commented Oct 3, 2025

The Eyecite Report 👁️

Gains and Losses

There were 8 gains and 9 losses.

Click here to see details.

There were 53 changes so we are only displaying the first 50. You can review the
entire list by downloading the output.csv file linked above.

id Gain Loss
3542084 Page 78
1771039 361 Dallas, 124
1771039 124 Tex. 1
1546016 360 App. 158
1546016 99 Conn. App. 158
2361623 538 C.F.R. §§ 404.1520
2427861 594 S.W.2d 795
1270074 992 C.F.R. § 423.440
2183357 80 L.Ed. 2d 674
901384 197 Mich.App. 349
1352961 434 P.2d at 593
2208164 384 N.W.2d at 244
2208164 251 N.W.2d at 244
1334683 236 SE2d 829
1485354 Tex. Gov't Code Ann. § 2001.033
1485354 351 S.W.3d 833
1485354 62 S.W.3d 833

Time Chart

image

Generated Files

Branch 1 Output
Branch 2 Output
Full Output CSV

@grossir grossir moved this to PRs to Review in Sprint (Case Law) Oct 3, 2025
@grossir grossir self-assigned this Oct 3, 2025
@grossir grossir requested review from flooie and grossir and removed request for flooie October 3, 2025 14:56
@flooie
Copy link
Contributor

flooie commented Oct 3, 2025

Thanks @quevon24

Copy link
Contributor

@grossir grossir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are used in Harvard's XML.

I also see them in html_lawbox for some of the examples. I am also seeing class="page-label" in Harvard data, maybe it's a good addition

The rest of this comment is some weird stuff I found while experimenting with the branch; but I think the changes are OK, I would only suggest adding the class I mention above


Documenting some weird behavior from the benchmark #212 and html_with_citations

See this one that appears as a status change in the benchmark for opinion and cluster 1485354

"351 S.W.3d 833" vs "62 S.W.3d 833"

Weirdly

  • on the source it has not got the same value as in the benchmark, having a "5" instead of an "S", so no citation should be found, anyway
  • Also, the class on the live html_with_citations is class="page-label" instead of class="start-pagination"
image

Checking the source it seems lawbox was actually used? (because of "S" instead of "5"

In [14]: op2.xml_harvard[18000:18300]
Out[14]: ' <em>\n   Reliant Energy, Inc. v. Public Util. Comm’n,\n  </em>\n  62\n  <span citation-index="1" class="star-pagination" label="351"> \n   *351\n   </span>\n  5.W.3d 833, 841 (Tex.App.2001);\n  <em>\n   McCarty,\n  </em>\n  919 S.W.2d at 854.\n </p>\n<p id="b371-4">\n  Appellants correctly note the factors the C'

In [20]: op2.html_lawbox[18000:18200]
Out[20]: 'ergy, Inc. v. Public Util, Comm\'n,</i> 62 <span class="star-pagination">*351</span> S.W.3d 833, 841 (Tex.App.2001); <i>McCarty,</i> 919 S.W.2d at 854.</p>\n<p>Appellants correctly note the factors the '

I was checking the benchmark output (as always, it's confusing that Loss is actually the gains; and viceversa)

In the Loss column

  • "99 Conn. App. 158" comes from this cluster with 1 opinion that has both html_lawbox and xml_harvard

But, if you check the opinion, we already have such citation...; it is also found on the current eyecite release installed in Courtlistener.

from eyecite.tokenizers import HyperscanTokenizer
from eyecite import get_citations
from cl.search.models import Opinion

op = Opinion.objects.get(id=1546016)
get_citations(markup_text=op.xml_harvard, clean_steps=['xml', 'html', 'inline_whitespace'], tokenizer=HyperscanTokenizer())

It makes sense that we already have it, since it's not actually affected by the star pagination

In [9]: op.xml_harvard[16500:16700]
Out[9]: 'ntirely speculative.” (Internal quotation marks omitted.)\n  <em>\n   Jezierny\n  </em>\n  v. Jezierny, 99 Conn. App. 158, 160-61, 912 A.2d 1127 (2007). Nowhere in the court’s memorandum of decision is th'

But, again, in html_lawbox it was actually being affected

In [26]: op.html_lawbox[15450:15650]
Out[26]: ' by us ... would be entirely speculative." (Internal quotation marks omitted.) <i>Jezierny v. Jezierny,</i> 99 Conn. <span class="star-pagination">*360</span> App. 158, 160-61, 912 A.2d 1127 (2007). N'

So, it's very strange that the benchmark is marking it as a status change. Is the benchmark data corrupted? Seems like xml_harvard was overloaded by html_lawbox?

@grossir grossir assigned quevon24 and unassigned grossir Oct 3, 2025
Copy link
Contributor

@flooie flooie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@flooie flooie merged commit 04d82c0 into main Oct 3, 2025
8 checks passed
@flooie flooie deleted the 293-listed-in-authorities-but-not-actually-cited branch October 3, 2025 17:23
@github-project-automation github-project-automation bot moved this from PRs to Review to Done in Sprint (Case Law) Oct 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Listed in authorities but not actually cited

3 participants