Exclude star-pagination text when extracting citations#294
Conversation
The Eyecite Report 👁️Gains and LossesThere were 8 gains and 9 losses. Click here to see details.There were 53 changes so we are only displaying the first 50. You can review the
Time ChartGenerated Files |
|
Thanks @quevon24 |
grossir
left a comment
There was a problem hiding this comment.
These are used in Harvard's XML.
I also see them in html_lawbox for some of the examples. I am also seeing class="page-label" in Harvard data, maybe it's a good addition
The rest of this comment is some weird stuff I found while experimenting with the branch; but I think the changes are OK, I would only suggest adding the class I mention above
Documenting some weird behavior from the benchmark #212 and html_with_citations
See this one that appears as a status change in the benchmark for opinion and cluster 1485354
"351 S.W.3d 833" vs "62 S.W.3d 833"
Weirdly
- on the source it has not got the same value as in the benchmark, having a "5" instead of an "S", so no citation should be found, anyway
- Also, the class on the live
html_with_citationsisclass="page-label"instead ofclass="start-pagination"
Checking the source it seems lawbox was actually used? (because of "S" instead of "5"
In [14]: op2.xml_harvard[18000:18300]
Out[14]: ' <em>\n Reliant Energy, Inc. v. Public Util. Comm’n,\n </em>\n 62\n <span citation-index="1" class="star-pagination" label="351"> \n *351\n </span>\n 5.W.3d 833, 841 (Tex.App.2001);\n <em>\n McCarty,\n </em>\n 919 S.W.2d at 854.\n </p>\n<p id="b371-4">\n Appellants correctly note the factors the C'
In [20]: op2.html_lawbox[18000:18200]
Out[20]: 'ergy, Inc. v. Public Util, Comm\'n,</i> 62 <span class="star-pagination">*351</span> S.W.3d 833, 841 (Tex.App.2001); <i>McCarty,</i> 919 S.W.2d at 854.</p>\n<p>Appellants correctly note the factors the 'I was checking the benchmark output (as always, it's confusing that Loss is actually the gains; and viceversa)
In the Loss column
- "99 Conn. App. 158" comes from this cluster with 1 opinion that has both html_lawbox and xml_harvard
But, if you check the opinion, we already have such citation...; it is also found on the current eyecite release installed in Courtlistener.
from eyecite.tokenizers import HyperscanTokenizer
from eyecite import get_citations
from cl.search.models import Opinion
op = Opinion.objects.get(id=1546016)
get_citations(markup_text=op.xml_harvard, clean_steps=['xml', 'html', 'inline_whitespace'], tokenizer=HyperscanTokenizer())It makes sense that we already have it, since it's not actually affected by the star pagination
In [9]: op.xml_harvard[16500:16700]
Out[9]: 'ntirely speculative.” (Internal quotation marks omitted.)\n <em>\n Jezierny\n </em>\n v. Jezierny, 99 Conn. App. 158, 160-61, 912 A.2d 1127 (2007). Nowhere in the court’s memorandum of decision is th'But, again, in html_lawbox it was actually being affected
In [26]: op.html_lawbox[15450:15650]
Out[26]: ' by us ... would be entirely speculative." (Internal quotation marks omitted.) <i>Jezierny v. Jezierny,</i> 99 Conn. <span class="star-pagination">*360</span> App. 158, 160-61, 912 A.2d 1127 (2007). N'
So, it's very strange that the benchmark is marking it as a status change. Is the benchmark data corrupted? Seems like xml_harvard was overloaded by html_lawbox?

This PR updates the XPath used for text extraction to ignore text nodes whose parent has the star-pagination class. These markers should not be included in the cleaned content. These are used in Harvard's XML.
Example:
Input:
135 <span class="star-pagination">*355</span> Mass. 147Before:
Extracted text included pagination markers:
Output:
355 Mass. 147After:
Pagination markers are excluded:
Output:
135 Mass. 147