Skip to content

Conversation

@EFord36
Copy link
Collaborator

@EFord36 EFord36 commented Jun 4, 2024

Closes #274

Currently, this will only prefer a term if the term norm is a single word contained (as a word) in the ent_match_norm. Change to also prefer it if the term norm is a sequence of words that are a substring of the ent_match_norm.

In draft because I still need to:

  • Check impact on performance (we're using a regex match now, is this too slow? We could compare to looking for the first word with list.index, and then iterating. The implementation here is longer though. We could also try mypyc on that for interest).
  • Check impact on behaviour (does this change anything in the test documents we have for the different use cases? It should help, but does it?)
  • Write tests

That said, the actual code is ready to look at to assess 'is this a good idea' in a broad sense?

@EFord36 EFord36 requested a review from RichJackson June 4, 2024 15:00
Currently, this will only prefer a term if the term norm is a single
word contained (as a word) in the ent_match_norm. Change to also prefer
it if the term norm is a sequence of words that are a substring of the
ent_match_norm.
@EFord36 EFord36 force-pushed the multi-word-substring-checking branch from 119f242 to 0d706b3 Compare June 4, 2024 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant