Added information about the use of slow tokenizers#2517
Conversation
Added information about the use of slow tokenizers to generate vocab files in ML.
🔍 Preview links for changed docs |
benironside
left a comment
There was a problem hiding this comment.
I am not familiar with the content, but the writing LGTM
|
FYI I started work on switching to the fast tokenizers for Eland in elastic/eland#803. This change is required for supporting more of the models found on HuggingFace, the Jina AI Reranker is an example However, some tests failed after the switch so it is not a simple change and we must first understand why those failures are occuring |
Co-authored-by: David Kyle <david.kyle@elastic.co>
davidkyle
left a comment
There was a problem hiding this comment.
We've lost the part about the slow tokenizers now
|
@davidkyle Oh I thought it was intentional because we are about to support fast tokenizers 😄 Do you think we should hold off on this PR until fast tokenizer support is available and we will make the statement about slow/fast tokenizers then? WDYT? |
|
@ppf2 I disabled the auto-merge as I saw your question on holding off on this PR, just to be sure this doesn’t get merged until you want it to. |
Good point. Let's merge as is and I will concentrate on the fast tokenizer work. If I don't make any progress next week I will create another PR here to document the use of slow tokenizers |
Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com>
|
@ppf2 please feel free to merge at your convenience as the auto-merge is not enabled. |
Added information about the use of slow tokenizers to generate vocab files in ML.