How HuggingFace tokenizer gets executed through maze of code?
I was very much intrigued about hugging face’s core concept of loading models and running it. It does everything under the hood and you don’t have to do anything besides calling some functions. I will try to cover the essence of the code execution that will capture how your model tokenizer is identified.
So the skeleton looks like this — ( honestly don’t call yourself an AI enthusiasts if you do this, but you are there !)
So in this blog we will dive into what happens inside these tokenizers. We will do a simple code walkthrough and then we will visualise what tokenizers do.
There are two ways we can do work with tokenizers. One way is to call the Tokenizer class from the transformers library and second one is tokenizers library, both coming from hugging face.
# First way is -
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(facebook/bart-base)
# Second way is
from tokenizers import Tokenizer
Calling this function
tokenizer = AutoTokenizer.from_pretrained(facebook/bart-base )
initiates the following params in AutoTokenizer class which contains the utilities to load the specific tokenizer class we want to associate with model for tokenization.
TOKENIZER_MAPPING_NAMES = OrderedDict(
[
(
"albert",
(
"AlbertTokenizer" if is_sentencepiece_available() else None,
"AlbertTokenizerFast" if is_tokenizers_available() else None,
),
),
("align", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("bark", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("bart", ("BartTokenizer", "BartTokenizerFast")), ...
Cruising through the code, the control comes on this function call —
config = AutoConfig.from_pretrained(
pretrained_model_name_or_path, trust_remote_code=trust_remote_code, **kwargs
)
We can see the configs are loaded into this config object and we can see that the model recognised is Bart. Our tokenizer has got some context on the direction it has to head to.
So after loading initial configs, the code then searches for any local model in the directory and also it checks the local cache where hugging face saves the dataset and the model. Then da da da code reaches at this point where we actually know the class to load the concerned tokenizer —
print(config) # BartConfig
print(type(config)) # type of class
# from the TOKENIZER_MAPPING mapping we get the the model class and if it is fast tokenizer or not
tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
There you got, you see the BartTokenizer! And this in the next code snippet we load the following tokenizer using from_pretrained function.
if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
Then moving forward BartTokenizerFast is the class that gets called —
https://github.com/huggingface/transformers/blob/main/src/transformers/models/bart/tokenization_bart.py and calls the from_pretrained function.
BartTokenizerFast.from_pretrained (“facebook/bart-base”, trust_remote_code=True) will call the tokenization_utils_base.py utility functions which are methods of PreTrainedTokenizerBase and BartTokenizerFast inherits from PreTrainedTokenizer class. So inheritance looks like this
Now let’s focus on the this piece of code —
from tokenizers import Tokenizer
This tokenizer is a class object that is quite interesting. It is a wrapper on top of the base tokenizer model and provides custom functionalities for implementing and experimenting with your own tokenizers. The link given below are the python bindings, since the original implementation is in rust. It is also released for node js as well — https://github.com/huggingface/tokenizers/tree/main/bindings/python.
Think of them as a classes given out of the box which you can use to tokenize your favourite sentences.
Let’s check out this diagram —
The details will be covered in next blog but this diagram indicates few pointers —
- BaseTokenizer is abstract class that implements the 5 mentioned tokenizers which can be used with any model for tokenization. This is the core concept and when you will start research with new model, train a new tokenizer based on your data for quick implementation using this lib, save the merge file and vocab and commit this tokenizer in transformers if you do something custom on top of these tokenizers.
- Models are simply abstract classes.
- Normalizers — This will help https://huggingface.co/docs/tokenizers/en/api/normalizers. Normalizers basically work on unicode to remove any unwanted behaviour at character level
- Pre Tokenizers — Some pre processing utils before tokenization. You can check out this link https://huggingface.co/docs/tokenizers/en/api/pre-tokenizers.
- Post Processors — Some additional logic handling is done on the token level like removing unwanted tokens added during input etc, so you get a clean text as output.
- Trainers — This is an important class. What happens is you initialise a Tokenizer as given in below code, use the trainer class to train the tokenizer according to the algorithm/model which we have initialised and save the model. We see Tokenizer is kind of a pipeline which can be customised in a very logical way and we put it into combination of legos. The following example has been taken from — https://github.com/huggingface/tokenizers/blob/main/bindings/python/examples/train_with_datasets.py which is from official repo. So next time when you find like okay I need to create a tokenizer, do this!
import datasets
from tokenizers import Tokenizer, models, normalizers, pre_tokenizers
# Build a tokenizer
bpe_tokenizer = Tokenizer(models.BPE())
bpe_tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
bpe_tokenizer.normalizer = normalizers.Lowercase()
# Initialize a dataset
dataset = datasets.load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
# Build an iterator over this dataset
def batch_iterator():
batch_size = 1000
for batch in dataset.iter(batch_size=batch_size):
yield batch["text"]
# And finally train
bpe_tokenizer.train_from_iterator(batch_iterator(), length=len(dataset))
I have tried to explain in few words ( :P ) about some coding brain storming that has been done at the HF HQ. Time to jump onto the next blog. Happy reading!