Fine tuning a Hugging Face model

Source

I've decided to write this post to make it easier to understand (and for future reference) on how to train a Hugging Face pre-trained transformer model with my own data.

The inspiration for this surfaced from the fact that I could not use the Simple Transformers package in my work environment.

Before starting

The instructions here are for the question answering pipeline. Most of the instructions posted here are the same or very similar to use with different pipelines. The source of data is probably the aspect that will have more differences. Check the official documentation here for more instructions.

Organize your data

This is a key step. In order to be successful in training a model with your own data, it has to be structured in the appropriate format.

The format is basically a list of Python dictionaries. You can store that how you think it's best, I'm doing it with a JSON file.

The structure is:

list of dictionaries 1 with "context" : "<context string>" and "qas" : <list of dictionaries 2>.
list of dictionaries 2 with "id" : <id as string>", "is_impossible" : <true|false>, "question" : "<question string>", "answers" : <list of dictionaries 3>.
list of dictionaries 3 with "text" : "<answer>", "answer_start" : 13.

Important observations:

the passage of text with the information that has the answer ("context").
id must be unique, one for each question.
is_impossible is a boolean value, self-explanatory (can't answer the question = true, can answer the question = false)
the answer to the question goes in "text", in the answers dictionary. It must be an exact copy of the excerpt from context.
the "answer_start" value is an integer corresponding to the position of the answer in the context string (how many characters counting from the beginning of the context string). This value MUST be accurate.

Here's a template:

[
    {
    "context" : "string with the relevant context",
    "qas" : [
        {
        "id" : "strID_1",
        "is_impossible" : false,
        "question" : "What is in this string?",
        "answers" : [
            {
            "text" : "relevant context",
            "answer_start" : 16
        }
        ]
    }
    ]
},
    {
    "context" : "string with another relevant context",
    "qas" : [
        {
        "id" : "strID_2",
        "is_impossible" : false,
        "question" : "What else there is to know?",
        "answers" : [
            {
            "text" : "another relevant context",
            "answer_start" : 12
        },
            {
            "id" : "strID_3",
            "is_impossible" : true,
            "question" : "What is not in the context?",
            "answers" : []
        }
        ]
    }
    ]
}
]

Note that I'm using multiple lines in the template above because it makes it easier to read. Usually you'll see everything in a single line.

Prepare the structured data fro training

If for any reason you are running your training program in an offline computer, remember to set Hugging Face accordingly:

  import os
  os.environ['TRANSFORMERS_OFFLINE'] = "1"
  os.environ['HF_DATASETS_OFFLINE'] = "1"

First, we need to load our structured data into memory.

  import json #can also use pandas
  with open("path_to_json_train_data", "rb") as file:
      train_data = json.load(file)
  with open("path_to_json_eval_data", "rb") as file:
      eval_data = json.load(file)

Next, we will create a function that will read the data and return it separately as contexts, questions, and answers (you can incorporate the file importing above to this function if you want. I've kept it separated for a better understanding of what each component is doing.

  def read_squad(data):
      ### optional, include to add the reading of the file here
      ### if so, also change the parameter name in the function to <file_name>
      #with open("<file_name>", "rb") as file:
      #    data = json.load(file)
      contexts = []
      questions = []
      answers = []
      for group in data:
          context = group["context"]
          for qa in group["qas"]:
              question = qa["question"]
              for answer in qa["answers"]:
                  contexts.append(context)
                  questions.append(question)
                  answers.append(answer)
      return contexts, questions, answers

Create variables to store contexts, questions, and answers using the function above.

  train_contexts, train_questions, train_answers = read_squad(train_data)
  val_contexts, val_questions, val_answers = read_squad(eval_data)

Now, the data also must have the "answer_end" value (the character count of the answer in the context. Let's create a function to add that value to our data (also, it may be necessary to correct character count from the raw data source):

def add_end_idx(answers, contexts):

    for answer, context in zip(answers, contexts):
        gold_text = answer["text"]
        start_idx = answer["answer_start"]
        end_idx = start_idx + len(gold_text)
        # fix start/end miss-calculations
        if context[start_idx:end_idx] == gold_text:
            answer["answer_end"] = end_idx
        elif context[start_idx-1:end_idx-1] == gold_text:
            answer["answer_start"] = start_idx - 1
            answer["answer_end"] = end_idx - 1
        elif context[start_idx-2:end_idx-2] == gold_text:
            answer["answer_start"] = start_idx - 2
            answer["answer_end"] = end_idx - 2

add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

If everything worked so far, you're ready for the next step. If you're getting errors, make sure you are using the correct variables (ex.: context vs. contexts), review your loops indentation (especially if you're copy/pasting), and check your loaded data for any missing values (ex.: an inaccurate "answer_start" or "text" that's not equal to the context will likely result in missing "answer_end" values).

Tokenize the data

Before start the actual training, there is one more step, tokenize the data. Let's load the model and tokenizer we want to use with our data.

  from transformers import DistilBertTokenizerFast, DistilBertForQuestionAnswering
  tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
  model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")

Tokenize context and questions:

  train_encodings = tokenizer(train_contexts,
                              train_questions,
                              truncation=True,
                              padding=True)
  val_encodings = tokenizer(val_contexts,
                            val_questions,
                            truncation=True,
                            padding=True)

Convert character star and end positions to tokens. Following the original example, we will use the char_to_token() method, from the Fast Tokenizer.

  def add_token_positions(encodings, answers):
      start_positions = []
      end_positions = []
      for i in range(len(answers)):
          start_positions.append(encodings.char_to_token(i, answers[i]["answer_start"]))
          end_positions.append(encodings.char_to_token(i, answers[i]["answer_end"] -1))
          # check if page is truncated
          if start_positions[-1] is None:
              start_positions[-1] = tokenizer.model_max_length
          if end_positions[-1] is None:
              end_positions[-1] = tokenizer.model_max_length
  
      encodings.update({"start_positions" : start_positions,
                        "end_positions" : end_positions})
  
  add_token_positions(train_encodings, train_answers)
  add_token_positions(val_encodings, val_answers)

Training

Everything should be ready to start training now. We have already loaded the model, so let's put our data in a PyTorch dataset.

  import torch
  
  class SquadDataset(torch.utils.data.Dataset):
      def __init__(self, encodings):
          self.encodings = encodings
  
      def __getitem__(self, idx):
          return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  
      def __len__(self):
          return len(self.encodings.input_ids)
  
  
  train_dataset = SquadDataset(train_encodings)
  val_dataset = SquadDataset(val_encodings)

Finally, we'll do the actual training. Let's also use a ternary operator to select between GPU (if available) or CPU processing.

  from torch.utils.data import DataLoader
  from transformers import AdamW
  
  device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
  
  model.to(device)
  model.train()
  
  train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
  
  optim = AdamW(model.parameters(), lr=5e-5)
  
  for epoch in range(3):
      for batch in train_loader:
          optim.zero_grad()
          input_ids = batch["input_ids"].to(device)
          attention_mask = batch["attention_mask"].to(device)
          start_positions = batch["start_positions"].to(device)
          end_positions = batch["end_positions"].to(device)
          outputs = model(input_ids,
                          attention_mask=attention_mask,
                          start_positions=start_positions,
                          end_positions=end_positions)
          loss = outputs[0]
          loss.backward()
          optim.step()
  
  model.eval()

Note that it is also possible to set custom options to run the trainer. Check the official website.

Save your new model

Done training? Save your new model for future use.

model.save_pretrained("path", "directory-name-to-save")
tokenizer.save_pretrained("path", "directory-name-to-save")

All done!