BERTino: an Italian DistilBERT model

The recent introduction of Transformers language representation models allowed great improvements in many natural language processing (NLP) tasks. However, if on one hand the performances achieved by this kind of architectures are surprising, on the other their usability is limited by the high number of parameters which constitute their network, resulting in high computational and memory demands. In this work we present BERTino, a DistilBERT model which proposes to be the first lightweight alternative to the BERT architecture specific for the Italian language. We evaluated BERTino on the Italian ISDT, Italian ParTUT, Italian WikiNER and multiclass classification tasks, obtaining F1 scores comparable to those obtained by a BERTBASE with a remarkable improvement in training and inference speed.

English. 1 The recent introduction of Transformers language representation models allowed great improvements in many natural language processing (NLP) tasks. However, if on one hand the performances achieved by this kind of architectures are surprising, on the other their usability is limited by the high number of parameters which constitute their network, resulting in high computational and memory demands. In this work we present BERTino, a DistilBERT model which proposes to be the first lightweight alternative to the BERT architecture specific for the Italian language. We evaluated BERTino on the Italian ISDT, Italian ParTUT, Italian WikiNER and multiclass classification tasks, obtaining F1 scores comparable to those obtained by a BERT BASE with a remarkable improvement in training and inference speed.

Introduction
In recent years the introduction of Transformers language models allowed great improvements in many natural language processing (NLP) tasks. Among Transformer language models, BERT (Devlin et al., 2018) affirmed itself as an high-performing and flexible alternative, being able to transfer knowledge from general tasks to downstream ones thanks to the pretrainingfinetuning approach.
The context-dependent text representations provided by this model demonstrated to be a richer source of information when compared to static textual embeddings such as Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), Fast-Text (Bojanowski et al., 2016) or Sent2Vec (Pagliardini et al., 2018). However, despite the substantial improvements brought by BERT in the NLP field, the high number of parameters that constitute its network makes its usage prohibitive in resource-limited devices, both at training and inference time, and with a non-negligible environmental impact. To address the aforementioned problem, recent research proposes several approaches to reduce the size of the BERT network, such as DistilBERT , MobileBERT (Sun et al., 2020) or pruning (Gordon et al., 2020;McCarley et al., 2019).
The experiments conducted in Virtanen et al. (2019), de Vries et al. (2019 and Martin et al. (2020) demonstrate that monolingual BERT models outperform the same multilingual BERT architecture (Devlin et al., 2018), justifying the effort for pre-training Transformer models required for specific languages.
In this work we present BERTino, a DistilBERT model pre-trained on a large Italian corpus. This model proposes to be the first general-domain, lightweight alternative to BERT specific for the Italian language.
We evaluate BERTino on two Part Of Speech tagging tasks, Italian ISDT (Bosco et al., 2000) and Italian ParTUT (Sanguinetti and Bosco, 2015), on the Italian WikiNER (Nothman et al., 2012) Named Entity Recognition task and on a multi-class sentence classification. Comparing the scores obtained by BERTino, its teacher model and GilBERTo, the first obtains performances comparable to the other two architectures while sensibly decreasing the fine-tuning and evaluation time. In Section 2 we discuss the related works with a focus on DistilBERT, in Section 3 we describe the corpus and the pre-train followed by the results in Section 4.

Related work
In this section we will give a brief outline of the inner workings for Transformers, then we overview some lightweight alternatives to BERT.
The introduction of Transformer blocks (Vaswani et al., 2017) in language representation models is a keystone in recent NLP. The attention mechanism adopted by the Transformer encoder allows to provide contextualized representations of words, which proved to be a richer source of information than static word embeddings. Attention mechanism processes all words in an input sentence simultaneously, allowing parallelization of computations. This is a nonnegligible improvement with respect to models like ELMo (Peters et al., 2018), which aim to provide contextualized text representations using a bidirectional LSTM network, processesing each word sequentially.
Among language models that adopt Transformer technology, BERT (Devlin et al., 2018) affirmed itself as a flexible and powerful alternative, being able to establish new state-of-the-art for 11 NLP tasks at the time of publication. In its base version, this model adopts an hidden size of 768 and is composed of 12 layers (Transformer blocks), each of these involving 12 atten-tion heads, for a total of 110 millions of parameters. As outlined in Section 1, the high number of parameters constituting BERT's network can result prohibitive for deployment in resourcelimited devices and the computational effort is not negligible. For this reason, great effort has been devoted by researchers in order to propose smaller but valid alternatives to the base version of BERT. Gordon et al. (2020) studies how weight pruning affects the performances of BERT, concluding that a low level of pruning (30-40% of weights) marginally affects the natural language understanding capabilities of the network.
McCarley et al. (2019) conducts a similar study on BERT weight pruning, but applied to the Question Answering downstream task specifically.  propose DistilBERT, a smaller BERT architecture which is trained using the knowledge distillation technique (Hinton et al., 2015). Since the model that we propose relies on this training technique, we propose a brief description of knowledge distillation in section 2.1. DistilBERT leverages the inductive biases learned by larger models during pre-training using a triple loss combining language modeling, distillation and cosine-distance losses. DistilBERT architecture counts 40% less parameters but is able to retain 97% of natural language understanding performances with respect to the teacher model, while being 60% faster. Sun et al. (2020) propose MobileBERT, a compressed BERT model which aims to reduce the hidden size instead of the depth of the network. As DistilBERT, MobileBERT uses knowledge distillation during pre-training but adopts a BERT LARGE model with inverted bottleneck as teacher.

Knowledge distillation
Knowledge distillation (Hinton et al., 2015) is a training technique that leverages the outputs of a big network (called teacher) to train a smaller network (the student). In general, in the context of supervised learning, a classifier is trained in such a way that the output probability distribution that it provides is as similar as possible to the one-hot vector representing the gold label, by minimizing the cross-entropy loss between the two. By receiving a one-hot vector as learning signal, a model evaluated on the training set will provide an output distribution with a near-one value in cor-respondence of the right class, and all near-zero values for other classes. Some of the near-zero probabilities, however, are larger than the others and are the result of the generalization capabilities of the model. The idea of knowledge distillation is to substitute the usual one-hot vector representing gold labels with the output distribution of the teacher model in the computation of the crossentropy loss, in order to leverage the information contained in the near-zero values of the teacher's output distribution. Formally, the knowledge distillation loss is computed as: with t i being the output distribution of the teacher model relative to the i th observation, and s i being the output distribution of the student model relative to the i th observation.

BERTino
As outlined in section 1, we propose in this work BERTino, a DistilBERT model pre-trained on a general-domain Italian corpus. As for BERT-like architectures, BERTino is task-agnostic and can be fine-tuned for every downstream task. In this section we will report details relative to the pretraining that we conducted.

Corpus
The corpus that we used to pre-train BERTino is the union of PAISA (Lyding et al., 2014) and ItWaC (Baroni et al., 2009), two general-domain Italian corpora scraped from the web. While the former is made up of short sentences, the latter includes a considerable amount of long sentences. Since our model can receive input sequences of at most 512 tokens, as for BERT architectures, we decided to apply a pre-processing scheme to the ItWaC corpus. We split the sentences with more than 400 words into sub-sentences, using fixed points to create chunks that keep the semantic sense of a sentence. In this way, most of the long sentences contained in ItWaC are split into sub-sentences containing less than 512 tokens. A certain number of the final sentences still contain more than 512 tokens and they will be useful for training the parameters relative to the last entries of the network. The PAISA corpus counts 7.5 million sentences and 223.5 million words. The ItWaC corpus counts 6.5 million sentences and 1.6 billion words after preprocessing. Our final corpus counts 14 million sentences and 1.9 billion words for a total of 12GB of text.

Pre-training
Teacher model The teacher model that we selected to perform knowledge distillation during the pre-training of BERTino is dbmdz/bert-baseitalian-xxl-uncased, made by Bavarian State Library 2 . We chose this model because it is the Italian BERT BASE model trained on the biggest corpus (81 GB of text), up to our knowledge. Following , we initialized the weights of our student model by taking one layer out of two from the teacher model.
Loss function We report the loss function used to pre-train BERTino: (2) with L KD being the knowledge distillation loss as described in equation 1, L M LM being the masked language modeling loss and L COS being the cosine embedding loss.  describe the cosine embedding loss useful to "align the directions of the student and teacher hidden states vectors". When choosing the weights of the three loss functions, we wanted our model to learn from the teacher and by itself in an equal way, so we set the same weights for both L KD and L M LM . Moreover, we considered the alignment of student and teacher hidden states vectors marginal for our objective, setting L COS as 10% of the total loss.
Architecture The architecture of BERTino is the same as in DistilBERT. Our model adopts an hidden size of 768 and is composed of 6 layers (Transformer blocks), each of which involving 12 attention heads. In this way BERTino's network results to have half the layers present in the BERT BASE architecture.
Training details To pre-train BERTino we used a batch size of 6 and an initial learning rate of 5 × 10 −4 , adopting Adam (Kingma and Ba, 2014) as optimization algorithm. We chose 6 as batch size due to the limited computational resources available. Results described in section 4 demonstrate that the small batch size that we adopted is sufficient to obtain a valid pre-trained model. We trained our model on 4 Tesla K80 GPUs for 3 epochs, requiring 45 days of computation in total. For some aspects of the training, we relied on the Huggingface Transformers repository ).

Results
We tested the performances of BERTino on benchmark datasets: the Italian ISDT (Bosco et al., 2000) and Italian ParTUT (Sanguinetti and Bosco, 2015) Part Of Speech tagging tasks, and the Italian WikiNER (Nothman et al., 2012) Named Entity Recognition task. To complete the evaluation of the model, we also tested it on a multi-class sentence classification task. In particular, we focused on intent detection, a task specific to the context of Dialogue Systems, creating a novel italian dataset which is freely available at our repository 3 . The dataset that we propose collects 2786 real-world questions (2228 for training and 558 for testing) submitted to a digital conversational agent. The total number of classes in the dataset is 139.
For the first two tasks mentioned, we fine-tuned our model on the training set for 4 epochs with a batch size of 32 and a learning rate of 5×10 −5 , for the NER task we performed 5-fold splitting of the dataset and fine-tuned BERTino for 2 epochs per fold with a batch size of 32 and a learning rate of 5 × 10 −5 , while for the multi-class classification task we fine-tuned our model for 14 epochs on the training set with a batch size of 32 and a learning rate of 5 × 10 −5 . To compare the results obtained, we fine-tuned the teacher model and a GilBERTo model 4 on the same tasks with the same hyperparameters. Tables 1, 2, 3 and 4 collect the F1 scores gathered in these experiments together with fine-tuning and evaluation time. All the scores reported represent the average computed over three different runs. Results show that the teacher model slightly outperforms BERTino, with an increase of the F1 score of 0,29%, 5,15%, 1,37% and 1,88% over the tasks analysed. However BERTino results to be a sensibly faster network with respect to the teacher model and GilBERTo, taking almost half of the time to perform both fine-tuning and evaluation. We can conclude from the last observation that BERTino is able to retain most of the natural language understanding capabilities of the teacher model, even with a much smaller architecture. 3 https://github.com/indigo-ai/BERTino 4 Available at https://github.com/idb-ita/GilBERTo

Conclusions
In this work we presented BERTino, a DistilBERT model which aims to be the first lightweight alternative to BERT specific for the Italian language. Our model has been trained on a general-domain corpus and can then be finetuned with good performances on a wide range of tasks like its larger counterparts. BERTino showed comparable performances with respect to both the teacher model and GilBERTo in the Italian ISDT, Italian ParTUT, Italian WikiNER and multi-class sentence classification tasks while taking almost half of the time to fine-tune, demonstrating to be a valid lightweight alternative to BERT BASE models for the Italian language.