Roberta Sets Upd — Wals
For languages not well-represented by an English-centric model like roberta-base , you can use XLM-RoBERTa . This model is pretrained on text from 100 different languages, making it much more suitable for working with the diverse set of languages found in WALS. The setup code is almost identical; you would just replace model_name = "roberta-base" with model_name = "xlm-roberta-base" .
RoBERTa is an iteration of the BERT model that removed the "Next Sentence Prediction" objective and trained on much larger datasets with longer sequences. While powerful, its "sets" of weights are initially optimized for the languages present in its training data (predominantly Indo-European). 3. Developing the "WALS-Updated" Article Set
from pycldf import Dataset import pandas as pd wals roberta sets upd
def __len__(self): return len(self.texts)
from transformers import TrainingArguments, Trainer RoBERTa is an iteration of the BERT model
tokenizer = RobertaTokenizer.from_pretrained('roberta-base') model = RobertaModel.from_pretrained('roberta-base')
While classification is the most common approach, the combination of WALS and RoBERTa isn't limited to it. The keyword "sets upd" could also refer to other configurations: Developing the "WALS-Updated" Article Set from pycldf import
trainer.train()
If you have no GPU, you can use Google Colab’s free GPU or a cloud provider (AWS, GCP, Azure) to accelerate training.