Transfer NLP Documentation¶
Transfer NLP
is a framework built on top of PyTorch which goal is to achieve 2 kinds of Transfer:
easy transfer of code: the framework should be modular enough so that you don’t have to re-write everything each time you experiment with a new architecture / a new kind of task
easy transfer learning: the framework should be able to easily interact with pre-trained models and manipulate them in order to fine-tune some of their parts.
You can try the library on this Colab Notebook., which shows how to use the framework on several examples. All examples on these notebooks embed in-cell Tensorboard training monitoring!
Installation¶
From source:
You can clone the source from github and run
python setup.py install
Concepts¶
Experiment¶
The essence of the framework is the class ExperimentConfig
, a class which
enables to define an experiment based on a json file. An experiment will contain all the components that you might need:
- Data loader
- Model
- Optimizer
- Trainer
- …
Launching experiments from json config files has two main advantages:
reproducibility: when you are happy with the outcome of an experiment, the json file you used defines it entirely, so it is really easy to reproduce
ablation studies: when experimenting with new architectures, it is becoming a standards practice to assess the importance of some model components to the outcome.
Using json files facilitates this process, where you just have to remove some components from the json file and run the experiment again.
from transfer_nlp.plugins.config import ExperimentConfig
# Defining an experiment and starting the training pipeline
experiment_config = {...} # Config dictionary with components defining your experiment
experiment = ExperimentConfig(experiment_config)
experiment['trainer'].train()
# Using the trained model to make predictions on some inputs
predictor = experiment['predictor']
json_input = {'inputs': []}
results = predictor.json_to_json(input_json=input_json)
Json file¶
The class ExperimentConfig
has been designed so that an experiment can be
instantiated from any kind of objects you might need.
The experiment instantiator is able to deal with 3 kinds of inputs from the json files:
Simple parameters: these are simple user-defined values, such as:
experiment_config = {"lr": 0.01,
"seed": 1,
"num_epochs": 1}
simple lists: this is the same as simple parameters, but using lists, e.g.:
experiment_config = {"layer_sizes": [10, 50, 10]}
complex configuration: here you can instantiate an object from any class. The framework will require the json file to contain the name of the used class, e.g.:
experiment_config = {"lr": 0.01,
"model": {"_name": "MyClassifier"}}
When creating an instance of the class, ExperimentConfig will check for the hyperparameters. If it does not find them and the class defines default parameters, those will be used. Otherwise, an exception will be thrown. So in this example if the MyClassifier class takes input_dim and output_dim as hyperparameters, you would define the experiment as:
experiment_config = {"input_dim": 10000,
"output_dim": 5,
"model": {"_name": "MyClassifier"}}
or:
experiment_config = {"model": {"_name": "MyClassifier",
"input_dim": 10000,
"output_dim": 5}}
If one of your objects takes another complex object as initialization parameter, ExperimentConfig can build it recursively, e.g.:
experiment_config = {
"my_dataset_splits": {
"_name": "SurnamesDatasetMLP",
"data_file": "$HOME/surnames/surnames_with_splits.csv",
"batch_size": 128,
"vectorizer": {
"_name": "SurnamesVectorizerMLP",
"data_file": "$HOME/surnames/surnames_with_splits.csv"
}
}
The framework encourages the use of this nesting definition for clarity. However, in this example if the object vectorizer was needed to initialize another object in your experiment, you should isolate this multi-use object. Objects which will use it will call a reference to that object using the common $ notation. This enables to not defining different objects when we don’t need them.
experiment_config = {
"common_object": {
"_name": "MyCommonObject",
"some_parameter": "foo/bar"
},
"complex_object_A": {
"_name": "ComplexObjectA",
"common_object": "$common_object"
},
"complex_object_B": {
"_name": "ComplexObjectB",
"common_object": "$common_object"
}
}
To let Transfer NLP know about your custom classes, you add them to a registry. The framework does not require using separate registries for some fixed set of components, such as Models, Optimizers, etc.. There is an only one registry of classes, where you need to add your custom classes to use the framework.
Let’s say you have a fancy model class that extends the PyTorch neural network module class. The only thing you need to do is add the class to the registry using the @register_plugin decorator:
import torch
from transfer_nlp.plugins.config import register_plugin
@register_plugin
class MyClassifier(torch.nn.Module):
def __init__(self, input_dim: int, ouput_dim: int):
super(MyClassifier, self).__init__()
def forward(self, input_tensor):
# Do complex transofmrations
return result
Finally, to enable the sharing of experiment configuration files, we can use environment variables for paths parameters, and the framework will automatically replace them:
experiment_config = {
"my_dataset_splits": {
"_name": "SurnamesDatasetMLP",
"data_file": "$HOME/surnames/surnames_with_splits.csv",
"batch_size": 128,
"vectorizer": {
"_name": "SurnamesVectorizerMLP",
"data_file": "$HOME/surnames/surnames_with_splits.csv"
}
}
experiment = ExperimentConfig(path, HOME=str(Path.home() / 'data')) # Changes $HOME to a custom folder
Final thoughts¶
In the core design of Transfer NLP, the framework allows any kind of experiment to be instantiated, run, checkpointed, monitored, etc… The framework is not PyTorch-specific at its core, which make it easy to extend to objects using other machine learning backends such as tensorflow. Although the framework allows this flexibility, we will start focusing on PyTorch for next steps on our end. You are very welcome to contribute with Tensorflow building blocks to run easily-customizable experiments! In the long-run we hope that Transfer NLP becomes backend-agnostic and can enable any kind of ML experiments.
Data Management Components¶
Vocabularies¶
We provide classes to build vocabularies over datasets. These classes do not take into account the nature of the symbols whith which you are filling a dictionary. Hence, whether you want to use vocabularies for tokens, characters, BPE, etc.., you can still use the vocabulary classes coupled with a vectorizer of your choice.
Vectorizers¶
Vectorizers take string inputs and converts hem to lists of symbnols. When implementing your vectorizer, you need to build the vocabularies that you need for your experiment, and set these vocabularies as vectorizer attributes. You also need to implement the vectorize method, which turns a string input into a list of numbers representing the symbols you choose to use to represent the text.
Loaders¶
Data Loaders splits te dataset into train, validation and test sets, and creates the appropriate PyTorch DataLoaders.
Modeling Components¶
While the framework is flexible enough to deal with any kind of objects, here are some baseline components that you can use:
Models¶
A model extends the PyTorch torch.nn.Module class. You only have to define implement the __init__ and the forward classes. Your model class will have hyperparameters (which are used at object creation), and parameters for the forward method (used when __call__ is called). The parameters that the forward method expects should match the parameters yield by the PyTorch batch iterator. For example:
import torch
from transfer_nlp.plugins.config import register_plugin
@register_plugin
class MyClassifier(torch.nn.Module):
def __init__(self, input_dim: int, ouput_dim: int):
super(MyClassifier, self).__init__()
def forward(self, input_tensor: torch.tensor):
# Do complex transofmrations
return result
In this example, you need to set your data loader to yield batches with the key “input_tensor”. If the forward method has default parameters that do not appear in the batch, they will be used, otherwise tyey will be replaced by the values from the batch
Optimizers¶
Optimizers allows for moving the model parameters in the direction of their gradients, following the strategy proper of a certain optimizer. The framework registry comes with all PyTorch optimizers so you should be good to go for most cases, e.g.:
experiment_config = {
"optimizer": {"_name": "Adam",
"params": "model_params"
}
}
However, if you want to use a custom Optimizer, you need to extend the torch.optim.Optimizer class and register it to the registry. For example, if we want to use the optimizer used for BERT, we can use this implementation and register it like this:
@register_plugin
class BertAdam(Optimizer):
def __init__(self, params, lr=required, warmup=-1, t_total=-1, schedule='warmup_linear',
b1=0.9, b2=0.999, e=1e-6, weight_decay=0.01,
max_grad_norm=1.0):
super(BertAdam, self).__init__(params, defaults)
def step(self, closure=None):
# Compute the loss
return loss
experiment_config = {
"optimizer": {"_name": "BertAdam",
"params": "model_params"
}
}
Trainer Components¶
While the framework is flexible enough to deal with any kind of trainers, we encourage the use of a framework to manage your training loops. We found that Ignite provides everything we could expect from a training management system.
Ignite defines 6 classes of events, defining a training loop:
STARTED: start the training loop
EPOCH_STARTED: start an epoch
ITERATION_STARTED: start processing of one batch
ITERATION_COMPLETED: complete processing of one batch
EPOCH_COMPLETED: complete a full epoch
COMPLETED: complete the training loop
Ignite allows to perform some actions at each of these events, by simply adding events.
Here are some examples of events you can do:
Track metrics and log them on the terminal
Log metrics, parameters norms, histograms, distributions, etc.. to Tensorboard (via TensorboardX)
Learning schedulers: adapt the learning rates at different times of the training. A good example is the Cyclical learning rate scheduling, which has proven successful in models like ULMFit
Model checkpointing: save your model periodically if it improves
Early stopping: stop training when no learning is ever observed
Terminate on NaNs: terminates the training when nans or infinite values are encountered.
Timers
…
We provide a BasicTrainer class which should set you up for most cases in the supervised single task setting. For more complex settings like multi-task learning, you might want to change the _update and _inference methods to fit several tasks objectives / loss functions.
Frequently Asked Questions¶
vocabulary¶
loader¶
-
class
transfer_nlp.loaders.loaders.
DatasetSplits
(train_set: torch.utils.data.Dataset, train_batch_size: int, val_set: torch.utils.data.Dataset, val_batch_size: int, test_set: torch.utils.data.Dataset = None, test_batch_size: int = None)[source]¶
This file contains an abstract CustomDataset class, on which we can build up custom dataset classes.
In your project, you will have to customize your data loader class. To let the framework interact with your class, you need to use the decorator @register_dataset, just as in the examples in this file
config¶
-
class
transfer_nlp.plugins.config.
ExperimentConfig
(experiment: Union[str, pathlib.Path, Dict], **env)[source]¶
This file contains all necessary plugins classes that the framework will use to let a user interact with custom models, data loaders, etc…
The Registry pattern used here is inspired from this post: https://realpython.com/primer-on-python-decorators/
trainers¶
-
class
transfer_nlp.plugins.trainers.
BasicTrainer
(model: torch.nn.Module, dataset_splits: transfer_nlp.loaders.loaders.DatasetSplits, loss: torch.nn.Module, optimizer: torch.optim.Optimizer, metrics: Dict[str, ignite.metrics.Metric], experiment_config: transfer_nlp.plugins.config.ExperimentConfig, device: str = None, num_epochs: int = 1, seed: int = None, cuda: bool = None, loss_accumulation_steps: int = 4, scheduler: Any = None, regularizer: transfer_nlp.plugins.regularizers.RegularizerABC = None, gradient_clipping: float = 1.0, output_transform=None, tensorboard_logs: str = None, embeddings_name: str = None, finetune: bool = False)[source]¶
This class contains the abstraction interface to customize runners. For the training loop, we use the engine logic from pytorch-ignite
Check experiments for examples of experiment json files
predictors¶
-
class
transfer_nlp.plugins.predictors.
PredictorABC
(vectorizer: transfer_nlp.loaders.vectorizers.Vectorizer, model: torch.nn.Module)[source]¶ -
decode
(*args, **kwargs) → List[Dict][source]¶ Return an output dictionary for every example in the batch :param args: :param kwargs: :return:
-
json_to_data
(input_json: Dict) → Dict[source]¶ Transform a json entry into a data example, which is the same that what the __getitem__ method in the data loader, except that this does not output any expected label as in supervised setting :param input_json: :return:
-
json_to_json
(input_json: Dict) → Dict[str, Any][source]¶ Full prediction: input_json –> data example –> predictions –> json output :param input_json: :return:
-
Surnames Classification¶
A use case that arise very often in the book NLP with PyTorch is that of surnames classification: a dataset of names from different countries is provided and the task is to predict the country.
Vectorizer¶
The most straigthforward to represent a surname is to get its one-hot character encoding:
import pandas as pd
import numpy as np
from transfer_nlp.loaders.vocabulary import Vocabulary
@register_plugin
class MyVectorizer(Vectorizer):
def __init__(self, data_file: str):
super().__init__(data_file=data_file)
df = pd.read_csv(data_file)
data_vocab = Vocabulary(unk_token='@')
target_vocab = Vocabulary(add_unk=False)
# Add surnames and nationalities to vocabulary
for index, row in df.iterrows():
surname = row.surname
nationality = row.nationality
data_vocab.add_many(tokens=surname)
target_vocab.add_token(token=nationality)
self.data_vocab = data_vocab
self.target_vocab = target_vocab
def vectorize(self, input_string: str) -> np.array:
encoding = np.zeros(shape=len(self.data_vocab), dtype=np.float32)
for character in surname:
encoding[self.data_vocab.lookup_token(token=character)] = 1
return encoding
Data loader¶
Let’s create a data loader and have the PyTorch loaders set for train, vaildation and test categories.
from transfer_nlp.loaders.loaders import DatasetSplits, DataFrameDataset, DatasetHyperParams
@register_plugin
class MyDataLoader(DatasetSplits):
def __init__(self, data_file: str, batch_size: int, dataset_hyper_params: DatasetHyperParams):
self.df = pd.read_csv(data_file)
self.vectorizer: Vectorizer = dataset_hyper_params.vectorizer
self.df['x_in'] = self.df.apply(lambda row: self.vectorizer.vectorize(row.surname), axis=1)
self.df['y_target'] = self.df.apply(lambda row: self.vectorizer.target_vocab.lookup_token(row.nationality), axis=1)
train_df = self.df[self.df.split == 'train'][['x_in', 'y_target']]
val_df = self.df[self.df.split == 'val'][['x_in', 'y_target']]
test_df = self.df[self.df.split == 'test'][['x_in', 'y_target']]
super().__init__(train_set=DataFrameDataset(train_df), train_batch_size=batch_size,
val_set=DataFrameDataset(val_df), val_batch_size=batch_size,
test_set=DataFrameDataset(test_df), test_batch_size=batch_size)
Model¶
A simple modeling approach is to take the character one-hot encoding as input to a multi-layer perceptron:
import torch
@register_plugin
class ModelHyperParams(ObjectHyperParams):
def __init__(self, dataset_splits: DatasetSplits):
super().__init__()
self.input_dim = len(dataset_splits.vectorizer.data_vocab)
self.output_dim = len(dataset_splits.vectorizer.target_vocab)
@register_plugin
class MultiLayerPerceptron(torch.nn.Module):
def __init__(self, model_hyper_params: ObjectHyperParams, hidden_dim: int):
super(MultiLayerPerceptron, self).__init__()
self.input_dim = model_hyper_params.input_dim
self.hidden_dim = hidden_dim
self.output_dim = model_hyper_params.output_dim
self.fc1 = torch.nn.Linear(in_features=self.input_dim, out_features=hidden_dim)
self.fc2 = torch.nn.Linear(in_features=hidden_dim, out_features=self.output_dim)
def forward(self, x_in: torch.tensor) -> torch.tensor:
"""
Linear -> ReLu -> Linear (+ softmax if probabilities needed)
:param x_in: size (batch, input_dim)
:return:
"""
intermediate = torch.nn.functional.relu(self.fc1(x_in))
output = self.fc2(intermediate)
if self.output_dim == 1:
output = output.squeeze()
return output
Predictor¶
To use the model in inference mode, we create a specific predictor object:
from transfer_nlp.plugins.predictors import PredictorABC, PredictorHyperParams
from transfer_nlp.plugins.config import register_plugin
@register_plugin
class MyPredictor(PredictorABC):
def __init__(self, predictor_hyper_params: PredictorHyperParams):
super().__init__(predictor_hyper_params=predictor_hyper_params)
def json_to_data(self, input_json: Dict):
return {
'x_in': torch.tensor([self.vectorizer.vectorize(input_string=input_string) for input_string in input_json['inputs']])}
def output_to_json(self, outputs: List) -> Dict[str, Any]:
return {
"outputs": outputs}
def decode(self, output: torch.tensor) -> List[Dict[str, Any]]:
probabilities = torch.nn.functional.softmax(output, dim=1)
probability_values, indices = probabilities.max(dim=1)
return [{
"class": self.vectorizer.target_vocab.lookup_index(index=int(res[1])),
"probability": float(res[0])} for res in zip(probability_values, indices)]
Experiment¶
Now that all classes are properly designed, we can define an experiment in a config file and have it trained:
from transfer_nlp.plugins.config import ExperimentConfig
experiment_config = {
"predictor": {
"_name": "MLPPredictor",
"data": "$my_dataset_splits",
"model": "$model"
},
"my_dataset_splits": {
"_name": "SurnamesDatasetMLP",
"data_file": "$HOME/surnames/surnames_with_splits.csv",
"batch_size": 128,
"vectorizer": {
"_name": "SurnamesVectorizerMLP",
"data_file": "$HOME/surnames/surnames_with_splits.csv"
}
},
"model": {
"_name": "MultiLayerPerceptron",
"hidden_dim": 100,
"data": "$my_dataset_splits"
},
"optimizer": {
"_name": "Adam",
"lr": 0.01,
"alpha": 0.99,
"params": {
"_name": "TrainableParameters"
}
},
"scheduler": {
"_name": "ReduceLROnPlateau",
"patience": 1,
"mode": "min",
"factor": 0.5
},
"trainer": {
"_name": "BasicTrainer",
"model": "$model",
"dataset_splits": "$my_dataset_splits",
"loss": {
"_name": "CrossEntropyLoss"
},
"optimizer": "$optimizer",
"gradient_clipping": 0.25,
"num_epochs": 5,
"seed": 1337,
"regularizer": {
"_name": "L1"
},
"tensorboard_logs": "$HOME/surnames/tensorboard/mlp",
"metrics": {
"accuracy": {
"_name": "Accuracy"
},
"loss": {
"_name": "LossMetric",
"loss_fn": {
"_name": "CrossEntropyLoss"
}
}
}
}
# Configure the experiment
experiment = ExperimentConfig(experiment_config)
# Launch the training loop
experiment['trainer'].train()
# Use the predictor for inference
input_json = {"inputs": ["Zhang", "Mueller", "Rastapopoulos"]}
output_json = experiment['predictor'].json_to_json(input_json=input_json)
License¶
MIT License
Copyright (c) 2019 Feedly
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE
Contact¶
Contact peter.martigny@gmail.com
Help¶
Contact peter.martigny@gmail.com