Model reimplementation is both a pain and a great source of learning.
It’s a pain because it takes time, time that is not well spent on research. It takes time to communicate with the author(s), to read the paper again and again, and to do training and testing. Even if there is a public repo, you might not reproduce the reported F1 scores.
But it’s a good learning process that you need to think the problem from scratch: how to preprocess, to batch, to regulate, and to tune. I will focus on my limited experience of reproducing end-to-end neural models because they seem low-hanging fruit in the world of reproduction.
A list of models I have successfully reproduced in Pytorch (I somehow missed the train of TensorFlow…)
- Decomposable Attention for Natural Language Inference
- Bidirectional Attention Flow for Machine Comprehension
- Deep Contextualized Word Representations (the MC model and the NLI model)
And a list in progress:
- Reasoning with Sarcasm by Reading In-between
- Get To The Point: Summarization with Pointer-Generator Networks
And I will hide the list I failed reproducing…
Yeah, they are all attention models. Once you have done one, it’s easy to make more:)
Here is a list of things I found important
- Positions of dropout
- Learning algorithm
The empire building is easy to spot, but the fuzziness on the street is more important here. Formulations in paper are easy to implement. The reason why you can’t get thee F1 score is mostly because of something else.
Let’s elaborate on each one of them
Generally speaking, end-to-end models don’t require complicated preprocessing as the only thing you feed the model are embeddings. Preprocessing covers parsing and batching.
For parsing, the first thing I would do is tokenization. Tokenization can be handled by NLTK, Spacy, StanfordCoreNLP, and many other packages. What is important is none of them is perfect. When it comes to open domain text, the output can be bad.
For instance there could be plenty of cases like “sense-making” got tokenized into “sense-“ and “making”. Such things can make vocabulary larger than neccessary. And you will end up with bunch of out-of-vocabulary tokens. Therefore extra manual tokenization is needed. To do that, Just look into the data first, and then come up with some simple regex to tokenize the tokens again.
In addition to tokenization, I rarely do parsing as they tends to be flimsy. Particularly when having a large training data, these non-tunable preprocessed things have a potential to hurt performance.
For sentence chunking, I found it works well on plain english text. When it comes to open domain text, it could be bad (e.g. SQuAD). Considering, “…(((n+2)*2)^3)…”, I have seen plenty of times that Spacy chops the innocent math expression into pieces.
Batching is another important thing as it speeds up training significantly. But in NLP tasks, cases are that examples are having variable sequence lengths. So padding comes into picture. But in attention models, padding could mess up alignment where you will have to mask off padding-related alignment. But that is not saying you can’t do batching without padding. Considering the following toy dataset:
A cat is sitting in a tree . A dog is barking by the tree . A girl is drinking coffee by the street . Oh my , I love coffee , medium roast the best .
We can batch up according to sentence lengths such that batches could have variable sizes. For instance we have a batch consists of sentences 1 and 2, sentence 3 is a batch, and sentence 4 another batch. So we end up with 3 batches of different sizes. During training, we can have another batch size i.e. the minimal batch size of gradient update, equals to 2. And delay the gradient update if the current batch is a batch of size smaller than that. In experiment, I found this works well.
Of course, there are cases you have to use padding. In TensorFlow’s recurrent cell module, padding is implicitly handled. But in Pytorch, this has to be handled explicitly.
This is important because you should always use the “official” evaluation, while you might have a customized loss function.
For instance, the Interpretable Semantic Textual Similarity task, the accuracy evaluation is a complicated script that is not differentiable. For another instance, in SQuAD, the metric is span over lap F1. The official evaluation function was something hidden in somewhere that I did not notice before. So I used the F1 of token span overlap which turns out to be constantly 10% lower than the official metric.
3.Positions of Dropout
Without a public repo or contact of the author(s), this is hard to make sure. Many papers simply say something like “…we added dropout to every linear layer…”. Then you followed exactly what they said, and your model overfit to hell…
The actual positions of dropout can be interesting. For instance with ELMo concatenated with GloVe embeddings, the ELMo first get 0.5 dropout, and after concatenation, the whole embeddings receive 0.2 dropout.
When downsampling from high dimension to lower dimension, I found it’s better not to have dropout.
During alignment/attention, the similarity calculation typically prefers to use all features without dropout. But still it depends. Depends on whether you are using ELMo, whether the alignment phase is close to the end of the pipeline.
At the very end of model pipeline, e.g. the final linear layer of classifier, I found adding dropout always help. This is about an intuition that the rear part of the pipeline has stronger tendency towards overfitting.
Lastly, dropout position is something needs extra tunning. The actually working verion in you reimplementation might appear different from the original paper.
Initialization covers input embeddings and parameter initialization.
When I was struggling with my first attention model, no matter what I try, I got a F1 which is 3% lower than the reported. Later it turns out I was using GloVe 100d word embeddings while the author used GloVe 300d. This is another reminder that when reproducing model, keep in touch with the author(s) :)
When possible, use xavier (aka glorot) initialization. For models of just feedforward layers (e.g. the decomposable attention model), you can live with just manually scaled random initialization. For models with LSTM/GRU, xavier initialization is a life saver. Specially, using good parameter initialization makes the model less volatile against randomness.
Among all the models on the reproduced list, I used different learning algorithm and different learning rate. Using the same algorithm and rate simply did not work. I found this is becoming a common issue when rewriting a TensorFlow model in Pytorch. For instance:
- The decomposable attention paper reported AdaGrad with learning rate 0.05. While I found that works well, Adam with rate 0.0001 works a bit better.
- The BiDAF paper reported AdaDelta with learning rate 0.5. The same setting failed in my case but I found Adam with 0.001 works well.
- The ELMo enhanced BiDAF uses AdaDelta with learning rate 1.0. The same setting failed in my case but I found Adam with 0.0005 works well.
So be sure to try more algorithms and redo the hyperparameter searching.