Title Generator

The point of this post is pretty straightforward: apply an encoder-decoder Transformer model to yet another text generation task. The problem at hand is to generate a title given the abstract of a scientific paper.

I’m trying to figure out how to have Jupyter notebooks on WordPress; the export to HTML seems fine, but it’s hard to run. This notebook can be downloaded from here.

Load and process data

We will use the Arxiv dataset which consists of 1.7 million articles with parsed information. The dataset is quite large at over 3GB, and we’ll be only interested in using a small subset. Thus, we first process this by only parsing out a set number of articles’ title and abstract which are from a specific scientific field.

We use read_json from pandas to process the data file, remove all other columns and create a single DataFrame after removing leading/trailing white space and newline characters.

In [1]:
# Pytorch
import torch
from torch.utils.data import Dataset, DataLoader, random_split
from torch.nn.utils.rnn import pad_sequence
from torch import nn
import torch.optim as optim

import pandas as pd
from tqdm.notebook import tqdm
from matplotlib import pyplot as plt
import math
import numpy as np
In [2]:
# Let's set some variables for the data processing part
FIELD = 'astro'
In [3]:
# We read a few lines at a time; note that if `chunksize` is none, the whole thing is read to memory
chunks = pd.read_json('arxiv-metadata-oai-snapshot.json', lines=True, chunksize=2048)

astro_dfs = []
num_articles = 0
for entries in tqdm(chunks):
    # We find all articles of a certain field, and then discard the rest of the columns
            ['id', 'submitter', 'authors', 'comments', 'journal-ref', 'doi',
            'report-no', 'categories', 'license', 'versions',
            'update_date', 'authors_parsed'], axis='columns'
    num_articles += len(astro_dfs[-1])

    # We stop once we have enough of a dataset
    if num_articles > TOTAL_ARTICLES:
In [4]:
# Make one dataframe, and we remove the \n in the text
data_df = pd.concat(astro_dfs, ignore_index=True)
data_df = data_df.replace(r'\n',' ', regex=True)
data_df['title'] = data_df['title'].str.strip()
data_df['abstract'] = data_df['abstract'].str.strip()
In [5]:
title abstract
0 The Spitzer c2d Survey of Large, Nearby, Inste… We discuss the results from the combined IRAC …
1 Spectroscopic Observations of the Intermediate… Results from spectroscopic observations of the…
2 ALMA as the ideal probe of the solar chromosphere The very nature of the solar chromosphere, its…
3 Astrophysical gyrokinetics: kinetic and fluid … We present a theoretical framework for plasma …
4 Inference on white dwarf binary systems using … We report on the analysis of selected single s…
301116 The imprint of massive black hole formation mo… The formation, merging, and accretion history …
301117 Resolving the Stellar Populations in the Circu… We investigate the stellar populations in the …
301118 UVES – VLT High Resolution Spectroscopy of GRB… We present early time, high resolution spectro…
301119 Far-Infrared detection of H2D+ toward Sgr B2 We report on the first far-IR detection of H2D…
301120 DE CVn: A bright, eclipsing red dwarf – white … Close white dwarf – red dwarf binaries must ha…

301121 rows × 2 columns

Tokenize Data

With the text in an easy to manipulate DataFrame, we have to convert the text to a numerical representation. Here, we use pre-trained tokenizers from Huggingface rather than build/train our own. Roughly speaking, a tokenizer takes a word and maps it to an integer. However, often times, modern tokenizers “understand” parts of words and would also parse out the prefix/suffix of a word too.

For simplicity, we’re only going to use 384 tokens for the abstract and 64 tokens for the title, and discard the rest. This is usually not a problem for the title (as it is quite short), but will generally truncate the abstract. However, judging from the box plot, the number of tokens seem fine. Ideally, our model would incorporate the full abstract, however the amount of compute power needed will be a lot larger due to the quadratic scaling of the transformers.

In [9]:
from transformers import AutoTokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# This is fairly slow,
data_tokenized = pd.DataFrame()
data_tokenized['abstract'] = data_df['abstract'].apply(
    lambda w: torch.Tensor(tokenizer(w, truncation=True, max_length=256+128)['input_ids']).int()
data_tokenized['title'] = data_df['title'].apply(
    lambda w: torch.Tensor(tokenizer(w, truncation=True, max_length=64)['input_ids']).int()
In [13]:
plt.boxplot(data_tokenized['abstract'].str.len(), vert=False)
plt.boxplot(data_tokenized['title'].str.len(), vert=False)

Data Loader

With the data in a numerical form, we can start loading them into PyTorch Dataset/DataLoader objects. This will help facilitate easier training.

Here, we note that we want to generate datasets which are applicable for “teacher forcing” learning. In our case, we want to deduce a sequence $(x_1, x_2, \ldots, x_{n+1})$ from $(x_0, \ldots, x_n)$. See figure (source) below. In our case the French corresponds to the abstract while the English is the title. This means that we would need to shift the titles by 1 in the collate_fn function for the DataLoader. We also take the opportunity here to perform the train/validation split. drawing

In [14]:
class ArxivDataset(Dataset):
    def __init__(self, dataframe):
        self.df = dataframe

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        Returns abstract, then title
        return self.df.iloc[idx, 0], self.df.iloc[idx, 1]

def pad_collate(batch):
    my_abstracts, my_titles = zip(*batch)

    # Pad sequence together, with 0 added
    my_abstracts = pad_sequence(my_abstracts, batch_first=True)
    my_titles = pad_sequence(my_titles, batch_first=True)

    # For teacher training, need to shift
    titles_inp = my_titles[:, 0:-1]
    titles_lab = my_titles[:, 1:]

    return (my_abstracts, titles_inp), titles_lab
In [15]:
dataset = ArxivDataset(data_tokenized)

train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

train_dataloader = DataLoader(train_dataset, batch_size=256+64, shuffle=True, collate_fn=pad_collate)
test_dataloader = DataLoader(test_dataset, batch_size=256+64, shuffle=True, collate_fn=pad_collate)
print(len(train_dataloader), len(test_dataloader))
753 189

Transformer model

Note that this is not intended as a tutorial for how to build a transformer, and we will gloss over most details. Roughly speaking, a transformer is a sequence-to-sequence model which supplies the probability distribution of the next element. First the abstract is embedded in some latent space by the encoder, which is then use by the decoder to condition for the prediction.

It is similar to RNNs, however, the main technique for a transformer is the attention mechanism, which is implemented in PyTorch. This allows transformers, in theory, to couple far ranging words with each other rather easily as a quadratic 1-to-1 comparison is done between every element in the sequence. With this extra $n^2$ expense, comes at a benefit of it seemingly able to learn languages very well.

PyTorch in theory has a Transformer module but it is not well documented. We will perform a quick and dirty implementation here. For more details, we refer to a great tutorial here. The schematics below should serve as enough of a guideline for the code below.

In [12]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Positional Embedding

In [4]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
            d_model - Hidden dimensionality of the input.
            max_len - Maximum length of a sequence to expect.

        # Create matrix of [SeqLen, HiddenDim] representing the positional encoding for max_len inputs
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)

        # register_buffer => Tensor which is not a parameter, but should be part of the modules state.
        # Used for tensors that need to be on the same device as the module.
        # persistent=False tells PyTorch to not add the buffer to the state dict (e.g. when we save the model)
        self.register_buffer('pe', pe, persistent=False)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return x

Encoder Block

This just consists of a multihead attention (essentially a few attentions concat-ed together), and a feed-forward NN.

In [5]:
class EncoderBlock(nn.Module):
    def __init__(self, input_dim, num_heads, dim_feedforward, dropout=0.1):

        # Attention layer
        self.self_attn = nn.MultiheadAttention(
            num_heads=num_heads, batch_first=True

        # Two-layer MLP
        self.linear_net = nn.Sequential(
            nn.Linear(input_dim, dim_feedforward),
            nn.Linear(dim_feedforward, input_dim)

        # Layers to apply in between the main layers
        self.norm1 = nn.LayerNorm(input_dim)
        self.norm2 = nn.LayerNorm(input_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Attention part
        attn_out, _ = self.self_attn(x, x, x)
        x = x + attn_out
        x = self.norm1(x)

        # MLP part
        linear_out = self.linear_net(x)
        x = x + self.dropout(linear_out)
        x = self.norm2(x)

        return x

class EncoderLayers(nn.Module):
    def __init__(self, num_layers, **encoder_args):
        self.layers = nn.ModuleList([EncoderBlock(**encoder_args) for _ in range(num_layers)])

    def forward(self, x):
        for l in self.layers:
            x = l(x)
        return x


A decoder layer consists of a causal attention, cross-attention from the decoder, and a final MLP.

In [6]:
# From PyTorch source
def _generate_square_subsequent_mask(
        sz: int,
        dtype: torch.dtype = torch.get_default_dtype(),
    r"""Generate a square causal mask for the sequence. The masked positions are filled with float('-inf').
        Unmasked positions are filled with float(0.0).
    return torch.triu(
        torch.full((sz, sz), float('-inf'), dtype=dtype, device=device),

class DecoderBlock(nn.Module):
    def __init__(self, input_dim, num_heads, dim_feedforward, dropout=0.1):

        # Attention layers
        self.causal_self_attn = nn.MultiheadAttention(
            num_heads=num_heads, dropout=dropout, batch_first=True
        self.cross_attn = nn.MultiheadAttention(
            num_heads=num_heads, dropout=dropout, batch_first=True,

        # Two-layer MLP
        self.linear_net = nn.Sequential(
            nn.Linear(input_dim, dim_feedforward),
            nn.Linear(dim_feedforward, input_dim)

        # Layers to apply in between the main layers
        self.norm1 = nn.LayerNorm(input_dim)
        self.norm2 = nn.LayerNorm(input_dim)
        self.norm3 = nn.LayerNorm(input_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, context):
        # Causal; create mask
        mask = _generate_square_subsequent_mask(x.shape[1])
        attn_out, _ = self.causal_self_attn(
            x, x, x, attn_mask=mask, is_causal=True
        x = x + attn_out
        x = self.norm1(x)

        # Cross attention
        attn_out, _ = self.cross_attn(
            x, context, context,
        x = x + attn_out
        x = self.norm2(x)

        # MLP
        linear_out = self.linear_net(x)
        x = x + self.dropout(linear_out)
        x = self.norm3(x)

        return x

class DecoderLayers(nn.Module):
    def __init__(self, num_layers, **decoder_args):
        self.layers = nn.ModuleList([DecoderBlock(**decoder_args) for _ in range(num_layers)])
    def forward(self, title, abstract):
        for l in self.layers:
            title = l(x=title, context=abstract)
        return title

Full Model

The transformer full model is simply the above two blocks, alongside some embeddings and a final classifier layer.

In [7]:
class Transformer(nn.Module):
    def __init__(self, num_layers, model_dim, num_heads, dff,
               vocab_size, dropout=0.2):

        # Needed for scaling
        self.model_dim = model_dim

        # Embedding
        self.embedding = nn.Embedding(
            num_embeddings=vocab_size, embedding_dim=model_dim

        # Positional
        self.positional_encoding = PositionalEncoding(d_model=model_dim)

        self.encoder = EncoderLayers(
            num_layers=num_layers, input_dim=model_dim, num_heads=num_heads, dim_feedforward=dff, dropout=dropout
        self.decoder = DecoderLayers(
            num_layers=num_layers, input_dim=model_dim, num_heads=num_heads, dim_feedforward=dff, dropout=dropout

        # Final layer which maps to "probabilities" in the token space
        self.ff = nn.Linear(model_dim, vocab_size)

    def forward(self, title, abstract):
        # Apply embedding and position
        title       = self.embedding(title) * math.sqrt(self.model_dim)
        abstract    = self.embedding(abstract) * math.sqrt(self.model_dim)

        # Add positional
        title       = self.positional_encoding(title)
        abstract    = self.positional_encoding(abstract)

        # Apply encoder than decoder
        abstract    = self.encoder(abstract)
        title       = self.decoder(title, abstract)

        return self.ff(title)


There are heuristics for training transformers; one of them is that a warmup should be used. We use the cosine warmup as detailed in the tutorial linked above. Another small caveat is that for the cross entropy loss, to ignore the index 0 (where padding occurred), as we don’t want the optimizer to optimize over dummy data.

Besides the two notes above, the below should be fairly similar to the usual training procedure.

In [22]:
class CosineWarmupScheduler(optim.lr_scheduler._LRScheduler):

    def __init__(self, optimizer, warmup, max_iters):
        self.warmup = warmup
        self.max_num_iters = max_iters

    def get_lr(self):
        lr_factor = self.get_lr_factor(epoch=self.last_epoch)
        return [base_lr * lr_factor for base_lr in self.base_lrs]

    def get_lr_factor(self, epoch):
        lr_factor = 0.5 * (1 + np.cos(np.pi * epoch / self.max_num_iters))
        if epoch <= self.warmup:
            lr_factor *= epoch * 1.0 / self.warmup
        return lr_factor
In [33]:
def train(dataloader, my_model, my_criterion, my_optimizer, my_scheduler, my_epoch, interval=500):
    Training loop for one epoch
    losses = []
    for batch_idx, ((abstract, title_inp), title_lab) in enumerate(dataloader):
        # Move to GPU
        abstract = abstract.to(device)
        title_inp = title_inp.to(device)
        title_lab = title_lab.type(torch.long)
        title_lab = title_lab.to(device)

        pred = my_model(title_inp, abstract)
        loss = my_criterion(
            pred.reshape(-1, MAX_VOCAB), title_lab.reshape(-1)

        torch.nn.utils.clip_grad_norm_(my_model.parameters(), 5)


        if batch_idx print(f'Epoch {my_epoch} |  {batch_idx / len(dataloader) * 100:.1f}return losses

def test(dataloader, my_model, my_criterion, my_epoch):
    Training loop for one epoch
    val_loss = 0
    val_acc = 0
    val_total = 0
    with torch.no_grad():
        for _, ((abstract, title_inp), title_lab) in enumerate(dataloader):
            # Move to GPU
            abstract = abstract.to(device)
            title_inp = title_inp.to(device)
            title_lab = title_lab.type(torch.long)
            title_lab = title_lab.to(device)
            title_lab = title_lab.reshape(-1)

            pred = my_model(title_inp, abstract)
            pred = pred.reshape(-1, MAX_VOCAB)
            loss = my_criterion(
                pred, title_lab

            val_loss += loss.item()

            # Count number of correct predictions
            pred = torch.argmax(pred, dim=1, keepdim=True)
            title_lab = title_lab.reshape_as(pred)
            val_acc += torch.count_nonzero(pred.eq(title_lab) * (title_lab != 0)).item()
            val_total += torch.count_nonzero(title_lab).item()

    val_loss /= len(dataloader)
    val_acc /= val_total

    print(f'Epoch {my_epoch} | Validation Loss: {val_loss:.2f} | Validation Accuracy: {val_acc * 100:.1f}return val_loss, val_acc
In [34]:
epochs = 40
lr = 5e-5
MAX_VOCAB = len(tokenizer)

model = Transformer(num_layers=6, model_dim=256, num_heads=8, dff=2056, vocab_size=MAX_VOCAB).to(device)

print(f'Using {device}')

optimizer = optim.Adam(model.parameters(), lr=lr)
scheduler = CosineWarmupScheduler(optimizer, 5000, MAX_VOCAB * epochs)
criterion = nn.CrossEntropyLoss(ignore_index=0)

train_loss = []
val_loss = []
val_acc = []
for epoch in range(1, epochs + 1):
    e_loss = train(train_dataloader, model, criterion, optimizer, scheduler, epoch, interval=100)
    train_loss += e_loss
    v_loss, v_acc = test(test_dataloader, model, criterion, epoch)

Using cuda
Epoch 1 |  0.
Epoch 1 |  13.
Epoch 1 |  26.
Epoch 1 |  39.
Epoch 1 |  53.
Epoch 1 |  66.
Epoch 40 |  39.
Epoch 40 |  53.
Epoch 40 |  66.
Epoch 40 |  79.
Epoch 40 |  93.
Epoch 40 | Validation Loss: 2.61 | Validation Accuracy: 48.
In [35]:
# Save model
state_dict = model.state_dict()
torch.save(state_dict, "large_model_v2.pt")

# Plot loss and stuff
fig, ax1 = plt.subplots()

ax1.set_xlabel('Batch num')
ax1.set_ylabel('X-entropy Loss')
ax1.plot(np.arange(1, len(val_loss) + 1) * len(train_dataloader) ,val_loss)
ax1.plot([], [], 'r', label = 'temp') # Just for legends; hack

ax2 = ax1.twinx()
ax2.set_ylabel('Validation Accuracy')
ax2.plot(np.arange(1, len(val_loss) + 1) * len(train_dataloader),
         np.array(val_acc) * 100, 'r')
ax1.legend(['Train loss', 'Test loss', 'Test acc'])



The above training does take awhile on an A100, and it’s not even the best results. With additional tweaking and modifications, we can probably do a lot better. In addition, we should probably look at a more specific field rather than “astro” as a whole. However, let’s move on to writing the inference.

To build a generator is easy: we start with $(x_0)$ where $x_0$ is the start token and repeatedly query the model conditioned on the abstract input. For example, the first token will be given by $x_0 \to x_1$, which we then pass in $(x_0, x_1) \to (x_1, x_2)$ and iterate. We stop once we encounter a stop token or if it reaches the number of max tokens.

In [15]:
# Reimport so I can load the model in a future date
from transformers import AutoTokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

MAX_VOCAB = len(tokenizer)
model = Transformer(num_layers=6, model_dim=256, num_heads=8, dff=2056, vocab_size=MAX_VOCAB).to(device)
model.load_state_dict(torch.load( "large_model_v2.pt"))
print('Loaded model')
Loaded model
In [16]:
class TitleGenerator(nn.Module):
    def __init__(self, transformer_model: nn.Module, my_tokenizer):

        self.model = transformer_model
        self.tokenizer = my_tokenizer

    def forward(self, x: str):
        Given an abstract, it should query the transformer

        :param x: Should be the abstract
        # Get tokenized
        tokenized = self.tokenizer(x, return_tensors='pt')['input_ids']
        tokenized = tokenized.to(device)
        output = torch.tensor([[101]]).to(device)
        with torch.no_grad():
            for i in range(128):
                probs = self.model(output, tokenized)
                pred = probs.argmax(dim=2)
                output = torch.cat((output, pred[0, -1].view(1, 1)), dim=1)
                if output[0, -1] == 102:

        return self.tokenizer.decode(output[0])
In [17]:
generate = TitleGenerator(model, tokenizer)
generate('We discuss the combined IRAC/MIPS c2d Spitzer Legacy observations of the Serpens star-forming region. We describe criteria for isolating bona fide YSOs from the extensive background of extragalactic objects.')
'[CLS] extragalactic star - forming regions from spitzer irac observations of the irac - irac region [SEP]'
In [18]:
generate('We explore the reach of analytical models at one-loop in Perturbation Theory (PT) to accurately describe measurements of the galaxy power spectrum from numerical simulations in redshift space. We consider the validity range in terms of three different diagnostics: 1) the goodness of fit; 2) a figure-of-bias quantifying the error in recovering the fiducial value of a cosmological parameter; 3) an internal consistency check of the theoretical model quantifying the running of the model parameters with the scale cut.')
'[CLS] the effect of loop corrections on the power spectrum of galaxy power spectra in the large - scale structure of the universe [SEP]'
In [20]:
generate('There has been a discussion for many years on whether the disc in the Milky Way extends down to low metallicity. We aim to address the question by employing a large sample of giant stars with radial velocities and homogeneous metallicities based on the Gaia DR3 XP spectra. We study the 3D velocity distribution of stars in various metallicity ranges, including the very-metal poor regime (VMP, [M/H] <−2.0). We find that a clear disc population starts to emerge only around [M/H] ∼−1.3, and is not visible for [M/H] <−1.6. Using Gaussian Mixture Modeling (GMM), we show that there are two halo populations in the VMP regime: one stationary and one with a net prograde rotation of ∼80km/s. In this low-metallicity range, we are able to place constraints on the contribution of a rotation-supported disc sub-population to a maximum of ∼3%. We compare our results to previous claims of discy VMP stars in both observations and simulations and find that having a prograde halo component could explain most of these. ')
'[CLS] the metallicity distribution of the milky way disc with gaia dr3 [SEP]'