Train your own language model with nanoGPT

Let’s build a songwriter

Sophia Yang, Ph.D.
6 min readMar 20, 2023

This morning, I watched Andrej Karpathy’s Build ChatGPT from Scratch video. I was so impressed. Only true legend can make such a complex model look so effortless. In his video, he builds a GPT language model from scratch with only a few hundred lines of code and organized everything in the nanoGPT Github repo. I can’t wait to give it a try. So in this blog post, I’m going to try out the nanoGPT and see if I use nanoGPT to train a songwriter.

Train a Shakespeare writer (following repo instructions)

Before we build our songwriter, let’s first follow the instructions on the nanoGPT repo to build a Shakespeare writer.

Step 1: Download Anaconda

We will first need to install Python. Downloading Anaconda is the easiest and recommended way to get your Python and the Conda environment management set up.

Step 2: Set up Conda environment

Let’s create a new Conda environment called “nanoGPT”:

conda create -n nanoGPT

Then we activate this environment and install the needed packages:

conda activate nanoGPT
conda install pytorch numpy transformers datasets tiktoken wandb tqdm pandas -c conda-forge

Step 3: Prepare training data

We download the Shakespeare text to an input.txt file and select 90% of the text as training data and 10% of the text as validation data. The prepare.py file also creates a train.bin and val.bin in the data directory for later use.

python data/shakespeare_char/prepare.py

Step 4: Train your model

On my Apple M1 computer, I tried the simplified version of the model. Note that I changed the device from cpu to mps to use its Metal GPU.

python train.py config/train_shakespeare_char.py — device=mps — compile=False — eval_iters=20 — log_interval=1 — block_size=64 — batch_size=12 — n_layer=4 — n_head=4 — n_embd=128 — max_iters=2000 — lr_decay_iters=2000 — dropout=0.0

On a separate GPU machine, I trained the full model with the default configurations:

python train.py config/train_shakespeare_char.py

One thing interesting to note is that the authors organized all the parameters in the config folder so that we can easily see the default values of those parameters and we don’t need to spell out all the parameter values when we call train.py. But we can overwrite all the parameters in the command line.

Step 5: Generate text

After training for a few minutes or hours depending on your patience and your computing power, the trained model ckpt.pt will be saved to output directly out-shakespeare-char, which will allow us to use it to generate text:

python sample.py — out_dir=out-shakespeare-char

Here is the generated text after running 200 iterations on my laptop. Looks like the model has not learned much yet. This is expected since we haven’t trained much.

Here is the result after running 3000 iterations on a GPU. You can see the result got significantly better. Of course, it’s still not Shakespeare, but you can train it a lot more if you have time : )

Train with your own data and build a songwriter

Photo by Cookie the Pom on Unsplash

Step 1: Find your dataset

I’m interested in building a songwriter (okay maybe more accurately a lyrics writer lol), so I need to find a dataset with lyrics. To find the dataset you are interested in, Google is your best friend. After some Google searches, I decided to use the Spotify Million Song dataset from Kaggle. I then created a new folder called lyrics in the data folder and saved the Spotify dataset in this folder.

Step 2: Prepare training data (revise prepare.py)

We then need to revise prepare.py based on the format of our current data — a local CSV file. All we need is just three lines of code to import pandas, read the CSV file, and combine all lyrics into one text string (line 4–7):

import os
import tiktoken
import numpy as np
import pandas as pd

df = pd.read_csv('data/lyrics/spotify_millsongdata.csv')
data = df['text'].str.cat(sep='\n')

n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

# encode with tiktoken gpt2 bpe
enc = tiktoken.get_encoding("gpt2")
train_ids = enc.encode_ordinary(train_data)
val_ids = enc.encode_ordinary(val_data)
print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")

# export to bin files
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(os.path.join(os.path.dirname(__file__), 'train.bin'))
val_ids.tofile(os.path.join(os.path.dirname(__file__), 'val.bin'))

# train.bin has 301,966 tokens
# val.bin has 36,059 tokens

Step 3: Training your model

I’d like to use the same parameter values as the Shakespeare example. So in the config folder, I created a new file called train_lyrics.py, copied all the default parameter values from train_shakespeare_char.py, and changed names from Shakespeare to lyrics:

# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-lyrics'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'lyrics'
wandb_run_name = 'mini-gpt'

dataset = 'lyrics'
batch_size = 64
block_size = 256 # context of up to 256 previous characters

# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 # make a bit bigger because number of tokens per iter is small

warmup_iters = 100 # not super necessary potentially

# on macbook also add
# device = 'cpu' # run on cpu only
# compile = False # do not torch compile the model

Then we can train the simplified model on a laptop:

python train.py config/train_lyrics.py — device=mps — compile=False — eval_iters=20 — log_interval=1 — block_size=64 — batch_size=12 — n_layer=4 — n_head=4 — n_embd=128 — max_iters=2000 — lr_decay_iters=2000 — dropout=0.0

Step 4: Generate songs

After training for some time, we can again generate text using sample.py:

python sample.py — out_dir=out-lyrics

I trained for 1000 iterations and here are the results. As you can see, the text generated does look like lyrics and you can train more to get better results!

Overall, in this blog post, we trained our own language model with Shakespeare's text and song lyrics. nanoGPT is surprisingly easy to use and easy to adapt to our own data. With nanoGPT and computing power, everyone should be able to train their own language models with their own domains of interest. Happy exploring with nanoGPT!

. . .

By Sophia Yang on March 19, 2023

Sophia Yang is a Senior Data Scientist at Anaconda. Connect with me on LinkedIn, Twitter, and YouTube and join the DS/ML Book Club ❤️

--

--