Transformers, the backbone of many state-of-the-art NLP models such as BERT, GPT has revolutionized the way we approach natural language understanding tasks. One key innovation in transformers is their ability to handle entire sequences of tokens simultaneously, without relying on recurrence or convolutions. However, to achieve this, they require mechanisms to represent both the semantic meaning of tokens and their positional information in a sequence. These mechanisms are known as token embeddings and positional embeddings.
This blog will take a deep dive into these embeddings, explaining how they are constructed, their mathematical foundations, and how they empower transformer models to achieve incredible performance.
Token Embeddings: Giving Words Meaning
What Are Token Embeddings?
Token embeddings represent the semantic meaning of words in a vector space. In simple terms, they map words (or subwords) into numerical vectors that capture their meanings. Words with similar meanings are mapped to nearby points in this vector space.
For example:
The word "king" might be represented as [0.5,0.2,0.8,...][0.5, 0.2, 0.8, ...][0.5,0.2,0.8,...].
The word "queen" might be represented as [0.6,0.3,0.7,...][0.6, 0.3, 0.7, ...][0.6,0.3,0.7,...].
Their closeness in the vector space reflects their semantic similarity.
Tokenization and Vocabulary
Transformers begin by breaking down text into smaller units called tokens. These tokens can be words, subwords, or characters, depending on the tokenizer used (e.g., WordPiece, BPE, or SentencePiece). The tokenizer then assigns each token a unique token ID based on a pre-defined vocabulary.
Example
Consider the sentence:
- "I love NLP."
The vocabulary might look like this:
"I" → Token ID 1
"love" → Token ID 2
"NLP" → Token ID 3
The sentence is tokenized as: [1, 2, 3]
Each token ID is mapped to an embedding vector using an embedding matrix.
Mathematical Representation of Token Embeddings
The embedding matrix E is a learnable matrix of size [V×d], where:
V is the size of the vocabulary.
d is the embedding dimension.
When a token ID is encountered, its embedding is obtained by indexing the corresponding row in E:
Token Embedding=E[Token ID]
For example:
For Token ID 1 (I): E[1]=[0.1,0.3,0.4]
For Token ID 2 (love): E[2]=[0.2,0.6,0.5]
For Token ID 3 (NLP): E[3]=[0.7,0.9,0.8]
Positional Embeddings: Encoding Sequence Order
Why Are Positional Embeddings Needed?
Transformers process tokens in parallel, unlike RNNs, which process them sequentially. This parallelism means transformers lack an inherent understanding of the order of tokens. For example, the sentences:
- "I love NLP" and "NLP love I" would appear identical without positional information.
To address this, positional embeddings encode the position of each token in the sequence, allowing the model to differentiate between tokens based on their order.
Mathematical Construction of Fixed Positional Embeddings
The fixed sinusoidal positional embeddings, introduced in the original transformer paper, provide a deterministic way to represent positions. These embeddings are computed using sine and cosine functions, ensuring smooth variations across dimensions and positions.
The Formula
For a sequence length L and embedding size d, positional embeddings are computed as follows:
- For even dimensions (2i):
PE(pos,2i)=sin(pos/100002id)
- For odd dimensions (2i+1):
PE(pos,2i+1)=cos(pos/100002id)
Where:
pos is the position of the token in the sequence.
i is the dimension index.
Example
Consider a sequence of length L=4 and embedding size d=6. For position pos=2:
- Dimension 0 (2i=0):
PE(2,0)=sin(2/10000^0/6)=sin(2)≈0.909
- Dimension 1 (2i+1=12i+1 = 12i+1=1):
PE(2,1)=cos(2/10000^0/6)=cos(2)≈−0.416
- Higher dimensions follow the same principle, with increasing divisors.
Key Properties of Sinusoidal Embeddings
Periodic Nature: The periodicity of sine and cosine functions allows the model to generalize to sequences longer than those seen during training.
Smooth Variations: Neighboring positions have embeddings with small, smooth differences.
Unique Representations: The combination of sine and cosine ensures each position is uniquely encoded.
Combining Token and Positional Embeddings
The final input to a transformer model is obtained by adding the token embeddings and the positional embeddings element-wise. This addition provides the model with both:
What: The semantic meaning of tokens (via token embeddings).
Where: The position of tokens in the sequence (via positional embeddings).
Example
Consider the sentence "I love NLP":
Token Embeddings: [[0.1,0.3,0.4],[0.2,0.6,0.5],[0.7,0.9,0.8]]
Positional Embeddings: [[0.01,0.02,0.03],[0.04,0.05,0.06],[0.07,0.08,0.09]]
Final Input Embeddings:
[[0.11,0.32,0.43],[0.24,0.65,0.56],[0.77,0.98,0.89]]
Trainable vs. Fixed Positional Embeddings
Fixed Positional Embeddings:
Sinusoidal embeddings are fixed and not updated during training.
Advantage: Simpler and generalizable.
Example: Used in the original transformer model.
Learnable Positional Embeddings:
Each position is assigned a unique embedding, which is learned during training.
Advantage: More flexible for specific tasks.
Example: Used in models like GPT and BERT.
Libraries and Implementation
TensorFlow/Keras: You can create sinusoidal positional embeddings using
tf.keras.layers.Layer
. Alternatively, KerasNLP provides high-level APIs for transformer models.PyTorch: PyTorch allows you to manually construct embeddings using
torch
functions or use pre-built models fromtorch.nn
.Hugging Face Transformers: Hugging Face provides ready-to-use transformer models, including pre-trained positional embeddings.
Conclusion
Token embeddings and positional embeddings are the foundation of transformer models. Token embeddings capture meaning, while positional embeddings encode order, together enabling models to process sequences effectively. By understanding their mathematical and practical implementations, we can better appreciate the inner workings of transformers and build more effective NLP solutions. Whether using fixed sinusoidal embeddings or trainable embeddings, these components are indispensable in modern natural language processing.
Reference Research Paper: Attention Is All You Need