关键词 > SI630
SI 630: Homework 2 – Word Embeddings and Attention
发布时间:2026-03-17
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
SI 630: Homework 2 – Word Embeddings and Attention
The simple counting model we described actually words pretty well as a baseline! However, it has two major drawbacks. First, if we have a lot of text and a big vocabulary, our word vector representations become very expensive to compute and store. A 1,000 words that all co-occur with some frequency would take a matrix of size |V | 2 , which has a million elements! Even though not all words will co-occur in practice, when we have hundreds of thousands of words, the matrix can become infeasible to compute. Second, this count-based representation has a lot of redundancy in it. If “ocean” and “sea” appear in similar contexts, we probably don’t need the co-occurrence counts for all |V | words to tell us they are synonyms. In mathematics terms, we’re trying to find a lower-rank matrix that doesn’t need all |V | dimensions.
Word embeddings solve both of these problems by trying to encode the kinds of contexts a word appears in as a low-dimensional vector. There are many (many) solutions for how to find lowerdimensional representations, with some of the earliest and successful ones being based on the Singular Value Decomposition (SVD); one you may have heard of is Latent Semantic Analysis. In Homework 2, you’ll learn about a relatively recent technique, word2vec, that outperforms prior approaches for a wide variety of NLP tasks and is very widely used. This homework will build on your experience with stochastic gradient descent (SGD) and log-likelihood (LL) from Homework
1. You’ll (1) implement a basic version of word2vec that will learn word representations and then (2) try using those representations in intrinsic tasks that measure word similarity and an extrinsic task for sentiment analysis.
For this homework, we’ve provided skeleton code in Python 3 that you can use to finish the implementation of word2vec and comments within to help hint at how to turn some of the math into python code. You’ll want to start early on this homework so you can familiarize yourself with the code and implement each part.
This homework has the following learning goals:
• Develop your pytorch programming skills through working with more of the library
• Learn how word2vec works in practice
• Improve your advanced data science debugging skills
• Have you work with large corpora
• Learn how to use Weights & Biases
• Evaluate one form of model explanation
This homework is a mix of conceptual and skills based learning. As you get the hang of pro gramming neural networks, you’ll be able to teach them to do many more advanced tasks. This homework will hopefully help prepare you by again having you advance your skills while also getting you thinking about what training word embeddings can do for us (as practitioners).
We’ve made the implementation easy to follow and avoided some of the useful-to-opaque opti mizations that can make the code much faster.2 As a result, training your model may take some time. We estimate that on a regular laptop, it might take 30-45 minutes to finish training a single epoch of your model. That said, you can still quickly run the model for ∼10K steps in a few min utes and check whether it’s working. A good way to check is to see what words are most similar to some high frequency words, e.g., “january” or “good.” If the model is working, similar-meaning words should have similar vector representations, which will be reflected in the most similar word lists. We have included this as an automated test which will print out the most similar words.
The skeleton code also includes methods for writing word2vec data in a common format read able by the Gensim library. This means you can save your model and load the data with any other common libraries that work with word2vec. Once you’re able to run your model for ∼100K iterations (or more), we recommend saving a copy of its vectors and loading them in a notebook to test.
We’ve included an exploratory notebook.
On a final note, this is the most challenging homework in the class. Much of your time will be spent on Task 1, which is just implementing word2vec. It’s a hard but incredibly rewarding homework and the process of doing the homework will help turn you into a world-class information and data scientist!
In Task 1, you’ll implement parts of word2vec in various stages. Word2vec itself is a complex piece of software and you won’t be implementing all the features in this homework. In particular, you will implement:
You’ll spend the majority of your time on Part 1 of that list which involves writing the gradient descent part. You’ll start by getting the core part of the algorithm up without parts 2 and 3 and running with gradient descent and using negative sampling to generate output data that is incorrect.
Then, you’ll work on ways to speed up the efficiency and quality by removing overly common words and removing rare words.
The vocabulary size is V , and the hidden layer size is k. The hidden layer size k is a hyperparameter that will determine the size of our embeddings. The units on these adjacent layers are fully connected. The input is a one-hot encoded vector x, which means for a given input context word, only one out of V units, {x1, . . . , xV }, will be 1, and all other units are 0. The output layer consists of a number of context words which are also V -dimensional one-hot encodings of a number of words before and after the input word in the sequence. So if your input word was word w in a sequence of text and you have a context window3 ±2, this means you will have four V -dimensional one-hot outputs in your output layer, each encoding words w−2, w−1, w+1, w+2 respectively. Unlike the input-hidden layer weights, the hidden-output layer weights are shared: the weight matrix that connects the hidden layer to output word wj will be the same one that connects to output word wk for all context words.
h = WT x = vI (1)
Similarly, vI acts as an input to the second weight matrix C to produce the output neurons which will be the same for all context words in the context window. That is, each output word vector is:
and for a specific word wj , we have the corresponding embedding in C as vj ′ and the corresponding neuron in the output layer gets uj as its input where:
Note that in both of these cases, multiplying the one-hot vector for a word wi by the corresponding matrix is the same thing has simply selecting the row of the matrix corresponding to the embedding for wi . If it helps to think about this visually, think about the case for the inputs to the network: the one-hot embedding represents which word is the center word, with all other words not being present. As a result, their inputs are zero and never contribute to the activation of the hidden layer (only the center word does!), so we don’t need to even do the multiplication. In practice, we typically never represent these one-hot vectors for word2vec as it’s much more efficient to simply select the appropriate row.
An unoptimized, naive version of word2vec would predict which context word wc was present given an input word wI by estimating the probabilities across the whole vocabulary using the softmax function:
This original log-likelihood function is then to maximize the probability that the context words (in this case, w−2, . . . , w+2) were all guessed correctly given the input word wI . Note that you are not implementing this function!
Showing this function raises two important questions (1) why is it still being described and (2) why aren’t you implementing it? First, the equation represents an ideal case of what the model should be doing: given some positive value to predict for one of the outputs (wc), everything else should be close to zero. This objective is similar to the likelihood you implemented for Logistic Regression: given some input, the weights need to be moved to push the predictions closer to 0 or closer to 1. However, think about how many weights you’d need to update to minimize this particular log-likelihood? For each positive prediction, you’d need to update |V | − 1 other vectors to make their predictions closer to 0. That strategy which uses the softmax results a huge computational overhead—despite being the most conceptually sound. The success of word2vec is, in part, due to coming up with a smart way to achieve nearly the same result without having to apply the softmax. Therefore, to answer the second question, now that you know what the goal is, you’ll be implementing a far more efficient method known as negative sampling that will approximate creating a model that minimizes this equation!
If you read the original word2vec paper, you might find some of the notation hard to follow. Thankfully, several papers have tried to unpack the paper in a more accessible format. If you want another description of how the algorithm works, try reading Goldberg and Levy [2014]4 or Rong [2014]5 for more explanation. There are also plenty of good blog tutorials for how word2vec works and you’re welcome to consult those6 as well as some online demos that show how things work.7 . There’s also a very nice illustrated guide to word2vec https://jalammar.github.io/illustrated-word2vec/ that can provide more intuition too.
Before we can even start training, we’ll need to determine the vocabulary of the input text and then convert the text into a sequence of IDs that reflect which input neuron corresponds to which word. Word2vec typically treats all text as one long sequence, which ignores sentences boundaries, doc ument boundaries, or otherwise-useful markers of discourse. We too will follow suit. In the code, you’ll see general instructions on which steps are needed to (1) create a mapping of word to ID and (2) processing the input sequence of tokens and covert it to a sequence of IDs that we can use for training. This sequence of IDs is what we’ll use to create our training data. As a part of this process, we’ll also keep track of all the token frequencies in our vocabulary.
For a target word, the nearby words in the context form the positive example for training our prediction model. Rather than train word2vec like a regular mutliclass classification model (which uses the softmax function to predict outputs8 ), word2vec uses a small number of randomly-selected words as negative examples.9 These negative examples are referred to as the negative samples.
The negative samples are chosen using a unigram distribution raised to the 3 4 power: Each word is given a weight equal to its frequency (word count) raised to the 4 3 power. The probability for a selecting a word is just its weight divided by the sum of weights for all words. The decision to
Once you have the tokens in place, the next step is get the training data in place to actually train the model. Say we have the input word “fox” and observed context word “quick”. When training the network on the word pair (“fox”, “quick”), we want the model to predict an output of 1 signalling this word (“quick”) was present in the context.
With negative sampling, we are will randomly select a small number of negative examples (let’s say 2) for each positive example to update the weights for. (In this context, a negative example is one for which we want the network to output a 0 for). When updating the model (later), our parameters will be updated on our current ability to predict 1 for the positive examples and 0 for the negative examples.
To generate the training, you’ll iterate through all token IDs in the sequence. At each time step, the current token ID will become the target word. You’ll use the window size parameter to decide how many nearby tokens should be included as positive training examples.
The original word2vec paper says that selecting 5-20 words works well for smaller datasets, and you can get away with only 2-5 words for large datasets. In this assignment, you will update with 2 negative words per context word. This means that if your context window selects four words, you will randomly sample 8 words as negative examples of context words. We recommend keeping the negative sampling rate at 2, but you’re welcome to try changing this and seeings its effect (we recommend doing this after you’ve completed the main assignment).
Note: There is one important PyTorch-related wrinkle that you will need to account for, which is described in detail in the code.
Now that the data is ready, we can define our PyTorch neural network for word2vec. Here, we will not use layers but instead use PyTorch’s Embedding class to keep track of our target word and context word embeddings.
■ Problem 4. Modify the init weights function to initialize the values in the two Embedding objects based on the size of the vocabulary |V | and the size of the embeddings. Unlike in logistic regression where we initialized our β vector be zeros, here, we’ll initialize the weights to have small non-zero values centered on zero and sampled from (-init range, init range).11
The next step is to update the forward function, which takes as input some target word and context words and predicts 0 or 1 for whether each context word was present. Formally, for some target word vector vt and context word vector vc, word2vec makes its predictions as σ(vt · vc) (5)
where σ is the sigmoid function (like in Homework 1). Word2vec aims to learn parameters (its two embedding matrices) such that this function is maximized for positive examples and minimized for negative examples.
One key difference this time is that you will use batching. In Homework 1 we had a stark contrast between (1) full gradient descent where a single step required us to compute the gradient with respect to all the data and (2) stochastic gradient descent where take a step based on the prediction error for a single instance. However, there is a middle ground! Often we can improve the gradient by computing it with respect to a few instances instead of just one. Analogously, consider if you wanted to know if you were on the right track, it can help to ask a few folks, but you don’t need to ask everyone (and asking just one person could be risky and send you on the wrong track). Batched gradient descent is the same way.
Conveniently, PyTorch works nearly seamlessly with batching. We can tell the DataLoader class our batch size and it will return a random sample of instances of that size. The code you write for the forward function will also work with a batch too with no modifications (most of the time). This behavior is even better for us because often computers are much faster at larger computations—especially GPUs—so trying to do the forward/backward passes for an entire batch is often just as fast as doing them for a single instance.
In your implementation we recommend starting with these default parameter values:
Once you’re finished here, you’re not yet ready to run everything but you’re close!
Using all the unique words in your source corpus is often not necessary, especially when considering words that convey very little semantic meaning like “the”, “of”, “we”. As a preprocessing step, it can be helpful to remove any instance of these so-called “stop words”.
Note that when you remove stop words, you should keep track of their position so that the context doesn’t include words outside of the window. This means that a sentence with “my big cats of the kind that...” if you have a context window of ±2, then you would only have “my” and ”big” as context words (since “of” and “the” get removed) and not include “kind.”
In addition to removing words that are so frequent that they have little semantic value for compari son purposes, it is also often a good idea to remove words that are so infrequent that they are likely very unusual words or words that don’t occur often enough to get sufficient training during SGD. While the minimum frequency can vary depending on your source corpus and requirements, we will set min count = 5 as the default in this assignment.
Instead of just removing words that had less than min count occurrences, we will replace these all with a unique token <UNK>. In the training phase, you will skip over any input word that is <UNK> but you will still keep these as possible context words.
Words appear with varying frequencies: some words like “the” are very common, whereas others are quite rare. In the current setup, most of our positive training examples will be for predicting very common words as context words. These examples don’t add much to learning since they ap pear in many contexts. The word2vec library offers an alternative to ensure that contexts are more likely to have meaningful words. When creating the sequence of words for training (i.e., what goes in full token sequence as ids), the software will randomly drop words based on their frequency so that more common words are less likely to be included in the sequence. This subsam pling effectively increases the context window too—because the context window is defined with respect to full token sequence as ids (not the original text), dropping a nearby common words means the context gets expanded to include the next-nearest word that was not dropped.
where p(wi) is the probability of the word appearing in the corpus initially. Using this probability, each occurrence of wi in the sequence is randomly decided to be kept or removed based on pk(wi).
Hovering over any point shows the loss at that time. As you can see, after one epoch the model as learned something but has probably not fully converged!
As you might guess, training word2vec on a lot of data can take some time. This waiting process will be increasingly true as you train larger and larger models (not just word2vec). However, the larger pytorch ecosystem provides some fantastic tools for you, the practitioner, to monitor the progress. In this subtask, you’ll be using one of those tools, Weights & Biases (wandb), that allows you to log how your model is doing and then you can connect to the wandb website and see the plot. Figure 1 shows an example of the wandb plot for our reference implementation after one epoch of training. Here, we’ve just recorded a running sum of the loss every 100 steps.
You will want to do the same. This will help you see how quickly your model is converging.
If you train multiple models, wandb will show all of their training plots so you can see how your choice in hyperparameters affects training speed and which model as learned the most (has the lowest loss). In practice, many people use wandb to determine when to stop training after seeing at their model has effectively converged.
■ Problem 12. Try batch sizes of 2, 8, 32, 64, 128, 256, 512 to see how fast each step (one batch worth of updates) is and the total estimated time. For this, you’ll set the parameter and then run the training long enough to get an estimate for both with tqdm wrapped around your batch iterator.
You do not need to finish training for the full epoch. Make a plot where batch size is on the x-axis and the tqdm-estimated time to finish one epoch is on the y-axis. (You may want to log-scale one or both of the axes). You can try other batch sizes too in this plot if you’re curious. In your write up, describe what you see. What batch size would you choose to maximize speed? Side note: You might also want to watch your memory usage, as larger batches can sometimes dramatically
We need to see the plot. It will probably look something like Figure 1.
Once you’ve finished training the model for at least one epoch, save your vector outputs. The rest of the homework will use these vectors so you don’t have even re-run the learning code (until the very last part, but ignore that for now). Task 2 is here just so that you have an explicit reminder to save your vectors. We’ve provided a function to do this for you.
Once you’ve learned the word2vec embeddings from how a word is used in context new we can use them! How can we tell whether what it’s learned is useful? As a part of training, we put in place code that shows the nearest neighbors, which is often a good indication of whether words that we think are similar end up getting similar representations. However, it’s often better to get a more quantitative estimate of similarity. In Task 3, we’ll begin evaluating the model by hand by looking at which words are most similar another word based on their vectors.
■ Problem 14. Load the model (vectors) you saved in Task 2 by using the Jupyter notebook pro vided (or code that does something similar) that uses the Gensim package to read the vectors. Gensim has a number of useful utilities for working with pretrained vectors.
Once you have completed all other steps, only then start on Task 4!
Hopefully Task 3 has shown you that your word vectors have learned something. But what exactly do we do with the vectors? In Task 4, you’ll try using your vectors in a downstream task:
Classifying documents.
What might we do if we have word vectors instead of word counts? Well, one way to think of a simple BoW vector is a sum of the one-hot vectors of the word in the document (e.g., if a word appears seven times, we’d sum its one-hot vector to get a value of 7 in that word’s index in the BoW vector). We might take an analogous approach to working with word vectors. To represent a document, we could take the sum of the word vectors. This would give us a fixed-length vectors!
More words in the document means we just add them to the sum—but the vector length stays the same! In practice, most approaches take the average of the word vectors to get a sense of “what kind of content is in this document?” This can work well in practice (as you might see later).
Using the average word vector to represent a document is promising but also seems a bit flawed when we think of which kinds of words are contributing to the vector. Why should the vector for “the” contribute just as much as the vector for “amazing”? In our bag-of-words representation, we tried to mitigate this with re-weighting the BoW vector with techniques like TF-IDF (note: you didn’t do this in Homework 1 but we briefly talked about in class). We could try doing something similar with our word vectors but this raises a question: which kind of re-weighting should we use for our classification task? How do we know which words to weight more or less? The answer is at the heart of new approaches to deep learning: let’s learn the weighting!
