SI 630: Homework 2 – Word Embeddings and Attention

发布时间：2026-03-17

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Due: See Canvas

1 Introduction

How do we represent word meaning so that we can analyze it, compare different words’ meanings, and use these representations in NLP tasks? One way to learn word meaning is to find regularities in how a word is used. Two words that appear in very similar contexts probably mean similar

things. One way you could capture these contexts is to simply count which words appeared nearby. If we had a vocabulary of V words, we would end up with each word being represented as a vector of length |V | 1 where for a word wi , each dimension j in wi’s vector, wi,j refers to how many times wj appeared in a context where wi was used.

The simple counting model we described actually words pretty well as a baseline! However, it has two major drawbacks. First, if we have a lot of text and a big vocabulary, our word vector representations become very expensive to compute and store. A 1,000 words that all co-occur with some frequency would take a matrix of size |V | 2 , which has a million elements! Even though not all words will co-occur in practice, when we have hundreds of thousands of words, the matrix can become infeasible to compute. Second, this count-based representation has a lot of redundancy in it. If “ocean” and “sea” appear in similar contexts, we probably don’t need the co-occurrence counts for all |V | words to tell us they are synonyms. In mathematics terms, we’re trying to find a lower-rank matrix that doesn’t need all |V | dimensions.

Word embeddings solve both of these problems by trying to encode the kinds of contexts a word appears in as a low-dimensional vector. There are many (many) solutions for how to find lowerdimensional representations, with some of the earliest and successful ones being based on the Singular Value Decomposition (SVD); one you may have heard of is Latent Semantic Analysis. In Homework 2, you’ll learn about a relatively recent technique, word2vec, that outperforms prior approaches for a wide variety of NLP tasks and is very widely used. This homework will build on your experience with stochastic gradient descent (SGD) and log-likelihood (LL) from Homework

1. You’ll (1) implement a basic version of word2vec that will learn word representations and then (2) try using those representations in intrinsic tasks that measure word similarity and an extrinsic task for sentiment analysis.

For this homework, we’ve provided skeleton code in Python 3 that you can use to finish the implementation of word2vec and comments within to help hint at how to turn some of the math into python code. You’ll want to start early on this homework so you can familiarize yourself with the code and implement each part.

This homework has the following learning goals:

• Develop your pytorch programming skills through working with more of the library

• Learn how word2vec works in practice

• Learn how one form of attention works

• Improve your advanced data science debugging skills

• Have you work with large corpora

• Learn how to use Weights & Biases

• Evaluate one form of model explanation

This homework is a mix of conceptual and skills based learning. As you get the hang of pro gramming neural networks, you’ll be able to teach them to do many more advanced tasks. This homework will hopefully help prepare you by again having you advance your skills while also getting you thinking about what training word embeddings can do for us (as practitioners).

2 Notes

We’ve made the implementation easy to follow and avoided some of the useful-to-opaque opti mizations that can make the code much faster.2 As a result, training your model may take some time. We estimate that on a regular laptop, it might take 30-45 minutes to finish training a single epoch of your model. That said, you can still quickly run the model for ∼10K steps in a few min utes and check whether it’s working. A good way to check is to see what words are most similar to some high frequency words, e.g., “january” or “good.” If the model is working, similar-meaning words should have similar vector representations, which will be reflected in the most similar word lists. We have included this as an automated test which will print out the most similar words.

The skeleton code also includes methods for writing word2vec data in a common format read able by the Gensim library. This means you can save your model and load the data with any other common libraries that work with word2vec. Once you’re able to run your model for ∼100K iterations (or more), we recommend saving a copy of its vectors and loading them in a notebook to test.

We’ve included an exploratory notebook.

On a final note, this is the most challenging homework in the class. Much of your time will be spent on Task 1, which is just implementing word2vec. It’s a hard but incredibly rewarding homework and the process of doing the homework will help turn you into a world-class information and data scientist!

3 Data

For data, we’ll be using a sample of cleaned Amazon book reviews that’s been shrunk down to make it manageable. This is pretty fun data to use since it lets us use word vectors to probe for knowledge about how people describe products. If you’re very ambitious, we’ve include a lot of extra data you can use to train. Feel free to see how the model works and whether you can get through a single epoch! We’ve provided several files for you to use in both the word2vec part and in the downstream classification part:

1. reviews-word2vec.med.txt – Eventually train your word2vec model on this data

2. reviews-word2vec.tiny.txt – A very small sample of data. Your model won’t learn much from this but you can use the file to quickly test and debug your code without having to wait for the tokenization to finish.

3. reviews-word2vec.large.txt – If you have an efficient implementation, try training your word2vec model on this data

4. reviews-word2vec.huge.txt – Lots of data! Running on this data will require some careful performance optimization and starting early

5. sentiment.train.csv – This is the training data for your attention-based classifier in Part 4.

You do not need this data for word2vec (nor should you use it)

6. sentiment.dev.csv – Labeled data for evaluating the attention-based classifier in Part 4

7. sentiment.test.csv – Unlabeled test data for evaluating the attention-based classifier in Part 4. You will upload your predictions for this test to Kaggle

4 Task 1: Word2vec

In Task 1, you’ll implement parts of word2vec in various stages. Word2vec itself is a complex piece of software and you won’t be implementing all the features in this homework. In particular, you will implement:

1. Skip-gram negative sampling (you might see this as SGNS)

2. Rare word removal

3. Frequent word subsampling

You’ll spend the majority of your time on Part 1 of that list which involves writing the gradient descent part. You’ll start by getting the core part of the algorithm up without parts 2 and 3 and running with gradient descent and using negative sampling to generate output data that is incorrect.

Then, you’ll work on ways to speed up the efficiency and quality by removing overly common words and removing rare words.

Parameters and notation

The vocabulary size is V , and the hidden layer size is k. The hidden layer size k is a hyperparameter that will determine the size of our embeddings. The units on these adjacent layers are fully connected. The input is a one-hot encoded vector x, which means for a given input context word, only one out of V units, {x1, . . . , xV }, will be 1, and all other units are 0. The output layer consists of a number of context words which are also V -dimensional one-hot encodings of a number of words before and after the input word in the sequence. So if your input word was word w in a sequence of text and you have a context window3 ±2, this means you will have four V -dimensional one-hot outputs in your output layer, each encoding words w−2, w−1, w+1, w+2 respectively. Unlike the input-hidden layer weights, the hidden-output layer weights are shared: the weight matrix that connects the hidden layer to output word wj will be the same one that connects to output word wk for all context words.

The weights between the input layer and the hidden layer can be represented by a V ×k matrix W and the weights between the hidden layer and each of the output contexts similarly represented as C with the same dimensions. Each row of W is the k-dimension embedded representation vI of the associated word wI of the input layer—these rows are effectively the word embeddings we want to produce with word2vec. Let input word wI have one-hot encoding x and h be the output produced at the hidden layer. Then, we have:

h = WT x = vI (1)

Similarly, vI acts as an input to the second weight matrix C to produce the output neurons which will be the same for all context words in the context window. That is, each output word vector is:

u = Ch (2)

and for a specific word wj , we have the corresponding embedding in C as vj ′ and the corresponding neuron in the output layer gets uj as its input where:

uj = v′Tj h(3)

Note that in both of these cases, multiplying the one-hot vector for a word wi by the corresponding matrix is the same thing has simply selecting the row of the matrix corresponding to the embedding for wi . If it helps to think about this visually, think about the case for the inputs to the network: the one-hot embedding represents which word is the center word, with all other words not being present. As a result, their inputs are zero and never contribute to the activation of the hidden layer (only the center word does!), so we don’t need to even do the multiplication. In practice, we typically never represent these one-hot vectors for word2vec as it’s much more efficient to simply select the appropriate row.

An unoptimized, naive version of word2vec would predict which context word wc was present given an input word wI by estimating the probabilities across the whole vocabulary using the softmax function:

P(wc = wc ∗ |wI ) = yc = exp(uc) P V i=1 exp(ui) (4)

This original log-likelihood function is then to maximize the probability that the context words (in this case, w−2, . . . , w+2) were all guessed correctly given the input word wI . Note that you are not implementing this function!

Showing this function raises two important questions (1) why is it still being described and (2) why aren’t you implementing it? First, the equation represents an ideal case of what the model should be doing: given some positive value to predict for one of the outputs (wc), everything else should be close to zero. This objective is similar to the likelihood you implemented for Logistic Regression: given some input, the weights need to be moved to push the predictions closer to 0 or closer to 1. However, think about how many weights you’d need to update to minimize this particular log-likelihood? For each positive prediction, you’d need to update |V | − 1 other vectors to make their predictions closer to 0. That strategy which uses the softmax results a huge computational overhead—despite being the most conceptually sound. The success of word2vec is, in part, due to coming up with a smart way to achieve nearly the same result without having to apply the softmax. Therefore, to answer the second question, now that you know what the goal is, you’ll be implementing a far more efficient method known as negative sampling that will approximate creating a model that minimizes this equation!

If you read the original word2vec paper, you might find some of the notation hard to follow. Thankfully, several papers have tried to unpack the paper in a more accessible format. If you want another description of how the algorithm works, try reading Goldberg and Levy [2014]4 or Rong [2014]5 for more explanation. There are also plenty of good blog tutorials for how word2vec works and you’re welcome to consult those6 as well as some online demos that show how things work.7 . There’s also a very nice illustrated guide to word2vec https://jalammar.github.io/illustrated-word2vec/ that can provide more intuition too.

4.1 Getting Started: Preparing the Corpus

Before we can even start training, we’ll need to determine the vocabulary of the input text and then convert the text into a sequence of IDs that reflect which input neuron corresponds to which word. Word2vec typically treats all text as one long sequence, which ignores sentences boundaries, doc ument boundaries, or otherwise-useful markers of discourse. We too will follow suit. In the code, you’ll see general instructions on which steps are needed to (1) create a mapping of word to ID and (2) processing the input sequence of tokens and covert it to a sequence of IDs that we can use for training. This sequence of IDs is what we’ll use to create our training data. As a part of this process, we’ll also keep track of all the token frequencies in our vocabulary.

■ Problem 1. Modify function load data in the Corpus class to read in the text data and fill in the id to word, word to id, and full token sequence as ids fields. You can safely skip the rare word removal and subsampling for now.

4.2 Negative sampling

For a target word, the nearby words in the context form the positive example for training our prediction model. Rather than train word2vec like a regular mutliclass classification model (which uses the softmax function to predict outputs8 ), word2vec uses a small number of randomly-selected words as negative examples.9 These negative examples are referred to as the negative samples.

The negative samples are chosen using a unigram distribution raised to the 3 4 power: Each word is given a weight equal to its frequency (word count) raised to the 4 3 power. The probability for a selecting a word is just its weight divided by the sum of weights for all words. The decision to

raise the frequency to the 3 4 power is fairly empirical and this function was reported in their paper to outperform other ways of biasing the negative sampling towards infrequent words. Computing this function each time we sample a negative example is expensive, so one impor tant implementation efficiency is to create a table so that we can quickly sample words. We’ve provided some notes in the code and your job will be to fill in a table that can be efficiently sam pled.10

■ Problem 2. Modify function generate negative sampling table to create the negative sampling table.

4.3Generating the Training Data

Once you have the tokens in place, the next step is get the training data in place to actually train the model. Say we have the input word “fox” and observed context word “quick”. When training the network on the word pair (“fox”, “quick”), we want the model to predict an output of 1 signalling this word (“quick”) was present in the context.

With negative sampling, we are will randomly select a small number of negative examples (let’s say 2) for each positive example to update the weights for. (In this context, a negative example is one for which we want the network to output a 0 for). When updating the model (later), our parameters will be updated on our current ability to predict 1 for the positive examples and 0 for the negative examples.

To generate the training, you’ll iterate through all token IDs in the sequence. At each time step, the current token ID will become the target word. You’ll use the window size parameter to decide how many nearby tokens should be included as positive training examples.

The original word2vec paper says that selecting 5-20 words works well for smaller datasets, and you can get away with only 2-5 words for large datasets. In this assignment, you will update with 2 negative words per context word. This means that if your context window selects four words, you will randomly sample 8 words as negative examples of context words. We recommend keeping the negative sampling rate at 2, but you’re welcome to try changing this and seeings its effect (we recommend doing this after you’ve completed the main assignment).

Note: There is one important PyTorch-related wrinkle that you will need to account for, which is described in detail in the code.

■ Problem 3. Generate the list of training instances according to the specifications in the code.

4.4 Define Your word2vec Network

Now that the data is ready, we can define our PyTorch neural network for word2vec. Here, we will not use layers but instead use PyTorch’s Embedding class to keep track of our target word and context word embeddings.

■ Problem 4. Modify the init weights function to initialize the values in the two Embedding objects based on the size of the vocabulary |V | and the size of the embeddings. Unlike in logistic regression where we initialized our β vector be zeros, here, we’ll initialize the weights to have small non-zero values centered on zero and sampled from (-init range, init range).11

The next step is to update the forward function, which takes as input some target word and context words and predicts 0 or 1 for whether each context word was present. Formally, for some target word vector vt and context word vector vc, word2vec makes its predictions as σ(vt · vc) (5)

where σ is the sigmoid function (like in Homework 1). Word2vec aims to learn parameters (its two embedding matrices) such that this function is maximized for positive examples and minimized for negative examples.

■ Problem 5. Modify the forward function

4.5 Train Your Model

Once you have the data in the right format, you’re ready to train your model! You will need to implement the core training loop like you did in Homework 1, where you iterate over all the instances in a single epoch and potentially train for multiple epochs.

One key difference this time is that you will use batching. In Homework 1 we had a stark contrast between (1) full gradient descent where a single step required us to compute the gradient with respect to all the data and (2) stochastic gradient descent where take a step based on the prediction error for a single instance. However, there is a middle ground! Often we can improve the gradient by computing it with respect to a few instances instead of just one. Analogously, consider if you wanted to know if you were on the right track, it can help to ask a few folks, but you don’t need to ask everyone (and asking just one person could be risky and send you on the wrong track). Batched gradient descent is the same way.

Conveniently, PyTorch works nearly seamlessly with batching. We can tell the DataLoader class our batch size and it will return a random sample of instances of that size. The code you write for the forward function will also work with a batch too with no modifications (most of the time). This behavior is even better for us because often computers are much faster at larger computations—especially GPUs—so trying to do the forward/backward passes for an entire batch is often just as fast as doing them for a single instance.

Note: One caveat to things just working is that sometimes your forward-pass code will be set up so that it can’t work with batching. The code hints and description in the notebook will hopefully help you avoid these, but we’re also here to support you in Piazza.

In your implementation we recommend starting with these default parameter values:

• batch size = 16 (you can go higher too if your computer supports it, which will speed things up!)

• k = 50 (embedding size)

• η = 5e − 5 (learning rate)

• window ±2

• min token freq = 5

• epochs = 1

• optimizer = AdamW

You can experiment around with other values to see how it affects your results. Your final sub mission should use a batch size > 1. For more details on the equations and details of word2vec, consult Rong’s paper [Rong, 2014], especially Equations 59 and 61.

■ Problem 6. Modify the cell containing the training loop to complete the required PyTorch training process. The notebook describes in more details all the steps

■ Problem 7. Check that your model actually works. We recommend running your code on the reviews-word2vec.med.txt file for one epoch. After this much data, your model should know enough for common words that the nearest neighbors (words with the most similar vectors) to words like “january” will be month-related words. We’ve provided code at the end of the notebook to explore. Try a few examples and convince yourself that your model/code is working.

Once you’re finished here, you’re not yet ready to run everything but you’re close!

4.6 Implement stop-word and rare-word removal

Using all the unique words in your source corpus is often not necessary, especially when considering words that convey very little semantic meaning like “the”, “of”, “we”. As a preprocessing step, it can be helpful to remove any instance of these so-called “stop words”.

Note that when you remove stop words, you should keep track of their position so that the context doesn’t include words outside of the window. This means that a sentence with “my big cats of the kind that...” if you have a context window of ±2, then you would only have “my” and ”big” as context words (since “of” and “the” get removed) and not include “kind.”

4.6.1 Minimum frequency threshold.

In addition to removing words that are so frequent that they have little semantic value for compari son purposes, it is also often a good idea to remove words that are so infrequent that they are likely very unusual words or words that don’t occur often enough to get sufficient training during SGD. While the minimum frequency can vary depending on your source corpus and requirements, we will set min count = 5 as the default in this assignment.

Instead of just removing words that had less than min count occurrences, we will replace these all with a unique token <UNK>. In the training phase, you will skip over any input word that is <UNK> but you will still keep these as possible context words.

■ Problem 8. Modify function load data to convert all words with less than min count occurrences into <UNK> tokens. Modify your dataset generation code to avoid creating a training instance when the target word is <UNK>.

4.6.2 Frequent word subsampling

Words appear with varying frequencies: some words like “the” are very common, whereas others are quite rare. In the current setup, most of our positive training examples will be for predicting very common words as context words. These examples don’t add much to learning since they ap pear in many contexts. The word2vec library offers an alternative to ensure that contexts are more likely to have meaningful words. When creating the sequence of words for training (i.e., what goes in full token sequence as ids), the software will randomly drop words based on their frequency so that more common words are less likely to be included in the sequence. This subsam pling effectively increases the context window too—because the context window is defined with respect to full token sequence as ids (not the original text), dropping a nearby common words means the context gets expanded to include the next-nearest word that was not dropped.

To determine whether a token in full token sequence as ids should be subsampled, the word2vec software uses this equation to compute the probability pk(wi) of a token for word wi being kept in for training:

pk(wi) = r p 0 ( .001 wi) + 1! · p 0 ( .001 wi) (6)

where p(wi) is the probability of the word appearing in the corpus initially. Using this probability, each occurrence of wi in the sequence is randomly decided to be kept or removed based on pk(wi).

■ Problem 9. Modify function load data to compute the probability pk(wi) of being kept during subsampling for each word wi .

■ Problem 10. Modify function load data so that after the initial full token sequence as ids is constructed, tokens are subsampled (i.e., removed) according to their probability of being kept pk(wi).

Figure 1: An example wandb run from the reference solution where the running sum of loss is reported every 100 steps (i.e., the sum of those steps’ loss) across one epoch on the training data.

Hovering over any point shows the loss at that time. As you can see, after one epoch the model as learned something but has probably not fully converged!

4.7 Using Weights & Biases

As you might guess, training word2vec on a lot of data can take some time. This waiting process will be increasingly true as you train larger and larger models (not just word2vec). However, the larger pytorch ecosystem provides some fantastic tools for you, the practitioner, to monitor the progress. In this subtask, you’ll be using one of those tools, Weights & Biases (wandb), that allows you to log how your model is doing and then you can connect to the wandb website and see the plot. Figure 1 shows an example of the wandb plot for our reference implementation after one epoch of training. Here, we’ve just recorded a running sum of the loss every 100 steps.

You will want to do the same. This will help you see how quickly your model is converging.

If you train multiple models, wandb will show all of their training plots so you can see how your choice in hyperparameters affects training speed and which model as learned the most (has the lowest loss). In practice, many people use wandb to determine when to stop training after seeing at their model has effectively converged.

■ Problem 11. Add wandb logging to your training loop so that you keep track of the sum of the losses for the past 100 steps and record the value with wandb. You will need to register for a free wandb account and then log into that account on your computer (e.g., on the command line) so that the wandb library can know how and where to post the results.

4.8 Train Your Final Model

All the pieces are now in place and you can verify the model has learned something. For your final vectors, we’ll have you train on at least one epoch. Before you do that, we’ll have you do one quick exploration to see how batch size impacts training speed.

■ Problem 12. Try batch sizes of 2, 8, 32, 64, 128, 256, 512 to see how fast each step (one batch worth of updates) is and the total estimated time. For this, you’ll set the parameter and then run the training long enough to get an estimate for both with tqdm wrapped around your batch iterator.

You do not need to finish training for the full epoch. Make a plot where batch size is on the x-axis and the tqdm-estimated time to finish one epoch is on the y-axis. (You may want to log-scale one or both of the axes). You can try other batch sizes too in this plot if you’re curious. In your write up, describe what you see. What batch size would you choose to maximize speed? Side note: You might also want to watch your memory usage, as larger batches can sometimes dramatically

increase memory.

■ Problem 13. Train your model on at least one epoch worth of data. You are welcome to change the hyperparameters as you see fit for your final model (although batch size must be > 1. Record the full training process and save a picture of the wandb plot from your training run in your report.

We need to see the plot. It will probably look something like Figure 1.

4.9 Optional Exercises

Once you get word2vec working, if you are really curious or excited by word2vec, we’ve included a few optional exercises or extension you could try out at the very end of the assignment in the notebook. There is no extra credit for these tasks but they will help provide a lot of insight into model building.

5 Task 2: Save Your Outputs

Once you’ve finished training the model for at least one epoch, save your vector outputs. The rest of the homework will use these vectors so you don’t have even re-run the learning code (until the very last part, but ignore that for now). Task 2 is here just so that you have an explicit reminder to save your vectors. We’ve provided a function to do this for you.

6 Task 3: Qualitative Evaluation of Word Similarities

Once you’ve learned the word2vec embeddings from how a word is used in context new we can use them! How can we tell whether what it’s learned is useful? As a part of training, we put in place code that shows the nearest neighbors, which is often a good indication of whether words that we think are similar end up getting similar representations. However, it’s often better to get a more quantitative estimate of similarity. In Task 3, we’ll begin evaluating the model by hand by looking at which words are most similar another word based on their vectors.

Here, we’ll compare words using the cosine similarity between their vectors. Cosine similarity measures the angle between two vectors and in our case, words that have similar vectors end up having similar (or at least related) meanings.

■ Problem 14. Load the model (vectors) you saved in Task 2 by using the Jupyter notebook pro vided (or code that does something similar) that uses the Gensim package to read the vectors. Gensim has a number of useful utilities for working with pretrained vectors.

■ Problem 15. Pick 10 target words and compute the most similar for each using Gensim’s function. Qualitatively looking at the most similar words for each target word, do these predicted word seem to be semantically similar to the target word? Describe what you see in 2-3 sentences. Hint: For maximum effect, try picking words across a range of frequencies (common, occasional, rare words).

■ Problem 16. Given the analogy function, find five interesting word analogies with your word2vec model. For example, when representing each word by word vectors, we can generate the following equation, king - man + woman = queen. In other word, you can understand the equation as queen - woman = king - man, which mean the vectors similarity between queen and women is equal to king and man. What kinds of other analogies can you find? (NOTE: Any analogies shown in the class recording cannot be used for this problem.) What approaches worked and what approaches didn’t? Write 2-3 seconds in a cell in the notebook.

7 Task 4: Using Word Vectors with Attention for Classification

Once you have completed all other steps, only then start on Task 4!

Hopefully Task 3 has shown you that your word vectors have learned something. But what exactly do we do with the vectors? In Task 4, you’ll try using your vectors in a downstream task:

Classifying documents.

Before we get to the details, let’s think about about some of the logistics for how a person might use word vectors in a classification by comparing with what you did in Homework 1 with a bag of words (BoW) representation. In the BoW representation, you have a fixed-length vector that represents which words are in the document. Even if you extended these features to include bigrams or other kinds of features like who is the author, the vector length would still stay the same if we added more text to the document—adding more words to a document only increase the counts in the document’s BoW vector.

What might we do if we have word vectors instead of word counts? Well, one way to think of a simple BoW vector is a sum of the one-hot vectors of the word in the document (e.g., if a word appears seven times, we’d sum its one-hot vector to get a value of 7 in that word’s index in the BoW vector). We might take an analogous approach to working with word vectors. To represent a document, we could take the sum of the word vectors. This would give us a fixed-length vectors!

More words in the document means we just add them to the sum—but the vector length stays the same! In practice, most approaches take the average of the word vectors to get a sense of “what kind of content is in this document?” This can work well in practice (as you might see later).

Using the average word vector to represent a document is promising but also seems a bit flawed when we think of which kinds of words are contributing to the vector. Why should the vector for “the” contribute just as much as the vector for “amazing”? In our bag-of-words representation, we tried to mitigate this with re-weighting the BoW vector with techniques like TF-IDF (note: you didn’t do this in Homework 1 but we briefly talked about in class). We could try doing something similar with our word vectors but this raises a question: which kind of re-weighting should we use for our classification task? How do we know which words to weight more or less? The answer is at the heart of new approaches to deep learning: let’s learn the weighting!