Word Embedding: GloVe

15 min readFeb 19, 2021

Hi Guys! In this blog, I would like to share points on this research paper ‘GloVe: Global Vectors for Word Representation’.

I went all through this research paper, and found that we really need to have a prior knowledge of Maths (Number theory) and Deep Learning concept.

You will really find confusing some of the equations, but these are comes from Number theory. If you have strong intuition with Number theory, you will feel ease. Otherwise I have also added the link whenever you encounter the part of number theory which you are not aware of it and just take the grasp of what its does and proceed it.

Requirement to have some grasp of knowledge:

Part 1: Word Embedding — CBOW and Skip-gram

1. Introduction

Semantic vector space models of language represent each word with a real-valued vector.
These vectors can be used as features in a variety of applications (such as information retrieval, document classification, question answering, named entity recognition and parsing).
Most word vector methods rely on the distance or angle between pairs of word vectors as the primary method for evaluating the intrinsic quality of such a set of word representation.
Like usual, as I explained in Part 1 (Under Goal Section, 2nd point), that for measuring the quality of the resulting vector representations, with the expectation that not only similar words tends to have closer to each other but also have multiple degree of similarity[1]. For example, the analogy “king is to queen as man is to woman” should be encoded in the vector space by the vector equation as king − queen = man − woman.

1.1 Drawback:

The two main model families for learning word vectors are global matrix factorization method (like LSA) and local context window methods (like skip-gram model). Both families suffer significant drawbacks.

Methods like LSA (Latent Semantic Analysis) efficiently provide statistical information, they do relatively poorly on the word analogy task, indicating a sub-optimal vector space structure.
Methods like skip-gram may do better on the analogy task, but they poorly utilize the statistics of the corpus since they train on separate local context windows instead of on global co-occurrence counts.

1.2 Objective

Analyze the model properties necessary to produce linear directions of meaning and argue that global log-bilinear regression models are appropriate for doing so.
Propose a specific weighted least squares model that trains on global word-word co-occurrence counts and thus makes efficient use of statistics.
Demonstrate that this papers’ methods outperform other current methods on several word similarity tasks, and also on a common named entity recognition (NER) benchmark.

2. Related Work

(Just don’t focus too much in this section. Just take a grasp about it or You can skip this section, if required. This blog only means to understand more focus on the state-of-the-art model.)

2.1 Matrix Factorization Methods

We already know why matrix factorization was used? To decompose large matrices that capture statistical information about a corpus.
In LSA, the matrices are of “term-document” type, i.e., the rows correspond to words or terms, and the columns correspond to different documents in the corpus.
The Hyperspace Analogue to Language (HAL) (Lund and Burgess, 1996), for example, utilizes matrices of “term-term” type (known as co-occurrence matrix), i.e., the rows and columns correspond to words and the entries correspond to the number of times a given word occurs in the context of another given word.
A main problem with HAL and related methods is that the most frequent words contribute a disproportionate amount to the similarity measure: For example, the number of times two words co-occur with the or and, will have a large effect on their similarity even though there is relatively little about their semantic relatedness.
A number of techniques came for HAL shortcoming, such as COALS methods), in which the co-occurrence matrix is first transformed by an entropy- or correlation-based normalization. Then, later on, a variety of methods also proposed like positive pointwise mutual information (PPMI), Hellinger PCA (HPCA).

2.2 Shallow Window-Based Methods

To learn word representations that aid in making predictions within local context windows.
For example, Bengio et al. (2003) introduced a model that learns word vector representations as part of a simple neural network architecture for language modeling. (which I already summarized in Part 1, under Model Architecture, section 3.1)
The skip-gram and continuous bag-of-words (CBOW) models of Mikolov et al. (2013a)[2] propose a simple single-layer architecture based on the inner product between two word vectors.
Mnih and Kavukcuoglu (2013) also proposed closely-related vector log-bilinear models, vLBL and ivLBL
Levy et al. (2014) proposed explicit word embeddings based on a PPMI metric.
In the skip-gram and ivLBL models, the objective is to predict a word’s context given the word itself, whereas the objective in the CBOW and vLBL models is to predict a word given its context. Through evaluation on a word analogy task, these models demonstrated the capacity to learn linguistic patterns as linear relationships between the word vectors.
Unlike the matrix factorization methods, the shallow window-based methods suffer from the disadvantage that they do not operate directly on the co-occurrence statistics of the corpus. Instead, these models scan context windows across the entire corpus, which fails to take advantage of the vast amount of repetition in the data.

3. The GloVe Model

Before proceeding, you need to know how co-occurrence matrix in NLP calculated. You can refer from here.

How GloVe named as model? Their insights to construct a new model for word representation which they call GloVe, for Global Vectors, because the global corpus statistics are captured directly by the model.

3.1 Notation

Let the matrix of word-word co-occurrence counts be denoted by X.

3.2 Example for showcase how co-occurrence probabilities can be extracted

Suppose we are interested in the concept of thermodynamic phase, for which we might take i = ice and j = steam.
The relationship of these words can be examined by studying the ratio of their co-occurrence probabilities with various probe words, k.
For words k related to ice but not steam, say k = solid, we expect the ratio P_ik /P_jk will be large. (Observe the Table 1 below)

P_ik means probability that word k=solid appears in the context i=ice which it should be high as ice is (context as) the solid state. Also written as P(k=solid|i=ice)
P_jk probability that word appears k=solid in the context j=steam which it should be low because steam is (context as) the solid state. Also written as P(k=solid|j=gas)
So overall, P_ik/P_jk = P(k=solid|i=ice)/P(k=solid|j=gas) should be high.

Table 1 shows these probabilities and their ratios for a large corpus

It has been experimented and shows that compared to the raw probabilities, the ratio is better able to distinguish relevant words (solid and gas) from irrelevant words (water and fashion) as steam and ice are part of water.

3.3 Conclusion

The above argument suggests that the appropriate starting point for word vector learning should be with ratios of co-occurrence probabilities rather than the probabilities themselves.
Noting that the ratio P_ik /P_jk depends on three words i, j, and k, the most general model takes the form,

where w ∈ R^(d) are word vectors and ˜w ∈ R^(d) are separate context word vectors. F may depend on some as-of-yet unspecified parameters (think of like its a function). The number of possibilities for F is vast, but by enforcing (or to make effective) a few desiderata (few something that is needed) we can select a unique choice.

4. Math formulation

First, we would like F to encode the information present the ratio P_ik /P_jk in the word vector space.
Since vector spaces are inherently (characteristic way) linear structures, the most natural way to do this is with vector differences. With this aim, we can restrict their consideration to those functions F that depend only on the difference of the two target words, modifying Eqn. (1) to,

Next, it noted that the arguments of F in Eqn. (2) are vectors while the right-hand side is a scalar.
While F could be taken to be a complicated function parameterized by, e.g., a neural network, doing so would obfuscate (uncertain) the linear structure we are trying to capture (that was the objective task we want). To avoid this issue, we can first take the dot product of the arguments, (as we already mentioned in above (2nd point) that vector spaces are inherently linear structure, so we can also perform dot product also)

which prevents F from mixing the vector dimensions in undesirable ways.

Next, note that for word-word co-occurrence matrices, the distinction between a word and a context word is arbitrary and that we are free to exchange the two roles. The symmetry can be restored in two steps. First, we require that F be a homomorphism between the groups (R,+) and (R>0, ×), i.e.,

which, by Eqn. (3), is solved by,

The solution to Eqn. (4) is F = exp, or,

We note that Eqn. (6) would exhibit the exchange symmetry if not for the log(X_i) on the right-hand side. However, this term is independent of k so it can be absorbed into a bias b_i for w_i . Finally, adding an additional bias ˜b_k for ˜w_k restores the symmetry,

Eqn. (7) is a drastic simplification over Eqn. (1), but it is actually ill-defined since the logarithm diverges whenever its argument is “zero” . (Here the link explained who we decide it’ll converge or diverge)
One resolution to this issue is to include an additive shift in the logarithm which maintains the sparsity of X while avoiding the divergences.

A main drawback to this model is that it “weighs all co-occurrences equally”, even those that happen rarely or never. Such rare co-occurrences are “noisy” and carry “less information” than the more frequent ones — yet even just the zero entries account for 75–95% of the data in X, depending on the vocabulary size and corpus.
They propose a new weighted least squares regression model that addresses these problems. Casting Eqn. (7) as a least squares problem and introducing a weighting function f(X_ij) into the cost function gives us the model

where V is the size of the vocabulary.

The weighting function f(X_ij) should obey the following properties:

There is one class of functions, which satisfy above properties, that they found to work well can be parameterized as,

The performance of the model depends weakly on the cutoff, which they fix to x_max = 100 for all our experiments. They found that α = 3/4 gives a modest improvement over a linear version with α = 1.

Fig. 1 Weighting function f with α = 3/4.

5. Relationship to Other Models

The starting point for the skip-gram or ivLBL methods is a model Q_ij for the probability that word j appears in the context of word i. For concreteness, let us assume that Q_ij is a softmax,

They attempt to maximize the log probability as a context window scans over the corpus. Training proceeds in an on-line, stochastic fashion, but the implied global objective function can be written as,

Evaluating the normalization factor of the softmax for each term in this sum is costly. However, the sum in Eqn. (11) can be evaluated much more efficiently if we first group together those terms that have the same values for i and j,

where they have used the fact that the number of like terms is given by the co-occurrence matrix X.

Recalling the notation for X_i and P_ij.

Replace X_ij into J, rewrite as,

where H(P_i ,Q_i) is the cross entropy of the distributions P_i and Q_i , which we define in analogy to X_i .

As a weighted sum of cross-entropy error, this objective bears some formal resemblance to “the weighted least squares objective” of Eqn. (8). In fact, it is possible to optimize Eqn. (13) directly as opposed to the on-line training methods used in the skip-gram and ivLBL models.
Cross entropy error is just one among many possible distance measures between probability distributions, and it has the unfortunate property that distributions with long tails are often modeled “poorly” with too much weight given to the unlikely events.
This presents a computational bottleneck owing to the sum over the whole vocabulary in Eqn. (10) and it would be desirable to consider a “different distance measure” that did not require this property of Q.
A natural choice would be a “least squares objective” in which normalization factors in Q and P are discarded.

where P_ ij = X_ij and Q_ij = exp((w_i)^T .˜w_j) are the unnormalized distribution.

At this stage another problem emerges, namely that X_ij often takes very large values, which can complicate the optimization.
An effective remedy is to minimize the squared error of the “logarithms of Pˆ and Qˆ” instead

They observe that while the weighting factor X_i is preordained by the on-line training method inherent to the skip-gram and ivLBL models, it is by no means guaranteed to be optimal. In fact, Mikolov et al. (2013a) observe that performance can be increased by filtering the data so as to reduce the “effective value of the weighting factor for frequent words”.
They introduce a more general weighting function, which we are free to take to depend on the context word as w.

which is similar as Eqn. (8)

6. Complexity of the model

The computational complexity of the model depends on the number of nonzero elements in the matrix X.
X. As this number is always less than the total number of entries of the matrix, the model scales no worse than O(|V|^2).
At first glance this might seem like a substantial improvement over the shallow window based approaches (like skip-gram), which scale with the corpus size, |C|. However, typical vocabularies have hundreds of thousands of words, so that (|V|^2) can be in the hundreds of billions, which is actually much larger than most corpora.
In order to make any concrete statements about the number of nonzero elements in X, it is necessary to make some assumptions about the distribution of word co-occurrences.
In particular, they assumed that the number of co-occurrences of word i with word j, X_ij, can be modeled as a “power-law” function of the frequency rank of that “word pair”, r_ij.

Eq. 17 Power law function of frequency rank of that word pair

The total number of words in the corpus is proportional to the sum over all elements of the cooccurrence matrix X,

where they have rewritten the last sum in terms of the generalized harmonic number H_n,m. The upper limit of the sum, |X|, is the maximum frequency rank, which coincides with the number of nonzero elements in the matrix X. This number, |X|, is also equal to the maximum value of r in Eqn. (17) such that X_ij ≥ 1, i.e., |X| = k^(1/α). Rewritten from Eqn. (18) as

We are interested in how |X| is related to |C| when both numbers are large; therefore we are free to expand the right hand side of the equation for large |X|. We use the expansion of generalized harmonic numbers

giving,

where ζ (s) is the Riemann zeta function.

In the limit that X is large, only one of the two terms on the right hand side of Eqn. (21) will be relevant, and which term that is depends on whether α > 1,

For the corpora studied in this article, we observe that X_ij is well-modeled by Eqn. (17) with “α = 1.25”. In this case we have that |X| = O(|C|^0.8 ). Therefore we conclude that the complexity of the model is much better than the worst case O(V^2 ), and in fact it does somewhat “better than the on-line window-based methods” which scale like O(|C|).

7. Corpora and training details

7.1 Trained our model on five corpora of varying sizes:

A 2010 Wikipedia dump with 1 billion tokens
A 2014 Wikipedia dump with 1.6 billion tokens
Gigaword 5 which has 4.3 billion tokens
The combination Gigaword5 + Wikipedia2014, which has 6 billion tokens
42 billion tokens of web data, from Common Crawl (For the model trained on Common Crawl data, we use a larger vocabulary of about 2 million words.)

7.2 Pre-step taken

Tokenize and lowercase each corpus with the Stanford tokenizer.
Build a vocabulary of the 400,000 most frequent words.
Construct a matrix of cooccurrence counts X. In constructing X, we must choose how large the context window should be and whether to distinguish left context from right context.
In all cases we use a decreasing weighting function, so that word pairs that are “d” words apart contribute “1/d” to the total count. This is one way to account for the fact that very distant word pairs are expected to contain less relevant information about the words’ relationship to one another.

7.3 Experiments

Set x_max = 100 and α = 3/4
Train the model using AdaGrad
Stochastically sampling nonzero elements from X (i.e. co-occurrence matrix)
Initial learning rate of 0.05
We run 50 iterations for vectors smaller than 300 dimensions, and 100 iterations otherwise
Use a context of ten words to the left and ten words to the right.

7.4 What model gives?

The model generates two sets of word vectors, W and ˜W .
When X is symmetric, W and ˜W are equivalent and differ only as a result of their random initializations; the two sets of vectors should perform equivalently.
There is evidence that for certain types of neural networks, training multiple instances of the network and then combining the results can help reduce overfitting and noise and generally improve results.
With this in mind, we choose to use the sum W +˜W as our word vectors. Doing so typically gives a small boost in performance, with the biggest increase in the semantic analogy task.

8. Result

The GloVe model performs significantly better than the other baselines, often with smaller vector sizes and smaller corpora.
Our results using the word2vec tool are somewhat better than most of the previously published results. This is due to a number of factors, including our choice to use “negative sampling” (which typically works better than the hierarchical softmax), the number of negative samples, and the choice of the corpus.
We note that increasing the corpus size does not guarantee improved results for other models, as can be seen by the decreased performance of the SVD-L model on this larger corpus.

The singular vectors of this matrix constitute the baseline “SVD”. For the SVD baselines, we generate a truncated matrix X_trunc which retains the information of how frequently each word occurs with only the top 10,000 most frequent words. (matrix-factorization-based methods)
We also evaluate two related baselines: “SVD-S” in which we take the SVD of sqrt(X_trunc), and “SVD-L” in which we take the SVD of log(1+Xtrunc).
(SG†) denotes skip-gram and (CBOW†) denotes continuous-bag-of-words
(CBOW∗ ), last row in Table 3, denotes the vectors available on the word2vec website that are trained with word and phrase vectors on 100B words of news data.

Table 3 shows results on five different word similarity datasets.
A similarity score is obtained from the word vectors by first normalizing each feature across the vocabulary and then calculating the cosine similarity. We compute Spearman’s rank correlation coefficient between this score and the human judgments.

Table 4 shows results on the NER task with the CRF-based model.
The GloVe model outperforms all other methods on all evaluation metrics, except for the CoNLL test set, on which the HPCA method does slightly better.
We conclude that the GloVe vectors are useful in downstream NLP tasks.

References

[1] Linguistic Regularities in Continuous Space Word Representations

[2] Efficient Estimation of Word Representations in Vector Space