Text Feature Extraction (3/3): Word Embeddings Model

Shachi Kaul
Geek Culture
Published in
6 min readJan 3, 2022

--

Source

In the previous parts of the Text Feature Extraction series, we have seen that in NLP the text-based data could be modeled. The feature extraction techniques such as BagOfWords and TF-IDF were already discussed. So, here we will be learning about another technique called Word Embeddings. Word Embeddings has become a very crucial and interesting approach with a variety of its uses in building text models. Let’s find out more!

Word Embeddings

Aim

It aims to identify the semantics and contextual information of text which wasn’t possible in BOW and TFIDF.

Detailed Study

  • Intuition: Similar context (words) tends to have similar vector space or embeddings and are located closer to each other.
  • What?
    - Technique to convert text into real-numbers vectors into a low-dimensional space. These vectors in low-D are called embedding.
    - Word Embeddings aka Word Vectors to represent words
    - It represents each word as a numerical vector with a specific dimension. These numbers sustain the similarity among various other word vectors that indirectly store the semantic info.
    Eg, Girl = [ ..,..,..] and Women = [..,..,..] are closely related.
  • Mapping of words into vector space
  • Since vectors, arithmetic operations are now possible in a text.
    Eg, King-Boy+Girl = Queen
    King vector (royal + male) is subtracted to Boy (normal + male) and added to Girl (normal + female) results into Queen (royal + female)
  • Let’s see how does this vector looks like.
    Vector with a particular dimension let’s say 300 dimensions if using the English corpus (en_core_web_lg). These 300 are various possible features of a word in English. Each feature is a certain trait of that word that is not human interpretable but obtained by the ML model. Thus, can’t say exactly what describes.
    Eg, For Queen, 300-D vector where numbers represent the magnitude of a feature as shown in Fig-1
Fig-1
  • Why did Word Embeddings come?
    Let’s bring up a scenario
    Vocab words = 5000
    A vector Queen projects at the 5th index in the vocab. Usually, text conversion into text in BOW and TFIDF uses the word count.
    Queen = [0,0,0,0,1,0..00..00..0] with 5000-D
    This results in the whole sparse vector with just 1 place non-zero value. Thus, comes Word Embedding.
    So exact reasons:
    1. Sparsity: Word Embedding uses real-value vectors of specific feature dimensions instead of 0s. The size of the embedding vector is smaller.
    2. Semantics & contextual: Learns by considering context words closer to your target word which wasn't possible in TF-IDF previous models.
    3. Relationships and similarity: Vectors allow to perform arithmetic operations this it easier to find similar words.
    4. Each word is not treated as a feature but infact a vector with a fixed dimension. Thus, algorithms of LSTM/GRU would not be complex and easy to train on.
  • Applications: Sentimental Analysis, Speech Recognition, Information retrieval
  • Utilization of Word Embeddings:
    — Predicts the target/ context words
    — Reduces dimensionality

Word Embeddings Umbrella

Word Embedding is an umbrella term with the concept of pre-trained models (Word2Vec, GloVe) or embedding layer.

Fig-2

Pretrained Model

- Uses pre-trained word vectors to build a model
- These word vectors came by training the embedding from a huge corpus
- Sklearm’s Gensim library provides pre-trained models and APIs such as Word2Vec, GloVe, FastText, etc.

1.1 Word2Vec

  • Google’s Neural Network embedding model which trains the network to learn relations among words from a large corpus of text
  • It’s trained on 100 billion google news datasets having the vocab of 3 million words
  • It embeds the relations among words into a low-dimensional vector space
  • Produces word embeddings
  • 2-layer Neural Network of input, hidden, and output layer
  • Crux: Here just like any NN, the aim is not just to predict the probability but instead to get the trained weights (vector)of the hidden layer.
  • The weights of the hidden layer learned by the model are used as word embedding
Fig-3
  • It plays with predicting the probability of context words and the target word per the word2vec variant architecture
Fig-4, Word2Vec Variants Architecture: Source
  • After training, we get the trained vectors of each word in the vocabulary and the vectors closer (similar) to each other are placed closer in vector space.
  • Let's see the variants of the W2V model.

1.2 GloVe (Global Vectors)

  • My research found me a perfect definition which says—

“GloVe is a count-based, unsupervised learning model that uses co-occurrence statistics at a Global level to model the vector representations of words.”

GloVe is based on ratios of probabilities from the word-word co-occurrence matrix, combining the intuitions of count-based models resulting into capturing the linear structures used by methods like word2vec.

  • Why named so?
    Because the mathematics of matrix is around considering global-level properties of the whole dataset while local properties are considered in Word2Vec models. It means that the models of Word2Vec brings out the statistical numbers considering words using window length but GloVe considers the whole dataset to create co-occurrence matrix.
  • Main idea
    This word-embedding algorithm found by Stanford University Researchers works on a concept of depicting the relation between words based on statistics (probability and counts) using co-occurrence matrix.
  • Mathematics in brief
    GloVe uses co-occurrence matrix where each value represents how those two words appear together. It considers co-occurrence probabilities as a ratio to differentiate two relevant words rather than the probabilities themselves.
  • It is trained on the co-occurrence matrix which signifies that how frequently a pair of words occur together in a corpus.
  • Let's understand some notations and the co-occurrence matrix

Probabilities and ratios from a large corpus

The above table resembles that the P(probability) of solid in the context of ice is high but with steam is low. Since the numerator is a higher ratio of both is >1 which is opposite for the word gas.

Check out my GitHub repo implementing the GloVe.

2. Embedding Layer

  • Instead of using pretrained model, you can train your own word embeddings uisng Keras Embedding layer.
  • Keras has an Embedding Layer to train neural network on textual dataset. It is a type of hidden layer specified after an input layer which takes the integer-encoded data.
  • Usually any textual data is comverted into numeric using some encoding techniques (lets say one-hot encoding). This creates dummy features for each categorical values. Seems like not feasible for big text data. Hecne, Embedding layer converts each word into fixed-length vector. It maps higher dimensional data into lower dimensional vector space.
  • Uses of Embedding Layer:
    1. The embedding layer alone could be used as in another model
    2. Contributes in transfer learning; used to load pre-trained embedding model
    3. Learned itself as a part of deep learning model
  • Refer to blog for the implementation in brief

References

https://nlp.stanford.edu/projects/glove/
https://jalammar.github.io/illustrated-word2vec/
https://www.youtube.com/watch?v=Fn_U2OG1uqI

Feel free to follow this author if you liked the blog because this author assures you to back again with more interesting ML/AI-related stuff.
Thanks,
Happy Learning! 😄

Can get in touch via LinkedIn.

--

--

Shachi Kaul
Geek Culture

Data Scientist by profession and a keen learner. Fascinates photography and scribbling other non-tech stuff too @shachi2flyyourthoughts.wordpress.com