Deep Learning Text Representation
Created on: February 4, 2025
Created on: February 4, 2025
Hey, I want to give me a very detailed study notes on the content present below in triple quotes. Also, give me problems with answer for each of the topics present. Make sure you cover everything, because, I'm going to use it for exam preparation.
''' 18CSC412 – Deep Learning for Natural Language Processing
Text Representation Techniques and Embeddings
Text and Reference Books TEXT BOOKS
Daniel and Martin J H, “Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition”, Prentice Hall, 2009.
REFERENCES Manning C D and Schutze H, “Foundations of Statistical Natural Language processing“, First Edition, MIT Press, 1999. Allen J, “Natural Language Understanding”, Second Edition, Pearson Education, 2003.
2 The Problem with Text Data
Machine learning algorithms cannot work with raw text directly; the text must be converted into numbers. Specifically, vectors of numbers
This is called feature extraction or feature encoding
Word embedding is a means of turning texts into numbers
Vectors for representing words are called embeddings
3 A Machine Learning Pipeline for Text Classification 4 Text Representation Techniques Vs Embeddings Text Representation Techniques transform text into numerical features using statistical or count-based approaches. They often lack contextual understanding of words One-Hot Encoding Bag-of-Words TF-IDF (Term Frequency (TF) – Inverse Document Frequency (IDF)) N-Grams
Embeddings are dense, low-dimensional vector representations of words or text that capture semantic and syntactic information. Unlike count-based methods, embeddings are learned from large corpora using neural networks Word2Vec FastText GloVe (Global Vectors for Word Representation) BERT (Bidirectional Encoder Representations from Transformers) Sentence Embeddings (e.g., Sentence-BERT) 5
6
The word “I” is at position 1, so its one-hot vector representation would be [1, 0, 0, 0, 0, 0, 0, 0]
7
8 The number of words in the source vocabulary signifies the number of dimensions - here we have eight Problems with One-Hot Encoding (i) Imagine we have a vocabulary of 50,000. (There are roughly a million words in English language.) Each word is represented with 49,999 zeros and a single one, and we need 50,000 squared = 2.5 billion units of memory space Not computationally efficient
(ii) It is hard to extract meanings Each word is embedded in isolation, and every word contains a single one and N-1 zeros where N is the number of dimensions. The resulting set of vectors do not say much about one another If our vocabulary had “orange,” “banana,” and “watermelon,” we can see the similarity between those words, such as the fact that they are types of fruit. But with one-hot vectors, all words are equal distance apart
9 2. Bag-of-Words (BoW) BoW an unordered set of words with their position ignored, keeping only their frequency in the document
10 2. Bag-of-Words (BoW) BoW is a technique that helps us convert text sentences into numeric vectors
Example : Three movie reviews Review 1: This movie is very scary and long Review 2: This movie is not scary and is slow Review 3: This movie is spooky and good
First build a vocabulary from all the unique words in the above three reviews The vocabulary consists of these 11 words: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’ Take each of these words and mark their occurrence in the three movie reviews above with 1’s and 0’s 11 Bag-of-Words (BoW)
Vector-of-Review 1: [1 1 1 1 1 1 1 0 0 0 0] Vector-of-Review 2: [1 1 2 0 0 1 1 0 1 0 0] Vector-of-Review 3: [1 1 1 0 0 0 1 0 0 1 1]
12 Review 1: This movie is very scary and long Review 2: This movie is not scary and is slow Review 3: This movie is spooky and good Advantages Bag-of-Words (BoW) Representation
Simplicity The BoW model is easy to understand and implement, making it a great starting point for text processing tasks
Good for Keyword Detection Since BoW focuses on the frequency of words, it’s effective for tasks where the presence or absence of specific words matters, such as spam detection or topic classification. Example: In a spam detection system, BoW can identify frequent words like “win” or “prize,” which are common in spam emails
13 Disadvantages Bag-of-Words (BoW) Representation Insensitivity to word order It treats all occurrences of a word as equivalent, regardless of the order in which they appear in a sentence It cannot capture the relationships between the words in a sentence and the meaning they convey For Example : BoW representation is same for the following two sentences with completely different meanings 14 Disadvantages Bag-of-Words (BoW) Representation
High Dimensionality If a corpus contains large number of unique words, the bag of words representation will have a high dimensionality Example: For a dataset with 10,000 unique words, each document is represented by a 10,000-dimensional vector, which leads to sparse matrices and increased computational cost
Ignoring New Words Since the BoW vector length is same as the number of unique words in the corpus, It makes it difficult to introduce new words as it requires recomputing vectors for all documents 15 When to Avoid BoW? Large corpora with diverse vocabularies When dealing with large text datasets, the dimensionality increases drastically. Consider using more efficient methods like TF-IDF or Word2Vec
Tasks requiring context For machine translation, question answering, or chatbot creation, Bag of Words fails to capture word order and context. Use embeddings or sequence-based models instead
16 The Information Retrieval(IR) Perspective – 3. Term-Document Matrix
The Term-Document Matrix In IR, we search a collection of N documents — We can represent each word in the vocabulary V as an N-dimensional vector indicating which documents it appears in Conversely, we can represent each document as a V-dimensional vector indicating which words appear in it Finding the most relevant document for a query Queries are also (short) documents Use the similarity of a query’s vector and the documents’ vectors to compute which document is most relevant to the query
Intuition: Documents are similar to each other if they contain the same words 17 3. Term-Document Matrix In a term-document matrix Each row represents a word in the vocabulary and Each column represents a document from some collection of documents
A Term-Document Matrix is a 2D table: Each cell contains the frequency (count) of the term (word) t in document d: tft,d Each column is a vector of counts over words, representing a document Each row is a vector of counts over documents, representing a word
18 3. Term-Document Matrix 19 Vector Space Informally, Vector is just a list or array of numbers As You Like It is represented as the list [1,114,36,20] (the first column vector in the Figure) and Julius Caesar is represented as the list [7,62,1,2] (the third column vector)
A vector space is a collection of vectors, characterized by their dimension In the above example, the document vectors are of dimension 4 In real term-document matrices, the vectors representing each document would have dimensionality |V|, the vocabulary size
20 First dimension Vector Space We can think of the vector for a document as a point in |V|-dimensional space The documents in the previous slide are points in 4-dimensional space Since 4-dimensional spaces are hard to visualize, the following figure shows a visualization in two dimensions (arbitrarily chosen the dimensions corresponding to the words battle and fool) 21 Vector Space Term-document matrices were originally defined as a means of finding similar documents for the task of document information retrieval
Information retrieval (IR) is the task of finding the document d from the D documents in some collection that best matches a query q
For IR, we’ll therefore also represent a query by a vector, also of length |V|, and we’ll need a way to compare two vectors to find how similar they are
22 Words can be vectors too!!! 23 The four dimensions of the vector for fool, [36,58,1,4], correspond to the four Shakespeare plays Word counts in the same four dimensions are used to form the vectors for the other 3 words: wit, [20,15,2,3]; battle, [1,0,7,13]; and good [114,80,62,89]. The red boxes show that each word is represented as a row vector of length four 4. Word-Word Matrix (or) Term-Term Matrix An alternative to using the term-document matrix to represent words as vectors of document counts, is to use the term-term matrix, also called the word-word matrix or the term-context matrix, in which the columns are labeled by words rather than documents This matrix is thus of dimensionality |V| x |V|
Each cell records the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus
We can create a window around the word, for example of 4 words to the left and 4 words to the right The cell represents the number of times the column word occurs in a 4 word window around the row
24 4. Word-Word Matrix 25 Row - target word and Column - context word 4. Word-Word Matrix 26 Two words are similar in meaning if their context vectors are similar Cosine for measuring similarity To measure similarity between two target words v and w, we need a metric that takes two vectors and gives a measure of their similarity
The most common similarity metric is the cosine of the angle between the vectors
The cosine-like most measures for vector similarity used in NLP is based on the dot product operator from linear algebra, also called the inner product: 27 Cosine for measuring similarity The dot product of two vectors provides a scalar value that encapsulates the relationship between the two vectors
Geometric Interpretation The dot product measures the degree of alignment between two vectors
28 Cosine for measuring similarity
Dot Product as a Similarity Metric The dot product is high when two vectors have large values in the same dimensions This means the vectors are aligned in those dimensions, indicating similarity
Orthogonal Vectors When vectors are orthogonal (perpendicular), their dot product is 0 This happens because they do not share any alignment in their dimensions A dot product of 0 represents strong dissimilarity 29 Cosine for measuring similarity Raw dot product, however, has a problem as a similarity metric: it favors long vectors
The dot product is higher if a vector is longer, with higher values in each dimension
More frequent words have longer vectors, since they tend to co-occur with more words and have higher co-occurrence values with each of them
The raw dot product thus will be higher for frequent words But we’d like a similarity metric that tells us how similar two words are regardless of their frequency
30 Cosine Similarity Modify the dot product to normalize for the vector length by dividing the dot product by the lengths of each of the two vectors This normalized dot product turns out to be the same as the cosine of the angle between the two vectors, following from the definition of the dot product between two vectors a and b 31 Vector length Cosine Similarity The cosine similarity metric between two vectors v and w thus can be computed as
For some applications we pre-normalize each vector, by dividing it by its length, creating a unit vector of length 1
32 Cosine Similarity 33 Cosine Similarity For unit vectors, the dot product between two vectors equals the cosine of the angle between them
The unit vectors eliminate the effect of magnitude, leaving only the directional relationship The cosine similarity value ranges between -1 and 1
34 Cosine Similarity Range Explanation
35 Cosine Similarity The cosine similarity value ranges between -1 and 1 Cosine Similarity = 1 => The vectors are identical and point in the same direction Two documents containing exactly the same content D1: The cat sat on the mat D2: The cat sat on the mat Cosine Similarity = 0 => The vectors are orthogonal, meaning there is no correlation or similarity between them Two documents about completely unrelated subjects D1: I love eating apples D2: The Earth orbits the Sun Cosine Similarity = -1 =>The vectors are opposite in direction, indicating they are completely dissimilar or contradictory Two documents expressing opposing views D1: I love summer D2: I hate summer
36 Cosine Similarity - Example 37 The model decides that information is way closer to digital than it is to cherry Other possible similarity measures 38 5.TF-IDF : Weighing terms in the Vector
Raw frequency (counting how often a word appears) isn't always helpful for understanding connections between words
For Example
Words like "cherry" and "strawberry" might appear together often and tell us something interesting (they are related to fruits)
Words like "the," "it," or "they" also appear frequently but don’t help us understand anything specific because they are common across all kinds of sentences
39 5. TF-IDF : Weighing terms in the Vector
Words that occur near by frequently(may be pie nearby cherry) are more important than words that only appear once or twice Words that are too frequent-ubiquitous, like the or is - are unimportant
How can we balance these two conflicting constraints?
There are two common solutions to this problem The tf-idf algorithm, usually used when the dimensions are documents The PPMI algorithm, usually used when the dimensions are words 40 5. TF-IDF : Weighing terms in the Vector The tf-idf algorithm is the product of two terms, each term capturing one of these two intuitions: (i) The first is the term frequency (Luhn, 1957): the frequency of the word t in the document d. We can just use the raw count as the term frequency
More commonly we squash the raw frequency a bit, by using the log10 of the frequency
Because we can’t take the log of 0, we normally add 1 to the count
41 5. TF-IDF
If we use log weighting, terms which occur 0 times in a document would have tf = log10(1) = 0 10 times in a document tf = log10(11) = 14 100 times tf = log10(101) = 2004 1000 times tf = 300044 and so on. 42 5. TF-IDF (ii) The second factor in tf-idf is used to give a higher weight to words that occur only in a few documents Terms that are limited to a few documents are useful for discriminating those documents from the rest of the collection Terms that occur frequently across the entire collection aren’t as helpful
The document frequency dft of a term ‘t’ is the number of documents it occurs in - Each document is counted only once, even if the term appears multiple times within the same document Document frequency is not the same as the collection frequency of a term, which is the total number of times the word appears in the whole collection in any document
43 5. TF-IDF Consider in the collection of Shakespeare’s 37 plays the two words Romeo and action The words have identical collection frequencies (they both occur 113 times in all the plays) but very different document frequencies, since Romeo only occurs in a single play
Inverse document frequency or idf - is defined using the fraction N / dft where N is the total number of documents in the collection dft is the number of documents in which term t occurs The fewer documents in which a term occurs, the higher this weight
The lowest weight of 1 is assigned to terms that occur in all the documents 44 5. TF-IDF Because of the large number of documents in many collections, this measure too is usually squashed with a log function The resulting definition for inverse document frequency (idf) is thus
idf values for some words in the Shakespeare corpus
45 5. TF-IDF The tf-idf weighted value wt,d for word t in document d thus combines term frequency tft,d with idf
tf: Measures how popular a word is in a single document idf: Ensures that common words like "the" or "and" don't dominate, since they appear in all documents 46 5. TF-IDF - Example Imagine we have three short documents: Doc 1: I love programming Doc 2: Programming is fun Doc 3: I love fun activities Step 1: Calculate Term Frequency (TF) - Tokenize the words TF - is the number of times a word appears in a document
47 5. TF-IDF - Example
48 5. TF-IDF - Example Step 2: Calculate Inverse Document Frequency (IDF) IDF measures how important a term is across all documents
IDF for each word
49 5. TF-IDF - Example Step 3: Calculate TF-IDF
TF-IDF is calculated by multiplying the TF and IDF for each word 50 5. TF-IDF - Example 51 Final TF-IDF Vectors
Doc 1:
Doc 2:
Doc 3: These TF-IDF vectors represent the documents in a numerical form where each element corresponds to the importance of a term in a document relative to the whole corpus 6. PPMI (Positive Pointwise Mutual Information)
An alternative weighting function to tf-idf, PPMI PPMI - is used for term-term-matrices, when the vector dimensions correspond to words rather than documents PPMI draws on the intuition that the best way to weigh the association between two words is to ask how much more the two words co-occur in our corpus than we would have a priori expected them to appear by chance
Pointwise mutual information - one of the most important concepts in NLP It is a measure of how often two events x and y occur, compared with what we would expect if they were independent
52 Joint Vs Marginal probabilities Joint Probability The probability of two (or more) events happening simultaneously Notation: P(A∩B) or P(A,B), where A and B are two events or random variables Properties If A and B are independent events, P(A∩B) = P(A)⋅P(B) Joint probabilities are often represented in tables for discrete random variables or through joint probability density functions (PDFs) for continuous random variables Example Suppose you toss two fair coins. Each coin can land as Heads (H) or Tails (T) The sample space for tossing two coins is: {(H,H),(H,T),(T,H),(T,T)} Events: Let A: The first coin is Heads, Let B: The second coin is Tails
53 Joint Vs Marginal probabilities The joint probability P(A∩B) is the probability that the first coin is Heads AND the second coin is Tails From the sample space, only one outcome satisfies this: (H,T)
Since the coins are fair, the probability of each outcome is 1/4
54 Joint Vs Marginal probabilities Marginal Probability The probability of a single event occurring, irrespective of other events It is derived by "summing out" or "integrating out" the other variables from the joint probability Notation: P(A) or P(B)
Example : Tossing Two Coins Suppose you toss two fair coins. The sample space for the outcomes is: {(H,H),(H,T),(T,H),(T,T)} Events: A: The first coin shows Heads, B: The second coin shows Heads To find the marginal probability P(A), we calculate the probability of the first coin showing Heads, regardless of the outcome of the second coin
55 Joint Vs Marginal probabilities There are 4 total possible outcomes: {(H,H),(H,T),(T,H),(T,T)}
The marginal probabilities are obtained by summing the joint probabilities across rows or columns
56 Joint Vs Marginal probabilities 57 This table showcases joint probabilities and their corresponding marginal probabilities by summing across rows and columns 6. PPMI PMI between two words: (Church & Hanks 1989) Do words word1 and word2 co-occur more than if they were independent?
The numerator tells us how often we observed the two words together (assuming we compute probability by using the MLE) The denominator tells us how often we would expect the two words to co-occur assuming they each occurred independently; recall that the probability of two independent events both occurring is just the product of the probabilities of the two events Thus, the ratio gives us an estimate of how much more the two words co-occur than we expect by chance
PMI ranges from −∞ to +∞
58 6. PPMI Negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable unless our corpora are enormous Imagine w1 and w2 whose probability is each 10-6 Hard to be sure p(w1,w2) is significantly different than 10-12 Furthermore, it’s not clear whether it’s even possible to evaluate such scores of ‘unrelatedness’ with human judgments So we just replace negative PMI values by 0 Positive PMI (PPMI) between word1 and word2 59 Assume that we have a co-occurrence matrix F with W rows (words) and C columns (contexts) fij is # of times wi occurs in context cj 6. PPMI 60 Denominator : Total co-occurrences across all words and contexts Total co-occurrences of word i across all contexts Co-occurrence count of word i and context j Total occurrences of context j across all words 6. PPMI 61 Numerator (pij): The actual co-occurrence probability
Denominator (Pi∗⋅P∗j): The expected co-occurrence probability if i and j were independent
PPMI helps capture meaningful associations between words and contexts:
Higher PPMI values indicate that two items co-occur more often than expected by chance Low PPMI values are filtered out to focus only on significant relationships 6. PPMI 62 6. PPMI 63 6. PPMI 64 6. PPMI PMI is biased toward infrequent events Very rare words tends to have very high PMI values
Why PMI is biased toward rare events? The PMI for a word ‘w’ and a context ‘c’ is calculated as:
65 6. PPMI Consider a streaming platform where we track the movies users watch. You’re analyzing two niche movies: "Fantasy Movie A" "Fantasy Movie B“ Probabilities (Hypothetical Data)
66 PMI measures how surprising it is that these two movies are watched together: 6. PPMI Why rare events lead to high PMI values?
Small Marginal Probabilities in the Denominator Rare words w and rare contexts c have small probabilities P(w) and P(c), making the denominator P(w)P(c) very small Even a single co-occurrence (small P(w,c)) can make the fraction P(w,c)/P(w)P(c) disproportionately large
Two solutions Give rare words slightly higher probabilities Use add-one smoothing (which has a similar effect)
67 Weighting PMI: Giving rare context words slightly higher probability 68 Use Laplace (add-1) smoothing 69 PPMI versus add-2 smoothed PPMI 70 Smoothing helps to handle cases where word pairs never occurred together by adding small values. But if two words still don’t seem related (even after smoothing), their relationship score (PPMI) will still be 0. Applications of tf-idf & PPMI vector models
Information Retrieval (IR) TF-IDF is widely used in search engines to rank documents based on their relevance to a query. Documents are represented as vectors in a high-dimensional space, and the similarity between the query and documents is computed using cosine similarity TF-IDF helps weigh terms that are important for identifying relevance while reducing the impact of common words (e.g., "the", "and")
Word Similarity and Lexical Semantics PPMI vectors are useful for measuring semantic similarity between words Applications include: Synonym detection (e.g., finding synonyms in thesaurus construction). Word sense disambiguation, where word meanings are inferred from their contextual co-occurrence patterns 71 Applications of the tf-idf or PPMI vector models
Distributional Semantics and Word Embeddings PPMI captures word meanings through co-occurrence statistics, forming the foundation for distributional semantics. It's a precursor to modern embeddings like Word2Vec and GloVe, as these methods often incorporate PMI-based ideas.
Plagiarism Detection For the document vectors d1 and d2, estimate the similarity between the two documents by cos(d1 d2) – TF-IDF Document similarity is also useful for all sorts of applications; information retrieval, plagiarism detection, news recommender systems, and even for digital humanities tasks like comparing different versions of a text to see which are similar to each other 72
'''
Reasoned about text representation techniques for 4 seconds
Below is a comprehensive set of study notes derived from the provided content. Each section is followed by a set of practice problems (with solutions) to help you solidify your understanding. The notes are organized in a logical flow, covering all key concepts: why text must be converted to numerical form, traditional text representation techniques, and more sophisticated weighting schemes like TF-IDF and PPMI. Finally, we tie everything back to how these methods are used in applications such as search engines (IR), semantic similarity, and plagiarism detection.
Text Representation Techniques (One-Hot, BoW, TF-IDF, N-Grams) generally use frequency or count-based strategies. They are:
Embeddings (Word2Vec, FastText, GloVe, BERT, etc.) are:
[I, ate, an, apple, and, played, the, piano]
.[1, 0, 0, 0, 0, 0, 0, 0]
.['This', 'movie', 'is', 'very', 'scary', 'and', 'long', 'not', 'slow', 'spooky', 'good']
.[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
.tft,d
).[1, 114, 36, 20]
.Term Frequency (TF) for term in document :
(Sometimes the raw count or slightly different log scale is used. The idea is to diminish the effect of repeated word occurrences.)
Inverse Document Frequency (IDF) for term in a corpus of documents:
where = number of documents containing .
TF-IDF weight:
Below are practice problems for each major topic. Work through them step-by-step; then review the provided solutions.
Suppose you have a small corpus with 5 unique words: .
cat -> index 0
dog -> index 1
apple -> index 2
banana -> index 3
runs -> index 4
[0, 1, 0, 0, 0]
.[1, 0, 0, 0, 0]
[0, 0, 0, 0, 1]
You have the following three sentences:
(a) Create a vocabulary from these sentences.
(b) Form the Bag-of-Words vectors for each sentence based on the vocabulary (use simple frequency counts).
["Enjoy", "your", "free", "pizza", "now", "coupons", "for", "day", "with", "fun"]
[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
[0, 0, 1, 1, 0, 1, 1, 0, 0, 0]
[1, 1, 0, 0, 0, 0, 0, 1, 1, 1]
Given a collection of 3 documents:
Construct the term-document matrix (rows = words, columns = documents), using raw frequency counts.
[“chocolate”, “is”, “sweet”, “vanilla”, “and”]
Word | D1 Count | D2 Count | D3 Count |
---|---|---|---|
chocolate | 1 | 0 | 1 |
is | 1 | 1 | 0 |
sweet | 1 | 1 | 0 |
vanilla | 0 | 1 | 1 |
and | 0 | 0 | 1 |
You have the sentence “The cat sat on the mat.” Assume a context window of ±1 word. Construct a simplified word-word co-occurrence matrix for the words [the, cat, sat, on, mat]
.
[the, cat, sat, on, the, mat]
the
appears next to [cat, mat]
(once each).cat
appears next to [the, sat]
.sat
appears next to [cat, on]
.on
appears next to [sat, the]
.the
appears next to [on, mat]
.mat
appears next to [the]
.Row (Target) | the | cat | sat | on | mat |
---|---|---|---|---|---|
the | 0 | 1 | 0 | 1 | 1 |
cat | 1 | 0 | 1 | 0 | 0 |
sat | 0 | 1 | 0 | 1 | 0 |
on | 1 | 0 | 1 | 0 | 1 |
mat | 1 | 0 | 0 | 1 | 0 |
(Counts can vary slightly depending on how you unify “the” occurrences, but this is a typical approach.)
You have two document vectors:
Calculate:
You have 3 documents:
D1: “apple banana apple”
D2: “banana fruit apple”
D3: “apple fruit”
Compute the term frequency (TF) for each word in each document (use log-scaled TF if possible).
Compute the inverse document frequency (IDF) for each word.
Generate the final TF-IDF values.
(Assume logs are base 10, and if a word doesn’t appear in a document, TF = 0.)
(Note that “apple” ended up with IDF=0 because it appears in all documents. This drastically lowers its TF-IDF weight — exactly what TF-IDF is designed to do for common words.)
You have a mini-corpus of two words “sun” and “moon” and two contexts “day” and “night.” The co-occurrence counts are:
day | night | |
---|---|---|
sun | 10 | 1 |
moon | 1 | 8 |
(Assume base 2 for the log if needed.)
This compilation covers:
With these structured notes and the practice problems, you should be well-prepared to tackle exam questions on Deep Learning for NLP text representation basics. Good luck with your studies!