Deep Learning Text Representation
創建於:2025年2月4日
創建於:2025年2月4日
Hey, I want to give me a very detailed study notes on the content present below in triple quotes.
Also, give me problems with answer for each of the topics present.
Make sure you cover everything, because, I'm going to use it for exam preparation.
'''
18CSC412 – Deep Learning for Natural Language Processing
Text Representation Techniques and Embeddings
Text and Reference Books
TEXT BOOKS
Daniel and Martin J H, “Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition”, Prentice Hall, 2009.
REFERENCES
Manning C D and Schutze H, “Foundations of Statistical Natural Language processing“, First Edition, MIT Press, 1999.
Allen J, “Natural Language Understanding”, Second Edition, Pearson
Education, 2003.
2
The Problem with Text Data
Machine learning algorithms cannot work with raw text directly; the text must be converted into numbers. Specifically, vectors of numbers
This is called feature extraction or feature encoding
Word embedding is a means of turning texts into numbers
Vectors for representing words are called embeddings
3
A Machine Learning Pipeline for Text Classification
4
Text Representation Techniques Vs Embeddings
Text Representation Techniques transform text into numerical features using statistical or count-based approaches. They often lack contextual understanding of words
One-Hot Encoding
Bag-of-Words
TF-IDF (Term Frequency (TF) – Inverse Document Frequency (IDF))
N-Grams
Embeddings are dense, low-dimensional vector representations of words or text that capture semantic and syntactic information. Unlike count-based methods, embeddings are learned from large corpora using neural networks
Word2Vec
FastText
GloVe (Global Vectors for Word Representation)
BERT (Bidirectional Encoder Representations from Transformers)
Sentence Embeddings (e.g., Sentence-BERT)
5
6
The word “I” is at position 1, so its one-hot vector representation would be [1, 0, 0, 0, 0, 0, 0, 0]
7
8
The number of words in the source vocabulary signifies the number of dimensions - here we have eight
Problems with One-Hot Encoding
(i) Imagine we have a vocabulary of 50,000. (There are roughly a million words in English language.)
Each word is represented with 49,999 zeros and a single one, and we need 50,000 squared = 2.5 billion units of memory space
Not computationally efficient
(ii) It is hard to extract meanings
Each word is embedded in isolation, and every word contains a single one and N-1 zeros where N is the number of dimensions. The resulting set of vectors do not say much about one another
If our vocabulary had “orange,” “banana,” and “watermelon,” we can see the similarity between those words, such as the fact that they are types of fruit. But with one-hot vectors, all words are equal distance apart
9
2. Bag-of-Words (BoW)
BoW
an unordered set of words with their position ignored, keeping only their frequency in the document
10
2. Bag-of-Words (BoW)
BoW is a technique that helps us convert text sentences into numeric vectors
Example : Three movie reviews
Review 1: This movie is very scary and long
Review 2: This movie is not scary and is slow
Review 3: This movie is spooky and good
First build a vocabulary from all the unique words in the above three reviews
The vocabulary consists of these 11 words:
‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’
Take each of these words and mark their occurrence in the three movie reviews above with 1’s and 0’s
11
Bag-of-Words (BoW)
Vector-of-Review 1: [1 1 1 1 1 1 1 0 0 0 0]
Vector-of-Review 2: [1 1 2 0 0 1 1 0 1 0 0]
Vector-of-Review 3: [1 1 1 0 0 0 1 0 0 1 1]
12
Review 1: This movie is very scary and long
Review 2: This movie is not scary and is slow
Review 3: This movie is spooky and good
Advantages Bag-of-Words (BoW) Representation
Simplicity
The BoW model is easy to understand and implement, making it a great starting point for text processing tasks
Good for Keyword Detection
Since BoW focuses on the frequency of words, it’s effective for tasks where the presence or absence of specific words matters, such as spam detection or topic classification.
Example: In a spam detection system, BoW can identify frequent words like “win” or “prize,” which are common in spam emails
13
Disadvantages Bag-of-Words (BoW) Representation
Insensitivity to word order
It treats all occurrences of a word as equivalent, regardless of the order in which they appear in a sentence
It cannot capture the relationships between the words in a sentence and the meaning they convey
For Example : BoW representation is same for the following two sentences with completely different meanings
14
Disadvantages Bag-of-Words (BoW) Representation
High Dimensionality
If a corpus contains large number of unique words, the bag of words representation will have a high dimensionality
Example: For a dataset with 10,000 unique words, each document is represented by a 10,000-dimensional vector, which leads to sparse matrices and increased computational cost
Ignoring New Words
Since the BoW vector length is same as the number of unique words in the corpus, It makes it difficult to introduce new words as it requires recomputing vectors for all documents
15
When to Avoid BoW?
Large corpora with diverse vocabularies
When dealing with large text datasets, the dimensionality increases drastically. Consider using more efficient methods like TF-IDF or Word2Vec
Tasks requiring context
For machine translation, question answering, or chatbot creation, Bag of Words fails to capture word order and context. Use embeddings or sequence-based models instead
16
The Information Retrieval(IR) Perspective – 3. Term-Document Matrix
The Term-Document Matrix
In IR, we search a collection of N documents — We can represent each word in the vocabulary V as an N-dimensional vector indicating which documents it appears in
Conversely, we can represent each document as a V-dimensional vector indicating which words appear in it
Finding the most relevant document for a query
Queries are also (short) documents
Use the similarity of a query’s vector and the documents’ vectors to compute which document is most relevant to the query
Intuition: Documents are similar to each other if they contain the same words
17
3. Term-Document Matrix
In a term-document matrix
Each row represents a word in the vocabulary and
Each column represents a document from some collection of documents
A Term-Document Matrix is a 2D table:
Each cell contains the frequency (count) of the term (word) t in document d: tft,d
Each column is a vector of counts over words, representing a document
Each row is a vector of counts over documents, representing a word
18
3. Term-Document Matrix
19
Vector Space
Informally, Vector is just a list or array of numbers
As You Like It is represented as the list [1,114,36,20] (the first column vector in the Figure) and Julius Caesar is represented as the list [7,62,1,2] (the third column vector)
A vector space is a collection of vectors, characterized by their dimension
In the above example, the document vectors are of dimension 4
In real term-document matrices, the vectors representing each document would have dimensionality |V|, the vocabulary size
20
First dimension
Vector Space
We can think of the vector for a document as a point in |V|-dimensional space
The documents in the previous slide are points in 4-dimensional space
Since 4-dimensional spaces are hard to visualize, the following figure shows a visualization in two dimensions (arbitrarily chosen the dimensions corresponding to the words battle and fool)
21
Vector Space
Term-document matrices were originally defined as a means of finding similar documents for the task of document information retrieval
Information retrieval (IR) is the task of finding the document d from the D documents in some collection that best matches a query q
For IR, we’ll therefore also represent a query by a vector, also of length |V|, and we’ll need a way to compare two vectors to find how similar they are
22
Words can be vectors too!!!
23
The four dimensions of the vector for fool, [36,58,1,4], correspond to the four Shakespeare plays
Word counts in the same four dimensions are used to form the vectors for the other 3 words: wit, [20,15,2,3]; battle, [1,0,7,13]; and good [114,80,62,89].
The red boxes show that each word is represented as a row vector of length four
4. Word-Word Matrix (or) Term-Term Matrix
An alternative to using the term-document matrix to represent words as vectors of document counts, is to use the term-term matrix, also called the word-word matrix or the term-context matrix, in which the columns are labeled by words rather than documents
This matrix is thus of dimensionality |V| x |V|
Each cell records the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus
We can create a window around the word, for example of 4 words to the left and 4 words to the right
The cell represents the number of times the column word occurs in a 4 word window around the row
24
4. Word-Word Matrix
25
Row - target word and
Column - context word
4. Word-Word Matrix
26
Two words are similar in meaning if their context vectors are similar
Cosine for measuring similarity
To measure similarity between two target words v and w, we need a metric that takes two vectors and gives a measure of their similarity
The most common similarity metric is the cosine of the angle between the vectors
The cosine-like most measures for vector similarity used in NLP is based on the dot product operator from linear algebra, also called the inner product:
27
Cosine for measuring similarity
The dot product of two vectors provides a scalar value that encapsulates the relationship between the two vectors
Geometric Interpretation
The dot product measures the degree of alignment between two vectors
28
Cosine for measuring similarity
Dot Product as a Similarity Metric
The dot product is high when two vectors have large values in the same dimensions
This means the vectors are aligned in those dimensions, indicating similarity
Orthogonal Vectors
When vectors are orthogonal (perpendicular), their dot product is 0
This happens because they do not share any alignment in their dimensions
A dot product of 0 represents strong dissimilarity
29
Cosine for measuring similarity
Raw dot product, however, has a problem as a similarity metric: it favors long vectors
The dot product is higher if a vector is longer, with higher values in each dimension
More frequent words have longer vectors, since they tend to co-occur with more words and have higher co-occurrence values with each of them
The raw dot product thus will be higher for frequent words
But we’d like a similarity metric that tells us how similar two words are regardless of their frequency
30
Cosine Similarity
Modify the dot product to normalize for the vector length by dividing the dot product by the lengths of each of the two vectors
This normalized dot product turns out to be the same as the cosine of the angle between the two vectors, following from the definition of the dot product between two vectors a and b
31
Vector length
Cosine Similarity
The cosine similarity metric between two vectors v and w thus can be computed as
For some applications we pre-normalize each vector, by dividing it by its length, creating a unit vector of length 1
32
Cosine Similarity
33
Cosine Similarity
For unit vectors, the dot product between two vectors equals the cosine of the angle between them
The unit vectors eliminate the effect of magnitude, leaving only the directional relationship
The cosine similarity value ranges between -1 and 1
34
Cosine Similarity
Range Explanation
35
Cosine Similarity
The cosine similarity value ranges between -1 and 1
Cosine Similarity = 1 => The vectors are identical and point in the same direction
Two documents containing exactly the same content
D1: The cat sat on the mat
D2: The cat sat on the mat
Cosine Similarity = 0 => The vectors are orthogonal, meaning there is no correlation or similarity between them
Two documents about completely unrelated subjects
D1: I love eating apples
D2: The Earth orbits the Sun
Cosine Similarity = -1 =>The vectors are opposite in direction, indicating they are completely dissimilar or contradictory
Two documents expressing opposing views
D1: I love summer
D2: I hate summer
36
Cosine Similarity - Example
37
The model decides that information is way closer to digital than it is to cherry
Other possible similarity measures
38
5.TF-IDF : Weighing terms in the Vector
Raw frequency (counting how often a word appears) isn't always helpful for understanding connections between words
For Example
Words like "cherry" and "strawberry" might appear together often and tell us something interesting (they are related to fruits)
Words like "the," "it," or "they" also appear frequently but don’t help us understand anything specific because they are common across all kinds of sentences
39
5. TF-IDF : Weighing terms in the Vector
Words that occur near by frequently(may be pie nearby cherry) are more important than words that only appear once or twice
Words that are too frequent-ubiquitous, like the or is - are unimportant
How can we balance these two conflicting constraints?
There are two common solutions to this problem
The tf-idf algorithm, usually used when the dimensions are documents
The PPMI algorithm, usually used when the dimensions are words
40
5. TF-IDF : Weighing terms in the Vector
The tf-idf algorithm is the product of two terms, each term capturing one of these two intuitions:
(i) The first is the term frequency (Luhn, 1957): the frequency of the word t in the document d. We can just use the raw count as the term frequency
More commonly we squash the raw frequency a bit, by using the log10 of the frequency
Because we can’t take the log of 0, we normally add 1 to the count
41
5. TF-IDF
If we use log weighting, terms which occur 0 times in a document would have tf = log10(1) = 0
10 times in a document tf = log10(11) = 14
100 times tf = log10(101) = 2004
1000 times tf = 300044
and so on.
42
5. TF-IDF
(ii) The second factor in tf-idf is used to give a higher weight to words that occur only in a few documents
Terms that are limited to a few documents are useful for discriminating those documents from the rest of the collection
Terms that occur frequently across the entire collection aren’t as helpful
The document frequency dft of a term ‘t’ is the number of documents it occurs in - Each document is counted only once, even if the term appears multiple times within the same document
Document frequency is not the same as the collection frequency of a term, which is the total number of times the word appears in the whole collection in any document
43
5. TF-IDF
Consider in the collection of Shakespeare’s 37 plays the two words Romeo and action
The words have identical collection frequencies (they both occur 113 times in all the plays) but very different document frequencies, since Romeo only occurs in a single play
Inverse document frequency or idf - is defined using the fraction N / dft
where N is the total number of documents in the collection
dft is the number of documents in which term t occurs
The fewer documents in which a term occurs, the higher this weight
The lowest weight of 1 is assigned to terms that occur in all the documents
44
5. TF-IDF
Because of the large number of documents in many collections, this measure too is usually squashed with a log function
The resulting definition for inverse document frequency (idf) is thus
idf values for some words in the Shakespeare corpus
45
5. TF-IDF
The tf-idf weighted value wt,d for word t in document d thus combines term frequency tft,d with idf
tf: Measures how popular a word is in a single document
idf: Ensures that common words like "the" or "and" don't dominate, since they appear in all documents
46
5. TF-IDF - Example
Imagine we have three short documents:
Doc 1: I love programming
Doc 2: Programming is fun
Doc 3: I love fun activities
Step 1: Calculate Term Frequency (TF) - Tokenize the words
TF - is the number of times a word appears in a document
47
5. TF-IDF - Example
48
5. TF-IDF - Example
Step 2: Calculate Inverse Document Frequency (IDF)
IDF measures how important a term is across all documents
IDF for each word
49
5. TF-IDF - Example
Step 3: Calculate TF-IDF
TF-IDF is calculated by multiplying the TF and IDF for each word
50
5. TF-IDF - Example
51
Final TF-IDF Vectors
Doc 1:
Doc 2:
Doc 3:
These TF-IDF vectors represent the documents in a numerical form where each element corresponds to the importance of a term in a document relative to the whole corpus
6. PPMI (Positive Pointwise Mutual Information)
An alternative weighting function to tf-idf, PPMI
PPMI - is used for term-term-matrices, when the vector dimensions correspond to words rather than documents
PPMI draws on the intuition that the best way to weigh the association between two words is to ask how much more the two words co-occur in our corpus than we would have a priori expected them to appear by chance
Pointwise mutual information - one of the most important concepts in NLP
It is a measure of how often two events x and y occur, compared with what we would expect if they were independent
52
Joint Vs Marginal probabilities
Joint Probability
The probability of two (or more) events happening simultaneously
Notation: P(A∩B) or P(A,B), where A and B are two events or random variables
Properties
If A and B are independent events, P(A∩B) = P(A)⋅P(B)
Joint probabilities are often represented in tables for discrete random variables or through joint probability density functions (PDFs) for continuous random variables
Example
Suppose you toss two fair coins. Each coin can land as Heads (H) or Tails (T)
The sample space for tossing two coins is: {(H,H),(H,T),(T,H),(T,T)}
Events: Let A: The first coin is Heads, Let B: The second coin is Tails
53
Joint Vs Marginal probabilities
The joint probability P(A∩B) is the probability that the first coin is Heads AND the second coin is Tails
From the sample space, only one outcome satisfies this: (H,T)
Since the coins are fair, the probability of each outcome is 1/4
54
Joint Vs Marginal probabilities
Marginal Probability
The probability of a single event occurring, irrespective of other events
It is derived by "summing out" or "integrating out" the other variables from the joint probability
Notation: P(A) or P(B)
Example : Tossing Two Coins
Suppose you toss two fair coins. The sample space for the outcomes is: {(H,H),(H,T),(T,H),(T,T)}
Events: A: The first coin shows Heads, B: The second coin shows Heads
To find the marginal probability P(A), we calculate the probability of the first coin showing Heads, regardless of the outcome of the second coin
55
Joint Vs Marginal probabilities
There are 4 total possible outcomes: {(H,H),(H,T),(T,H),(T,T)}
The marginal probabilities are obtained by summing the joint probabilities across rows or columns
56
Joint Vs Marginal probabilities
57
This table showcases joint probabilities and their corresponding marginal probabilities by summing across rows and columns
6. PPMI
PMI between two words: (Church & Hanks 1989)
Do words word1 and word2 co-occur more than if they were independent?
The numerator tells us how often we observed the two words together (assuming we compute probability by using the MLE)
The denominator tells us how often we would expect the two words to co-occur assuming they each occurred independently; recall that the probability of two independent events both occurring is just the product of the probabilities of the two events
Thus, the ratio gives us an estimate of how much more the two words co-occur than we expect by chance
PMI ranges from −∞ to +∞
58
6. PPMI
Negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable unless our corpora are enormous
Imagine w1 and w2 whose probability is each 10-6
Hard to be sure p(w1,w2) is significantly different than 10-12
Furthermore, it’s not clear whether it’s even possible to evaluate such scores of ‘unrelatedness’ with human judgments
So we just replace negative PMI values by 0
Positive PMI (PPMI) between word1 and word2
59
Assume that we have a co-occurrence matrix F with W rows (words) and C columns (contexts)
fij is # of times wi occurs in context cj
6. PPMI
60
Denominator : Total co-occurrences across all words and contexts
Total co-occurrences
of word i across
all contexts
Co-occurrence count of word i and context j
Total occurrences of
context j across all words
6. PPMI
61
Numerator (pij): The actual co-occurrence probability
Denominator (Pi∗⋅P∗j): The expected co-occurrence probability if i and j were independent
PPMI helps capture meaningful associations between words and contexts:
Higher PPMI values indicate that two items co-occur more often than expected by chance
Low PPMI values are filtered out to focus only on significant relationships
6. PPMI
62
6. PPMI
63
6. PPMI
64
6. PPMI
PMI is biased toward infrequent events
Very rare words tends to have very high PMI values
Why PMI is biased toward rare events?
The PMI for a word ‘w’ and a context ‘c’ is calculated as:
65
6. PPMI
Consider a streaming platform where we track the movies users watch. You’re analyzing two niche movies:
"Fantasy Movie A"
"Fantasy Movie B“
Probabilities (Hypothetical Data)
66
PMI measures how surprising it is that these two movies are watched together:
6. PPMI
Why rare events lead to high PMI values?
Small Marginal Probabilities in the Denominator
Rare words w and rare contexts c have small probabilities P(w) and P(c), making the denominator P(w)P(c) very small
Even a single co-occurrence (small P(w,c)) can make the fraction P(w,c)/P(w)P(c) disproportionately large
Two solutions
Give rare words slightly higher probabilities
Use add-one smoothing (which has a similar effect)
67
Weighting PMI: Giving rare context words slightly higher probability
68
Use Laplace (add-1) smoothing
69
PPMI versus add-2 smoothed PPMI
70
Smoothing helps to handle cases where word pairs never occurred together by adding small values. But if two words still don’t seem related (even after smoothing), their relationship score (PPMI) will still be 0.
Applications of tf-idf & PPMI vector models
Information Retrieval (IR)
TF-IDF is widely used in search engines to rank documents based on their relevance to a query.
Documents are represented as vectors in a high-dimensional space, and the similarity between the query and documents is computed using cosine similarity
TF-IDF helps weigh terms that are important for identifying relevance while reducing the impact of common words (e.g., "the", "and")
Word Similarity and Lexical Semantics
PPMI vectors are useful for measuring semantic similarity between words
Applications include:
Synonym detection (e.g., finding synonyms in thesaurus construction).
Word sense disambiguation, where word meanings are inferred from their contextual co-occurrence patterns
71
Applications of the tf-idf or PPMI vector models
Distributional Semantics and Word Embeddings
PPMI captures word meanings through co-occurrence statistics, forming the foundation for distributional semantics.
It's a precursor to modern embeddings like Word2Vec and GloVe, as these methods often incorporate PMI-based ideas.
Plagiarism Detection
For the document vectors d1 and d2, estimate the similarity between the two documents by cos(d1 d2) – TF-IDF
Document similarity is also useful for all sorts of applications; information retrieval, plagiarism detection, news recommender systems, and even for digital humanities tasks like comparing different versions of a text to see which are similar to each other
72
'''
Below is a comprehensive set of study notes derived from the provided content. Each section is followed by a set of practice problems (with solutions) to help you solidify your understanding. The notes are organized in a logical flow, covering all key concepts: why text must be converted to numerical form, traditional text representation techniques, and more sophisticated weighting schemes like TF-IDF and PPMI. Finally, we tie everything back to how these methods are used in applications such as search engines (IR), semantic similarity, and plagiarism detection.
Text Representation Techniques (One-Hot, BoW, TF-IDF, N-Grams) generally use frequency or count-based strategies. They are:
Embeddings (Word2Vec, FastText, GloVe, BERT, etc.) are:
[I, ate, an, apple, and, played, the, piano]
.[1, 0, 0, 0, 0, 0, 0, 0]
.['This', 'movie', 'is', 'very', 'scary', 'and', 'long', 'not', 'slow', 'spooky', 'good']
.[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
.tft,d
).[1, 114, 36, 20]
.Term Frequency (TF) for term in document :
(Sometimes the raw count or slightly different log scale is used. The idea is to diminish the effect of repeated word occurrences.)
Inverse Document Frequency (IDF) for term in a corpus of documents:
where = number of documents containing .
TF-IDF weight:
Below are practice problems for each major topic. Work through them step-by-step; then review the provided solutions.
Suppose you have a small corpus with 5 unique words: .
textcat -> index 0 dog -> index 1 apple -> index 2 banana -> index 3 runs -> index 4
[0, 1, 0, 0, 0]
.[1, 0, 0, 0, 0]
[0, 0, 0, 0, 1]
You have the following three sentences:
(a) Create a vocabulary from these sentences.
(b) Form the Bag-of-Words vectors for each sentence based on the vocabulary (use simple frequency counts).
["Enjoy", "your", "free", "pizza", "now", "coupons", "for", "day", "with", "fun"]
[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
[0, 0, 1, 1, 0, 1, 1, 0, 0, 0]
[1, 1, 0, 0, 0, 0, 0, 1, 1, 1]
Given a collection of 3 documents:
Construct the term-document matrix (rows = words, columns = documents), using raw frequency counts.
[“chocolate”, “is”, “sweet”, “vanilla”, “and”]
Word | D1 Count | D2 Count | D3 Count |
---|---|---|---|
chocolate | 1 | 0 | 1 |
is | 1 | 1 | 0 |
sweet | 1 | 1 | 0 |
vanilla | 0 | 1 | 1 |
and | 0 | 0 | 1 |
You have the sentence “The cat sat on the mat.” Assume a context window of ±1 word. Construct a simplified word-word co-occurrence matrix for the words [the, cat, sat, on, mat]
.
[the, cat, sat, on, the, mat]
the
appears next to [cat, mat]
(once each).cat
appears next to [the, sat]
.sat
appears next to [cat, on]
.on
appears next to [sat, the]
.the
appears next to [on, mat]
.mat
appears next to [the]
.Row (Target) | the | cat | sat | on | mat |
---|---|---|---|---|---|
the | 0 | 1 | 0 | 1 | 1 |
cat | 1 | 0 | 1 | 0 | 0 |
sat | 0 | 1 | 0 | 1 | 0 |
on | 1 | 0 | 1 | 0 | 1 |
mat | 1 | 0 | 0 | 1 | 0 |
(Counts can vary slightly depending on how you unify “the” occurrences, but this is a typical approach.)
You have two document vectors:
Calculate:
You have 3 documents:
D1: “apple banana apple”
D2: “banana fruit apple”
D3: “apple fruit”
Compute the term frequency (TF) for each word in each document (use log-scaled TF if possible).
Compute the inverse document frequency (IDF) for each word.
Generate the final TF-IDF values.
(Assume logs are base 10, and if a word doesn’t appear in a document, TF = 0.)
(Note that “apple” ended up with IDF=0 because it appears in all documents. This drastically lowers its TF-IDF weight — exactly what TF-IDF is designed to do for common words.)
You have a mini-corpus of two words “sun” and “moon” and two contexts “day” and “night.” The co-occurrence counts are:
day | night | |
---|---|---|
sun | 10 | 1 |
moon | 1 | 8 |
(Assume base 2 for the log if needed.)
This compilation covers:
With these structured notes and the practice problems, you should be well-prepared to tackle exam questions on Deep Learning for NLP text representation basics. Good luck with your studies!