Deep Learning Text Representation

Question

Hey, I want to give me a very detailed study notes on the content present below in triple quotes.
Also, give me problems with answer for each of the topics present.
Make sure you cover everything, because, I'm going to use it for exam preparation.

'''
18CSC412 – Deep Learning for Natural Language Processing

Text Representation Techniques and Embeddings

Text and Reference Books
TEXT BOOKS

Daniel and Martin J H, “Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition”, Prentice Hall, 2009.

REFERENCES
Manning C D and Schutze H, “Foundations of Statistical Natural Language processing“, First Edition, MIT Press, 1999.
Allen J, “Natural Language Understanding”, Second Edition, Pearson
Education, 2003.

2
The Problem with Text Data

Machine learning algorithms cannot work with raw text directly; the text must be converted into numbers. Specifically, vectors of numbers

This is called feature extraction or feature encoding

Word embedding is a means of turning texts into numbers

Vectors for representing words are called embeddings

3
A Machine Learning Pipeline for Text Classification
4
Text Representation Techniques Vs Embeddings
Text Representation Techniques transform text into numerical features using statistical or count-based approaches. They often lack contextual understanding of words
One-Hot Encoding
Bag-of-Words
TF-IDF (Term Frequency (TF) – Inverse Document Frequency (IDF))
N-Grams

Embeddings are dense, low-dimensional vector representations of words or text that capture semantic and syntactic information. Unlike count-based methods, embeddings are learned from large corpora using neural networks
Word2Vec
FastText
GloVe (Global Vectors for Word Representation)
BERT (Bidirectional Encoder Representations from Transformers)
Sentence Embeddings (e.g., Sentence-BERT)
5

One-hot Encoding
One-hot Encoding is a technique that translates categorical data, including textual data, into numerical vectors
In the context of NLP, it is primarily used to convert textual information into a format that machine learning models can understand and process
Steps
Vocabulary creation: The first step in one-hot encoding is to create a vocabulary consisting of all unique words in the entire text corpus
Each word is indexed with a unique integer value
Vector representation: Once the vocabulary is created, each word is represented as a vector of 0s and 1s
The length of the vector is equal to the size of the vocabulary, and each position in the vector corresponds to a specific word in the vocabulary

6

One-Hot Encoding
If the word is present in a particular text sample, its corresponding position in the vector is marked as 1, and all other positions are 0
This implies that each word is uniquely represented by a binary vector, with only one element being 1 (indicating its presence) and all others being 0
Example
Consider the sentence: “I ate an apple and played the piano”
Indexing each word’s position in the given vocabulary set

The word “I” is at position 1, so its one-hot vector representation would be [1, 0, 0, 0, 0, 0, 0, 0]

7

One-Hot Encoding
The one-hot embedding matrix for the example text
I ate an apple and played the piano

8
The number of words in the source vocabulary signifies the number of dimensions - here we have eight
Problems with One-Hot Encoding
(i) Imagine we have a vocabulary of 50,000. (There are roughly a million words in English language.)
Each word is represented with 49,999 zeros and a single one, and we need 50,000 squared = 2.5 billion units of memory space
Not computationally efficient

(ii) It is hard to extract meanings
Each word is embedded in isolation, and every word contains a single one and N-1 zeros where N is the number of dimensions. The resulting set of vectors do not say much about one another
If our vocabulary had “orange,” “banana,” and “watermelon,” we can see the similarity between those words, such as the fact that they are types of fruit. But with one-hot vectors, all words are equal distance apart

9
2. Bag-of-Words (BoW)
BoW
an unordered set of words with their position ignored, keeping only their frequency in the document

10
2. Bag-of-Words (BoW)
BoW is a technique that helps us convert text sentences into numeric vectors

Example : Three movie reviews
Review 1: This movie is very scary and long
Review 2: This movie is not scary and is slow
Review 3: This movie is spooky and good

First build a vocabulary from all the unique words in the above three reviews
The vocabulary consists of these 11 words:
‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’
Take each of these words and mark their occurrence in the three movie reviews above with 1’s and 0’s
11
Bag-of-Words (BoW)

Vector-of-Review 1: [1 1 1 1 1 1 1 0 0 0 0]
Vector-of-Review 2: [1 1 2 0 0 1 1 0 1 0 0]
Vector-of-Review 3: [1 1 1 0 0 0 1 0 0 1 1]

12
Review 1: This movie is very scary and long
Review 2: This movie is not scary and is slow
Review 3: This movie is spooky and good
Advantages Bag-of-Words (BoW) Representation

Simplicity
The BoW model is easy to understand and implement, making it a great starting point for text processing tasks

Good for Keyword Detection
Since BoW focuses on the frequency of words, it’s effective for tasks where the presence or absence of specific words matters, such as spam detection or topic classification.
Example: In a spam detection system, BoW can identify frequent words like “win” or “prize,” which are common in spam emails

13
Disadvantages Bag-of-Words (BoW) Representation
Insensitivity to word order
It treats all occurrences of a word as equivalent, regardless of the order in which they appear in a sentence
It cannot capture the relationships between the words in a sentence and the meaning they convey
For Example : BoW representation is same for the following two sentences with completely different meanings
14
Disadvantages Bag-of-Words (BoW) Representation

High Dimensionality
If a corpus contains large number of unique words, the bag of words representation will have a high dimensionality
Example: For a dataset with 10,000 unique words, each document is represented by a 10,000-dimensional vector, which leads to sparse matrices and increased computational cost

Ignoring New Words
Since the BoW vector length is same as the number of unique words in the corpus, It makes it difficult to introduce new words as it requires recomputing vectors for all documents
15
When to Avoid BoW?
Large corpora with diverse vocabularies
When dealing with large text datasets, the dimensionality increases drastically. Consider using more efficient methods like TF-IDF or Word2Vec

Tasks requiring context
For machine translation, question answering, or chatbot creation, Bag of Words fails to capture word order and context. Use embeddings or sequence-based models instead

16
The Information Retrieval(IR) Perspective – 3. Term-Document Matrix

The Term-Document Matrix
In IR, we search a collection of N documents — We can represent each word in the vocabulary V as an N-dimensional vector indicating which documents it appears in
Conversely, we can represent each document as a V-dimensional vector indicating which words appear in it
Finding the most relevant document for a query
Queries are also (short) documents
Use the similarity of a query’s vector and the documents’ vectors to compute which document is most relevant to the query

Intuition: Documents are similar to each other if they contain the same words
17
3. Term-Document Matrix
In a term-document matrix
Each row represents a word in the vocabulary and
Each column represents a document from some collection of documents

A Term-Document Matrix is a 2D table:
Each cell contains the frequency (count) of the term (word) t in document d: tft,d
Each column is a vector of counts over words, representing a document
Each row is a vector of counts over documents, representing a word

18
3. Term-Document Matrix
19
Vector Space
Informally, Vector is just a list or array of numbers
As You Like It is represented as the list [1,114,36,20] (the first column vector in the Figure) and Julius Caesar is represented as the list [7,62,1,2] (the third column vector)

A vector space is a collection of vectors, characterized by their dimension
In the above example, the document vectors are of dimension 4
In real term-document matrices, the vectors representing each document would have dimensionality |V|, the vocabulary size

20
First dimension
Vector Space
We can think of the vector for a document as a point in |V|-dimensional space
The documents in the previous slide are points in 4-dimensional space
Since 4-dimensional spaces are hard to visualize, the following figure shows a visualization in two dimensions (arbitrarily chosen the dimensions corresponding to the words battle and fool)
21
Vector Space
Term-document matrices were originally defined as a means of finding similar documents for the task of document information retrieval

Information retrieval (IR) is the task of finding the document d from the D documents in some collection that best matches a query q

For IR, we’ll therefore also represent a query by a vector, also of length |V|, and we’ll need a way to compare two vectors to find how similar they are

22
Words can be vectors too!!!
23
The four dimensions of the vector for fool, [36,58,1,4], correspond to the four Shakespeare plays
Word counts in the same four dimensions are used to form the vectors for the other 3 words: wit, [20,15,2,3]; battle, [1,0,7,13]; and good [114,80,62,89].
The red boxes show that each word is represented as a row vector of length four
4. Word-Word Matrix (or) Term-Term Matrix
An alternative to using the term-document matrix to represent words as vectors of document counts, is to use the term-term matrix, also called the word-word matrix or the term-context matrix, in which the columns are labeled by words rather than documents
This matrix is thus of dimensionality |V| x |V|

Each cell records the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus

We can create a window around the word, for example of 4 words to the left and 4 words to the right
The cell represents the number of times the column word occurs in a 4 word window around the row

24
4. Word-Word Matrix
25
Row - target word and
Column - context word
4. Word-Word Matrix
26
Two words are similar in meaning if their context vectors are similar
Cosine for measuring similarity
To measure similarity between two target words v and w, we need a metric that takes two vectors and gives a measure of their similarity

The most common similarity metric is the cosine of the angle between the vectors

The cosine-like most measures for vector similarity used in NLP is based on the dot product operator from linear algebra, also called the inner product:
27
Cosine for measuring similarity
The dot product of two vectors provides a scalar value that encapsulates the relationship between the two vectors

Geometric Interpretation
The dot product measures the degree of alignment between two vectors

28
Cosine for measuring similarity

Dot Product as a Similarity Metric
The dot product is high when two vectors have large values in the same dimensions
This means the vectors are aligned in those dimensions, indicating similarity
Orthogonal Vectors
When vectors are orthogonal (perpendicular), their dot product is 0
This happens because they do not share any alignment in their dimensions
A dot product of 0 represents strong dissimilarity
29
Cosine for measuring similarity
Raw dot product, however, has a problem as a similarity metric: it favors long vectors

The dot product is higher if a vector is longer, with higher values in each dimension

More frequent words have longer vectors, since they tend to co-occur with more words and have higher co-occurrence values with each of them

The raw dot product thus will be higher for frequent words
But we’d like a similarity metric that tells us how similar two words are regardless of their frequency

30
Cosine Similarity
Modify the dot product to normalize for the vector length by dividing the dot product by the lengths of each of the two vectors
This normalized dot product turns out to be the same as the cosine of the angle between the two vectors, following from the definition of the dot product between two vectors a and b
31
Vector length
Cosine Similarity
The cosine similarity metric between two vectors v and w thus can be computed as

For some applications we pre-normalize each vector, by dividing it by its length, creating a unit vector of length 1

32
Cosine Similarity
33
Cosine Similarity
For unit vectors, the dot product between two vectors equals the cosine of the angle between them

The unit vectors eliminate the effect of magnitude, leaving only the directional relationship
The cosine similarity value ranges between -1 and 1

34
Cosine Similarity
Range Explanation

35
Cosine Similarity
The cosine similarity value ranges between -1 and 1
Cosine Similarity = 1 => The vectors are identical and point in the same direction
Two documents containing exactly the same content
D1: The cat sat on the mat
D2: The cat sat on the mat
Cosine Similarity = 0 => The vectors are orthogonal, meaning there is no correlation or similarity between them
Two documents about completely unrelated subjects
D1: I love eating apples
D2: The Earth orbits the Sun
Cosine Similarity = -1 =>The vectors are opposite in direction, indicating they are completely dissimilar or contradictory
Two documents expressing opposing views
D1: I love summer
D2: I hate summer

36
Cosine Similarity - Example
37
The model decides that information is way closer to digital than it is to cherry
Other possible similarity measures
38
5.TF-IDF : Weighing terms in the Vector

Raw frequency (counting how often a word appears) isn't always helpful for understanding connections between words

For Example

Words like "cherry" and "strawberry" might appear together often and tell us something interesting (they are related to fruits)

Words like "the," "it," or "they" also appear frequently but don’t help us understand anything specific because they are common across all kinds of sentences

39
5. TF-IDF : Weighing terms in the Vector

Words that occur near by frequently(may be pie nearby cherry) are more important than words that only appear once or twice
Words that are too frequent-ubiquitous, like the or is - are unimportant

How can we balance these two conflicting constraints?

There are two common solutions to this problem
The tf-idf algorithm, usually used when the dimensions are documents
The PPMI algorithm, usually used when the dimensions are words
40
5. TF-IDF : Weighing terms in the Vector
The tf-idf algorithm is the product of two terms, each term capturing one of these two intuitions:
(i) The first is the term frequency (Luhn, 1957): the frequency of the word t in the document d. We can just use the raw count as the term frequency

More commonly we squash the raw frequency a bit, by using the log10 of the frequency

Because we can’t take the log of 0, we normally add 1 to the count

41
5. TF-IDF

If we use log weighting, terms which occur 0 times in a document would have tf = log10(1) = 0
10 times in a document tf = log10(11) = 14
100 times tf = log10(101) = 2004
1000 times tf = 300044
and so on.
42
5. TF-IDF
(ii) The second factor in tf-idf is used to give a higher weight to words that occur only in a few documents
Terms that are limited to a few documents are useful for discriminating those documents from the rest of the collection
Terms that occur frequently across the entire collection aren’t as helpful

The document frequency dft of a term ‘t’ is the number of documents it occurs in - Each document is counted only once, even if the term appears multiple times within the same document
Document frequency is not the same as the collection frequency of a term, which is the total number of times the word appears in the whole collection in any document

43
5. TF-IDF
Consider in the collection of Shakespeare’s 37 plays the two words Romeo and action
The words have identical collection frequencies (they both occur 113 times in all the plays) but very different document frequencies, since Romeo only occurs in a single play

Inverse document frequency or idf - is defined using the fraction N / dft
where N is the total number of documents in the collection
dft is the number of documents in which term t occurs
The fewer documents in which a term occurs, the higher this weight

The lowest weight of 1 is assigned to terms that occur in all the documents
44
5. TF-IDF
Because of the large number of documents in many collections, this measure too is usually squashed with a log function
The resulting definition for inverse document frequency (idf) is thus

idf values for some words in the Shakespeare corpus

45
5. TF-IDF
The tf-idf weighted value wt,d for word t in document d thus combines term frequency tft,d with idf

tf: Measures how popular a word is in a single document
idf: Ensures that common words like "the" or "and" don't dominate, since they appear in all documents
46
5. TF-IDF - Example
Imagine we have three short documents:
Doc 1: I love programming
Doc 2: Programming is fun
Doc 3: I love fun activities
Step 1: Calculate Term Frequency (TF) - Tokenize the words
TF - is the number of times a word appears in a document

47
5. TF-IDF - Example

48
5. TF-IDF - Example
Step 2: Calculate Inverse Document Frequency (IDF)
IDF measures how important a term is across all documents

IDF for each word

49
5. TF-IDF - Example
Step 3: Calculate TF-IDF

TF-IDF is calculated by multiplying the TF and IDF for each word
50
5. TF-IDF - Example
51
Final TF-IDF Vectors

Doc 1:

Doc 2:

Doc 3:
These TF-IDF vectors represent the documents in a numerical form where each element corresponds to the importance of a term in a document relative to the whole corpus
6. PPMI (Positive Pointwise Mutual Information)

An alternative weighting function to tf-idf, PPMI
PPMI - is used for term-term-matrices, when the vector dimensions correspond to words rather than documents
PPMI draws on the intuition that the best way to weigh the association between two words is to ask how much more the two words co-occur in our corpus than we would have a priori expected them to appear by chance

Pointwise mutual information - one of the most important concepts in NLP
It is a measure of how often two events x and y occur, compared with what we would expect if they were independent

52
Joint Vs Marginal probabilities
Joint Probability
The probability of two (or more) events happening simultaneously
Notation: P(A∩B) or P(A,B), where A and B are two events or random variables
Properties
If A and B are independent events, P(A∩B) = P(A)⋅P(B)
Joint probabilities are often represented in tables for discrete random variables or through joint probability density functions (PDFs) for continuous random variables
Example
Suppose you toss two fair coins. Each coin can land as Heads (H) or Tails (T)
The sample space for tossing two coins is: {(H,H),(H,T),(T,H),(T,T)}
Events: Let A: The first coin is Heads, Let B: The second coin is Tails

53
Joint Vs Marginal probabilities
The joint probability P(A∩B) is the probability that the first coin is Heads AND the second coin is Tails
From the sample space, only one outcome satisfies this: (H,T)

Since the coins are fair, the probability of each outcome is 1/4

54
Joint Vs Marginal probabilities
Marginal Probability
The probability of a single event occurring, irrespective of other events
It is derived by "summing out" or "integrating out" the other variables from the joint probability
Notation: P(A) or P(B)

Example : Tossing Two Coins
Suppose you toss two fair coins. The sample space for the outcomes is: {(H,H),(H,T),(T,H),(T,T)}
Events: A: The first coin shows Heads, B: The second coin shows Heads
To find the marginal probability P(A), we calculate the probability of the first coin showing Heads, regardless of the outcome of the second coin

55
Joint Vs Marginal probabilities
There are 4 total possible outcomes: {(H,H),(H,T),(T,H),(T,T)}

The marginal probabilities are obtained by summing the joint probabilities across rows or columns

56
Joint Vs Marginal probabilities
57
This table showcases joint probabilities and their corresponding marginal probabilities by summing across rows and columns
6. PPMI
PMI between two words: (Church & Hanks 1989)
Do words word1 and word2 co-occur more than if they were independent?

The numerator tells us how often we observed the two words together (assuming we compute probability by using the MLE)
The denominator tells us how often we would expect the two words to co-occur assuming they each occurred independently; recall that the probability of two independent events both occurring is just the product of the probabilities of the two events
Thus, the ratio gives us an estimate of how much more the two words co-occur than we expect by chance

PMI ranges from −∞ to +∞

58
6. PPMI
Negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable unless our corpora are enormous
Imagine w1 and w2 whose probability is each 10-6
Hard to be sure p(w1,w2) is significantly different than 10-12
Furthermore, it’s not clear whether it’s even possible to evaluate such scores of ‘unrelatedness’ with human judgments
So we just replace negative PMI values by 0
Positive PMI (PPMI) between word1 and word2
59
Assume that we have a co-occurrence matrix F with W rows (words) and C columns (contexts)
fij is # of times wi occurs in context cj
6. PPMI
60
Denominator : Total co-occurrences across all words and contexts
Total co-occurrences
of word i across
all contexts
Co-occurrence count of word i and context j
Total occurrences of
context j across all words
6. PPMI
61
Numerator (pij): The actual co-occurrence probability

Denominator (Pi∗⋅P∗j): The expected co-occurrence probability if i and j were independent

PPMI helps capture meaningful associations between words and contexts:

Higher PPMI values indicate that two items co-occur more often than expected by chance
Low PPMI values are filtered out to focus only on significant relationships
6. PPMI
62
6. PPMI
63
6. PPMI
64
6. PPMI
PMI is biased toward infrequent events
Very rare words tends to have very high PMI values

Why PMI is biased toward rare events?
The PMI for a word ‘w’ and a context ‘c’ is calculated as:

65
6. PPMI
Consider a streaming platform where we track the movies users watch. You’re analyzing two niche movies:
"Fantasy Movie A"
"Fantasy Movie B“
Probabilities (Hypothetical Data)

66
PMI measures how surprising it is that these two movies are watched together:
6. PPMI
Why rare events lead to high PMI values?

Small Marginal Probabilities in the Denominator
Rare words w and rare contexts c have small probabilities P(w) and P(c), making the denominator P(w)P(c) very small
Even a single co-occurrence (small P(w,c)) can make the fraction P(w,c)/P(w)P(c) disproportionately large

Two solutions
Give rare words slightly higher probabilities
Use add-one smoothing (which has a similar effect)

67
Weighting PMI: Giving rare context words slightly higher probability
68
Use Laplace (add-1) smoothing
69
PPMI versus add-2 smoothed PPMI
70
Smoothing helps to handle cases where word pairs never occurred together by adding small values. But if two words still don’t seem related (even after smoothing), their relationship score (PPMI) will still be 0.
Applications of tf-idf & PPMI vector models

Information Retrieval (IR)
TF-IDF is widely used in search engines to rank documents based on their relevance to a query.
Documents are represented as vectors in a high-dimensional space, and the similarity between the query and documents is computed using cosine similarity
TF-IDF helps weigh terms that are important for identifying relevance while reducing the impact of common words (e.g., "the", "and")
Word Similarity and Lexical Semantics
PPMI vectors are useful for measuring semantic similarity between words
Applications include:
Synonym detection (e.g., finding synonyms in thesaurus construction).
Word sense disambiguation, where word meanings are inferred from their contextual co-occurrence patterns
71
Applications of the tf-idf or PPMI vector models
Distributional Semantics and Word Embeddings
PPMI captures word meanings through co-occurrence statistics, forming the foundation for distributional semantics.
It's a precursor to modern embeddings like Word2Vec and GloVe, as these methods often incorporate PMI-based ideas.
Plagiarism Detection
For the document vectors d1 and d2, estimate the similarity between the two documents by cos(d1 d2) – TF-IDF
Document similarity is also useful for all sorts of applications; information retrieval, plagiarism detection, news recommender systems, and even for digital humanities tasks like comparing different versions of a text to see which are similar to each other
72

'''

Answer 1

Below is a comprehensive set of study notes derived from the provided content. Each section is followed by a set of practice problems (with solutions) to help you solidify your understanding. The notes are organized in a logical flow, covering all key concepts: why text must be converted to numerical form, traditional text representation techniques, and more sophisticated weighting schemes like TF-IDF and PPMI. Finally, we tie everything back to how these methods are used in applications such as search engines (IR), semantic similarity, and plagiarism detection.

1. Introduction

1.1 Why Convert Text into Numbers?

Machine Learning algorithms cannot work directly with raw text. We need to convert text into vectors of numbers, a process called feature extraction or feature encoding.
Simple approaches use count-based methods like:
- One-Hot Encoding
- Bag-of-Words (BoW)
- TF-IDF
- N-grams
More advanced approaches generate dense embeddings using neural networks:
- Word2Vec, FastText, GloVe, BERT, Sentence-BERT, etc.

1.2 A Machine Learning Pipeline for Text Classification (General Overview)

Collect Text Data: E.g., movie reviews, documents, tweets.
Text Preprocessing: Cleaning, tokenization, removing stopwords, etc.
Feature Extraction: Converting text to numerical vectors (e.g., one-hot vectors, TF-IDF, or embeddings).
Model Training: Training a classification model (SVM, neural network, etc.) on the numerical vectors.
Evaluation: Measuring model performance on unseen data.
Deployment/Inference: Using the model to classify new text inputs in real-world scenarios.

2. Text Representation Techniques vs. Embeddings

Text Representation Techniques (One-Hot, BoW, TF-IDF, N-Grams) generally use frequency or count-based strategies. They are:

Often high-dimensional (depending on vocabulary size).
Primarily based on surface statistics: word frequencies, occurrences, and co-occurrences.
Typically do not capture context well (order of words, deeper semantics).

Embeddings (Word2Vec, FastText, GloVe, BERT, etc.) are:

Dense, low-dimensional vector representations.
Learned from large corpora (neural networks often involved).
Capture both semantic and syntactic properties (context matters).
More powerful for downstream tasks that need deeper language understanding.

3. One-Hot Encoding

3.1 Definition

Represents each word by a vector of all zeros except for a single 1 at the index corresponding to that word.
Vocabulary Creation: Collect all unique words from the corpus, assign an index to each.
Vector Representation: A word w is represented by a binary vector of length |V| (the size of the vocabulary). Only the position corresponding to w is 1, others are 0.

3.2 Example

Sentence: “I ate an apple and played the piano”
Suppose the vocabulary (8 words) is: [I, ate, an, apple, and, played, the, piano].
One-hot vector for “I” = [1, 0, 0, 0, 0, 0, 0, 0].

3.3 Advantages & Disadvantages

Advantages:
- Conceptually simple.
- Easy to implement.
Disadvantages:
1. High Dimensionality: If the vocabulary is 50,000 words, each vector also has 50,000 dimensions—very large and sparse.
2. No Semantic Information: All words are equidistant, meaning “banana” and “orange” appear no more similar to each other than they do to “computer.”

4. Bag-of-Words (BoW)

4.1 Definition

Ignores order and context; retains only word frequency in a document.
A text document is turned into a vector representing the counts (or presence) of each word in the vocabulary.

4.2 Example

Three movie reviews:
1. Review 1: “This movie is very scary and long”
2. Review 2: “This movie is not scary and is slow”
3. Review 3: “This movie is spooky and good”
Vocabulary (11 words): ['This', 'movie', 'is', 'very', 'scary', 'and', 'long', 'not', 'slow', 'spooky', 'good'].
Vector for Review 1 (count of each vocabulary word): [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0].

4.3 Advantages

Simplicity: Easy to build and interpret.
Keyword Detection: Great for tasks like spam detection where frequent use of certain words matters.

4.4 Disadvantages

Ignores Word Order: Loses context—“Apple eats a boy” vs. “A boy eats an apple” look the same to BoW.
High Dimensionality: Large vocabularies → large, sparse vectors.
Cannot Handle New Words: Adding new words changes vector size for all documents.

4.5 When to Avoid BoW?

Large corpora with many unique words (dimensionality explosion).
Tasks needing context (machine translation, question answering).

5. Term-Document Matrix (Information Retrieval Perspective)

5.1 Definition

A 2D matrix with:
- Rows = terms (words in the vocabulary).
- Columns = documents in the collection.
- Cell (row, column) = frequency of term t in document d (often denoted tft,d).

5.2 Using Term-Document Matrix for IR

Represent each document as a column vector (of dimension |V|).
Represent a query (a short document) similarly.
Compute Similarity (like cosine similarity) between the query vector and each document vector to find the most relevant documents.

5.3 Example (Matrix as a “Vector Space”)

If you had 4 words in your vocabulary and 4 documents, each document is a 4-dimensional vector, e.g., [1, 114, 36, 20].
Documents (and queries) become points in that 4-dimensional space.

6. Word-Word Matrix (Term-Term Matrix)

6.1 Definition

Instead of words vs. documents, we have words vs. words.
Rows = target words, Columns = context words.
Cells = co-occurrence counts (how often the row word appears next to the column word within a defined window, e.g., ±4 words).

6.2 Similarity in Word-Word Space

Two words are considered similar if their context vectors are similar.
Common measure: Cosine similarity.

7. Cosine Similarity

7.1 Dot Product

The dot product of two vectors $\vec{v}$ and $\vec{w}$ measures how much they align dimension-wise.
Larger dot product → more alignment.

7.2 Why Normalize the Dot Product?

High-frequency words have higher magnitude vectors, leading to inflated dot products.
Solution: Normalize by the product of vector lengths → Cosine similarity.

7.3 Cosine Similarity Formula

\text{cosine\_similarity}(\vec{v}, \vec{w}) \;=\; \frac{\vec{v} \cdot \vec{w}}{\|\vec{v}\|\|\vec{w}\|}

Ranges from -1 to +1.
+1 means identical direction, 0 means orthogonal (no overlap), -1 means opposite directions.

8. TF-IDF (Term Frequency - Inverse Document Frequency)

8.1 Motivation

Raw frequency often overemphasizes common words like “the”, “and”.
Need a way to downweight words that appear in many documents and upweight words that are characteristic of fewer documents.

8.2 Formula Components

Term Frequency (TF) for term $t$ in document $d$ :
$\text{tf}_{t,d} = \log_{10}\bigl(\text{freq}(t,d) + 1\bigr)$
(Sometimes the raw count or slightly different log scale is used. The idea is to diminish the effect of repeated word occurrences.)
Inverse Document Frequency (IDF) for term $t$ in a corpus of $N$ documents:
$\text{idf}_t = \log_{10}\left(\frac{N}{\text{df}_t}\right)$
where $\text{df}_t$ = number of documents containing $t$ .
TF-IDF weight:
$w_{t,d} = \text{tf}_{t,d} \times \text{idf}_t$

8.3 Example

Docs:
1. D1: “I love programming”
2. D2: “Programming is fun”
3. D3: “I love fun activities”
Step 1: Calculate TF (counts or log-scaled).
Step 2: Calculate IDF of each word.
Step 3: Multiply TF × IDF = TF-IDF for each (term, document) pair.

8.4 Applications of TF-IDF

Information Retrieval: Ranking documents by relevance to a query.
Document Similarity: Check how similar two documents are via cosine similarity on their TF-IDF vectors.
Plagiarism Detection, News Recommendation, etc.

9. PPMI (Positive Pointwise Mutual Information)

9.1 Rationale

TF-IDF is common for document-based features. For word-word contexts, PPMI is often better.
PMI measures how often two events co-occur, compared to what we’d expect if they were independent.

9.2 PMI Formula

\text{PMI}(w_1, w_2) = \log_{2}\!\left(\frac{p(w_1, w_2)}{p(w_1)p(w_2)}\right).

$p(w_1, w_2)$ is the joint probability of $w_1$ and $w_2$ co-occurring.
$p(w_1)$ and $p(w_2)$ are marginal probabilities.

9.3 PPMI

Positive PMI sets negative PMI values to 0. $\text{PPMI}(w_1, w_2) = \max\bigl(\text{PMI}(w_1, w_2), 0\bigr).$
Negative PMI often indicates that two words co-occur less than chance; these values can be unreliable, so they are set to 0.

9.4 Bias Towards Rare Events

Rare words with even a single co-occurrence can produce very high PMI values because the denominator (product of marginal probabilities) is small.
Solution: Smoothing techniques (like add-1 or add-2) to avoid extreme values.

9.5 Applications of PPMI

Word Similarity & Lexical Semantics: Identify synonyms, measure semantic closeness.
Distributional Semantics: Foundation for embeddings like Word2Vec and GloVe.

10. Applications Summary

Information Retrieval (IR):
- Represent documents as TF-IDF vectors or queries in a term-document space.
- Rank documents based on cosine similarity to the query.
Word Similarity & Lexical Semantics:
- PPMI vectors used to find synonyms, clusters of similar words.
Distributional Semantics:
- PPMI influences methods like Word2Vec, GloVe for capturing deeper meaning.
Plagiarism Detection & Document Similarity:
- Compare documents using cosine similarity on TF-IDF vectors.

Practice Problems and Solutions

Below are practice problems for each major topic. Work through them step-by-step; then review the provided solutions.

1. One-Hot Encoding

Problem 1.1

Suppose you have a small corpus with 5 unique words: $\{ \text{cat}, \text{dog}, \text{apple}, \text{banana}, \text{runs} \}$ .

Create the one-hot vector for the word “dog”.
If the sentence is “cat runs”, which two one-hot vectors represent this sentence?

Solution 1.1

Vocabulary Indexing (example indexing, any consistent indexing is fine):
```
text
cat -> index 0
dog -> index 1
apple -> index 2
banana -> index 3
runs -> index 4
```
- One-hot for “dog” (length 5): [0, 1, 0, 0, 0].
“cat runs” → two vectors:
- “cat” = [1, 0, 0, 0, 0]
- “runs” = [0, 0, 0, 0, 1]

2. Bag-of-Words (BoW)

Problem 2.1

You have the following three sentences:

“Enjoy your free pizza now”
“Free coupons for pizza”
“Enjoy your day with fun”

(a) Create a vocabulary from these sentences.
(b) Form the Bag-of-Words vectors for each sentence based on the vocabulary (use simple frequency counts).

Solution 2.1

Extract Vocabulary
- Unique words: ["Enjoy", "your", "free", "pizza", "now", "coupons", "for", "day", "with", "fun"]
  - (Exact list/order can vary if you choose to include all-lowercase, etc. The main idea is to identify the unique tokens.)
BoW Vectors
- Sentence 1: “Enjoy your free pizza now”
  - [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
- Sentence 2: “Free coupons for pizza”
  - [0, 0, 1, 1, 0, 1, 1, 0, 0, 0]
- Sentence 3: “Enjoy your day with fun”
  - [1, 1, 0, 0, 0, 0, 0, 1, 1, 1]

3. Term-Document Matrix

Problem 3.1

Given a collection of 3 documents:

D1: “chocolate is sweet”
D2: “vanilla is sweet”
D3: “chocolate and vanilla”

Construct the term-document matrix (rows = words, columns = documents), using raw frequency counts.

Solution 3.1

Vocabulary: [“chocolate”, “is”, “sweet”, “vanilla”, “and”]
Matrix:

Word	D1 Count	D2 Count	D3 Count
chocolate	1	0	1
is	1	1	0
sweet	1	1	0
vanilla	0	1	1
and	0	0	1

Columns are D1, D2, D3; rows are words.

4. Word-Word (Term-Term) Matrix

Problem 4.1

You have the sentence “The cat sat on the mat.” Assume a context window of ±1 word. Construct a simplified word-word co-occurrence matrix for the words [the, cat, sat, on, mat].

List the co-occurrences for each word (within ±1 window).
Fill out the matrix.

Solution 4.1

Tokens (lowercase for simplicity, ignoring punctuation): [the, cat, sat, on, the, mat]
±1 Window co-occurrences:
- the appears next to [cat, mat] (once each).
- cat appears next to [the, sat].
- sat appears next to [cat, on].
- on appears next to [sat, the].
- The second the appears next to [on, mat].
- mat appears next to [the].
Counting (combine both “the” tokens into a single row since it’s the same word):

Row (Target)	the	cat	sat	on	mat
the	0	1	0	1	1
cat	1	0	1	0	0
sat	0	1	0	1	0
on	1	0	1	0	1
mat	1	0	0	1	0

(Counts can vary slightly depending on how you unify “the” occurrences, but this is a typical approach.)

5. Cosine Similarity

Problem 5.1

You have two document vectors:

$\vec{d_1} = [3, 0, 2]$
$\vec{d_2} = [1, 2, 2]$

Calculate:

The dot product $\vec{d_1} \cdot \vec{d_2}$ .
The cosine similarity between $\vec{d_1}$ and $\vec{d_2}$ .

Solution 5.1

Dot Product: $\vec{d_1} \cdot \vec{d_2} = (3 \times 1) + (0 \times 2) + (2 \times 2) = 3 + 0 + 4 = 7.$
Cosine Similarity: $\|\vec{d_1}\| = \sqrt{3^2 + 0^2 + 2^2} = \sqrt{9 + 4} = \sqrt{13}.$ $\|\vec{d_2}\| = \sqrt{1^2 + 2^2 + 2^2} = \sqrt{1 + 4 + 4} = \sqrt{9} = 3.$ $\text{cosine\_similarity}(\vec{d_1}, \vec{d_2}) = \frac{7}{\sqrt{13} \times 3} \approx \frac{7}{3 \times 3.6055} \approx 0.647.$

6. TF-IDF

Problem 6.1

You have 3 documents:

D1: “apple banana apple”
D2: “banana fruit apple”
D3: “apple fruit”
Compute the term frequency (TF) for each word in each document (use log-scaled TF if possible).
Compute the inverse document frequency (IDF) for each word.
Generate the final TF-IDF values.

(Assume logs are base 10, and if a word doesn’t appear in a document, TF = 0.)

Solution 6.1 (sketch)

Vocabulary: $\{\text{apple}, \text{banana}, \text{fruit}\}$
Raw Counts:
- D1: apple=2, banana=1, fruit=0
- D2: apple=1, banana=1, fruit=1
- D3: apple=1, banana=0, fruit=1
Log-Scaled TF ( $\log_{10}(\text{count} + 1)$ $lo g_{10} (count + 1)$ ):
- D1:
  - apple: $\log_{10}(2+1) = \log_{10}(3)\approx 0.4771$
  - banana: $\log_{10}(1+1)=\log_{10}(2)\approx 0.3010$
  - fruit: 0 (doesn’t appear)
- D2:
  - apple: $\log_{10}(1+1)=0.3010$
  - banana: $\log_{10}(1+1)=0.3010$
  - fruit: $\log_{10}(1+1)=0.3010$
- D3:
  - apple: $\log_{10}(1+1)=0.3010$
  - banana: 0
  - fruit: $\log_{10}(1+1)=0.3010$
Document Frequency:
- apple: appears in D1, D2, D3 → $\text{df}=3$
- banana: appears in D1, D2 → $\text{df}=2$
- fruit: appears in D2, D3 → $\text{df}=2$
- $N = 3$ (total docs)
IDF ( $\log_{10}(N/df)$ $lo g_{10} (N / df)$ ):
- apple: $\log_{10}(3/3)=\log_{10}(1)=0$
- banana: $\log_{10}(3/2)\approx\log_{10}(1.5)\approx0.1761$
- fruit: $\log_{10}(3/2)\approx0.1761$
TF-IDF = (TF) × (IDF):
- D1:
  - apple: $0.4771 \times 0=0$
  - banana: $0.3010 \times 0.1761\approx 0.0530$
  - fruit: $0\times 0.1761=0$
- D2:
  - apple: $0.3010 \times 0=0$
  - banana: $0.3010 \times 0.1761\approx0.0530$
  - fruit: $0.3010 \times 0.1761\approx0.0530$
- D3:
  - apple: $0.3010 \times 0=0$
  - banana: $0 \times 0.1761=0$
  - fruit: $0.3010 \times 0.1761\approx0.0530$

(Note that “apple” ended up with IDF=0 because it appears in all documents. This drastically lowers its TF-IDF weight — exactly what TF-IDF is designed to do for common words.)

7. PPMI (Positive Pointwise Mutual Information)

Problem 7.1

You have a mini-corpus of two words “sun” and “moon” and two contexts “day” and “night.” The co-occurrence counts are:

	day	night
sun	10	1
moon	1	8

Compute the joint probabilities $p(\text{word}, \text{context})$ by dividing each count by the total.
Calculate PMI( $\text{sun}$ , $\text{day}$ ) and PMI( $\text{moon}$ , $\text{night}$ ).
Convert to PPMI by setting negative values to 0 if there are any.

(Assume base 2 for the log if needed.)

Solution 7.1 (sketch)

Total Co-occurrences: $10 + 1 + 1 + 8 = 20$ .
Joint Probabilities:
- $p(\text{sun}, \text{day}) = 10/20 = 0.5$
- $p(\text{sun}, \text{night}) = 1/20 = 0.05$
- $p(\text{moon}, \text{day}) = 1/20 = 0.05$
- $p(\text{moon}, \text{night}) = 8/20 = 0.4$
Marginal Probabilities:
- $p(\text{sun}) = 10/20 + 1/20 = 11/20 = 0.55$
- $p(\text{moon}) = 1/20 + 8/20 = 9/20 = 0.45$
- $p(\text{day}) = (10+1)/20 = 11/20 = 0.55$
- $p(\text{night}) = (1+8)/20 = 9/20 = 0.45$
PMI( $\text{sun}$ , $\text{day}$ ): $= \log_{2}\left(\frac{0.5}{0.55 \times 0.55}\right) = \log_{2}\left(\frac{0.5}{0.3025}\right) = \log_{2}(1.6529) \approx 0.73.$
PMI( $\text{moon}$ , $\text{night}$ ): $= \log_{2}\left(\frac{0.4}{0.45 \times 0.45}\right) = \log_{2}\left(\frac{0.4}{0.2025}\right) = \log_{2}(1.9753) \approx 0.99.$
PPMI: Since both values are positive, PPMI = PMI for these pairs.

Additional Tips for Exam Preparation

Concept Mastery: Understand the key differences between count-based methods and embeddings.
Formulas: Be comfortable with the formulas for TF, IDF, TF-IDF, PMI, Cosine Similarity.
Interpretation: Practice explaining why one method might be chosen over another (e.g., TF-IDF vs. PPMI).
Examples: Work through small examples manually. This cements the concepts better than purely theoretical study.
Edge Cases: Think about extreme or edge cases (e.g., words appearing in all documents, or extremely rare words).

Final Note

This compilation covers:

Why raw text must be converted to numerical features (one-hot, BoW, TF-IDF, PPMI).
The basic mathematics behind cosine similarity.
The nature of term-document and word-word matrices.
The interplay between frequent and rare terms (and how methods like TF-IDF and PPMI balance them).

With these structured notes and the practice problems, you should be well-prepared to tackle exam questions on Deep Learning for NLP text representation basics. Good luck with your studies!

Question

Thought

Answer