Deep Learning Text Representation

Created on: February 4, 2025

Answered using O1 by Chat01

Question

Hey, I want to give me a very detailed study notes on the content present below in triple quotes. Also, give me problems with answer for each of the topics present. Make sure you cover everything, because, I'm going to use it for exam preparation.

''' 18CSC412 – Deep Learning for Natural Language Processing

Text Representation Techniques and Embeddings

Text and Reference Books TEXT BOOKS

Daniel and Martin J H, “Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition”, Prentice Hall, 2009.

REFERENCES Manning C D and Schutze H, “Foundations of Statistical Natural Language processing“, First Edition, MIT Press, 1999. Allen J, “Natural Language Understanding”, Second Edition, Pearson Education, 2003.

2 The Problem with Text Data

Machine learning algorithms cannot work with raw text directly; the text must be converted into numbers. Specifically, vectors of numbers

This is called feature extraction or feature encoding

Word embedding is a means of turning texts into numbers

Vectors for representing words are called embeddings

3 A Machine Learning Pipeline for Text Classification 4 Text Representation Techniques Vs Embeddings Text Representation Techniques transform text into numerical features using statistical or count-based approaches. They often lack contextual understanding of words One-Hot Encoding Bag-of-Words TF-IDF (Term Frequency (TF) – Inverse Document Frequency (IDF)) N-Grams

Embeddings are dense, low-dimensional vector representations of words or text that capture semantic and syntactic information. Unlike count-based methods, embeddings are learned from large corpora using neural networks Word2Vec FastText GloVe (Global Vectors for Word Representation) BERT (Bidirectional Encoder Representations from Transformers) Sentence Embeddings (e.g., Sentence-BERT) 5

  1. One-hot Encoding One-hot Encoding is a technique that translates categorical data, including textual data, into numerical vectors In the context of NLP, it is primarily used to convert textual information into a format that machine learning models can understand and process Steps Vocabulary creation: The first step in one-hot encoding is to create a vocabulary consisting of all unique words in the entire text corpus Each word is indexed with a unique integer value Vector representation: Once the vocabulary is created, each word is represented as a vector of 0s and 1s The length of the vector is equal to the size of the vocabulary, and each position in the vector corresponds to a specific word in the vocabulary

6

  1. One-Hot Encoding If the word is present in a particular text sample, its corresponding position in the vector is marked as 1, and all other positions are 0 This implies that each word is uniquely represented by a binary vector, with only one element being 1 (indicating its presence) and all others being 0 Example Consider the sentence: “I ate an apple and played the piano” Indexing each word’s position in the given vocabulary set

The word “I” is at position 1, so its one-hot vector representation would be [1, 0, 0, 0, 0, 0, 0, 0]

7

  1. One-Hot Encoding The one-hot embedding matrix for the example text I ate an apple and played the piano

8 The number of words in the source vocabulary signifies the number of dimensions - here we have eight Problems with One-Hot Encoding (i) Imagine we have a vocabulary of 50,000. (There are roughly a million words in English language.) Each word is represented with 49,999 zeros and a single one, and we need 50,000 squared = 2.5 billion units of memory space Not computationally efficient

(ii) It is hard to extract meanings Each word is embedded in isolation, and every word contains a single one and N-1 zeros where N is the number of dimensions. The resulting set of vectors do not say much about one another If our vocabulary had “orange,” “banana,” and “watermelon,” we can see the similarity between those words, such as the fact that they are types of fruit. But with one-hot vectors, all words are equal distance apart

9 2. Bag-of-Words (BoW) BoW an unordered set of words with their position ignored, keeping only their frequency in the document

10 2. Bag-of-Words (BoW) BoW is a technique that helps us convert text sentences into numeric vectors

Example : Three movie reviews Review 1: This movie is very scary and long Review 2: This movie is not scary and is slow Review 3: This movie is spooky and good

First build a vocabulary from all the unique words in the above three reviews The vocabulary consists of these 11 words: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’,  ‘slow’, ‘spooky’,  ‘good’ Take each of these words and mark their occurrence in the three movie reviews above with 1’s and 0’s 11 Bag-of-Words (BoW)

Vector-of-Review 1: [1 1 1 1 1 1 1 0 0 0 0] Vector-of-Review 2: [1 1 2 0 0 1 1 0 1 0 0] Vector-of-Review 3: [1 1 1 0 0 0 1 0 0 1 1]

12 Review 1: This movie is very scary and long Review 2: This movie is not scary and is slow Review 3: This movie is spooky and good Advantages Bag-of-Words (BoW) Representation

Simplicity The BoW model is easy to understand and implement, making it a great starting point for text processing tasks

Good for Keyword Detection Since BoW focuses on the frequency of words, it’s effective for tasks where the presence or absence of specific words matters, such as spam detection or topic classification. Example: In a spam detection system, BoW can identify frequent words like “win” or “prize,” which are common in spam emails

13 Disadvantages Bag-of-Words (BoW) Representation Insensitivity to word order It treats all occurrences of a word as equivalent, regardless of the order in which they appear in a sentence It cannot capture the relationships between the words in a sentence and the meaning they convey For Example : BoW representation is same for the following two sentences with completely different meanings 14 Disadvantages Bag-of-Words (BoW) Representation

High Dimensionality If a corpus contains large number of unique words, the bag of words representation will have a high dimensionality Example: For a dataset with 10,000 unique words, each document is represented by a 10,000-dimensional vector, which leads to sparse matrices and increased computational cost

Ignoring New Words Since the BoW vector length is same as the number of unique words in the corpus, It makes it difficult to introduce new words as it requires recomputing vectors for all documents 15 When to Avoid BoW? Large corpora with diverse vocabularies When dealing with large text datasets, the dimensionality increases drastically. Consider using more efficient methods like TF-IDF or Word2Vec

Tasks requiring context For machine translation, question answering, or chatbot creation, Bag of Words fails to capture word order and context. Use embeddings or sequence-based models instead

16 The Information Retrieval(IR) Perspective – 3. Term-Document Matrix

The Term-Document Matrix In IR, we search a collection of N documents — We can represent each word in the vocabulary V as an N-dimensional vector indicating which documents it appears in Conversely, we can represent each document as a V-dimensional vector indicating which words appear in it Finding the most relevant document for a query Queries are also (short) documents Use the similarity of a query’s vector and the documents’ vectors to compute which document is most relevant to the query

Intuition: Documents are similar to each other if they contain the same words 17 3. Term-Document Matrix In a term-document matrix Each row represents a word in the vocabulary and Each column represents a document from some collection of documents

A Term-Document Matrix is a 2D table: Each cell contains the frequency (count) of the term (word) t in document d: tft,d Each column is a vector of counts over words, representing a document Each row is a vector of counts over documents, representing a word

18 3. Term-Document Matrix 19 Vector Space Informally, Vector is just a list or array of numbers As You Like It is represented as the list [1,114,36,20] (the first column vector in the Figure) and Julius Caesar is represented as the list [7,62,1,2] (the third column vector)

A vector space is a collection of vectors, characterized by their dimension In the above example, the document vectors are of dimension 4 In real term-document matrices, the vectors representing each document would have dimensionality |V|, the vocabulary size

20 First dimension Vector Space We can think of the vector for a document as a point in |V|-dimensional space The documents in the previous slide are points in 4-dimensional space Since 4-dimensional spaces are hard to visualize, the following figure shows a visualization in two dimensions (arbitrarily chosen the dimensions corresponding to the words battle and fool) 21 Vector Space Term-document matrices were originally defined as a means of finding similar documents for the task of document information retrieval

Information retrieval (IR) is the task of finding the document d from the D documents in some collection that best matches a query q

For IR, we’ll therefore also represent a query by a vector, also of length |V|, and we’ll need a way to compare two vectors to find how similar they are

22 Words can be vectors too!!! 23 The four dimensions of the vector for fool, [36,58,1,4], correspond to the four Shakespeare plays Word counts in the same four dimensions are used to form the vectors for the other 3 words: wit, [20,15,2,3]; battle, [1,0,7,13]; and good [114,80,62,89]. The red boxes show that each word is represented as a row vector of length four 4. Word-Word Matrix (or) Term-Term Matrix An alternative to using the term-document matrix to represent words as vectors of document counts, is to use the term-term matrix, also called the word-word matrix or the term-context matrix, in which the columns are labeled by words rather than documents This matrix is thus of dimensionality |V| x |V|

Each cell records the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus

We can create a window around the word, for example of 4 words to the left  and 4 words to the  right The cell represents the number of times the column word occurs in a 4 word window around the row

24 4. Word-Word Matrix 25 Row - target word and Column - context word 4. Word-Word Matrix 26 Two words are similar in meaning if their context vectors are similar Cosine for measuring similarity To measure similarity between two target words v and w, we need a metric that takes two vectors and gives a measure of their similarity

The most common similarity metric is the cosine of the angle between the vectors

The cosine-like most measures for vector similarity used in NLP is based on the dot product operator from linear algebra, also called the inner product: 27 Cosine for measuring similarity The dot product of two vectors provides a scalar value that encapsulates the relationship between the two vectors

Geometric Interpretation The dot product measures the degree of alignment between two vectors

28 Cosine for measuring similarity

  1. Dot Product as a Similarity Metric The dot product is high when two vectors have large values in the same dimensions This means the vectors are aligned in those dimensions, indicating similarity

  2. Orthogonal Vectors When vectors are orthogonal (perpendicular), their dot product is 0 This happens because they do not share any alignment in their dimensions A dot product of 0 represents strong dissimilarity 29 Cosine for measuring similarity Raw dot product, however, has a problem as a similarity metric: it favors long vectors

The dot product is higher if a vector is longer, with higher values in each dimension

More frequent words have longer vectors, since they tend to co-occur with more words and have higher co-occurrence values with each of them

The raw dot product thus will be higher for frequent words But we’d like a similarity metric that tells us how similar two words are regardless of their frequency

30 Cosine Similarity Modify the dot product to normalize for the vector length by dividing the dot product by the lengths of each of the two vectors This normalized dot product turns out to be the same as the cosine of the angle between the two vectors, following from the definition of the dot product between two vectors a and b 31 Vector length Cosine Similarity The cosine similarity metric between two vectors v and w thus can be computed as

For some applications we pre-normalize each vector, by dividing it by its length, creating a unit vector of length 1

32 Cosine Similarity 33 Cosine Similarity For unit vectors, the dot product between two vectors equals the cosine of the angle between them

The unit vectors eliminate the effect of magnitude, leaving only the directional relationship The cosine similarity value ranges between -1 and 1

34 Cosine Similarity Range Explanation

35 Cosine Similarity The cosine similarity value ranges between -1 and 1 Cosine Similarity = 1 => The vectors are identical and point in the same direction Two documents containing exactly the same content D1: The cat sat on the mat D2: The cat sat on the mat Cosine Similarity = 0 => The vectors are orthogonal, meaning there is no correlation or similarity between them Two documents about completely unrelated subjects D1: I love eating apples D2: The Earth orbits the Sun Cosine Similarity = -1 =>The vectors are opposite in direction, indicating they are completely dissimilar or contradictory Two documents expressing opposing views D1: I love summer D2: I hate summer

36 Cosine Similarity - Example 37 The model decides that information is way closer to digital than it is to cherry Other possible similarity measures 38 5.TF-IDF : Weighing terms in the Vector

Raw frequency (counting how often a word appears) isn't always helpful for understanding connections between words

For Example

Words like "cherry" and "strawberry" might appear together often and tell us something interesting (they are related to fruits)

Words like "the," "it," or "they" also appear frequently but don’t help us understand anything specific because they are common across all kinds of sentences

39 5. TF-IDF : Weighing terms in the Vector

Words that occur near by frequently(may be pie nearby cherry) are more important than words that only appear once or twice Words that are too frequent-ubiquitous, like the or is - are unimportant

How can we balance these two conflicting constraints?

There are two common solutions to this problem The tf-idf algorithm, usually used when the dimensions are documents The PPMI algorithm, usually used when the dimensions are words 40 5. TF-IDF : Weighing terms in the Vector The tf-idf algorithm is the product of two terms, each term capturing one of these two intuitions: (i) The first is the term frequency (Luhn, 1957): the frequency of the word t in the document d. We can just use the raw count as the term frequency

More commonly we squash the raw frequency a bit, by using the log10 of the frequency

Because we can’t take the log of 0, we normally add 1 to the count

41 5. TF-IDF

If we use log weighting, terms which occur 0 times in a document would have tf = log10(1) = 0 10 times in a document tf = log10(11) = 14 100 times tf = log10(101) = 2004 1000 times tf = 300044 and so on. 42 5. TF-IDF (ii) The second factor in tf-idf is used to give a higher weight to words that occur only in a few documents Terms that are limited to a few documents are useful for discriminating those documents from the rest of the collection Terms that occur frequently across the entire collection aren’t as helpful

The document frequency dft of a term ‘t’ is the number of documents it occurs in - Each document is counted only once, even if the term appears multiple times within the same document Document frequency is not the same as the collection frequency of a term, which is the total number of times the word appears in the whole collection in any document

43 5. TF-IDF Consider in the collection of Shakespeare’s 37 plays the two words Romeo and action The words have identical collection frequencies (they both occur 113 times in all the plays) but very different document frequencies, since Romeo only occurs in a single play

Inverse document frequency or idf - is defined using the fraction N / dft where N is the total number of documents in the collection dft is the number of documents in which term t occurs The fewer documents in which a term occurs, the higher this weight

The lowest weight of 1 is assigned to terms that occur in all the documents 44 5. TF-IDF Because of the large number of documents in many collections, this measure too is usually squashed with a log function The resulting definition for inverse document frequency (idf) is thus

idf values for some words in the Shakespeare corpus

45 5. TF-IDF The tf-idf weighted value wt,d for word t in document d thus combines term frequency tft,d with idf

tf: Measures how popular a word is in a single document idf: Ensures that common words like "the" or "and" don't dominate, since they appear in all documents 46 5. TF-IDF - Example Imagine we have three short documents: Doc 1: I love programming Doc 2: Programming is fun Doc 3: I love fun activities Step 1: Calculate Term Frequency (TF) - Tokenize the words TF - is the number of times a word appears in a document

47 5. TF-IDF - Example

48 5. TF-IDF - Example Step 2: Calculate Inverse Document Frequency (IDF) IDF measures how important a term is across all documents

IDF for each word

49 5. TF-IDF - Example Step 3: Calculate TF-IDF

TF-IDF is calculated by multiplying the TF and IDF for each word 50 5. TF-IDF - Example 51 Final TF-IDF Vectors

Doc 1:

Doc 2:

Doc 3: These TF-IDF vectors represent the documents in a numerical form where each element corresponds to the importance of a term in a document relative to the whole corpus 6. PPMI (Positive Pointwise Mutual Information)

An alternative weighting function to tf-idf, PPMI PPMI - is used for term-term-matrices, when the vector dimensions correspond to words rather than documents PPMI draws on the intuition that the best way to weigh the association between two words is to ask how much more the two words co-occur in our corpus than we would have a priori expected them to appear by chance

Pointwise mutual information - one of the most important concepts in NLP It is a measure of how often two events x and y occur, compared with what we would expect if they were independent

52 Joint Vs Marginal probabilities Joint Probability The probability of two (or more) events happening simultaneously Notation: P(A∩B) or P(A,B), where A and B are two events or random variables Properties If A and B are independent events, P(A∩B) = P(A)⋅P(B) Joint probabilities are often represented in tables for discrete random variables or through joint probability density functions (PDFs) for continuous random variables Example Suppose you toss two fair coins. Each coin can land as Heads (H) or Tails (T) The sample space for tossing two coins is: {(H,H),(H,T),(T,H),(T,T)} Events: Let A: The first coin is Heads, Let B: The second coin is Tails

53 Joint Vs Marginal probabilities The joint probability P(A∩B) is the probability that the first coin is Heads AND the second coin is Tails From the sample space, only one outcome satisfies this: (H,T)

Since the coins are fair, the probability of each outcome is 1/4

54 Joint Vs Marginal probabilities Marginal Probability The probability of a single event occurring, irrespective of other events It is derived by "summing out" or "integrating out" the other variables from the joint probability Notation: P(A) or P(B)

Example : Tossing Two Coins Suppose you toss two fair coins. The sample space for the outcomes is: {(H,H),(H,T),(T,H),(T,T)} Events: A: The first coin shows Heads, B: The second coin shows Heads To find the marginal probability P(A), we calculate the probability of the first coin showing Heads, regardless of the outcome of the second coin

55 Joint Vs Marginal probabilities There are 4 total possible outcomes: {(H,H),(H,T),(T,H),(T,T)}

The marginal probabilities are obtained by summing the joint probabilities across rows or columns

56 Joint Vs Marginal probabilities 57 This table showcases joint probabilities and their corresponding marginal probabilities by summing across rows and columns 6. PPMI PMI between two words: (Church & Hanks 1989) Do words word1 and word2 co-occur more than if they were independent?

The numerator tells us how often we observed the two words together (assuming we compute probability by using the MLE) The denominator tells us how often we would expect the two words to co-occur assuming they each occurred independently; recall that the probability of two independent events both occurring is just the product of the probabilities of the two events Thus, the ratio gives us an estimate of how much more the two words co-occur than we expect by chance

PMI ranges from −∞ to +∞

58 6. PPMI Negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable unless our corpora are enormous Imagine w1 and w2 whose probability is each 10-6 Hard to be sure p(w1,w2) is significantly different than 10-12 Furthermore, it’s not clear whether it’s even possible to evaluate such scores of ‘unrelatedness’ with human judgments So we just replace negative PMI values by 0 Positive PMI (PPMI) between word1 and word2 59 Assume that we have a co-occurrence matrix F with W rows (words) and C columns (contexts) fij is # of times wi occurs in context cj 6. PPMI 60 Denominator : Total co-occurrences across all words and contexts Total co-occurrences of word i across all contexts Co-occurrence count of word i and context j Total occurrences of context j across all words 6. PPMI 61 Numerator (pij​): The actual co-occurrence probability

Denominator (Pi∗⋅P∗j​): The expected co-occurrence probability if i and j were independent

PPMI helps capture meaningful associations between words and contexts:

Higher PPMI values indicate that two items co-occur more often than expected by chance Low PPMI values are filtered out to focus only on significant relationships 6. PPMI 62 6. PPMI 63 6. PPMI 64 6. PPMI PMI is biased toward infrequent events Very rare words tends to have very high PMI values

Why PMI is biased toward rare events? The PMI for a word ‘w’ and a context ‘c’ is calculated as:

65 6. PPMI Consider a streaming platform where we track the movies users watch. You’re analyzing two niche movies: "Fantasy Movie A" "Fantasy Movie B“ Probabilities (Hypothetical Data)

66 PMI measures how surprising it is that these two movies are watched together: 6. PPMI Why rare events lead to high PMI values?

Small Marginal Probabilities in the Denominator Rare words w and rare contexts c have small probabilities P(w) and P(c), making the denominator P(w)P(c) very small Even a single co-occurrence (small P(w,c)) can make the fraction P(w,c)/P(w)P(c) disproportionately large

Two solutions Give rare words slightly higher probabilities Use add-one smoothing (which has a similar effect)

67 Weighting PMI: Giving rare context words slightly higher probability 68 Use Laplace (add-1) smoothing 69 PPMI versus add-2 smoothed PPMI 70 Smoothing helps to handle cases where word pairs never occurred together by adding small values. But if two words still don’t seem related (even after smoothing), their relationship score (PPMI) will still be 0. Applications of tf-idf & PPMI vector models

  1. Information Retrieval (IR) TF-IDF is widely used in search engines to rank documents based on their relevance to a query. Documents are represented as vectors in a high-dimensional space, and the similarity between the query and documents is computed using cosine similarity TF-IDF helps weigh terms that are important for identifying relevance while reducing the impact of common words (e.g., "the", "and")

  2. Word Similarity and Lexical Semantics PPMI vectors are useful for measuring semantic similarity between words Applications include: Synonym detection (e.g., finding synonyms in thesaurus construction). Word sense disambiguation, where word meanings are inferred from their contextual co-occurrence patterns 71 Applications of the tf-idf or PPMI vector models

  3. Distributional Semantics and Word Embeddings PPMI captures word meanings through co-occurrence statistics, forming the foundation for distributional semantics. It's a precursor to modern embeddings like Word2Vec and GloVe, as these methods often incorporate PMI-based ideas.

  4. Plagiarism Detection For the document vectors d1 and d2, estimate the similarity between the two documents by cos(d1 d2) – TF-IDF Document similarity is also useful for all sorts of applications; information retrieval, plagiarism detection, news recommender systems, and even for digital humanities tasks like comparing different versions of a text to see which are similar to each other 72

'''

Thought

Reasoned about text representation techniques for 4 seconds

Answer

Below is a comprehensive set of study notes derived from the provided content. Each section is followed by a set of practice problems (with solutions) to help you solidify your understanding. The notes are organized in a logical flow, covering all key concepts: why text must be converted to numerical form, traditional text representation techniques, and more sophisticated weighting schemes like TF-IDF and PPMI. Finally, we tie everything back to how these methods are used in applications such as search engines (IR), semantic similarity, and plagiarism detection.


1. Introduction

1.1 Why Convert Text into Numbers?

  • Machine Learning algorithms cannot work directly with raw text. We need to convert text into vectors of numbers, a process called feature extraction or feature encoding.
  • Simple approaches use count-based methods like:
    • One-Hot Encoding
    • Bag-of-Words (BoW)
    • TF-IDF
    • N-grams
  • More advanced approaches generate dense embeddings using neural networks:
    • Word2Vec, FastText, GloVe, BERT, Sentence-BERT, etc.

1.2 A Machine Learning Pipeline for Text Classification (General Overview)

  1. Collect Text Data: E.g., movie reviews, documents, tweets.
  2. Text Preprocessing: Cleaning, tokenization, removing stopwords, etc.
  3. Feature Extraction: Converting text to numerical vectors (e.g., one-hot vectors, TF-IDF, or embeddings).
  4. Model Training: Training a classification model (SVM, neural network, etc.) on the numerical vectors.
  5. Evaluation: Measuring model performance on unseen data.
  6. Deployment/Inference: Using the model to classify new text inputs in real-world scenarios.

2. Text Representation Techniques vs. Embeddings

Text Representation Techniques (One-Hot, BoW, TF-IDF, N-Grams) generally use frequency or count-based strategies. They are:

  • Often high-dimensional (depending on vocabulary size).
  • Primarily based on surface statistics: word frequencies, occurrences, and co-occurrences.
  • Typically do not capture context well (order of words, deeper semantics).

Embeddings (Word2Vec, FastText, GloVe, BERT, etc.) are:

  • Dense, low-dimensional vector representations.
  • Learned from large corpora (neural networks often involved).
  • Capture both semantic and syntactic properties (context matters).
  • More powerful for downstream tasks that need deeper language understanding.

3. One-Hot Encoding

3.1 Definition

  • Represents each word by a vector of all zeros except for a single 1 at the index corresponding to that word.
  • Vocabulary Creation: Collect all unique words from the corpus, assign an index to each.
  • Vector Representation: A word w is represented by a binary vector of length |V| (the size of the vocabulary). Only the position corresponding to w is 1, others are 0.

3.2 Example

  • Sentence: “I ate an apple and played the piano”
  • Suppose the vocabulary (8 words) is: [I, ate, an, apple, and, played, the, piano].
  • One-hot vector for “I” = [1, 0, 0, 0, 0, 0, 0, 0].

3.3 Advantages & Disadvantages

  • Advantages:
    • Conceptually simple.
    • Easy to implement.
  • Disadvantages:
    1. High Dimensionality: If the vocabulary is 50,000 words, each vector also has 50,000 dimensions—very large and sparse.
    2. No Semantic Information: All words are equidistant, meaning “banana” and “orange” appear no more similar to each other than they do to “computer.”

4. Bag-of-Words (BoW)

4.1 Definition

  • Ignores order and context; retains only word frequency in a document.
  • A text document is turned into a vector representing the counts (or presence) of each word in the vocabulary.

4.2 Example

  • Three movie reviews:
    1. Review 1: “This movie is very scary and long”
    2. Review 2: “This movie is not scary and is slow”
    3. Review 3: “This movie is spooky and good”
  • Vocabulary (11 words): ['This', 'movie', 'is', 'very', 'scary', 'and', 'long', 'not', 'slow', 'spooky', 'good'].
  • Vector for Review 1 (count of each vocabulary word): [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0].

4.3 Advantages

  1. Simplicity: Easy to build and interpret.
  2. Keyword Detection: Great for tasks like spam detection where frequent use of certain words matters.

4.4 Disadvantages

  1. Ignores Word Order: Loses context—“Apple eats a boy” vs. “A boy eats an apple” look the same to BoW.
  2. High Dimensionality: Large vocabularies → large, sparse vectors.
  3. Cannot Handle New Words: Adding new words changes vector size for all documents.

4.5 When to Avoid BoW?

  • Large corpora with many unique words (dimensionality explosion).
  • Tasks needing context (machine translation, question answering).

5. Term-Document Matrix (Information Retrieval Perspective)

5.1 Definition

  • A 2D matrix with:
    • Rows = terms (words in the vocabulary).
    • Columns = documents in the collection.
    • Cell (row, column) = frequency of term t in document d (often denoted tft,d).

5.2 Using Term-Document Matrix for IR

  • Represent each document as a column vector (of dimension |V|).
  • Represent a query (a short document) similarly.
  • Compute Similarity (like cosine similarity) between the query vector and each document vector to find the most relevant documents.

5.3 Example (Matrix as a “Vector Space”)

  • If you had 4 words in your vocabulary and 4 documents, each document is a 4-dimensional vector, e.g., [1, 114, 36, 20].
  • Documents (and queries) become points in that 4-dimensional space.

6. Word-Word Matrix (Term-Term Matrix)

6.1 Definition

  • Instead of words vs. documents, we have words vs. words.
  • Rows = target words, Columns = context words.
  • Cells = co-occurrence counts (how often the row word appears next to the column word within a defined window, e.g., ±4 words).

6.2 Similarity in Word-Word Space

  • Two words are considered similar if their context vectors are similar.
  • Common measure: Cosine similarity.

7. Cosine Similarity

7.1 Dot Product

  • The dot product of two vectors v\vec{v} and w\vec{w} measures how much they align dimension-wise.
  • Larger dot product → more alignment.

7.2 Why Normalize the Dot Product?

  • High-frequency words have higher magnitude vectors, leading to inflated dot products.
  • Solution: Normalize by the product of vector lengths → Cosine similarity.

7.3 Cosine Similarity Formula

cosine_similarity(v,w)  =  vwvw\text{cosine\_similarity}(\vec{v}, \vec{w}) \;=\; \frac{\vec{v} \cdot \vec{w}}{\|\vec{v}\|\|\vec{w}\|}
  • Ranges from -1 to +1.
  • +1 means identical direction, 0 means orthogonal (no overlap), -1 means opposite directions.

8. TF-IDF (Term Frequency - Inverse Document Frequency)

8.1 Motivation

  • Raw frequency often overemphasizes common words like “the”, “and”.
  • Need a way to downweight words that appear in many documents and upweight words that are characteristic of fewer documents.

8.2 Formula Components

  1. Term Frequency (TF) for term tt in document dd:

    tft,d=log10(freq(t,d)+1)\text{tf}_{t,d} = \log_{10}\bigl(\text{freq}(t,d) + 1\bigr)

    (Sometimes the raw count or slightly different log scale is used. The idea is to diminish the effect of repeated word occurrences.)

  2. Inverse Document Frequency (IDF) for term tt in a corpus of NN documents:

    idft=log10(Ndft)\text{idf}_t = \log_{10}\left(\frac{N}{\text{df}_t}\right)

    where dft\text{df}_t = number of documents containing tt.

  3. TF-IDF weight:

    wt,d=tft,d×idftw_{t,d} = \text{tf}_{t,d} \times \text{idf}_t

8.3 Example

  • Docs:
    1. D1: “I love programming”
    2. D2: “Programming is fun”
    3. D3: “I love fun activities”
  • Step 1: Calculate TF (counts or log-scaled).
  • Step 2: Calculate IDF of each word.
  • Step 3: Multiply TF × IDF = TF-IDF for each (term, document) pair.

8.4 Applications of TF-IDF

  • Information Retrieval: Ranking documents by relevance to a query.
  • Document Similarity: Check how similar two documents are via cosine similarity on their TF-IDF vectors.
  • Plagiarism Detection, News Recommendation, etc.

9. PPMI (Positive Pointwise Mutual Information)

9.1 Rationale

  • TF-IDF is common for document-based features. For word-word contexts, PPMI is often better.
  • PMI measures how often two events co-occur, compared to what we’d expect if they were independent.

9.2 PMI Formula

PMI(w1,w2)=log2 ⁣(p(w1,w2)p(w1)p(w2)).\text{PMI}(w_1, w_2) = \log_{2}\!\left(\frac{p(w_1, w_2)}{p(w_1)p(w_2)}\right).
  • p(w1,w2)p(w_1, w_2) is the joint probability of w1w_1 and w2w_2 co-occurring.
  • p(w1)p(w_1) and p(w2)p(w_2) are marginal probabilities.

9.3 PPMI

  • Positive PMI sets negative PMI values to 0. PPMI(w1,w2)=max(PMI(w1,w2),0).\text{PPMI}(w_1, w_2) = \max\bigl(\text{PMI}(w_1, w_2), 0\bigr).
  • Negative PMI often indicates that two words co-occur less than chance; these values can be unreliable, so they are set to 0.

9.4 Bias Towards Rare Events

  • Rare words with even a single co-occurrence can produce very high PMI values because the denominator (product of marginal probabilities) is small.
  • Solution: Smoothing techniques (like add-1 or add-2) to avoid extreme values.

9.5 Applications of PPMI

  • Word Similarity & Lexical Semantics: Identify synonyms, measure semantic closeness.
  • Distributional Semantics: Foundation for embeddings like Word2Vec and GloVe.

10. Applications Summary

  1. Information Retrieval (IR):
    • Represent documents as TF-IDF vectors or queries in a term-document space.
    • Rank documents based on cosine similarity to the query.
  2. Word Similarity & Lexical Semantics:
    • PPMI vectors used to find synonyms, clusters of similar words.
  3. Distributional Semantics:
    • PPMI influences methods like Word2Vec, GloVe for capturing deeper meaning.
  4. Plagiarism Detection & Document Similarity:
    • Compare documents using cosine similarity on TF-IDF vectors.

Practice Problems and Solutions

Below are practice problems for each major topic. Work through them step-by-step; then review the provided solutions.


1. One-Hot Encoding

Problem 1.1

Suppose you have a small corpus with 5 unique words: {cat,dog,apple,banana,runs}\{ \text{cat}, \text{dog}, \text{apple}, \text{banana}, \text{runs} \}.

  1. Create the one-hot vector for the word “dog”.
  2. If the sentence is “cat runs”, which two one-hot vectors represent this sentence?

Solution 1.1

  1. Vocabulary Indexing (example indexing, any consistent indexing is fine):
    cat -> index 0
    dog -> index 1
    apple -> index 2
    banana -> index 3
    runs -> index 4
    
    • One-hot for “dog” (length 5): [0, 1, 0, 0, 0].
  2. “cat runs” → two vectors:
    • “cat” = [1, 0, 0, 0, 0]
    • “runs” = [0, 0, 0, 0, 1]

2. Bag-of-Words (BoW)

Problem 2.1

You have the following three sentences:

  1. “Enjoy your free pizza now”
  2. “Free coupons for pizza”
  3. “Enjoy your day with fun”

(a) Create a vocabulary from these sentences.
(b) Form the Bag-of-Words vectors for each sentence based on the vocabulary (use simple frequency counts).

Solution 2.1

  1. Extract Vocabulary
    • Unique words: ["Enjoy", "your", "free", "pizza", "now", "coupons", "for", "day", "with", "fun"]
      • (Exact list/order can vary if you choose to include all-lowercase, etc. The main idea is to identify the unique tokens.)
  2. BoW Vectors
    • Sentence 1: “Enjoy your free pizza now”
      • [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
    • Sentence 2: “Free coupons for pizza”
      • [0, 0, 1, 1, 0, 1, 1, 0, 0, 0]
    • Sentence 3: “Enjoy your day with fun”
      • [1, 1, 0, 0, 0, 0, 0, 1, 1, 1]

3. Term-Document Matrix

Problem 3.1

Given a collection of 3 documents:

  1. D1: “chocolate is sweet”
  2. D2: “vanilla is sweet”
  3. D3: “chocolate and vanilla”

Construct the term-document matrix (rows = words, columns = documents), using raw frequency counts.

Solution 3.1

  1. Vocabulary: [“chocolate”, “is”, “sweet”, “vanilla”, “and”]
  2. Matrix:
WordD1 CountD2 CountD3 Count
chocolate101
is110
sweet110
vanilla011
and001
  • Columns are D1, D2, D3; rows are words.

4. Word-Word (Term-Term) Matrix

Problem 4.1

You have the sentence “The cat sat on the mat.” Assume a context window of ±1 word. Construct a simplified word-word co-occurrence matrix for the words [the, cat, sat, on, mat].

  1. List the co-occurrences for each word (within ±1 window).
  2. Fill out the matrix.

Solution 4.1

  1. Tokens (lowercase for simplicity, ignoring punctuation): [the, cat, sat, on, the, mat]
  2. ±1 Window co-occurrences:
    • the appears next to [cat, mat] (once each).
    • cat appears next to [the, sat].
    • sat appears next to [cat, on].
    • on appears next to [sat, the].
    • The second the appears next to [on, mat].
    • mat appears next to [the].
  3. Counting (combine both “the” tokens into a single row since it’s the same word):
Row (Target)thecatsatonmat
the01011
cat10100
sat01010
on10101
mat10010

(Counts can vary slightly depending on how you unify “the” occurrences, but this is a typical approach.)


5. Cosine Similarity

Problem 5.1

You have two document vectors:

  • d1=[3,0,2]\vec{d_1} = [3, 0, 2]
  • d2=[1,2,2]\vec{d_2} = [1, 2, 2]

Calculate:

  1. The dot product d1d2\vec{d_1} \cdot \vec{d_2}.
  2. The cosine similarity between d1\vec{d_1} and d2\vec{d_2}.

Solution 5.1

  1. Dot Product: d1d2=(3×1)+(0×2)+(2×2)=3+0+4=7.\vec{d_1} \cdot \vec{d_2} = (3 \times 1) + (0 \times 2) + (2 \times 2) = 3 + 0 + 4 = 7.
  2. Cosine Similarity: d1=32+02+22=9+4=13.\|\vec{d_1}\| = \sqrt{3^2 + 0^2 + 2^2} = \sqrt{9 + 4} = \sqrt{13}. d2=12+22+22=1+4+4=9=3.\|\vec{d_2}\| = \sqrt{1^2 + 2^2 + 2^2} = \sqrt{1 + 4 + 4} = \sqrt{9} = 3. cosine_similarity(d1,d2)=713×373×3.60550.647.\text{cosine\_similarity}(\vec{d_1}, \vec{d_2}) = \frac{7}{\sqrt{13} \times 3} \approx \frac{7}{3 \times 3.6055} \approx 0.647.

6. TF-IDF

Problem 6.1

You have 3 documents:

  1. D1: “apple banana apple”

  2. D2: “banana fruit apple”

  3. D3: “apple fruit”

  4. Compute the term frequency (TF) for each word in each document (use log-scaled TF if possible).

  5. Compute the inverse document frequency (IDF) for each word.

  6. Generate the final TF-IDF values.

(Assume logs are base 10, and if a word doesn’t appear in a document, TF = 0.)

Solution 6.1 (sketch)

  1. Vocabulary: {apple,banana,fruit}\{\text{apple}, \text{banana}, \text{fruit}\}
  2. Raw Counts:
    • D1: apple=2, banana=1, fruit=0
    • D2: apple=1, banana=1, fruit=1
    • D3: apple=1, banana=0, fruit=1
  3. Log-Scaled TF ( log10(count+1)\log_{10}(\text{count} + 1) ):
    • D1:
      • apple: log10(2+1)=log10(3)0.4771\log_{10}(2+1) = \log_{10}(3)\approx 0.4771
      • banana: log10(1+1)=log10(2)0.3010\log_{10}(1+1)=\log_{10}(2)\approx 0.3010
      • fruit: 0 (doesn’t appear)
    • D2:
      • apple: log10(1+1)=0.3010\log_{10}(1+1)=0.3010
      • banana: log10(1+1)=0.3010\log_{10}(1+1)=0.3010
      • fruit: log10(1+1)=0.3010\log_{10}(1+1)=0.3010
    • D3:
      • apple: log10(1+1)=0.3010\log_{10}(1+1)=0.3010
      • banana: 0
      • fruit: log10(1+1)=0.3010\log_{10}(1+1)=0.3010
  4. Document Frequency:
    • apple: appears in D1, D2, D3 → df=3\text{df}=3
    • banana: appears in D1, D2 → df=2\text{df}=2
    • fruit: appears in D2, D3 → df=2\text{df}=2
    • N=3N = 3 (total docs)
  5. IDF ( log10(N/df)\log_{10}(N/df) ):
    • apple: log10(3/3)=log10(1)=0\log_{10}(3/3)=\log_{10}(1)=0
    • banana: log10(3/2)log10(1.5)0.1761\log_{10}(3/2)\approx\log_{10}(1.5)\approx0.1761
    • fruit: log10(3/2)0.1761\log_{10}(3/2)\approx0.1761
  6. TF-IDF = (TF) × (IDF):
    • D1:
      • apple: 0.4771×0=00.4771 \times 0=0
      • banana: 0.3010×0.17610.05300.3010 \times 0.1761\approx 0.0530
      • fruit: 0×0.1761=00\times 0.1761=0
    • D2:
      • apple: 0.3010×0=00.3010 \times 0=0
      • banana: 0.3010×0.17610.05300.3010 \times 0.1761\approx0.0530
      • fruit: 0.3010×0.17610.05300.3010 \times 0.1761\approx0.0530
    • D3:
      • apple: 0.3010×0=00.3010 \times 0=0
      • banana: 0×0.1761=00 \times 0.1761=0
      • fruit: 0.3010×0.17610.05300.3010 \times 0.1761\approx0.0530

(Note that “apple” ended up with IDF=0 because it appears in all documents. This drastically lowers its TF-IDF weight — exactly what TF-IDF is designed to do for common words.)


7. PPMI (Positive Pointwise Mutual Information)

Problem 7.1

You have a mini-corpus of two words “sun” and “moon” and two contexts “day” and “night.” The co-occurrence counts are:

daynight
sun101
moon18
  1. Compute the joint probabilities p(word,context)p(\text{word}, \text{context}) by dividing each count by the total.
  2. Calculate PMI(sun\text{sun}, day\text{day}) and PMI(moon\text{moon}, night\text{night}).
  3. Convert to PPMI by setting negative values to 0 if there are any.

(Assume base 2 for the log if needed.)

Solution 7.1 (sketch)

  1. Total Co-occurrences: 10+1+1+8=2010 + 1 + 1 + 8 = 20.
  2. Joint Probabilities:
    • p(sun,day)=10/20=0.5p(\text{sun}, \text{day}) = 10/20 = 0.5
    • p(sun,night)=1/20=0.05p(\text{sun}, \text{night}) = 1/20 = 0.05
    • p(moon,day)=1/20=0.05p(\text{moon}, \text{day}) = 1/20 = 0.05
    • p(moon,night)=8/20=0.4p(\text{moon}, \text{night}) = 8/20 = 0.4
  3. Marginal Probabilities:
    • p(sun)=10/20+1/20=11/20=0.55p(\text{sun}) = 10/20 + 1/20 = 11/20 = 0.55
    • p(moon)=1/20+8/20=9/20=0.45p(\text{moon}) = 1/20 + 8/20 = 9/20 = 0.45
    • p(day)=(10+1)/20=11/20=0.55p(\text{day}) = (10+1)/20 = 11/20 = 0.55
    • p(night)=(1+8)/20=9/20=0.45p(\text{night}) = (1+8)/20 = 9/20 = 0.45
  4. PMI(sun\text{sun}, day\text{day}): =log2(0.50.55×0.55)=log2(0.50.3025)=log2(1.6529)0.73.= \log_{2}\left(\frac{0.5}{0.55 \times 0.55}\right) = \log_{2}\left(\frac{0.5}{0.3025}\right) = \log_{2}(1.6529) \approx 0.73.
  5. PMI(moon\text{moon}, night\text{night}): =log2(0.40.45×0.45)=log2(0.40.2025)=log2(1.9753)0.99.= \log_{2}\left(\frac{0.4}{0.45 \times 0.45}\right) = \log_{2}\left(\frac{0.4}{0.2025}\right) = \log_{2}(1.9753) \approx 0.99.
  6. PPMI: Since both values are positive, PPMI = PMI for these pairs.

Additional Tips for Exam Preparation

  1. Concept Mastery: Understand the key differences between count-based methods and embeddings.
  2. Formulas: Be comfortable with the formulas for TF, IDF, TF-IDF, PMI, Cosine Similarity.
  3. Interpretation: Practice explaining why one method might be chosen over another (e.g., TF-IDF vs. PPMI).
  4. Examples: Work through small examples manually. This cements the concepts better than purely theoretical study.
  5. Edge Cases: Think about extreme or edge cases (e.g., words appearing in all documents, or extremely rare words).

Final Note

This compilation covers:

  • Why raw text must be converted to numerical features (one-hot, BoW, TF-IDF, PPMI).
  • The basic mathematics behind cosine similarity.
  • The nature of term-document and word-word matrices.
  • The interplay between frequent and rare terms (and how methods like TF-IDF and PPMI balance them).

With these structured notes and the practice problems, you should be well-prepared to tackle exam questions on Deep Learning for NLP text representation basics. Good luck with your studies!

Share this Q&A