LLM-Based Classification Strategies
Erstellt am: 10. Januar 2025
Erstellt am: 10. Januar 2025
Addressing imbalanced data
I’m developing a strategy to classify research papers as publishable or non-publishable, given a 2:1 imbalance ratio and 15 labeled papers. This involves examining text cleaning, representation, and various model approaches like generative, pipeline, and embedding models.
Handling imbalanced data
I’m thinking through how to handle data imbalance and section segmentation, possibly using bag-of-words. Discussing multi-class pipeline approaches, data augmentation, self-supervised methods, and classification strategies.
Thought about research paper classification strategies for 18 seconds
Below is a systematic breakdown of possible approaches and considerations for building a highly accurate and robust framework to classify research papers as “Publishable” or “Non-Publishable,” given the constraints and context you described:
Synthetic Document Generation / Data Augmentation
Document Splitting + Label Inheritance
Active Learning or Semi-Supervised Learning
Oversampling / Undersampling
Class Weighting / Focal Loss
Splitting the documents by major sections (Abstract, Introduction, Methodology, etc.) can help isolate where certain “red flags” for non-publishability might appear. For instance:
Issue: Not all papers have a uniform section structure.
Given the text-length and complexity, large pre-trained language models (PLMs) such as BERT, RoBERTa, DeBERTa, or Longformer (for longer context) can be used:
Longformer / BigBird
Chunk-Level Approach with BERT
Instruction-Finetuned or GPT-based Models Locally
For smaller data scenarios, a simpler pipeline might be surprisingly effective:
Feature Extraction
Classification Algorithms
Below is an illustration of a pipeline that incorporates many of the above ideas:
Preprocessing & Document Splitting
Embeddings
Aggregation
Address Imbalance
Train Classifier
Active Learning for Additional Labels
While you cannot upload all data to GPT-4 or other LLMs directly for classification, you can still use them to aid certain subtasks:
Section Extraction / Title Identification
Topic Tagging
Keyword Generation
Following these steps in a well-structured pipeline can yield a high-performing system, even with very limited labeled data and heavily imbalanced classes. By integrating data-centric approaches (like augmentation and active learning) with advanced NLP architectures (like Longformer/BERT for large text), you stand a strong chance of building a solution that meets the accuracy and F1 requirements for the hackathon.
Taking a closer look
I’m gathering info on unsupervised classification for 150 papers, considering heuristic-based section detection, hierarchical models, rule-based systems, semantic similarity, fine-tuning, and LLM-generated procedures.
Crafting the plan
Thinking through a Python-based approach with Kaggle libraries, using GPU/TPU. It involves PDF extraction, heuristic detection, unsupervised classification, hierarchical models, and rule-based or LLM-generated pattern generation.
Mapping the process
I’m planning to provide end-to-end Python code for each approach, focusing on unsupervised methods like clustering to spot suspicious patterns. Code blocks will be concise and illustrative.
Assessing documents
I’m pulling together code to chunk and store PDFs, apply unsupervised techniques like clustering with Sentence Transformers and KMeans for outlier detection, and develop a rule-based approach for spotting missing elements.
Assessing similarity
I’m crafting a semantic similarity snippet to match document embeddings, suggesting non-publishable examples if distances fall within a set threshold, and noting the need for GPU specification in Kaggle.
Thought about alternative approaches for classification for 9 seconds
Below is a detailed, end-to-end outline of how you might implement several of the ideas you highlighted—ranging from heuristic-based section detection, a hierarchical model design, rule-based “red flag” detection, unsupervised clustering, and leveraging semantic similarity. The goal is to provide a variety of methods you can try in your Kaggle environment. Where relevant, I’ve included code snippets (in Python) that you can adapt. I’ll also point out potential pitfalls or ways to improve your approach.
Preprocessing & Section Extraction
1.1 Basic Preprocessing vs. Context-Aware Preprocessing
1.2 Heuristic-Based Section Detection
Unsupervised/Clustering Approach
2.1 Embedding Generation (Sentence Transformers)
2.2 K-Means / Hierarchical Clustering
2.3 Interpreting Clusters & Using Minimal Labels
Hierarchical Model for Long Documents
3.1 Chunking into Sections (or Subsections)
3.2 Encoding & Combining Section Embeddings
3.3 Fine-Tuning a Small Classifier Head (Optional)
Rule-Based “Red Flag” Detection
4.1 Designing Heuristics
4.2 Scoring & Combining with Other Methods
Semantic Similarity to Known Non-Publishable Papers
5.1 Building a Similarity Index
5.2 Threshold-Based Classification
Putting It All Together: Example Pipeline
Improving & Checking for Pitfalls
Complete Code Examples
Throughout, I’ll highlight ways to utilize your small set of 15 labeled papers effectively—even if it’s only at the end as a final “label assignment” step.
What you’re currently doing:
def preprocess_text(text): text = re.sub(r'[^a-zA-Z0-9\s]', '', text) text = text.lower() tokens = word_tokenize(text) stop_words = set(stopwords.words('english')) tokens = [word for word in tokens if word not in stop_words] return ' '.join(tokens)
Suggestions:
%
, =
, maybe “( )” if relevant.Many research papers follow a rough structure with headings like:
Heuristic approach:
Below is a conceptual snippet for Kaggle (Python) to detect sections:
import re def split_into_sections(pdf_text): # Possible common headings to look for headings = [ r"(abstract)\b", r"(introduction)\b", r"(background|related\s?work)\b", r"(methodology|methods|experimental\s?setup)\b", r"(results|experiments)\b", r"(discussion)\b", r"(conclusion|conclusions)\b", r"(references)\b" ] # We'll store tuples of (section_name, section_text) sections = [] current_section_name = "Unlabeled" current_section_text = [] lines = pdf_text.split('\n') for line in lines: line_lower = line.lower().strip() matched_heading = None for hd in headings: if re.match(hd, line_lower): matched_heading = line.strip() break if matched_heading: # Save previous section if we have any text if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) # Start a new section current_section_name = matched_heading current_section_text = [] else: current_section_text.append(line) # Add the last collected section if any if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) return sections # Example usage # pdf_text = "Your extracted text from pdfplumber" # sections = split_into_sections(pdf_text) # for sec_name, sec_content in sections: # print(sec_name, len(sec_content.split()))
What to do next:
sec_content
through your preprocess_text
or a more domain-aware pipeline.all-mpnet-base-v2
, all-distilroberta-v1
, or a domain-specific model like allenai/scibert_scivocab_uncased
) can create embeddings for text segments.By splitting papers into sections or ~1,000-word chunks:
def chunk_text(text, chunk_size=1000): words = text.split() for i in range(0, len(words), chunk_size): yield ' '.join(words[i:i+chunk_size])
Then process each chunk with a transformer (like SciBERT) to obtain a vector.
One approach (a “hierarchical” approach) is:
[num_chunks, embedding_dim]
.paper_embedding
for either clustering (unsupervised) or as input to a small neural classifier if you do want to use your labeled data.Even if you have only 15 labeled documents, you could:
Some potential signals for “Non-Publishable”:
You could create a simple “red flag” score for each paper:
This approach can run in parallel with your unsupervised model. You might ultimately label a paper “non-publishable” if:
nonpub_centroid
.nonpub_centroid
.A simple approach:
Overfitting to the 15 labeled docs:
Inconsistent Section Headings:
Large Variation in Paper Length:
GPU/TPU Usage:
Below are some consolidated code snippets (in Python) that you can adapt in a Kaggle notebook. I’ll assume you’ve already installed pdfplumber
and sentence-transformers
. (If not, you can do !pip install pdfplumber sentence-transformers
at the top of your Kaggle notebook.)
import re import pdfplumber import torch import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize # Make sure to download NLTK data if not already present # nltk.download('punkt') # nltk.download('stopwords') def extract_text_from_pdf(pdf_path): """Extract all text from a PDF using pdfplumber.""" pdf_text = "" with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: page_text = page.extract_text() if page_text: pdf_text += page_text + "\n" return pdf_text def custom_preprocess(text): """ A slightly more cautious preprocessing that: - Keeps numbers, percentage signs, - Lowercases, - Removes some punctuation but keeps essential symbols. """ # Lowercase text = text.lower() # Keep only certain punctuation: letters, numbers, whitespace, punctuation like '%', '=', etc. text = re.sub(r'[^a-z0-9%\=$$\.\,\s]', '', text) # Tokenize tokens = word_tokenize(text) # Remove stopwords (optional) stop_words = set(stopwords.words('english')) tokens = [t for t in tokens if t not in stop_words] return ' '.join(tokens) def split_into_sections(pdf_text): headings = [ r"(abstract)\b", r"(introduction)\b", r"(background|related\s?work)\b", r"(methodology|methods|experimental\s?setup)\b", r"(results|experiments)\b", r"(discussion)\b", r"(conclusion|conclusions)\b", r"(references)\b" ] sections = [] current_section_name = "Unlabeled" current_section_text = [] lines = pdf_text.split('\n') for line in lines: line_lower = line.strip().lower() matched_heading = None for hd in headings: if re.match(hd, line_lower): matched_heading = line.strip() break if matched_heading: if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) current_section_name = matched_heading current_section_text = [] else: current_section_text.append(line) if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) return sections
Usage in a Kaggle environment:
# Suppose you have a folder "papers" with PDF files import os pdf_dir = "./papers" documents = [] for filename in os.listdir(pdf_dir): if filename.endswith(".pdf"): path = os.path.join(pdf_dir, filename) raw_text = extract_text_from_pdf(path) sections = split_into_sections(raw_text) # Preprocess each section processed_sections = [] for sec_name, sec_text in sections: processed_text = custom_preprocess(sec_text) processed_sections.append((sec_name, processed_text)) # Store the document data documents.append({ 'filename': filename, 'sections': processed_sections })
Below is a simple example of clustering the average embedding of each paper. We’ll use Sentence Transformers to get embeddings.
!pip install sentence-transformers from sentence_transformers import SentenceTransformer from sklearn.cluster import KMeans import numpy as np # Load a pre-trained or domain-specific model (SciBERT, etc.) model_name = 'all-mpnet-base-v2' # or 'allenai/scibert_scivocab_uncased' device = 'cuda' if torch.cuda.is_available() else 'cpu' embed_model = SentenceTransformer(model_name, device=device) def get_document_embedding(processed_sections): """ Generate embeddings for each section, then average them. processed_sections: list of (sec_name, text) after preprocessing. """ section_texts = [sec_text for _, sec_text in processed_sections if sec_text.strip()] if not section_texts: # fallback if there's no text return np.zeros(768) # Encode the entire batch embeddings = embed_model.encode(section_texts, convert_to_numpy=True) doc_embedding = np.mean(embeddings, axis=0) return doc_embedding # Compute embeddings for all documents doc_embeddings = [] for doc in documents: embedding = get_document_embedding(doc['sections']) doc_embeddings.append(embedding) doc_embeddings = np.array(doc_embeddings) # K-Means Clustering num_clusters = 5 # arbitrary choice; you can experiment kmeans = KMeans(n_clusters=num_clusters, random_state=42) labels = kmeans.fit_predict(doc_embeddings) for i, doc in enumerate(documents): doc['cluster_label'] = labels[i] # Inspect the cluster distribution from collections import Counter cluster_counts = Counter(labels) print(cluster_counts)
Next Steps:
Create a few simple functions to detect “red flags,” e.g.:
def count_red_flags(processed_sections): flags = 0 # 1) Check for unrealistic claims: e.g., "100% accuracy" text_joined = " ".join(sec_text for _, sec_text in processed_sections) if '100% accuracy' in text_joined or '100 percent accuracy' in text_joined: flags += 1 # 2) Check if Methodology section is too short for sec_name, sec_text in processed_sections: if 'method' in sec_name.lower(): # Adjust threshold to your taste if len(sec_text.split()) < 100: flags += 1 # 3) Check if there's a references section at all has_references = any('references' in sec_name.lower() for sec_name, _ in processed_sections) if not has_references: flags += 1 return flags # Example usage for doc in documents: doc['red_flag_score'] = count_red_flags(doc['sections']) # Then you could set a threshold threshold = 1 for doc in documents: if doc['red_flag_score'] >= threshold: doc['suspicious'] = True else: doc['suspicious'] = False
If you have 5 known non-publishable papers:
known_nonpub_embeddings = [] for doc in documents: if doc['filename'] in known_nonpub_filenames: # e.g. the 5 known non-publishable emb = get_document_embedding(doc['sections']) known_nonpub_embeddings.append(emb) nonpub_centroid = np.mean(known_nonpub_embeddings, axis=0) def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) # Score each doc for doc in documents: emb = get_document_embedding(doc['sections']) sim = cosine_similarity(emb, nonpub_centroid) doc['similarity_to_nonpub'] = sim
Then set a threshold:
# For instance, see the distribution of similarity_to_nonpub for known publishable vs. known non-publishable # This helps you pick a threshold threshold = 0.7 for doc in documents: doc['maybe_nonpub'] = (doc['similarity_to_nonpub'] > threshold)
Below is a condensed version of how this might look in an actual Kaggle Notebook with minimal commentary:
# Step 0: Install dependencies (if needed) !pip install pdfplumber sentence-transformers import os import pdfplumber import re import torch import nltk import numpy as np from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from sentence_transformers import SentenceTransformer from sklearn.cluster import KMeans from collections import Counter device = 'cuda' if torch.cuda.is_available() else 'cpu' model_name = 'all-mpnet-base-v2' embed_model = SentenceTransformer(model_name, device=device) # Download NLTK data nltk.download('punkt') nltk.download('stopwords') def extract_text_from_pdf(pdf_path): pdf_text = "" with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: page_text = page.extract_text() if page_text: pdf_text += page_text + "\n" return pdf_text def custom_preprocess(text): text = text.lower() # Keep certain punctuation text = re.sub(r'[^a-z0-9%\=$$\.\,\s]', '', text) tokens = word_tokenize(text) stop_words = set(stopwords.words('english')) tokens = [t for t in tokens if t not in stop_words] return ' '.join(tokens) def split_into_sections(pdf_text): headings = [ r"(abstract)\b", r"(introduction)\b", r"(background|related\s?work)\b", r"(methodology|methods|experimental\s?setup)\b", r"(results|experiments)\b", r"(discussion)\b", r"(conclusion|conclusions)\b", r"(references)\b" ] sections = [] current_section_name = "Unlabeled" current_section_text = [] lines = pdf_text.split('\n') for line in lines: line_lower = line.strip().lower() matched_heading = None for hd in headings: if re.match(hd, line_lower): matched_heading = line.strip() break if matched_heading: if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) current_section_name = matched_heading current_section_text = [] else: current_section_text.append(line) if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) return sections def get_document_embedding(processed_sections): section_texts = [sec_text for _, sec_text in processed_sections if sec_text.strip()] if not section_texts: return np.zeros(768) embeddings = embed_model.encode(section_texts, convert_to_numpy=True) return np.mean(embeddings, axis=0) def count_red_flags(processed_sections): flags = 0 text_joined = " ".join(sec_text for _, sec_text in processed_sections) # Example rule: 100% accuracy if '100% accuracy' in text_joined or '100 percent accuracy' in text_joined: flags += 1 # Example rule: short methodology section for sec_name, sec_text in processed_sections: if 'method' in sec_name.lower(): if len(sec_text.split()) < 100: flags += 1 # Example rule: no references has_references = any('references' in sec_name.lower() for sec_name, _ in processed_sections) if not has_references: flags += 1 return flags # ===== MAIN PIPELINE ===== pdf_dir = "./papers" documents = [] for filename in os.listdir(pdf_dir): if filename.endswith(".pdf"): path = os.path.join(pdf_dir, filename) raw_text = extract_text_from_pdf(path) sections = split_into_sections(raw_text) # Preprocess each section processed_sections = [] for sec_name, sec_text in sections: processed_text = custom_preprocess(sec_text) processed_sections.append((sec_name, processed_text)) documents.append({ 'filename': filename, 'sections': processed_sections }) # Compute embeddings for each doc doc_embeddings = [] for doc in documents: emb = get_document_embedding(doc['sections']) doc_embeddings.append(emb) doc_embeddings = np.array(doc_embeddings) # KMeans clustering num_clusters = 5 kmeans = KMeans(n_clusters=num_clusters, random_state=42) labels = kmeans.fit_predict(doc_embeddings) for i, doc in enumerate(documents): doc['cluster_label'] = int(labels[i]) doc['red_flag_score'] = count_red_flags(doc['sections']) cluster_counts = Counter(labels) print("Cluster distribution:", cluster_counts) # Suppose we know the filenames of the 5 known non-publishable known_nonpub_filenames = ["nonpub1.pdf", "nonpub2.pdf", "nonpub3.pdf", "nonpub4.pdf", "nonpub5.pdf"] known_nonpub_embeddings = [] for doc in documents: if doc['filename'] in known_nonpub_filenames: known_nonpub_embeddings.append(get_document_embedding(doc['sections'])) nonpub_centroid = np.mean(known_nonpub_embeddings, axis=0) def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) for doc in documents: emb = get_document_embedding(doc['sections']) sim = cosine_similarity(emb, nonpub_centroid) doc['similarity_to_nonpub'] = float(sim) # Example threshold for suspicious similarity_threshold = 0.7 red_flag_threshold = 1 for doc in documents: doc['maybe_nonpub'] = (doc['similarity_to_nonpub'] > similarity_threshold) \ or (doc['red_flag_score'] >= red_flag_threshold) # Inspect final results for doc in documents: print(doc['filename'], doc['cluster_label'], doc['red_flag_score'], doc['similarity_to_nonpub'], doc['maybe_nonpub'])
maybe_nonpub
)I hope this gives you a more comprehensive roadmap—and enough code to try out each approach in Kaggle. Feel free to tweak thresholds, the choice of embedding model, clustering method, and rule-based patterns to adapt to your specific corpus. Above all, remember that your domain knowledge about “what makes a paper unpublishable” can be translated into powerful heuristics that complement the unsupervised features.
Framing the approach
I’m setting up for a KMeans visualization by reducing dimensionality with PCA, then creating a scatter plot to color or label clusters, highlighting specific examples.
Visualizing similarities
I’m preparing to create a histogram or density curve to show cosine similarity distributions, using colors or shapes to highlight 15 specific documents and their unique characteristics.
Thought for 4 seconds
Below is a structured walk-through of how to visualize the clustering results, the cosine similarity distribution, and the red flag scores for your documents. I’ve also included code snippets that:
R001.pdf
to R015.pdf
), with R001–R005 labeled as class 0 and R006–R015 labeled as class 1.Where possible, I’ve included commentary for improvements or alternative approaches.
Since doc_embeddings
is likely high-dimensional (e.g., 768 dimensions from a transformer), we’ll reduce the dimensionality to 2D for plotting. You can use PCA, t-SNE, or UMAP. Below, I’ll show PCA for simplicity, but feel free to replace it with t-SNE or UMAP.
import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA from sklearn.manifold import TSNE # We have: # doc_embeddings (shape: [num_documents, embedding_dim]) # documents (a list of dicts with 'filename' and 'cluster_label') # Let's define our labeled sets labeled_files_class0 = set([f"R00{i}.pdf" for i in range(1,6)]) # R001 to R005 labeled_files_class1 = set([f"R00{i}.pdf" for i in range(6,16)]) # R006 to R015 # (1) PCA for 2D visualization pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) # (2) Alternatively, you can try t-SNE (comment out PCA if you want this) # tsne = TSNE(n_components=2, random_state=42) # reduced_tsne = tsne.fit_transform(doc_embeddings) # Then use reduced_tsne instead of reduced_pca below. # We'll use reduced_pca here for demonstration. x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] # Create a figure for the cluster visualization plt.figure(figsize=(10, 7)) # Plot each document as a point, colored by cluster_label clusters = [doc['cluster_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title("K-Means Clusters (PCA 2D Projection)") plt.xlabel("Principal Component 1") plt.ylabel("Principal Component 2") # Now let's highlight or label the 15 known PDFs for i, doc in enumerate(documents): filename = doc['filename'] # If this doc is one of the 15 labeled if filename in labeled_files_class0 or filename in labeled_files_class1: # We'll put a black edge or a special marker plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) # Optionally, annotate text # e.g., display 'C0' or 'C1' for class 0 or class 1 if filename in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i]+0.02, y_coords[i]+0.02, label_text, fontsize=9) # Add a legend for the cluster colors # We'll add it by mapping each cluster label to a color patch unique_clusters = list(sorted(set(clusters))) handles = scatter.legend_elements()[0] labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, labels_for_legend, title="Clusters", loc="best") plt.show()
cluster_label
.R001-R005
are class 0, R006-R015
are class 1) are drawn with a black edge around the marker, and we label them in text as “C0” or “C1.”If you want a more advanced or non-linear reduction (like t-SNE or UMAP), just uncomment the relevant lines, and refer to reduced_tsne[:, 0]
and reduced_tsne[:, 1]
for x and y coordinates.
You have code that computes doc['similarity_to_nonpub']
for each document. Let’s plot a histogram (or KDE plot) of these similarity scores and highlight the 15 labeled PDFs.
import seaborn as sns # Extract similarity scores for all docs similarities = [doc['similarity_to_nonpub'] for doc in documents] # Make a list of labeled similarities, plus the label for each similarities_labeled_c0 = [] similarities_labeled_c1 = [] similarities_unlabeled = [] for i, doc in enumerate(documents): sim_val = doc['similarity_to_nonpub'] filename = doc['filename'] if filename in labeled_files_class0: similarities_labeled_c0.append(sim_val) elif filename in labeled_files_class1: similarities_labeled_c1.append(sim_val) else: similarities_unlabeled.append(sim_val) plt.figure(figsize=(10, 6)) # Plot distribution with seaborn's histplot or kdeplot sns.histplot(similarities, kde=True, color='gray', alpha=0.4, label='All Documents') # Plot labeled class 0 sns.histplot(similarities_labeled_c0, color='red', alpha=0.8, kde=True, label='Class 0 (R001-R005)') # Plot labeled class 1 sns.histplot(similarities_labeled_c1, color='blue', alpha=0.6, kde=True, label='Class 1 (R006-R015)') plt.title("Cosine Similarity to Non-Publishable Centroid") plt.xlabel("Similarity Score") plt.ylabel("Count") plt.legend(loc='best') plt.show()
nonpub_centroid
than the “Publishable” ones.sns.kdeplot(...)
instead of sns.histplot(kde=True, ...)
.You mentioned that having or not having a references section wasn’t always accurate. Below are some alternative heuristics:
Below is an updated example of a “red flag” function using some of these additional heuristics:
def count_red_flags(processed_sections): flags = 0 full_text = " ".join(sec_text for _, sec_text in processed_sections) word_count = len(full_text.split()) # 1) Very short paper if word_count < 1000: flags += 1 # 2) Check if baseline or comparison keywords are absent if ("baseline" not in full_text) and ("compare" not in full_text) and ("benchmark" not in full_text): flags += 1 # 3) Check if there's any mention of data or dataset if ("data" not in full_text) and ("dataset" not in full_text) and ("corpus" not in full_text): flags += 1 # 4) No mention of common evaluation metrics common_metrics = ["accuracy", "precision", "recall", "f1", "auc", "mse", "rmse"] if not any(metric in full_text for metric in common_metrics): flags += 1 return flags
Similar to the cosine similarity distribution, you might want to plot a histogram of red_flag_score
across all documents. We’ll again highlight the labeled ones.
import seaborn as sns red_flag_scores = [doc['red_flag_score'] for doc in documents] scores_labeled_c0 = [] scores_labeled_c1 = [] scores_unlabeled = [] for doc in documents: score_val = doc['red_flag_score'] filename = doc['filename'] if filename in labeled_files_class0: scores_labeled_c0.append(score_val) elif filename in labeled_files_class1: scores_labeled_c1.append(score_val) else: scores_unlabeled.append(score_val) plt.figure(figsize=(10, 6)) sns.histplot(red_flag_scores, color='gray', alpha=0.4, binwidth=1, label='All Documents') sns.histplot(scores_labeled_c0, color='red', alpha=0.8, binwidth=1, label='Class 0 (R001-R005)') sns.histplot(scores_labeled_c1, color='blue', alpha=0.6, binwidth=1, label='Class 1 (R006-R015)') plt.title("Red Flag Score Distribution") plt.xlabel("Red Flag Score (count of triggered heuristics)") plt.ylabel("Number of Papers") plt.legend(loc='best') plt.show()
This will give you a sense of how many rules each document is triggering, and whether your known class 0 or class 1 docs align with high or low red-flag counts.
doc_embeddings
, color by cluster, highlight labeled docs.By combining these visualizations, you’ll get a much better understanding of how your dataset is structured, how your known “non-publishable” vs. “publishable” examples fit in, and which heuristics or similarity thresholds might be most effective.
The issue you're describing likely stems from how the labeled files (R001
to R015
) are matched against the filenames in your documents
list. If there’s any mismatch in how the filenames are stored (e.g., missing .pdf
extensions, inconsistent capitalization, or whitespace issues), some of the labeled files might not be correctly identified, resulting in only a subset being visualized.
Here’s a step-by-step rectification and improved code to ensure all 15 labeled files are visualized properly:
Ensure that the filenames for your labeled files are correctly matched against the filenames in your documents
list. Normalize both sets of filenames to avoid mismatches.
# Define labeled files with consistent formatting labeled_files_class0 = {f"R00{i}.pdf" for i in range(1, 6)} # R001-R005 labeled_files_class1 = {f"R00{i}.pdf" for i in range(6, 16)} # R006-R015 # Normalize filenames in the documents list for comparison for doc in documents: doc['normalized_filename'] = doc['filename'].strip().lower() # Normalize filenames # Ensure the labeled sets are also normalized labeled_files_class0 = {f.strip().lower() for f in labeled_files_class0} labeled_files_class1 = {f.strip().lower() for f in labeled_files_class1}
Update the clustering visualization code to ensure all labeled files are highlighted properly:
# PCA for dimensionality reduction pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) # Extract coordinates x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] # Create figure plt.figure(figsize=(10, 7)) clusters = [doc['cluster_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title("K-Means Clusters (PCA 2D Projection)") plt.xlabel("Principal Component 1") plt.ylabel("Principal Component 2") # Highlight labeled files for i, doc in enumerate(documents): filename = doc['normalized_filename'] if filename in labeled_files_class0 or filename in labeled_files_class1: plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) # Label the points if filename in labeled_files_class0: label_text = "C0" elif filename in labeled_files_class1: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) # Add legend for clusters unique_clusters = list(sorted(set(clusters))) handles = scatter.legend_elements()[0] labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, labels_for_legend, title="Clusters", loc="best") plt.show()
Update the cosine similarity plot to ensure all labeled files are included:
# Extract cosine similarity values similarities = [doc['similarity_to_nonpub'] for doc in documents] # Categorize similarities similarities_labeled_c0 = [] similarities_labeled_c1 = [] for doc in documents: filename = doc['normalized_filename'] sim_val = doc['similarity_to_nonpub'] if filename in labeled_files_class0: similarities_labeled_c0.append(sim_val) elif filename in labeled_files_class1: similarities_labeled_c1.append(sim_val) plt.figure(figsize=(10, 6)) # Plot overall distribution sns.histplot(similarities, kde=True, color='gray', alpha=0.4, label='All Documents') # Highlight class 0 sns.histplot(similarities_labeled_c0, kde=True, color='red', alpha=0.8, label='Class 0 (R001-R005)') # Highlight class 1 sns.histplot(similarities_labeled_c1, kde=True, color='blue', alpha=0.6, label='Class 1 (R006-R015)') plt.title("Cosine Similarity to Non-Publishable Centroid") plt.xlabel("Similarity Score") plt.ylabel("Count") plt.legend(loc='best') plt.show()
If any of the 15 labeled files still fail to appear in the plots, print the filenames for debugging:
# Check which labeled files are not being matched missing_files_c0 = labeled_files_class0 - {doc['normalized_filename'] for doc in documents} missing_files_c1 = labeled_files_class1 - {doc['normalized_filename'] for doc in documents} print("Missing Class 0 Files:", missing_files_c0) print("Missing Class 1 Files:", missing_files_c1)
Ensure the filenames in the labeled lists (labeled_files_class0
, labeled_files_class1
) exactly match the filenames in documents
.
Let me know if the updated code resolves the issue or if further adjustments are needed!
Here's how you can adjust the previous visualization codes to highlight the documents where maybe_nonpub == True
. This involves updating both the clustering visualization and the cosine similarity distribution visualization.
maybe_nonpub
HighlightingWe’ll add a special highlight for documents marked as maybe_nonpub == True
in the clustering plot. These will be shown with a distinct edge color (e.g., orange).
# PCA for dimensionality reduction pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) # Extract coordinates x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] # Create figure plt.figure(figsize=(10, 7)) clusters = [doc['cluster_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title("K-Means Clusters (PCA 2D Projection)") plt.xlabel("Principal Component 1") plt.ylabel("Principal Component 2") # Highlight labeled files for i, doc in enumerate(documents): filename = doc['normalized_filename'] # Highlight labeled documents (Class 0 and Class 1) if filename in labeled_files_class0 or filename in labeled_files_class1: plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if filename in labeled_files_class0: label_text = "C0" elif filename in labeled_files_class1: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) # Highlight maybe_nonpub documents with a distinct edge color if doc.get('maybe_nonpub', False): plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', marker='s', s=120, linewidths=1.5) # Add legend for clusters unique_clusters = list(sorted(set(clusters))) handles = scatter.legend_elements()[0] labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, labels_for_legend, title="Clusters", loc="best") plt.show()
R001
to R015
) with class annotations (C0
and C1
).maybe_nonpub == True
with orange square edges.maybe_nonpub
HighlightingWe’ll adjust the cosine similarity plot to highlight the distribution of documents where maybe_nonpub == True
in addition to the labeled documents.
# Extract cosine similarity values similarities = [doc['similarity_to_nonpub'] for doc in documents] # Categorize similarities similarities_labeled_c0 = [] similarities_labeled_c1 = [] similarities_maybe_nonpub = [] for doc in documents: filename = doc['normalized_filename'] sim_val = doc['similarity_to_nonpub'] if filename in labeled_files_class0: similarities_labeled_c0.append(sim_val) elif filename in labeled_files_class1: similarities_labeled_c1.append(sim_val) if doc.get('maybe_nonpub', False): similarities_maybe_nonpub.append(sim_val) plt.figure(figsize=(10, 6)) # Plot overall distribution sns.histplot(similarities, kde=True, color='gray', alpha=0.4, label='All Documents') # Highlight class 0 sns.histplot(similarities_labeled_c0, kde=True, color='red', alpha=0.8, label='Class 0 (R001-R005)') # Highlight class 1 sns.histplot(similarities_labeled_c1, kde=True, color='blue', alpha=0.6, label='Class 1 (R006-R015)') # Highlight maybe_nonpub sns.histplot(similarities_maybe_nonpub, kde=True, color='orange', alpha=0.5, label='Maybe Non-Publishable') plt.title("Cosine Similarity to Non-Publishable Centroid") plt.xlabel("Similarity Score") plt.ylabel("Count") plt.legend(loc='best') plt.show()
R001-R005
) and Class 1 (R006-R015
), respectively.maybe_nonpub == True
.To ensure the plots work without errors:
documents
has a normalized filename (normalized_filename
) and the maybe_nonpub
attribute (default to False
if not already present).for doc in documents: if 'normalized_filename' not in doc or 'maybe_nonpub' not in doc: print("Missing attributes in:", doc)
Let me know if the updated code highlights the maybe_nonpub
documents correctly and if further refinements are needed!
Thought for 7 seconds
Below is a detailed answer addressing your specific questions:
Finally, I provide two complete code examples (so you can simply paste them into your workflow):
Practical Tip: Start with a higher threshold (e.g., > 0.60) to keep false positives low. Then you can slowly lower the threshold if you want to catch borderline cases, but watch for error drift.
Below is a consolidated code block illustrating:
similarity_to_nonpub
as input features.Where to paste: Ideally, after you’ve computed
doc_embeddings
,similarity_to_nonpub
, andmaybe_nonpub
for each document. This can replace or supplement the final classification steps.
############################################################################### # SET 1: Iterative Centroid Expansion + K-Means + MLP ############################################################################### import numpy as np from copy import deepcopy from sklearn.neural_network import MLPClassifier from sklearn.model_selection import cross_val_score import warnings warnings.filterwarnings('ignore') ############################################################################### # 1. Iterative Centroid Expansion ############################################################################### def iterative_nonpub_expansion(documents, known_nonpub_filenames, similarity_threshold=0.60, max_iterations=5): """ Iteratively expands the known non-publishable list by adding any documents with similarity > similarity_threshold. Then recalculates centroid. Arguments: - documents: list of dicts containing doc['filename'], doc['similarity_to_nonpub'], etc. - known_nonpub_filenames: initial set/list of known non-publishable filenames (strings). - similarity_threshold: similarity above which a doc is considered "non-publishable." - max_iterations: maximum number of expansions to run. Returns: - updated_nonpub_filenames: final set of non-publishable filenames after expansion """ updated_nonpub_filenames = set(known_nonpub_filenames) iteration = 0 while iteration < max_iterations: iteration += 1 # Calculate new centroid from the current set nonpub_embeddings = [] for doc in documents: if doc['filename'] in updated_nonpub_filenames: emb = doc['doc_embedding'] # we store doc_embedding for convenience nonpub_embeddings.append(emb) new_centroid = np.mean(nonpub_embeddings, axis=0) # Add docs that exceed the threshold newly_added = False for doc in documents: if doc['filename'] not in updated_nonpub_filenames: # compute similarity to new centroid emb = doc['doc_embedding'] sim = cosine_similarity(emb, new_centroid) if sim > similarity_threshold: updated_nonpub_filenames.add(doc['filename']) newly_added = True if not newly_added: print(f"No new docs added in iteration {iteration}. Stopping.") break else: print(f"Iteration {iteration} added new docs. Current size of non-pub set: {len(updated_nonpub_filenames)}.") return updated_nonpub_filenames # We'll assume you already have doc['doc_embedding'] = get_document_embedding(...) # To store doc_embedding in your existing pipeline, do something like: # for i, doc in enumerate(documents): # doc['doc_embedding'] = doc_embeddings[i] # 1.1: Store embeddings in each doc for convenience for i, doc in enumerate(documents): doc['doc_embedding'] = doc_embeddings[i] # 1.2: Perform initial similarity calculation (already done in your code, typically) # for doc in documents: # sim = cosine_similarity(doc['doc_embedding'], nonpub_centroid) # doc['similarity_to_nonpub'] = float(sim) # 1.3: Iteratively expand known_nonpub_initial = ["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"] expanded_nonpub_filenames = iterative_nonpub_expansion(documents, known_nonpub_initial, similarity_threshold=0.60, max_iterations=5) print("Final set of non-publishable (expanded) = ", expanded_nonpub_filenames) ############################################################################### # 2. K-Means (Already in your code) ############################################################################### # We'll keep your existing KMeans step. No changes needed unless you want to integrate # the newly expanded non-publishable set in some way (like removing them from K-Means, etc.). # # For reference: # labels = kmeans.fit_predict(doc_embeddings) # for i, doc in enumerate(documents): # doc['cluster_label'] = int(labels[i]) ############################################################################### # 3. Small MLP for classification ############################################################################### # We'll build a tiny dataset from the 15 labeled documents you have # (including your newly expanded nonpub set if you like). # Then we do cross-validation to evaluate how well it classifies # publishable (class 1) vs non-publishable (class 0). from sklearn.model_selection import StratifiedKFold # Let's define your 15 labeled docs (or any subset you trust): nonpub_set_c0 = set(["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"]) pub_set_c1 = set(["R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf", "R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"]) X = [] y = [] for doc in documents: fname = doc['filename'] if fname in nonpub_set_c0: X.append(doc['doc_embedding']) y.append(0) elif fname in pub_set_c1: X.append(doc['doc_embedding']) y.append(1) X = np.array(X) y = np.array(y) # If we want to include similarity as an extra feature: # We'll combine doc_embedding with similarity_to_nonpub # Just do a quick horizontal stack: X_aug = [] for doc, label in zip(documents, range(len(documents))): # but we only want those that are in our labeled set fname = doc['filename'] if fname in nonpub_set_c0 or fname in pub_set_c1: embed_768 = doc['doc_embedding'] sim_val = doc['similarity_to_nonpub'] # a single float # create a feature vector of size 769 new_vec = np.concatenate([embed_768, [sim_val]]) X_aug.append(new_vec) # We must track y in the same order y_aug = [] for doc in documents: fname = doc['filename'] if fname in nonpub_set_c0: y_aug.append(0) elif fname in pub_set_c1: y_aug.append(1) X_aug = np.array(X_aug) y_aug = np.array(y_aug) # We'll train an MLP on X_aug, y_aug mlp = MLPClassifier(hidden_layer_sizes=(64,), random_state=42, max_iter=1000) skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(mlp, X_aug, y_aug, cv=skf, scoring='accuracy') print(f"5-Fold CV Accuracy for MLP: {scores.mean():.2f} +/- {scores.std():.2f}") mlp.fit(X_aug, y_aug) # Now we can apply this MLP to all 150 documents # They each get a predicted label: 0 = non-publishable, 1 = publishable for doc in documents: embed_768 = doc['doc_embedding'] sim_val = doc['similarity_to_nonpub'] test_vec = np.concatenate([embed_768, [sim_val]]) pred_label = mlp.predict(test_vec.reshape(1, -1))[0] doc['mlp_pred_label'] = int(pred_label) # doc['mlp_pred_label'] == 0 => non-publishable, 1 => publishable
doc_embeddings
and the initial nonpub_centroid
.doc_embeddings
and doc['similarity_to_nonpub']
.This second code snippet replaces K-Means with HDBSCAN or Hierarchical Clustering. HDBSCAN requires installing the library:
!pip install hdbscan
We’ll also show how to do the MLP in the same manner. (You can remove the MLP part if you only want to test clustering.)
Where to paste: Replace your K-Means block with the HDBSCAN block. Keep the rest of your pipeline the same.
############################################################################### # SET 2: Using HDBSCAN (Outlier Detection) + Optional MLP ############################################################################### !pip install hdbscan import hdbscan # 1) HDBSCAN to cluster doc_embeddings hdb = hdbscan.HDBSCAN(min_cluster_size=5, min_samples=3, metric='euclidean') hdb_labels = hdb.fit_predict(doc_embeddings) # We can store these cluster labels in documents for i, doc in enumerate(documents): doc['hdb_cluster_label'] = int(hdb_labels[i]) # -1 means outlier/no cluster # Let's see how many are labeled as outliers outlier_count = sum(1 for lbl in hdb_labels if lbl == -1) print(f"Number of outliers (no cluster assigned): {outlier_count}") # 2) Inspect cluster distribution label_counts = Counter(hdb_labels) print("HDBSCAN Cluster Distribution:", label_counts) # 3) If you want a quick visualization: import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) x_coords = reduced_pca[:,0] y_coords = reduced_pca[:,1] plt.figure(figsize=(10,7)) scatter = plt.scatter(x_coords, y_coords, c=hdb_labels, cmap='tab10', alpha=0.6) plt.title("HDBSCAN Clusters (PCA 2D Projection)") plt.xlabel("PC1") plt.ylabel("PC2") # highlight your known labeled docs labeled_files_class0 = set(["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"]) labeled_files_class1 = set(["R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf", "R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"]) for i, doc in enumerate(documents): if doc['filename'] in labeled_files_class0 or doc['filename'] in labeled_files_class1: plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if doc['filename'] in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) plt.show() # 4) Optional: Train an MLP or logistic regression using same approach as above # Build labeled X, y from the 15 known docs from sklearn.neural_network import MLPClassifier from sklearn.model_selection import StratifiedKFold X_aug = [] y_aug = [] for doc in documents: fname = doc['filename'] if fname in labeled_files_class0: # Create feature vector from doc['doc_embedding'] + doc['similarity_to_nonpub'] embed_768 = doc_embeddings[[i for i, d in enumerate(documents) if d['filename'] == fname][0]] sim_val = doc['similarity_to_nonpub'] new_vec = np.concatenate([embed_768, [sim_val]]) X_aug.append(new_vec) y_aug.append(0) elif fname in labeled_files_class1: embed_768 = doc_embeddings[[i for i, d in enumerate(documents) if d['filename'] == fname][0]] sim_val = doc['similarity_to_nonpub'] new_vec = np.concatenate([embed_768, [sim_val]]) X_aug.append(new_vec) y_aug.append(1) X_aug = np.array(X_aug) y_aug = np.array(y_aug) mlp = MLPClassifier(hidden_layer_sizes=(64,), random_state=42, max_iter=1000) skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(mlp, X_aug, y_aug, cv=skf, scoring='accuracy') print(f"[HDBSCAN] 5-Fold CV Accuracy for MLP: {scores.mean():.2f} +/- {scores.std():.2f}") mlp.fit(X_aug, y_aug) for doc in documents: # For each doc, build the same feature vector and predict embed_768 = doc['doc_embedding'] sim_val = doc['similarity_to_nonpub'] test_vec = np.concatenate([embed_768, [sim_val]]) pred_label = mlp.predict(test_vec.reshape(1, -1))[0] doc['mlp_pred_label'] = int(pred_label) print("Done: HDBSCAN + MLP classification.")
Note:
-1
indicates an outlier (i.e., no cluster). You can treat outliers as “suspect” if you suspect unusual papers might be “non-publishable,” or you can do a separate check.Ultimately, it’s best to compare them on your small labeled set (the 15 papers). Whichever approach consistently classifies them correctly (via cross-validation) is typically the safest bet to scale to the 150-paper dataset.
Feel free to integrate the above code snippets wherever they fit best in your existing pipeline. Good luck!
Thought for 6 seconds
Below is a deeper discussion of the issues you’re encountering, how to interpret your similarity-based approach, and how you might reduce false negatives (while keeping false positives in check). I’ll also address how you can evaluate different transformers to pick one that best separates publishable and non-publishable papers.
When you calculate the cosine similarity between your paper embeddings and the “non-publishable centroid,” you typically end up with a distribution of similarity scores (e.g., from 0.0 to 1.0). You may also separate out the scores for:
You mentioned that “75% of data is under the threshold” and are thus classified as publishable. This is typical if your threshold is somewhat high. The question is: Are you comfortable with that many being labeled publishable if it means you avoid false positives?
To gauge that, you can do a small precision-recall style analysis on your 15 labeled docs:
If your main problem is that the model is “missing” some truly non-publishable papers that have similarity below your chosen threshold, here are ways to tackle that:
Dual-Centroid Approach: Instead of a single “non-publishable” centroid, also compute a “publishable” centroid using your known Class 1 docs. Then, for each unlabeled paper, compare its distance (similarity) to both centroids. Whichever it’s closest to (or whichever similarity is higher) is the predicted class. This can sometimes reduce false negatives if those borderline papers are “less similar” to the publishable centroid.
Manual Spot-Check Borderlines: If you have a band around your threshold (say 0.50–0.60) where many papers lie, you could manually examine some of these borderline papers to either confirm they’re publishable or non-publishable. This manual check (even if for a handful of papers) can refine your threshold or expand your labeled set.
Active Learning: Similar to the manual spot-check, you pick the unlabeled docs that are near the threshold. Label them carefully. Recompute the centroid with these newly labeled docs. This way, you’re less likely to incorporate incorrectly labeled examples (which can happen with naive iterative expansion).
Add a Secondary Feature: If the similarity approach alone leads to false negatives, consider adding a single, carefully chosen rule-based feature. For instance, if a borderline doc is suspiciously short or lacks any mention of “data/method,” maybe that alone can nudge its final classification to non-publishable. You don’t have to rely on an entire rule-based system—just one or two very telling features might help.
Different transformers (e.g., all-mpnet-base-v2
, scibert_scivocab_uncased
, all-distilroberta-v1
) can produce different embedding spaces. Some points to remember:
all-distilroberta-v1
might produce embeddings whose cosine similarity has a narrower band overall compared to mpnet
or scibert
.scibert_scivocab_uncased
might do better if the text is very academic. Alternatively, mpnet
or a huggingface “sentence-transformers” model might be more robust across general text.Key: The “best” model will produce the largest separation between known Class 0 and Class 1, or at least a separation that is easy to threshold.
Stay with the Similarity Approach If it’s giving you 0 false positives but you want to reduce false negatives, consider:
Validate Thresholds with Known Data
Light Touch Rules
Refine the Embedding Model
Minimal Classifier
[768-dim embeddings + 1 similarity feature]
might be enough to glean some improvement without overfitting too easily.Interpretation of Similarity Plot:
Improving False Negatives:
Selecting the Best Transformer:
Why Some Methods Fail:
Recommendation:
Your strong initial results show that cosine similarity (with a well-chosen threshold) can already separate non-publishable from publishable decently. Adding more data—or carefully adjusting your threshold—will likely do more to address the false negatives than fancy expansions or clustering. If you only have 15 labeled docs, applying active learning or a dual-centroid approach can be more robust than heavy iterative expansions that risk picking up accidental false positives.
Focus on incremental improvements around your best baseline. Sometimes the simplest method (similarity + threshold) is both the most interpretable and the most robust, especially with very limited labels.
Thought for 8 seconds
Below is a comprehensive answer that addresses all your requests:
Class 0
)Class 1
)doc_embeddings[i]
: a vector (e.g. 768-dimensional) for the -th documentdocuments[i]['similarity_to_nonpub']
: a float for the similarity to the “non-publishable centroid”documents[i]['filename']
: the PDF filenameWhere to store the labeled classes:
labeled_files_class0 = {"R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"} labeled_files_class1 = {"R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf", "R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"}
We’ll combine the raw embedding (-dim) + the single similarity feature = -dim total.
import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.model_selection import StratifiedKFold, cross_val_score # 1) Build X, y from your 15 labeled docs X_list = [] y_list = [] for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0: # Non-publishable x_vec = np.concatenate([doc_embeddings[i], [doc['similarity_to_nonpub']]]) X_list.append(x_vec) y_list.append(0) elif fname in labeled_files_class1: # Publishable x_vec = np.concatenate([doc_embeddings[i], [doc['similarity_to_nonpub']]]) X_list.append(x_vec) y_list.append(1) X = np.array(X_list) y = np.array(y_list) # 2) Logistic Regression with strong regularization # Using 'saga' or 'liblinear' solver can handle small data well. # We also set a high C value for L2 or try a smaller C for stronger reg. logreg = LogisticRegression(solver='liblinear', C=1.0, penalty='l2', random_state=42) # 3) Cross-validation to check for overfitting skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(logreg, X, y, cv=skf, scoring='accuracy') print(f"[LogReg] 5-Fold CV accuracy: {scores.mean():.2f} +/- {scores.std():.2f}") # 4) Fit final model on entire labeled set logreg.fit(X, y) # 5) Predict for all 150 documents for i, doc in enumerate(documents): x_vec = np.concatenate([doc_embeddings[i], [doc['similarity_to_nonpub']]]) pred_label = logreg.predict(x_vec.reshape(1, -1))[0] doc['logreg_pred'] = int(pred_label) # 0=non-publishable, 1=publishable # If you want predicted probabilities: # doc['logreg_prob'] = logreg.predict_proba(x_vec.reshape(1, -1))[0][1]
You have three 2D projections (PCA, t-SNE, etc.). The logistic regression actually lives in a 769-dimensional space (embedding + similarity). Plotting that boundary in 2D is non-trivial. Typically, one of two strategies is used:
[pca_x, pca_y]
for the labeled docs. Then we can easily plot the boundary in PCA space.For simplicity, let’s do approach (1)—train a separate logistic regression purely for the 2D PCA coordinates of your labeled docs, and then we can plot the boundary. Note: This is only for illustration, not the actual classifier you use in practice.
############ Visualization of Decision Boundary in PCA ############ import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression # 1) Prepare PCA (2D) for all docs pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] # 2) Build the training set in 2D PCA for the 15 labeled docs X_2d = [] y_2d = [] for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0: # Non-publishable X_2d.append([x_coords[i], y_coords[i]]) y_2d.append(0) elif fname in labeled_files_class1: # Publishable X_2d.append([x_coords[i], y_coords[i]]) y_2d.append(1) X_2d = np.array(X_2d) y_2d = np.array(y_2d) # 3) Train logistic regression on the 2D PCA data logreg_2d = LogisticRegression(solver='liblinear', random_state=42) logreg_2d.fit(X_2d, y_2d) # 4) Plot PCA scatter with the decision boundary plt.figure(figsize=(10,7)) # Color by cluster_label (if using K-Means) clusters = [doc['cluster_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title("K-Means Clusters (PCA 2D Projection) + LogReg Boundary") plt.xlabel("Principal Component 1") plt.ylabel("Principal Component 2") # Mark labeled docs for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0 or fname in labeled_files_class1: plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if fname in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) if doc.get('maybe_nonpub', False): plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', marker='s', s=120, linewidths=1.5) # Plot the boundary in 2D # We do this by creating a grid of (x,y), predict with logreg_2d, and draw contour. xx, yy = np.meshgrid( np.linspace(x_coords.min()-0.5, x_coords.max()+0.5, 200), np.linspace(y_coords.min()-0.5, y_coords.max()+0.5, 200) ) Z = logreg_2d.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape) plt.contour(xx, yy, Z, levels=[0.5], cmap="gray", linewidths=2) # Add cluster legend unique_clusters = list(sorted(set(clusters))) handles = scatter.legend_elements()[0] labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, labels_for_legend, title="Clusters", loc="best") plt.show()
Note: The boundary you see here is only for the 2D PCA logistic regression, not the main (embedding+similarity) logistic regression. We do this because otherwise we can’t easily show a 769-dimensional boundary in 2D.
Yes, you can combine embeddings from multiple transformer models (e.g., SciBERT, MPNet, DistilRoBERTa). The simplest approach is concatenation:
# Suppose you computed embeddings from 2 different models: # doc_embeddings_1 (shape: [n_docs, d1]) # doc_embeddings_2 (shape: [n_docs, d2]) # for each doc in the same order combined_embeddings = np.concatenate([doc_embeddings_1, doc_embeddings_2], axis=1) # shape: [n_docs, d1 + d2]
Then you can apply the same pipeline (compute similarity, do logistic regression, etc.) on the combined vector. Or you can do more sophisticated merging, like averaging the two embeddings if they’re from the same dimensional space. But typically, concatenation is fine if you can handle the increased dimensionality.
Pros:
Cons:
sklearn.decomposition.LatentDirichletAllocation
or gensim
to extract topics.Assume you have a CSV: papers_fulltext.csv
with columns: filename
, full_text
.
############################################################################### # TOPIC MODELING EXAMPLE ############################################################################### !pip install gensim import pandas as pd from gensim.corpora.dictionary import Dictionary from gensim.models.ldamodel import LdaModel from nltk.corpus import stopwords from nltk.tokenize import word_tokenize import re df_topics = pd.read_csv("papers_fulltext.csv") # columns: ['filename', 'full_text'] def clean_for_lda(text): # Basic cleaning for LDA text = text.lower() text = re.sub(r'[^a-z0-9\s]', '', text) tokens = word_tokenize(text) stop_words = set(stopwords.words('english')) tokens = [t for t in tokens if t not in stop_words and len(t) > 2] return tokens # Preprocess each doc docs_tokens = [clean_for_lda(t) for t in df_topics['full_text']] # Gensim requires a dictionary and a bag-of-words corpus dictionary = Dictionary(docs_tokens) dictionary.filter_extremes(no_below=3, no_above=0.6) # filter out rare & frequent tokens bow_corpus = [dictionary.doc2bow(tokens) for tokens in docs_tokens] # Train LDA with, say, 5 or 10 topics num_topics = 5 lda_model = LdaModel(bow_corpus, num_topics=num_topics, id2word=dictionary, passes=5, random_state=42) # Inspect top words in each topic for idx in range(num_topics): print(f"Topic #{idx}: {lda_model.print_topic(idx, 10)}") # Now you can infer the topic distribution of each doc doc_topics = [] for i, bow in enumerate(bow_corpus): topics_prob = lda_model.get_document_topics(bow, minimum_probability=0.0) # 'topics_prob' is a list of (topic_id, prob) doc_topics.append(topics_prob) df_topics['topic_distribution'] = doc_topics # DETECT "BULLSHIT" PAPERS? # Example: if doc's top topic = 'cooking' but the paper claims to be about 'CV' or 'ML' # you can do some heuristic check. E.g., if the paper states "NLP" but the top topic is "cooking" for i, row in df_topics.iterrows(): # topics_prob = row['topic_distribution'] => list of (topic_id, prob) top_topic_id, top_topic_prob = max(row['topic_distribution'], key=lambda x: x[1]) # If top_topic_id is not in your expected domain topics, mark it suspicious # Or do more advanced matching
You could also do a similarity approach on the doc’s topic distribution vs. a known distribution for “real ML topics.” Or if your LDA is picking up nonsense topics (like “oven, pan, salt, recipe”), you can flag those docs as suspicious.
Instead of a single “non-publishable” centroid, create two centroids:
For each unlabeled doc, compute similarity to both centroids. Predict whichever similarity is higher (or whichever distance is smaller).
############################################################################### # DUAL-CENTROID APPROACH ############################################################################### import numpy as np # 1) Collect embeddings for known classes nonpub_embeddings = [] pub_embeddings = [] for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0: nonpub_embeddings.append(doc_embeddings[i]) elif fname in labeled_files_class1: pub_embeddings.append(doc_embeddings[i]) # 2) Compute two centroids centroid_nonpub = np.mean(nonpub_embeddings, axis=0) centroid_pub = np.mean(pub_embeddings, axis=0) def cos_sim(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) # 3) For each doc, compute similarity to each centroid for i, doc in enumerate(documents): emb = doc_embeddings[i] sim_nonpub = cos_sim(emb, centroid_nonpub) sim_pub = cos_sim(emb, centroid_pub) # Option 1: Classify whichever is higher if sim_nonpub > sim_pub: doc['dual_centroid_pred'] = 0 # non-publishable else: doc['dual_centroid_pred'] = 1 # publishable # 4) Visualization in 2D PCA pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] plt.figure(figsize=(10,7)) # color docs by 'dual_centroid_pred' dual_labels = [doc['dual_centroid_pred'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=dual_labels, cmap='bwr', alpha=0.6) plt.title("Dual Centroid Classification (PCA 2D Projection)") plt.xlabel("PC1") plt.ylabel("PC2") # Mark the known labeled docs for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0 or fname in labeled_files_class1: # black edge circle plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if fname in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i]+0.02, y_coords[i]+0.02, label_text, fontsize=9) plt.show()
Notes:
Logistic Regression
C
) and cross-validation to mitigate overfitting.Combining Multiple Embeddings
Topic Modeling
Dual-Centroid
Interpreting the Plots
With these pieces in place, you can experiment in separate notebooks:
You’ll then see which method yields the fewest false positives while also reducing false negatives, hopefully improving on your baseline. Good luck!
Crafting interpretation guidelines
I'm piecing together specific steps for interpreting LDA topics, pinpointing irrelevant claims, merging embeddings from various models, and then providing the unified code to combine these embeddings and visualize the results.
Crafting a pipeline
I’m working on a method to interpret LDA topics, identify irrelevant ones using a rule-based strategy, merge embeddings from various models, and visualize the results with a final consistent code snippet.
Thought for 9 seconds
Below is a structured explanation and code on:
From your example:
Topic #0: 0.006*"image" + 0.005*"images" + 0.004*"features" + 0.004*"car" ...
Topic #1: 0.008*"entropy" + 0.005*"food" + 0.005*"patterns" + 0.004*"system" ...
Topic #2: 0.008*"led" + 0.008*"patterns" + 0.007*"human" + 0.005*"properties" ...
Topic #3: 0.005*"datasets" + 0.004*"tasks" + 0.004*"input" + 0.004*"features" ...
Topic #4: 0.008*"engine" + 0.007*"system" + 0.006*"llm" + 0.005*"patterns" ...
Each topic is basically a distribution over words (with probabilities). The top words give you a clue about what that topic is about. For instance:
If you see a topic with words like “recipe,” “oven,” “salt,” or “cooking,” that’s likely nonsense in an ML context. In your example, some references to “food” or “caffeine” might be suspicious if your domain is purely machine learning.
After training LDA, you can:
Caution: LDA is just unsupervised clustering of words. Sometimes an “irrelevant” topic might appear if your corpus has unusual domain text, or it might partially overlap with real ML concepts in unexpected ways. So treat this as an additional signal (like a “red flag” score), not an absolute truth.
You can certainly use two (or more) transformer models to embed each paper. Suppose:
doc_embeddings_1
is shape doc_embeddings_2
is shape Concatenate them:
combined_embeddings = np.concatenate([doc_embeddings_1, doc_embeddings_2], axis=1) # Now shape: [n_docs, d_1 + d_2]
You can then do exactly what you’ve been doing (K-Means, similarity measure, etc.) on combined_embeddings
. The biggest difference:
doc_embeddings
everywhere, you swap in combined_embeddings
.Potential Pitfalls:
combined_embeddings
to bring it down to 512 or 256 dimensions before continuing.Below is a simplified pipeline illustrating how to:
doc_embeddings_1
, doc_embeddings_2
).combined_embeddings
.Feel free to adjust the details to match your environment.
############################################################################### # Step 0: Setup & Basic Definitions ############################################################################### import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.decomposition import PCA from collections import Counter # Labeled sets (example) labeled_files_class0 = {"R001.pdf","R002.pdf","R003.pdf","R004.pdf","R005.pdf"} # non-pub labeled_files_class1 = {"R006.pdf","R007.pdf","R008.pdf","R009.pdf","R010.pdf", "R011.pdf","R012.pdf","R013.pdf","R014.pdf","R015.pdf"} # pub def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) ############################################################################### # Step 1: Suppose you have two sets of embeddings for each doc # doc_embeddings_1, doc_embeddings_2 # shape = [n_docs, d1], [n_docs, d2] ############################################################################### # For demonstration, let's pretend doc_embeddings_1 and doc_embeddings_2 exist. # You should replace them with your actual arrays from your environment. # e.g., doc_embeddings_1 = ... # e.g., doc_embeddings_2 = ... # We'll assume you also have the 'documents' list of dicts as usual. # => documents[i]['filename'], etc. ############################################################################### # Step 2: Concatenate them ############################################################################### combined_embeddings = np.concatenate([doc_embeddings_1, doc_embeddings_2], axis=1) print("Shape of combined embeddings:", combined_embeddings.shape) ############################################################################### # Step 3: K-Means on combined embeddings ############################################################################### num_clusters = 5 kmeans = KMeans(n_clusters=num_clusters, random_state=42) labels = kmeans.fit_predict(combined_embeddings) for i, doc in enumerate(documents): doc['cluster_label'] = int(labels[i]) cluster_counts = Counter(labels) print("Cluster distribution:", cluster_counts) ############################################################################### # Step 4: Compute Non-Publishable Centroid & Similarity (optional) ############################################################################### # We'll collect embeddings for the known non-publishable docs known_nonpub_inds = [] for i, doc in enumerate(documents): if doc['filename'] in labeled_files_class0: known_nonpub_inds.append(i) nonpub_embeddings = combined_embeddings[known_nonpub_inds] nonpub_centroid = np.mean(nonpub_embeddings, axis=0) # For each doc, compute similarity for i, doc in enumerate(documents): emb = combined_embeddings[i] sim_val = cosine_similarity(emb, nonpub_centroid) doc['similarity_to_nonpub'] = float(sim_val) # Example threshold similarity_threshold = 0.5 for doc in documents: doc['maybe_nonpub'] = (doc['similarity_to_nonpub'] > similarity_threshold) ############################################################################### # Step 5: Visualization (PCA 2D of the combined embeddings) ############################################################################### pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(combined_embeddings) x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] plt.figure(figsize=(10, 7)) clusters = [doc['cluster_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title("K-Means Clusters on Combined Embeddings (PCA 2D Projection)") plt.xlabel("PC1") plt.ylabel("PC2") # Highlight labeled files for i, doc in enumerate(documents): filename = doc['filename'] if filename in labeled_files_class0 or filename in labeled_files_class1: plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if filename in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) # Highlight maybe_nonpub docs if doc.get('maybe_nonpub', False): plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', marker='s', s=120, linewidths=1.5) unique_clusters = list(sorted(set(clusters))) handles = scatter.legend_elements()[0] labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, labels_for_legend, title="Clusters", loc="best") plt.show()
doc_embeddings_1
and doc_embeddings_2
to get combined_embeddings
.combined_embeddings
, store cluster labels in doc['cluster_label']
.similarity_to_nonpub
.doc['maybe_nonpub'] = True
if similarity > 0.5 (or whichever threshold you prefer).If you suspect a doc is about “food/cooking/recipes” while it claims to be “machine learning,” you can do something like:
# For each doc in df_topics (the LDA pipeline): # row['topic_distribution'] => list of (topic_id, prob) irrelevant_topic_ids = [1] # Suppose we decided topic #1 is "food" or nonsense threshold = 0.6 # if doc's top topic is topic #1 with prob >= 0.6 => suspicious for i, row in df_topics.iterrows(): topics_prob = row['topic_distribution'] top_topic_id, top_topic_prob = max(topics_prob, key=lambda x: x[1]) if (top_topic_id in irrelevant_topic_ids) and (top_topic_prob >= threshold): row['suspect_topic'] = True else: row['suspect_topic'] = False
You can incorporate this as an additional dimension in your main pipeline—e.g., a boolean feature has_suspect_topic
that you feed into your logistic regression or use as a simple rule.
This should give you a complete sense of how to interpret LDA outputs for “nonsense” detection and how to combine multiple embeddings in your existing pipeline, with an end-to-end example showing K-Means clustering, similarity thresholding, and a PCA visualization. Good luck!
Building upon dual centroid methods
I’m looking to establish four distinct code blocks identifying invariant and iterative dual centroid classifier methods, using similarity and distance measures on 'nonpub_embeddings' and 'pub_embeddings'. Progress would involve iterative inclusion and centroid recomputations.
Defining the framework
I’m detailing four code blocks to produce 'original_similarity', 'original_distance', 'iterative_similarity', 'iterative_distance' columns. I’ll define an iterative approach with thresholds and maximum iterations, ensuring comprehensive labels for each document.
Deciding on column names
OK, let's finalize 'original_similarity', 'original_distance', 'iterative_similarity', and 'iterative_distance' to classify documents into 'publishable' and 'nonpublishable'. The strategy involves iterative inclusion and centroid recomputations.
Thought for 9 seconds
Below is a comprehensive guide that addresses your requests:
At the end, each code snippet stores its final 0/1 labels into specific columns in documents
, namely:
doc['original_similarity']
doc['original_distance']
doc['iterative_similarity']
doc['iterative_distance']
This way, you end up with a single DataFrame that has the results of all four approaches.
Currently, your dual-centroid approach uses 5 known non-publishable and 10 known publishable examples to compute one non-publishable centroid and one publishable centroid. This works well if those examples are representative and you don’t suspect labeling errors.
Hence, any iterative approach should have a high confidence threshold (on similarity or low distance) to minimize false positives creeping in.
If you want a distance-based approach for Euclidean distance, you can do:
and pick the centroid that yields the lowest distance.
Once you have your final centroids (either after one-shot or iterative updates):
Similarity Approach:
k-Nearest Neighbors:
When to choose KNN:
We’ll assume you already have:
documents
, where each doc
has at least doc['filename']
.doc_embeddings
is a numpy array of shape .labeled_files_class0 = {"R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"} labeled_files_class1 = {"R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf", "R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"}
original_similarity
original_distance
iterative_similarity
iterative_distance
We will define each snippet in turn. At the end, you can run them all in sequence (or in separate cells) and end up with all 4 label columns in documents
.
import numpy as np import math def cos_sim(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) def euclidean_dist(a, b): return np.sqrt(np.sum((a - b)**2))
One-shot approach: pick whichever centroid yields a higher similarity. Store the predicted label in doc['original_similarity']
.
############################ # A) ORIGINAL SIMILARITY ############################ # 1) Gather embeddings for known labeled docs nonpub_indices = [] pub_indices = [] for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0: nonpub_indices.append(i) elif fname in labeled_files_class1: pub_indices.append(i) nonpub_embeddings = doc_embeddings[nonpub_indices] pub_embeddings = doc_embeddings[pub_indices] # 2) Compute centroids centroid_nonpub = np.mean(nonpub_embeddings, axis=0) centroid_pub = np.mean(pub_embeddings, axis=0) # 3) Classify each doc by comparing cos_sim to each centroid for i, doc in enumerate(documents): emb = doc_embeddings[i] sim_nonpub = cos_sim(emb, centroid_nonpub) sim_pub = cos_sim(emb, centroid_pub) # 0 => non-publishable, 1 => publishable if sim_nonpub > sim_pub: doc['original_similarity'] = 0 else: doc['original_similarity'] = 1
Same idea, but we pick whichever centroid yields a lower distance. Store the predicted label in doc['original_distance']
.
############################ # B) ORIGINAL DISTANCE ############################ # We assume the same centroid_nonpub, centroid_pub from above # If you are running this code separately, recompute them: nonpub_indices = [] pub_indices = [] for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0: nonpub_indices.append(i) elif fname in labeled_files_class1: pub_indices.append(i) nonpub_embeddings = doc_embeddings[nonpub_indices] pub_embeddings = doc_embeddings[pub_indices] centroid_nonpub = np.mean(nonpub_embeddings, axis=0) centroid_pub = np.mean(pub_embeddings, axis=0) # Classify using distance for i, doc in enumerate(documents): emb = doc_embeddings[i] dist_nonpub = euclidean_dist(emb, centroid_nonpub) dist_pub = euclidean_dist(emb, centroid_pub) # 0 => non-publishable, 1 => publishable if dist_nonpub < dist_pub: doc['original_distance'] = 0 else: doc['original_distance'] = 1
We iteratively include new docs whose similarity to a particular centroid surpasses a threshold. For example:
sim_to_nonpub > threshold_nonpub
and sim_to_nonpub > sim_to_pub
(i.e., it’s really closer to non-pub?), we add that doc to the nonpub set.sim_to_pub > threshold_pub
and sim_to_pub > sim_to_nonpub
, we add that doc to the pub set.max_iterations
.Parameters you might want to tune:
threshold_nonpub
, threshold_pub
: e.g. 0.60 or 0.65.max_iterations
: e.g. 5.The code below lumps them into one threshold if you prefer. We store the final label in doc['iterative_similarity']
.
############################ # C) ITERATIVE SIMILARITY ############################ def iterative_dual_centroid_similarity(doc_embeddings, documents, labeled_files_class0, labeled_files_class1, threshold=0.65, max_iterations=5): """ Iteratively expand the sets of non-publishable and publishable docs based on which centroid they're more similar to, provided that similarity > threshold. Recompute centroids each iteration. """ # 1) Initialize sets from your labeled data nonpub_set = set(labeled_files_class0) pub_set = set(labeled_files_class1) iteration = 0 changed = True while iteration < max_iterations and changed: iteration += 1 changed = False # Compute centroids nonpub_embs = [] pub_embs = [] for i, doc in enumerate(documents): if doc['filename'] in nonpub_set: nonpub_embs.append(doc_embeddings[i]) elif doc['filename'] in pub_set: pub_embs.append(doc_embeddings[i]) # If one class is empty, skip to avoid error if len(nonpub_embs) == 0 or len(pub_embs) == 0: print("One set is empty! Stopping iterative approach.") break centroid_nonpub = np.mean(nonpub_embs, axis=0) centroid_pub = np.mean(pub_embs, axis=0) # 2) For each unlabeled doc, see if we can add it for i, doc in enumerate(documents): fname = doc['filename'] # Skip if it's already in either set if fname in nonpub_set or fname in pub_set: continue emb = doc_embeddings[i] sim_nonpub = cos_sim(emb, centroid_nonpub) sim_pub = cos_sim(emb, centroid_pub) # If it's clearly more similar to nonpub (and above threshold) if sim_nonpub > sim_pub and sim_nonpub >= threshold: nonpub_set.add(fname) changed = True # If it's clearly more similar to pub (and above threshold) elif sim_pub > sim_nonpub and sim_pub >= threshold: pub_set.add(fname) changed = True print(f"Iteration {iteration} done. Non-pub size: {len(nonpub_set)}, Pub size: {len(pub_set)}") if not changed: print("No changes in this iteration, stopping early.") # Final classification # If a doc is in nonpub_set => 0, if in pub_set => 1, else break tie for i, doc in enumerate(documents): fname = doc['filename'] if fname in nonpub_set and fname not in pub_set: doc['iterative_similarity'] = 0 elif fname in pub_set and fname not in nonpub_set: doc['iterative_similarity'] = 1 else: # doc not classified or possibly in both sets somehow # you can break ties by whichever centroid is more similar emb = doc_embeddings[i] sim_nonpub = cos_sim(emb, centroid_nonpub) sim_pub = cos_sim(emb, centroid_pub) doc['iterative_similarity'] = 0 if sim_nonpub >= sim_pub else 1 # Call the function: iterative_dual_centroid_similarity(doc_embeddings, documents, labeled_files_class0, labeled_files_class1, threshold=0.65, max_iterations=5)
Feel free to tune the threshold. If you set threshold=0.80
, you’ll be very conservative about adding new docs. If you set threshold=0.50
, you’ll likely add more docs (risking false positives).
Same iterative logic, but now we measure Euclidean distance to each centroid. If a doc is closer to the non-pub centroid by some margin, we add it to that set, provided it’s under some threshold. The threshold means “distance must be less than X” for us to add it confidently.
############################ # D) ITERATIVE DISTANCE ############################ def iterative_dual_centroid_distance(doc_embeddings, documents, labeled_files_class0, labeled_files_class1, distance_threshold=1.0, max_iterations=5): """ Iteratively expand the sets of non-publishable and publishable docs based on which centroid they are closer to, provided the distance < distance_threshold. Recompute centroids each iteration. """ nonpub_set = set(labeled_files_class0) pub_set = set(labeled_files_class1) iteration = 0 changed = True while iteration < max_iterations and changed: iteration += 1 changed = False # Compute centroids nonpub_embs = [] pub_embs = [] for i, doc in enumerate(documents): if doc['filename'] in nonpub_set: nonpub_embs.append(doc_embeddings[i]) elif doc['filename'] in pub_set: pub_embs.append(doc_embeddings[i]) if len(nonpub_embs) == 0 or len(pub_embs) == 0: print("One set is empty! Stopping iterative approach.") break centroid_nonpub = np.mean(nonpub_embs, axis=0) centroid_pub = np.mean(pub_embs, axis=0) for i, doc in enumerate(documents): fname = doc['filename'] if fname in nonpub_set or fname in pub_set: continue emb = doc_embeddings[i] dist_nonpub = euclidean_dist(emb, centroid_nonpub) dist_pub = euclidean_dist(emb, centroid_pub) # If it's strictly closer to nonpub and below threshold if dist_nonpub < dist_pub and dist_nonpub <= distance_threshold: nonpub_set.add(fname) changed = True # If it's strictly closer to pub and below threshold elif dist_pub < dist_nonpub and dist_pub <= distance_threshold: pub_set.add(fname) changed = True print(f"Iteration {iteration} done. Non-pub size: {len(nonpub_set)}, Pub size: {len(pub_set)}") if not changed: print("No changes in this iteration, stopping early.") # Final classification for i, doc in enumerate(documents): fname = doc['filename'] if fname in nonpub_set and fname not in pub_set: doc['iterative_distance'] = 0 elif fname in pub_set and fname not in nonpub_set: doc['iterative_distance'] = 1 else: # break tie by whichever centroid is closer emb = doc_embeddings[i] dist_nonpub = euclidean_dist(emb, centroid_nonpub) dist_pub = euclidean_dist(emb, centroid_pub) doc['iterative_distance'] = 0 if dist_nonpub <= dist_pub else 1 # Call the function: iterative_dual_centroid_distance(doc_embeddings, documents, labeled_files_class0, labeled_files_class1, distance_threshold=1.0, max_iterations=5)
Note: You need to choose a “distance threshold” that makes sense in your embedding space. You might try 1.0, 1.5, 2.0, etc., then see if new docs get added. If your embeddings are large in magnitude, you might need a bigger threshold. If your embeddings are normalized (like in some BERT variants), typical Euclidean distances might be < 2.0.
After running all four methods (A, B, C, D) in sequence, each doc
in documents
will have:
doc['original_similarity']
doc['original_distance']
doc['iterative_similarity']
doc['iterative_distance']
You can then do:
final_df = pd.DataFrame(documents) print(final_df[['filename', 'original_similarity', 'original_distance', 'iterative_similarity', 'iterative_distance']])
to see the final classification side by side.
sim_nonpub - sim_pub > 0.1
) or (dist_pub - dist_nonpub) > 0.5
. This helps filter out borderline cases.documents
and see which approach yields the best results on your 15 labeled examples (lowest false positives/negatives).max_iterations
.This should give you ample flexibility to compare all four dual-centroid methods (similarity vs. distance, one-shot vs. iterative) and store results in your DataFrame. Good luck!
Applying clustering
I’m piecing together a method to cluster data using KMeans, then project it to 2D with PCA, t-SNE, and Isomap. Notably, no supervised classification is employed, and 15 labeled points are highlighted.
Outlining the approach
I’m detailing the plan to visualize 2D data plots using PCA, t-SNE, and Isomap. Each subplot will showcase clusters and highlight 15 labeled data points for clarity.
Thought for 6 seconds
Below is a detailed explanation of your question about centroid-based classification possibly “overfitting,” followed by K-Means with 2 clusters that purely ignores labels when clustering. Then we visualize the K-Means results with three different 2D projections (PCA, t-SNE, and ISOMAP) in a single code snippet, highlighting your 15 labeled examples.
Is it Overfitting?
If your manual checks on unlabeled data suggest it’s doing “decently,” that’s a good sign. But keep in mind, your final performance on the private data is what the judges will rely on to see if it truly generalizes or not.
You asked for a purely unsupervised K-Means approach that does not use the labeled data to define the cluster centers. Instead, we let K-Means discover 2 clusters from the entire dataset (150 documents). Then we can see if cluster 0 aligns well with non-publishable or cluster 1 aligns well with publishable, etc.
Below is a single code snippet that does all three visualizations in subplots. Adjust as you see fit.
import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.decomposition import PCA from sklearn.manifold import TSNE, Isomap # Suppose we have: # doc_embeddings: shape [n_docs, d] # documents: list of dicts, each with doc['filename'] # labeled_files_class0 = { "R001.pdf", ..., "R005.pdf" } # labeled_files_class1 = { "R006.pdf", ..., "R015.pdf" } # 1) Perform K-Means with 2 clusters, ignoring labels kmeans_2 = KMeans(n_clusters=2, random_state=42) kmeans_labels = kmeans_2.fit_predict(doc_embeddings) # Store cluster labels in documents for i, doc in enumerate(documents): doc['kmeans_label'] = int(kmeans_labels[i]) print("Finished K-Means with 2 clusters. Label distribution:") unique, counts = np.unique(kmeans_labels, return_counts=True) for lab, count in zip(unique, counts): print(f"Cluster {lab}: {count} documents") # 2) We'll define a helper function to plot 2D projections def plot_2d_projection(proj_matrix, title): """ proj_matrix: shape [n_docs, 2] title: string for the plot """ x_coords = proj_matrix[:, 0] y_coords = proj_matrix[:, 1] # color by kmeans_label clusters = [doc['kmeans_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title(title) plt.xlabel("Component 1") plt.ylabel("Component 2") # highlight the 15 labeled docs for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0 or fname in labeled_files_class1: # black edge circle plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if fname in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) # Optionally add a legend for cluster coloring unique_clusters = list(set(clusters)) handles = scatter.legend_elements()[0] legend_labels = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, legend_labels, title="K-Means Clusters", loc="best") # 3) Prepare 2D projections: PCA, t-SNE, and ISOMAP pca = PCA(n_components=2, random_state=42) pca_2d = pca.fit_transform(doc_embeddings) tsne = TSNE(n_components=2, random_state=42) tsne_2d = tsne.fit_transform(doc_embeddings) iso = Isomap(n_components=2) iso_2d = iso.fit_transform(doc_embeddings) # 4) Plot them in subplots fig = plt.figure(figsize=(18, 5)) plt.subplot(1, 3, 1) plot_2d_projection(pca_2d, "PCA (2D) of doc_embeddings") plt.subplot(1, 3, 2) plot_2d_projection(tsne_2d, "t-SNE (2D) of doc_embeddings") plt.subplot(1, 3, 3) plot_2d_projection(iso_2d, "ISOMAP (2D) of doc_embeddings") plt.tight_layout() plt.show()
Notes:
With this code, you can show a simpler unsupervised approach to the judges, demonstrating you didn’t rely solely on your labeled docs to define the entire classification logic. Good luck!
Organizing embeddings
I’m mapping out a way to store embeddings for each document section and an overall document embedding. This involves modularizing the visualization for easier access to individual section KMeans results.
Formulating the plan
I’m outlining a modular function to build embeddings for each document section and an overall document embedding. Then, I’ll apply KMeans clustering and visualize 2D embeddings using PCA, t-SNE, or Isomap.
Pulling together final code
I'm composing a comprehensive code snippet. It builds section embeddings and doc-level embeddings, applies KMeans clustering, and visualizes using a modularized approach, maintaining the user's preferred coding style.
Thought for 10 seconds
Below is a complete solution that:
We’ll assume you have:
documents
, where doc['sections']
is a list of (section_name, text)
tuples.embed_model
(like SciBERT) that provides an .encode(...)
method returning a 768-dim vector (or a different dimension; just adjust accordingly).labeled_files_class0 = {"R001.pdf","R002.pdf","R003.pdf","R004.pdf","R005.pdf"} labeled_files_class1 = {"R006.pdf","R007.pdf","R008.pdf","R009.pdf","R010.pdf", "R011.pdf","R012.pdf","R013.pdf","R014.pdf","R015.pdf"}
SECTIONS = [ "Unlabeled", "Abstract", "Introduction", "Methods", "Results", "Related work", "Experiments", "Conclusion", "Discussion", "Background", "References" ]
Feel free to adjust naming or add your own custom logic. This code is meant to be modular so you can run a loop over all sections and produce separate K-Means + 2D visualizations.
import numpy as np def build_section_and_doc_embeddings(documents, embed_model): """ For each document in 'documents', build: doc['section_embeddings'] = { "Abstract": <768-dim embedding or zeros>, "Introduction": <768-dim embedding or zeros>, ... } doc['doc_embedding'] = <768-dim average across all available sections> 'documents' is a list of dicts, each with: doc['sections'] = list of (sec_name, sec_text) embed_model.encode() returns a numpy array embedding for given text input. """ SECTIONS = [ "Unlabeled", "Abstract", "Introduction", "Methods", "Results", "Related work", "Experiments", "Conclusion", "Discussion", "Background", "References" ] for doc in documents: # Initialize a dict that maps section_name -> embedding doc['section_embeddings'] = {} # We’ll store per-section embeddings in a dictionary # e.g. doc['section_embeddings']['Abstract'] = np.array([...]) # We might have multiple sections in doc['sections']. Let's gather them in a dict for easy access # e.g. actual_sections_found = { "Abstract": text_for_abstract, ... } actual_sections_found = {} for (sec_name, sec_text) in doc['sections']: actual_sections_found[sec_name] = sec_text # For each possible section in SECTIONS, build an embedding for section_name in SECTIONS: if section_name in actual_sections_found: raw_text = actual_sections_found[section_name].strip() if len(raw_text) == 0: # If empty text, store zeros doc['section_embeddings'][section_name] = np.zeros(embed_model.get_sentence_embedding_dimension()) else: # Encode this section's text embedding = embed_model.encode(raw_text, convert_to_numpy=True) # embed_model.encode(...) returns a 768-dim array if you're using SciBERT or similar # If you have large text, you might want to do chunking or similar, but here's the simple approach doc['section_embeddings'][section_name] = embedding else: # If doc doesn't have this section, store zeros doc['section_embeddings'][section_name] = np.zeros(embed_model.get_sentence_embedding_dimension()) # Now, compute doc['doc_embedding'] by averaging all non-zero embeddings nonzero_vectors = [] for sec_name, emb in doc['section_embeddings'].items(): # If emb is all zeros, skip if np.any(emb): # i.e. not all zeros nonzero_vectors.append(emb) if len(nonzero_vectors) == 0: # If no valid sections, store a zeros vector doc['doc_embedding'] = np.zeros(embed_model.get_sentence_embedding_dimension()) else: doc['doc_embedding'] = np.mean(nonzero_vectors, axis=0)
Notes:
embed_model.encode(raw_text, convert_to_numpy=True)
to get a single embedding for that entire section.We’ll define a function that:
k
(for instance, k=5
).doc['kmeans_{section_name}']
.doc_embedding
.from sklearn.cluster import KMeans def run_kmeans_for_section(documents, section_name, n_clusters=5, random_state=42): """ Gathers the embedding for 'section_name' from each doc and runs K-Means. Then stores the cluster label in doc['kmeans_{section_name}']. """ # 1) Build an array of shape [n_docs, embedding_dim] for this section all_embeddings = [] for doc in documents: # E.g. doc['section_embeddings']['Abstract'] emb = doc['section_embeddings'][section_name] all_embeddings.append(emb) all_embeddings = np.array(all_embeddings) # shape: [n_docs, d] # 2) K-Means kmeans = KMeans(n_clusters=n_clusters, random_state=random_state) labels = kmeans.fit_predict(all_embeddings) # 3) Store the label in the doc for i, doc in enumerate(documents): doc[f'kmeans_{section_name}'] = int(labels[i]) def run_kmeans_for_doc_embedding(documents, n_clusters=5, random_state=42): """ Gathers doc['doc_embedding'] for each doc, runs K-Means, and stores cluster label in doc['kmeans_doc']. """ # 1) Build array of shape [n_docs, d] from doc['doc_embedding'] all_doc_emb = [doc['doc_embedding'] for doc in documents] all_doc_emb = np.array(all_doc_emb) # 2) K-Means kmeans = KMeans(n_clusters=n_clusters, random_state=random_state) labels = kmeans.fit_predict(all_doc_emb) # 3) Store label for i, doc in enumerate(documents): doc['kmeans_doc'] = int(labels[i])
We want to pass in a numpy array of shape [n_docs, d]
and a list of cluster labels. Then we do:
We create subplots for each. We also highlight the labeled docs. This is very similar to your previous code, but more modular so we can re-use it for each section or for doc-level embeddings.
import matplotlib.pyplot as plt from sklearn.decomposition import PCA from sklearn.manifold import TSNE, Isomap def visualize_2d_projections(embeddings, cluster_labels, documents, labeled_files_class0, labeled_files_class1, title_prefix=""): """ embeddings: shape [n_docs, d] cluster_labels: shape [n_docs], e.g. from KMeans documents: the list of doc dicts labeled_files_class0, labeled_files_class1: sets for highlight title_prefix: string to prefix in the title, e.g. "Abstract" or "doc_embedding" """ # 1) Prepare the 3 transformations pca = PCA(n_components=2, random_state=42) pca_2d = pca.fit_transform(embeddings) tsne = TSNE(n_components=2, random_state=42) tsne_2d = tsne.fit_transform(embeddings) iso = Isomap(n_components=2) iso_2d = iso.fit_transform(embeddings) fig = plt.figure(figsize=(18, 5)) # We'll define a helper closure to plot each sub-figure def plot_projection(ax, proj_matrix, method_name): x_coords = proj_matrix[:, 0] y_coords = proj_matrix[:, 1] scatter = ax.scatter(x_coords, y_coords, c=cluster_labels, cmap='tab10', alpha=0.6) ax.set_title(f"{title_prefix} - {method_name}") ax.set_xlabel("Component 1") ax.set_ylabel("Component 2") for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0 or fname in labeled_files_class1: ax.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) label_text = "C0" if fname in labeled_files_class0 else "C1" ax.text(x_coords[i]+0.02, y_coords[i]+0.02, label_text, fontsize=9) unique_clusters = list(sorted(set(cluster_labels))) handles = scatter.legend_elements()[0] legend_labels = [f"Cluster {uc}" for uc in unique_clusters] ax.legend(handles, legend_labels, title="K-Means Clusters", loc="best") ax1 = fig.add_subplot(1, 3, 1) plot_projection(ax1, pca_2d, "PCA") ax2 = fig.add_subplot(1, 3, 2) plot_projection(ax2, tsne_2d, "t-SNE") ax3 = fig.add_subplot(1, 3, 3) plot_projection(ax3, iso_2d, "ISOMAP") plt.tight_layout() plt.show()
Below is an example workflow that you can adapt to your code:
############################################################################### # EXAMPLE WORKFLOW ############################################################################### # 1) Build section embeddings and doc-level embeddings build_section_and_doc_embeddings(documents, embed_model) # 2) For each section in our known set, run K-Means + visualize SECTIONS = [ "Unlabeled", "Abstract", "Introduction", "Methods", "Results", "Related work", "Experiments", "Conclusion", "Discussion", "Background", "References" ] for section_name in SECTIONS: # 2a) Run K-Means run_kmeans_for_section(documents, section_name, n_clusters=5, random_state=42) # 2b) Gather the embeddings array for that section section_embs = [] section_labels = [] for i, doc in enumerate(documents): emb = doc['section_embeddings'][section_name] lbl = doc[f'kmeans_{section_name}'] section_embs.append(emb) section_labels.append(lbl) section_embs = np.array(section_embs) section_labels = np.array(section_labels) # 2c) Visualize visualize_2d_projections( embeddings=section_embs, cluster_labels=section_labels, documents=documents, labeled_files_class0=labeled_files_class0, labeled_files_class1=labeled_files_class1, title_prefix=f"{section_name} Section" ) # 3) Do doc-level K-Means run_kmeans_for_doc_embedding(documents, n_clusters=5, random_state=42) # 4) Gather doc-level embeddings + labels doc_embs = [] doc_labels = [] for i, doc in enumerate(documents): emb = doc['doc_embedding'] lbl = doc['kmeans_doc'] doc_embs.append(emb) doc_labels.append(lbl) doc_embs = np.array(doc_embs) doc_labels = np.array(doc_labels) visualize_2d_projections( embeddings=doc_embs, cluster_labels=doc_labels, documents=documents, labeled_files_class0=labeled_files_class0, labeled_files_class1=labeled_files_class1, title_prefix="Document-Level" )
What this does:
doc['section_embeddings'][section_name]
.doc['doc_embedding']
.k=2
if you want pure “non-pub vs. pub”—that’s up to you).If you want exactly 2 clusters for “publishable vs. non-publishable,” just replace n_clusters=5
with n_clusters=2
. Or do both and see how they differ.
This modular approach should integrate cleanly with your existing data structures (documents
, embed_model
). You can tweak n_clusters
, or do further analysis on each section’s clusters to see if, for instance, “Abstract” embeddings separate out certain flawed papers more clearly.
Good luck, and feel free to adapt the code as needed!
luck with your competition submission! for a second
Below are some reflections on why you might still be seeing identical final classifications despite introducing section-level embeddings, multiple clustering approaches, etc.:
Homogeneity in the Data
If most of your papers (both publishable and non-publishable) share similar language and vocabulary in every section—i.e., the “Results” and “Discussion” sections look more or less the same across many papers—then averaging or section-level embeddings might not introduce enough variability to shift the clusters. In other words, the model “sees” roughly the same signals no matter how you segment the text.
Strong Dominant Signal
Some domain-specific terms or writing style might dominate across all sections. For example, if every paper includes standard ML terminology or references, it might overwhelm the differences that typically differentiate publishable vs. non-publishable content. Thus, whether you look at the “Introduction” or “Methods” or combine them, the embeddings land in roughly the same space.
Limited Label Diversity
You only have 15 labeled examples (5 non-publishable, 10 publishable). If they are not diverse enough or if the unlabeled set is large and somewhat homogeneous, the clustering and centroid-based methods often yield the same outcome. The data itself may not be providing strong opposing signals to re-separate the clusters.
Robustness of the Embedding Model
If your chosen embedding model (e.g., SciBERT) consistently encodes the papers in a similar fashion, the different partitionings (section-level vs. entire doc) might just lead to near-identical vectors. For instance, if the “Methods” section heavily overlaps with the “Introduction” content, the average or sum might not differ enough to rearrange the cluster memberships.
The Data is Not Contradictory
The fact that multiple approaches lead to the same classification implies that, from a textual perspective, the boundary between “publishable” and “non-publishable” (as your data is capturing it) is consistent. The data is not easily teased apart differently by slicing or chunking.
Consistency in the Embedding Space
You might actually have a solid embedding approach. If all these transformations (section-level vs. doc-level) converge to the same outcome, it might mean the representation is stable and not very sensitive to how you slice the document text.
Possibility of Domain Constraints
Some tasks simply do not vary enough across sections for them to produce differing clusters. If every paper has the same general structure, length, or domain keywords, the embeddings can look very alike regardless of how you chunk or average them.
Refine or Add Additional Features
Examine “Borderline” Cases
Increase Labeled Diversity
Try Another Embedding Model
Validate with External Data
Getting the same final classification from different chunking/averaging strategies often means:
It does not necessarily mean you are doing something wrong—rather, it suggests your data and chosen embeddings converge on the same representation regardless of how you slice the paper text. This can be interpreted as stability in your approach.
The best way to be sure is to test on truly unseen data or carefully label more documents to see if there are any subtle differences the current pipeline is missing.