LLM-Based Classification Strategies
Создано: 10 января 2025 г.
Создано: 10 января 2025 г.
Below is the problem statement for a national level data science hackathon that I'm participating in.
The task involves developing a framework that can classify research papers as
either "Publishable" or "Non-Publishable" based on the evaluation of their
content. The goal is to create a robust system capable of identifying critical
issues, such as inappropriate methodologies, incoherent arguments, or
unsubstantiated claims, that affect the suitability of a paper for publication. For
instance, a research paper that applies methodologies or techniques that are not
well-suited to the problem being addressed, without adequate justification or
adaptation to the context, would be considered unsuitable. Similarly, a paper that
presents arguments that are unclear, disorganized, or lack logical coherence, or
one that claims results that appear unusually high or unrealistic without sufficient
evidence or proper validation, would also fall into the "Non-Publishable" category
A dataset of 150 research papers is provided for classification, with 15 labeled
papers available for reference to guide the development of the framework. The
framework should be designed to accurately classify papers into the appropriate
category, ensuring that it can handle a wide range of research topics and
maintain consistency across different types of content. The proposed framework
must be capable of systematically analyzing these and other aspects of research
papers to ensure a reliable and objective evaluation process. The solution should
demonstrate high accuracy in detecting such issues, ensuring its applicability
across a range of research domains and scalability for future use with larger
datasets.
Think deeply about this guide me. For context, the 15 labeled papers are imbalanced with 10 being class 'Publishable' and only 5 being class 'non-publishable', so its in the ratio 2:1. The labeled data is also very very low.
Every paper in the 150 given papers has thousands of words spanning several pages. Think about if splitting these documents into major sections like abstract, introduction, methodology, related works, experiments and conclusion could benefit in any way? But there is also the problem that not every papers have the same sections and could vary across different papers and a lot of papers may have subheadings and what not. The papers also span across different topics, but the publishable papers even in the unlabeled dataset would be papers that could be put into one of the following 5 conferences - CVPR, NeurIPS, KDD, EMNLP, TMLR, so the topic and what the papers talk about should fall into the topics that majorly dominate these 5 conferences. Equipped with this knowledge, think about what sort of approach could be taken and what sort of model could be used and all other approaches. Think deeply for this
We cannot directly upload the data to a GPT4 and ask it to classify it as we are judged in this hackathon based on
With that said, we can still use llms in other ways if we want, this is just for your clarification. We have no preferences regarding the method we use, we only want results.
Think deeply about this problem and provide different ways to approach it, different techniques or chain of thoughts to address this problem given the heavily imbalanced dataset and all other conditions mentioned. But try to suggest the ones that will work the best given the context in this query. Make sure your answer is extremely detailed and covers and clears all my questions, doubts and ideas.
Below is a systematic breakdown of possible approaches and considerations for building a highly accurate and robust framework to classify research papers as “Publishable” or “Non-Publishable,” given the constraints and context you described:
Synthetic Document Generation / Data Augmentation
Document Splitting + Label Inheritance
Active Learning or Semi-Supervised Learning
Oversampling / Undersampling
Class Weighting / Focal Loss
Splitting the documents by major sections (Abstract, Introduction, Methodology, etc.) can help isolate where certain “red flags” for non-publishability might appear. For instance:
Issue: Not all papers have a uniform section structure.
Given the text-length and complexity, large pre-trained language models (PLMs) such as BERT, RoBERTa, DeBERTa, or Longformer (for longer context) can be used:
Longformer / BigBird
Chunk-Level Approach with BERT
Instruction-Finetuned or GPT-based Models Locally
For smaller data scenarios, a simpler pipeline might be surprisingly effective:
Feature Extraction
Classification Algorithms
Below is an illustration of a pipeline that incorporates many of the above ideas:
Preprocessing & Document Splitting
Embeddings
Aggregation
Address Imbalance
Train Classifier
Active Learning for Additional Labels
While you cannot upload all data to GPT-4 or other LLMs directly for classification, you can still use them to aid certain subtasks:
Section Extraction / Title Identification
Topic Tagging
Keyword Generation
Following these steps in a well-structured pipeline can yield a high-performing system, even with very limited labeled data and heavily imbalanced classes. By integrating data-centric approaches (like augmentation and active learning) with advanced NLP architectures (like Longformer/BERT for large text), you stand a strong chance of building a solution that meets the accuracy and F1 requirements for the hackathon.
For context, we have already tried semi-supervised and self-training approaches with the available limited labeled dataset with the combination of transformers like scibert, roberta, bert, sentence transformers, zero shot classification, etc and also tried the SMOTE oversampling of embeddings. But the end result was always overfitting or even if the model worked we couldn't infer its performance due to an even smaller test set typically consisting of one or two examples from a class. So we are not able to gauge the model's performance and improve it.
So we would like to try different approaches and especially unsupervised learning and train on the the entire corpus of 150 documents without context of the labels and somehow finally classify all the 150 papers based on the extracted features or some other method to distinguish the publishable and non-publishable papers. Our thought process is that we cannot use the labeled papers which are 15 in number to train as it overfits or is harder to infer from.
We extracted the text from the documents using pdfplumber library using the code :
pdf_text = ""
try:
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
pdf_text += page.extract_text()
and we only did basic preprocessing that is always done for the NLP tasks using the below code.
def preprocess_text(text):
text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
text = text.lower()
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
return ' '.join(tokens)
df['Processed_Content'] = df['Content'].apply(preprocess_text)
We are not sure if this preprocessing is enough for this task and for papers spanning such huge volumes of text. So think about this and tell us if there is a better suited way for our application.
-Hierarchical Model: A hierarchical approach first encodes each section and then combines them with an attention mechanism. This can help the model learn that certain sections might hold more weight in publishability decisions.
-A rule-based or heuristic-based system (in parallel with an ML model) could flag suspicious patterns in these sections. Combined signals from these heuristics can augment your final classifier’s confidence.
-Semantic Similarity to Known “Non-Publishable” examples: Create an embedding for the entire text and measure the distance to embeddings from known non-publishable documents.
-Fine-tune a robust model (e.g., a small MLP on top of embeddings, or fine-tune a transformer with a classification head).
-You can prompt an LLM to generate regular expressions or a procedure for detecting sections from a typical research paper. You implement that code locally and run it on your documents.
These are the ideas that we like from what you suggested. Go into more detail and think through these ideas for our specific application with the context of past attempts at this task and tell us how do we go about it step by step and then give us end to end code for each of them.
Keep in mind we are running in a kaggle environment and give code that utilizes the GPU or TPU accordingly.
Finally all my thoughts entered in this prompt or my own, they are neither objective nor non refutable, so you can still find flaws in my thought process and suggest ways of improving if there are any.
Below is a detailed, end-to-end outline of how you might implement several of the ideas you highlighted—ranging from heuristic-based section detection, a hierarchical model design, rule-based “red flag” detection, unsupervised clustering, and leveraging semantic similarity. The goal is to provide a variety of methods you can try in your Kaggle environment. Where relevant, I’ve included code snippets (in Python) that you can adapt. I’ll also point out potential pitfalls or ways to improve your approach.
Preprocessing & Section Extraction
1.1 Basic Preprocessing vs. Context-Aware Preprocessing
1.2 Heuristic-Based Section Detection
Unsupervised/Clustering Approach
2.1 Embedding Generation (Sentence Transformers)
2.2 K-Means / Hierarchical Clustering
2.3 Interpreting Clusters & Using Minimal Labels
Hierarchical Model for Long Documents
3.1 Chunking into Sections (or Subsections)
3.2 Encoding & Combining Section Embeddings
3.3 Fine-Tuning a Small Classifier Head (Optional)
Rule-Based “Red Flag” Detection
4.1 Designing Heuristics
4.2 Scoring & Combining with Other Methods
Semantic Similarity to Known Non-Publishable Papers
5.1 Building a Similarity Index
5.2 Threshold-Based Classification
Putting It All Together: Example Pipeline
Improving & Checking for Pitfalls
Complete Code Examples
Throughout, I’ll highlight ways to utilize your small set of 15 labeled papers effectively—even if it’s only at the end as a final “label assignment” step.
What you’re currently doing:
pythondef preprocess_text(text): text = re.sub(r'[^a-zA-Z0-9\s]', '', text) text = text.lower() tokens = word_tokenize(text) stop_words = set(stopwords.words('english')) tokens = [word for word in tokens if word not in stop_words] return ' '.join(tokens)
Suggestions:
%
, =
, maybe “( )” if relevant.Many research papers follow a rough structure with headings like:
Heuristic approach:
Below is a conceptual snippet for Kaggle (Python) to detect sections:
pythonimport re def split_into_sections(pdf_text): # Possible common headings to look for headings = [ r"(abstract)\b", r"(introduction)\b", r"(background|related\s?work)\b", r"(methodology|methods|experimental\s?setup)\b", r"(results|experiments)\b", r"(discussion)\b", r"(conclusion|conclusions)\b", r"(references)\b" ] # We'll store tuples of (section_name, section_text) sections = [] current_section_name = "Unlabeled" current_section_text = [] lines = pdf_text.split('\n') for line in lines: line_lower = line.lower().strip() matched_heading = None for hd in headings: if re.match(hd, line_lower): matched_heading = line.strip() break if matched_heading: # Save previous section if we have any text if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) # Start a new section current_section_name = matched_heading current_section_text = [] else: current_section_text.append(line) # Add the last collected section if any if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) return sections # Example usage # pdf_text = "Your extracted text from pdfplumber" # sections = split_into_sections(pdf_text) # for sec_name, sec_content in sections: # print(sec_name, len(sec_content.split()))
What to do next:
sec_content
through your preprocess_text
or a more domain-aware pipeline.all-mpnet-base-v2
, all-distilroberta-v1
, or a domain-specific model like allenai/scibert_scivocab_uncased
) can create embeddings for text segments.By splitting papers into sections or ~1,000-word chunks:
pythondef chunk_text(text, chunk_size=1000): words = text.split() for i in range(0, len(words), chunk_size): yield ' '.join(words[i:i+chunk_size])
Then process each chunk with a transformer (like SciBERT) to obtain a vector.
One approach (a “hierarchical” approach) is:
[num_chunks, embedding_dim]
.paper_embedding
for either clustering (unsupervised) or as input to a small neural classifier if you do want to use your labeled data.Even if you have only 15 labeled documents, you could:
Some potential signals for “Non-Publishable”:
You could create a simple “red flag” score for each paper:
This approach can run in parallel with your unsupervised model. You might ultimately label a paper “non-publishable” if:
nonpub_centroid
.nonpub_centroid
.A simple approach:
Overfitting to the 15 labeled docs:
Inconsistent Section Headings:
Large Variation in Paper Length:
GPU/TPU Usage:
Below are some consolidated code snippets (in Python) that you can adapt in a Kaggle notebook. I’ll assume you’ve already installed pdfplumber
and sentence-transformers
. (If not, you can do !pip install pdfplumber sentence-transformers
at the top of your Kaggle notebook.)
pythonimport re import pdfplumber import torch import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize # Make sure to download NLTK data if not already present # nltk.download('punkt') # nltk.download('stopwords') def extract_text_from_pdf(pdf_path): """Extract all text from a PDF using pdfplumber.""" pdf_text = "" with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: page_text = page.extract_text() if page_text: pdf_text += page_text + "\n" return pdf_text def custom_preprocess(text): """ A slightly more cautious preprocessing that: - Keeps numbers, percentage signs, - Lowercases, - Removes some punctuation but keeps essential symbols. """ # Lowercase text = text.lower() # Keep only certain punctuation: letters, numbers, whitespace, punctuation like '%', '=', etc. text = re.sub(r'[^a-z0-9%\=\(\)\.\,\s]', '', text) # Tokenize tokens = word_tokenize(text) # Remove stopwords (optional) stop_words = set(stopwords.words('english')) tokens = [t for t in tokens if t not in stop_words] return ' '.join(tokens) def split_into_sections(pdf_text): headings = [ r"(abstract)\b", r"(introduction)\b", r"(background|related\s?work)\b", r"(methodology|methods|experimental\s?setup)\b", r"(results|experiments)\b", r"(discussion)\b", r"(conclusion|conclusions)\b", r"(references)\b" ] sections = [] current_section_name = "Unlabeled" current_section_text = [] lines = pdf_text.split('\n') for line in lines: line_lower = line.strip().lower() matched_heading = None for hd in headings: if re.match(hd, line_lower): matched_heading = line.strip() break if matched_heading: if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) current_section_name = matched_heading current_section_text = [] else: current_section_text.append(line) if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) return sections
Usage in a Kaggle environment:
python# Suppose you have a folder "papers" with PDF files import os pdf_dir = "./papers" documents = [] for filename in os.listdir(pdf_dir): if filename.endswith(".pdf"): path = os.path.join(pdf_dir, filename) raw_text = extract_text_from_pdf(path) sections = split_into_sections(raw_text) # Preprocess each section processed_sections = [] for sec_name, sec_text in sections: processed_text = custom_preprocess(sec_text) processed_sections.append((sec_name, processed_text)) # Store the document data documents.append({ 'filename': filename, 'sections': processed_sections })
Below is a simple example of clustering the average embedding of each paper. We’ll use Sentence Transformers to get embeddings.
python!pip install sentence-transformers from sentence_transformers import SentenceTransformer from sklearn.cluster import KMeans import numpy as np # Load a pre-trained or domain-specific model (SciBERT, etc.) model_name = 'all-mpnet-base-v2' # or 'allenai/scibert_scivocab_uncased' device = 'cuda' if torch.cuda.is_available() else 'cpu' embed_model = SentenceTransformer(model_name, device=device) def get_document_embedding(processed_sections): """ Generate embeddings for each section, then average them. processed_sections: list of (sec_name, text) after preprocessing. """ section_texts = [sec_text for _, sec_text in processed_sections if sec_text.strip()] if not section_texts: # fallback if there's no text return np.zeros(768) # Encode the entire batch embeddings = embed_model.encode(section_texts, convert_to_numpy=True) doc_embedding = np.mean(embeddings, axis=0) return doc_embedding # Compute embeddings for all documents doc_embeddings = [] for doc in documents: embedding = get_document_embedding(doc['sections']) doc_embeddings.append(embedding) doc_embeddings = np.array(doc_embeddings) # K-Means Clustering num_clusters = 5 # arbitrary choice; you can experiment kmeans = KMeans(n_clusters=num_clusters, random_state=42) labels = kmeans.fit_predict(doc_embeddings) for i, doc in enumerate(documents): doc['cluster_label'] = labels[i] # Inspect the cluster distribution from collections import Counter cluster_counts = Counter(labels) print(cluster_counts)
Next Steps:
Create a few simple functions to detect “red flags,” e.g.:
pythondef count_red_flags(processed_sections): flags = 0 # 1) Check for unrealistic claims: e.g., "100% accuracy" text_joined = " ".join(sec_text for _, sec_text in processed_sections) if '100% accuracy' in text_joined or '100 percent accuracy' in text_joined: flags += 1 # 2) Check if Methodology section is too short for sec_name, sec_text in processed_sections: if 'method' in sec_name.lower(): # Adjust threshold to your taste if len(sec_text.split()) < 100: flags += 1 # 3) Check if there's a references section at all has_references = any('references' in sec_name.lower() for sec_name, _ in processed_sections) if not has_references: flags += 1 return flags # Example usage for doc in documents: doc['red_flag_score'] = count_red_flags(doc['sections']) # Then you could set a threshold threshold = 1 for doc in documents: if doc['red_flag_score'] >= threshold: doc['suspicious'] = True else: doc['suspicious'] = False
If you have 5 known non-publishable papers:
pythonknown_nonpub_embeddings = [] for doc in documents: if doc['filename'] in known_nonpub_filenames: # e.g. the 5 known non-publishable emb = get_document_embedding(doc['sections']) known_nonpub_embeddings.append(emb) nonpub_centroid = np.mean(known_nonpub_embeddings, axis=0) def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) # Score each doc for doc in documents: emb = get_document_embedding(doc['sections']) sim = cosine_similarity(emb, nonpub_centroid) doc['similarity_to_nonpub'] = sim
Then set a threshold:
python# For instance, see the distribution of similarity_to_nonpub for known publishable vs. known non-publishable # This helps you pick a threshold threshold = 0.7 for doc in documents: doc['maybe_nonpub'] = (doc['similarity_to_nonpub'] > threshold)
Below is a condensed version of how this might look in an actual Kaggle Notebook with minimal commentary:
python# Step 0: Install dependencies (if needed) !pip install pdfplumber sentence-transformers import os import pdfplumber import re import torch import nltk import numpy as np from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from sentence_transformers import SentenceTransformer from sklearn.cluster import KMeans from collections import Counter device = 'cuda' if torch.cuda.is_available() else 'cpu' model_name = 'all-mpnet-base-v2' embed_model = SentenceTransformer(model_name, device=device) # Download NLTK data nltk.download('punkt') nltk.download('stopwords') def extract_text_from_pdf(pdf_path): pdf_text = "" with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: page_text = page.extract_text() if page_text: pdf_text += page_text + "\n" return pdf_text def custom_preprocess(text): text = text.lower() # Keep certain punctuation text = re.sub(r'[^a-z0-9%\=\(\)\.\,\s]', '', text) tokens = word_tokenize(text) stop_words = set(stopwords.words('english')) tokens = [t for t in tokens if t not in stop_words] return ' '.join(tokens) def split_into_sections(pdf_text): headings = [ r"(abstract)\b", r"(introduction)\b", r"(background|related\s?work)\b", r"(methodology|methods|experimental\s?setup)\b", r"(results|experiments)\b", r"(discussion)\b", r"(conclusion|conclusions)\b", r"(references)\b" ] sections = [] current_section_name = "Unlabeled" current_section_text = [] lines = pdf_text.split('\n') for line in lines: line_lower = line.strip().lower() matched_heading = None for hd in headings: if re.match(hd, line_lower): matched_heading = line.strip() break if matched_heading: if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) current_section_name = matched_heading current_section_text = [] else: current_section_text.append(line) if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) return sections def get_document_embedding(processed_sections): section_texts = [sec_text for _, sec_text in processed_sections if sec_text.strip()] if not section_texts: return np.zeros(768) embeddings = embed_model.encode(section_texts, convert_to_numpy=True) return np.mean(embeddings, axis=0) def count_red_flags(processed_sections): flags = 0 text_joined = " ".join(sec_text for _, sec_text in processed_sections) # Example rule: 100% accuracy if '100% accuracy' in text_joined or '100 percent accuracy' in text_joined: flags += 1 # Example rule: short methodology section for sec_name, sec_text in processed_sections: if 'method' in sec_name.lower(): if len(sec_text.split()) < 100: flags += 1 # Example rule: no references has_references = any('references' in sec_name.lower() for sec_name, _ in processed_sections) if not has_references: flags += 1 return flags # ===== MAIN PIPELINE ===== pdf_dir = "./papers" documents = [] for filename in os.listdir(pdf_dir): if filename.endswith(".pdf"): path = os.path.join(pdf_dir, filename) raw_text = extract_text_from_pdf(path) sections = split_into_sections(raw_text) # Preprocess each section processed_sections = [] for sec_name, sec_text in sections: processed_text = custom_preprocess(sec_text) processed_sections.append((sec_name, processed_text)) documents.append({ 'filename': filename, 'sections': processed_sections }) # Compute embeddings for each doc doc_embeddings = [] for doc in documents: emb = get_document_embedding(doc['sections']) doc_embeddings.append(emb) doc_embeddings = np.array(doc_embeddings) # KMeans clustering num_clusters = 5 kmeans = KMeans(n_clusters=num_clusters, random_state=42) labels = kmeans.fit_predict(doc_embeddings) for i, doc in enumerate(documents): doc['cluster_label'] = int(labels[i]) doc['red_flag_score'] = count_red_flags(doc['sections']) cluster_counts = Counter(labels) print("Cluster distribution:", cluster_counts) # Suppose we know the filenames of the 5 known non-publishable known_nonpub_filenames = ["nonpub1.pdf", "nonpub2.pdf", "nonpub3.pdf", "nonpub4.pdf", "nonpub5.pdf"] known_nonpub_embeddings = [] for doc in documents: if doc['filename'] in known_nonpub_filenames: known_nonpub_embeddings.append(get_document_embedding(doc['sections'])) nonpub_centroid = np.mean(known_nonpub_embeddings, axis=0) def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) for doc in documents: emb = get_document_embedding(doc['sections']) sim = cosine_similarity(emb, nonpub_centroid) doc['similarity_to_nonpub'] = float(sim) # Example threshold for suspicious similarity_threshold = 0.7 red_flag_threshold = 1 for doc in documents: doc['maybe_nonpub'] = (doc['similarity_to_nonpub'] > similarity_threshold) \ or (doc['red_flag_score'] >= red_flag_threshold) # Inspect final results for doc in documents: print(doc['filename'], doc['cluster_label'], doc['red_flag_score'], doc['similarity_to_nonpub'], doc['maybe_nonpub'])
maybe_nonpub
)I hope this gives you a more comprehensive roadmap—and enough code to try out each approach in Kaggle. Feel free to tweak thresholds, the choice of embedding model, clustering method, and rule-based patterns to adapt to your specific corpus. Above all, remember that your domain knowledge about “what makes a paper unpublishable” can be translated into powerful heuristics that complement the unsupervised features.
num_clusters = 5
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
labels = kmeans.fit_predict(doc_embeddings)
for i, doc in enumerate(documents):
doc['cluster_label'] = int(labels[i])
doc['red_flag_score'] = count_red_flags(doc['sections'])
cluster_counts = Counter(labels)
print("Cluster distribution:", cluster_counts)
This was the clustering code that you have given in the previous response. Can you now give me code to visualize them. So i can understand where the different classes are and how they are distributed. I want you to give labels or highlight for the following 15 filenames when visualizing the clusters as they are my labeled examples, I wanna see where they end up in.
The pdfs are named as R001.pdf, R002.pdf, R003.pdf....until R014.pdf, R015.pdf. R001 - R005 are of class 0, so label them as such and the remaining 10 are class 1, so label them as such too.
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
for doc in documents:
emb = get_document_embedding(doc['sections'])
sim = cosine_similarity(emb, nonpub_centroid)
doc['similarity_to_nonpub'] = float(sim)
similarity_threshold = 0.6
red_flag_threshold = 1
for doc in documents:
doc['maybe_nonpub'] = (doc['similarity_to_nonpub'] > similarity_threshold)
This was the code you gave for finding the cosine similarity between documents, can you give me the code to visualize the distribution of cosine similarities and also give labels for those 15 filenames I mentioned or atleast highlight them in some form. I would also like some form of histogram/density curve for this additionally.
Can you give some more red flag rules? We found out that not all papers have a reference section, even if they are publishable, so that was not a good metric that you gave. I would also like to see the red flag distribution as some form of a plot. Give code for that as well.
Can you suggest me any improvements?
Below is a structured walk-through of how to visualize the clustering results, the cosine similarity distribution, and the red flag scores for your documents. I’ve also included code snippets that:
R001.pdf
to R015.pdf
), with R001–R005 labeled as class 0 and R006–R015 labeled as class 1.Where possible, I’ve included commentary for improvements or alternative approaches.
Since doc_embeddings
is likely high-dimensional (e.g., 768 dimensions from a transformer), we’ll reduce the dimensionality to 2D for plotting. You can use PCA, t-SNE, or UMAP. Below, I’ll show PCA for simplicity, but feel free to replace it with t-SNE or UMAP.
pythonimport matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA from sklearn.manifold import TSNE # We have: # doc_embeddings (shape: [num_documents, embedding_dim]) # documents (a list of dicts with 'filename' and 'cluster_label') # Let's define our labeled sets labeled_files_class0 = set([f"R00{i}.pdf" for i in range(1,6)]) # R001 to R005 labeled_files_class1 = set([f"R00{i}.pdf" for i in range(6,16)]) # R006 to R015 # (1) PCA for 2D visualization pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) # (2) Alternatively, you can try t-SNE (comment out PCA if you want this) # tsne = TSNE(n_components=2, random_state=42) # reduced_tsne = tsne.fit_transform(doc_embeddings) # Then use reduced_tsne instead of reduced_pca below. # We'll use reduced_pca here for demonstration. x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] # Create a figure for the cluster visualization plt.figure(figsize=(10, 7)) # Plot each document as a point, colored by cluster_label clusters = [doc['cluster_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title("K-Means Clusters (PCA 2D Projection)") plt.xlabel("Principal Component 1") plt.ylabel("Principal Component 2") # Now let's highlight or label the 15 known PDFs for i, doc in enumerate(documents): filename = doc['filename'] # If this doc is one of the 15 labeled if filename in labeled_files_class0 or filename in labeled_files_class1: # We'll put a black edge or a special marker plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) # Optionally, annotate text # e.g., display 'C0' or 'C1' for class 0 or class 1 if filename in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i]+0.02, y_coords[i]+0.02, label_text, fontsize=9) # Add a legend for the cluster colors # We'll add it by mapping each cluster label to a color patch unique_clusters = list(sorted(set(clusters))) handles = scatter.legend_elements()[0] labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, labels_for_legend, title="Clusters", loc="best") plt.show()
cluster_label
.R001-R005
are class 0, R006-R015
are class 1) are drawn with a black edge around the marker, and we label them in text as “C0” or “C1.”If you want a more advanced or non-linear reduction (like t-SNE or UMAP), just uncomment the relevant lines, and refer to reduced_tsne[:, 0]
and reduced_tsne[:, 1]
for x and y coordinates.
You have code that computes doc['similarity_to_nonpub']
for each document. Let’s plot a histogram (or KDE plot) of these similarity scores and highlight the 15 labeled PDFs.
pythonimport seaborn as sns # Extract similarity scores for all docs similarities = [doc['similarity_to_nonpub'] for doc in documents] # Make a list of labeled similarities, plus the label for each similarities_labeled_c0 = [] similarities_labeled_c1 = [] similarities_unlabeled = [] for i, doc in enumerate(documents): sim_val = doc['similarity_to_nonpub'] filename = doc['filename'] if filename in labeled_files_class0: similarities_labeled_c0.append(sim_val) elif filename in labeled_files_class1: similarities_labeled_c1.append(sim_val) else: similarities_unlabeled.append(sim_val) plt.figure(figsize=(10, 6)) # Plot distribution with seaborn's histplot or kdeplot sns.histplot(similarities, kde=True, color='gray', alpha=0.4, label='All Documents') # Plot labeled class 0 sns.histplot(similarities_labeled_c0, color='red', alpha=0.8, kde=True, label='Class 0 (R001-R005)') # Plot labeled class 1 sns.histplot(similarities_labeled_c1, color='blue', alpha=0.6, kde=True, label='Class 1 (R006-R015)') plt.title("Cosine Similarity to Non-Publishable Centroid") plt.xlabel("Similarity Score") plt.ylabel("Count") plt.legend(loc='best') plt.show()
nonpub_centroid
than the “Publishable” ones.sns.kdeplot(...)
instead of sns.histplot(kde=True, ...)
.You mentioned that having or not having a references section wasn’t always accurate. Below are some alternative heuristics:
Below is an updated example of a “red flag” function using some of these additional heuristics:
pythondef count_red_flags(processed_sections): flags = 0 full_text = " ".join(sec_text for _, sec_text in processed_sections) word_count = len(full_text.split()) # 1) Very short paper if word_count < 1000: flags += 1 # 2) Check if baseline or comparison keywords are absent if ("baseline" not in full_text) and ("compare" not in full_text) and ("benchmark" not in full_text): flags += 1 # 3) Check if there's any mention of data or dataset if ("data" not in full_text) and ("dataset" not in full_text) and ("corpus" not in full_text): flags += 1 # 4) No mention of common evaluation metrics common_metrics = ["accuracy", "precision", "recall", "f1", "auc", "mse", "rmse"] if not any(metric in full_text for metric in common_metrics): flags += 1 return flags
Similar to the cosine similarity distribution, you might want to plot a histogram of red_flag_score
across all documents. We’ll again highlight the labeled ones.
pythonimport seaborn as sns red_flag_scores = [doc['red_flag_score'] for doc in documents] scores_labeled_c0 = [] scores_labeled_c1 = [] scores_unlabeled = [] for doc in documents: score_val = doc['red_flag_score'] filename = doc['filename'] if filename in labeled_files_class0: scores_labeled_c0.append(score_val) elif filename in labeled_files_class1: scores_labeled_c1.append(score_val) else: scores_unlabeled.append(score_val) plt.figure(figsize=(10, 6)) sns.histplot(red_flag_scores, color='gray', alpha=0.4, binwidth=1, label='All Documents') sns.histplot(scores_labeled_c0, color='red', alpha=0.8, binwidth=1, label='Class 0 (R001-R005)') sns.histplot(scores_labeled_c1, color='blue', alpha=0.6, binwidth=1, label='Class 1 (R006-R015)') plt.title("Red Flag Score Distribution") plt.xlabel("Red Flag Score (count of triggered heuristics)") plt.ylabel("Number of Papers") plt.legend(loc='best') plt.show()
This will give you a sense of how many rules each document is triggering, and whether your known class 0 or class 1 docs align with high or low red-flag counts.
doc_embeddings
, color by cluster, highlight labeled docs.By combining these visualizations, you’ll get a much better understanding of how your dataset is structured, how your known “non-publishable” vs. “publishable” examples fit in, and which heuristics or similarity thresholds might be most effective.
Using the codes that you gave for both visualizing the clusters using pca and visualizing the cosine similarity. The graph seems to visualize all the non-publishable papers, but only a subset of the 10 publishable papers. You can see in the image, that only 4 out of 10 publishable papers are visualized in the clusters.png and I think the same case in the histplot as well. Why is that. Can you rectify and give me the correct code.
The issue you're describing likely stems from how the labeled files (R001
to R015
) are matched against the filenames in your documents
list. If there’s any mismatch in how the filenames are stored (e.g., missing .pdf
extensions, inconsistent capitalization, or whitespace issues), some of the labeled files might not be correctly identified, resulting in only a subset being visualized.
Here’s a step-by-step rectification and improved code to ensure all 15 labeled files are visualized properly:
Ensure that the filenames for your labeled files are correctly matched against the filenames in your documents
list. Normalize both sets of filenames to avoid mismatches.
python# Define labeled files with consistent formatting labeled_files_class0 = {f"R00{i}.pdf" for i in range(1, 6)} # R001-R005 labeled_files_class1 = {f"R00{i}.pdf" for i in range(6, 16)} # R006-R015 # Normalize filenames in the documents list for comparison for doc in documents: doc['normalized_filename'] = doc['filename'].strip().lower() # Normalize filenames # Ensure the labeled sets are also normalized labeled_files_class0 = {f.strip().lower() for f in labeled_files_class0} labeled_files_class1 = {f.strip().lower() for f in labeled_files_class1}
Update the clustering visualization code to ensure all labeled files are highlighted properly:
python# PCA for dimensionality reduction pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) # Extract coordinates x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] # Create figure plt.figure(figsize=(10, 7)) clusters = [doc['cluster_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title("K-Means Clusters (PCA 2D Projection)") plt.xlabel("Principal Component 1") plt.ylabel("Principal Component 2") # Highlight labeled files for i, doc in enumerate(documents): filename = doc['normalized_filename'] if filename in labeled_files_class0 or filename in labeled_files_class1: plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) # Label the points if filename in labeled_files_class0: label_text = "C0" elif filename in labeled_files_class1: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) # Add legend for clusters unique_clusters = list(sorted(set(clusters))) handles = scatter.legend_elements()[0] labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, labels_for_legend, title="Clusters", loc="best") plt.show()
Update the cosine similarity plot to ensure all labeled files are included:
python# Extract cosine similarity values similarities = [doc['similarity_to_nonpub'] for doc in documents] # Categorize similarities similarities_labeled_c0 = [] similarities_labeled_c1 = [] for doc in documents: filename = doc['normalized_filename'] sim_val = doc['similarity_to_nonpub'] if filename in labeled_files_class0: similarities_labeled_c0.append(sim_val) elif filename in labeled_files_class1: similarities_labeled_c1.append(sim_val) plt.figure(figsize=(10, 6)) # Plot overall distribution sns.histplot(similarities, kde=True, color='gray', alpha=0.4, label='All Documents') # Highlight class 0 sns.histplot(similarities_labeled_c0, kde=True, color='red', alpha=0.8, label='Class 0 (R001-R005)') # Highlight class 1 sns.histplot(similarities_labeled_c1, kde=True, color='blue', alpha=0.6, label='Class 1 (R006-R015)') plt.title("Cosine Similarity to Non-Publishable Centroid") plt.xlabel("Similarity Score") plt.ylabel("Count") plt.legend(loc='best') plt.show()
If any of the 15 labeled files still fail to appear in the plots, print the filenames for debugging:
python# Check which labeled files are not being matched missing_files_c0 = labeled_files_class0 - {doc['normalized_filename'] for doc in documents} missing_files_c1 = labeled_files_class1 - {doc['normalized_filename'] for doc in documents} print("Missing Class 0 Files:", missing_files_c0) print("Missing Class 1 Files:", missing_files_c1)
Ensure the filenames in the labeled lists (labeled_files_class0
, labeled_files_class1
) exactly match the filenames in documents
.
Let me know if the updated code resolves the issue or if further adjustments are needed!
can you adjust the previous visualization codes to highlight and visualize the documents that are maybe_nonpub == True. So I can visualize how they are distributed in the clusters and cosine similarity distribution. Make sure it works without errors.
Here's how you can adjust the previous visualization codes to highlight the documents where maybe_nonpub == True
. This involves updating both the clustering visualization and the cosine similarity distribution visualization.
maybe_nonpub
HighlightingWe’ll add a special highlight for documents marked as maybe_nonpub == True
in the clustering plot. These will be shown with a distinct edge color (e.g., orange).
python# PCA for dimensionality reduction pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) # Extract coordinates x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] # Create figure plt.figure(figsize=(10, 7)) clusters = [doc['cluster_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title("K-Means Clusters (PCA 2D Projection)") plt.xlabel("Principal Component 1") plt.ylabel("Principal Component 2") # Highlight labeled files for i, doc in enumerate(documents): filename = doc['normalized_filename'] # Highlight labeled documents (Class 0 and Class 1) if filename in labeled_files_class0 or filename in labeled_files_class1: plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if filename in labeled_files_class0: label_text = "C0" elif filename in labeled_files_class1: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) # Highlight maybe_nonpub documents with a distinct edge color if doc.get('maybe_nonpub', False): plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', marker='s', s=120, linewidths=1.5) # Add legend for clusters unique_clusters = list(sorted(set(clusters))) handles = scatter.legend_elements()[0] labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, labels_for_legend, title="Clusters", loc="best") plt.show()
R001
to R015
) with class annotations (C0
and C1
).maybe_nonpub == True
with orange square edges.maybe_nonpub
HighlightingWe’ll adjust the cosine similarity plot to highlight the distribution of documents where maybe_nonpub == True
in addition to the labeled documents.
python# Extract cosine similarity values similarities = [doc['similarity_to_nonpub'] for doc in documents] # Categorize similarities similarities_labeled_c0 = [] similarities_labeled_c1 = [] similarities_maybe_nonpub = [] for doc in documents: filename = doc['normalized_filename'] sim_val = doc['similarity_to_nonpub'] if filename in labeled_files_class0: similarities_labeled_c0.append(sim_val) elif filename in labeled_files_class1: similarities_labeled_c1.append(sim_val) if doc.get('maybe_nonpub', False): similarities_maybe_nonpub.append(sim_val) plt.figure(figsize=(10, 6)) # Plot overall distribution sns.histplot(similarities, kde=True, color='gray', alpha=0.4, label='All Documents') # Highlight class 0 sns.histplot(similarities_labeled_c0, kde=True, color='red', alpha=0.8, label='Class 0 (R001-R005)') # Highlight class 1 sns.histplot(similarities_labeled_c1, kde=True, color='blue', alpha=0.6, label='Class 1 (R006-R015)') # Highlight maybe_nonpub sns.histplot(similarities_maybe_nonpub, kde=True, color='orange', alpha=0.5, label='Maybe Non-Publishable') plt.title("Cosine Similarity to Non-Publishable Centroid") plt.xlabel("Similarity Score") plt.ylabel("Count") plt.legend(loc='best') plt.show()
R001-R005
) and Class 1 (R006-R015
), respectively.maybe_nonpub == True
.To ensure the plots work without errors:
documents
has a normalized filename (normalized_filename
) and the maybe_nonpub
attribute (default to False
if not already present).pythonfor doc in documents: if 'normalized_filename' not in doc or 'maybe_nonpub' not in doc: print("Missing attributes in:", doc)
Let me know if the updated code highlights the maybe_nonpub
documents correctly and if further refinements are needed!
!pip install pymupdf sentence-transformers
import time
import warnings
warnings.filterwarnings('ignore')
import os
import fitz
import re
import torch
import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from collections import Counter
import pickle
import json
import pandas as pd
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'all-mpnet-base-v2' #'all-mpnet-base-v2' #all-distilroberta-v1 #allenai/scibert_scivocab_uncased #t5-base #sentence-transformers/all-roberta-large-v1
embed_model = SentenceTransformer(model_name, device=device)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')
############################################
############################################
def extract_text_pymupdf(pdf_path):
"""
Use PyMuPDF to extract text from each page,
returning a single string for the entire PDF.
"""
doc_text = []
with fitz.open(pdf_path) as doc:
for page in doc:
page_text = page.get_text("text")
doc_text.append(page_text)
return "\n".join(doc_text)
############################################
############################################
heading_pattern = re.compile(
r"^(?:\d+.\s*)?"
r"(abstract|introduction|background|related\swork|methods?|methodology|experiments|results|discussion|conclusions?|references)\s$",
re.IGNORECASE
)
def unify_heading(heading_str):
"""
Convert certain headings to a unified label:
- methods, method, methodology -> Methods
- conclusions, conclusion -> Conclusion
- all else -> capitalized heading (e.g. 'Introduction', 'Results', etc.)
"""
heading_lower = heading_str.lower()
textif heading_lower in {"methods", "method", "methodology"}: return "Methods" elif heading_lower in {"conclusion", "conclusions"}: return "Conclusion" else: # Capitalize first letter, rest lower # e.g. "introduction" -> "Introduction", "results" -> "Results" return heading_lower.capitalize()
def split_into_strict_sections(full_text):
"""
Splits text by newlines. Then checks if the entire line
matches typical headings. If it does, we start a new section.
Otherwise, accumulate lines in the current section.
"""
sections = []
current_section_name = "Unlabeled"
current_lines = []
text# Split on newline lines = full_text.split('\n') for line in lines: line_stripped = line.strip() match = heading_pattern.match(line_stripped.lower()) if match: # Save previous lines under the old section if current_lines: sections.append((current_section_name, "\n".join(current_lines))) # The matched text (e.g. 'methods', 'introduction', etc.) matched_text = match.group(1) # Normalize it section_name = unify_heading(matched_text) current_section_name = section_name current_lines = [] else: current_lines.append(line_stripped) # Append leftover lines to the final section if current_lines: sections.append((current_section_name, "\n".join(current_lines))) return sections
############################################
############################################
def custom_preprocess(text):
"""
Light text cleaning:
- Lowercase
- Remove extraneous punctuation (but keep basic . , % etc.)
- Remove stopwords
"""
text = text.lower()
# Adjust regex to keep or remove punctuation as desired
text = re.sub(r'[^a-z0-9%=$$.,\s]', '', text)
texttokens = word_tokenize(text) stop_words = set(stopwords.words('english')) tokens = [t for t in tokens if t not in stop_words] return ' '.join(tokens)
############################################
############################################
def find_all_pdfs(root_dir):
pdf_paths = []
for root, _, files in os.walk(root_dir):
for file in files:
if file.lower().endswith(".pdf"):
pdf_paths.append(os.path.join(root, file))
return pdf_paths
############################################
############################################
def process_pdfs_strict_unified(root_folder):
"""
1) Recursively find all PDFs
2) Extract text with PyMuPDF
3) Strictly split into sections
4) Normalize headings (e.g., 'Methodology' -> 'Methods')
5) Preprocess each section
6) Return a list of dicts
"""
pdf_paths = find_all_pdfs(root_folder)
documents = []
textstart_time = time.time() for pdf_path in pdf_paths: raw_text = extract_text_pymupdf(pdf_path) sections = split_into_strict_sections(raw_text) processed_sections = [] for (sec_name, sec_text) in sections: clean_text = custom_preprocess(sec_text) processed_sections.append((sec_name, clean_text)) doc_info = { 'filename': os.path.basename(pdf_path), 'sections': processed_sections } documents.append(doc_info) end_time = time.time() print(f"Processed {len(documents)} PDFs in {end_time - start_time:.2f} seconds.") return documents
############################################
############################################
root_folder = "/kaggle/input/kdsh-2025-dataset/KDSH_2025_Dataset" # Modify as needed
documents = process_pdfs_strict_unified(root_folder)
df = pd.DataFrame([
{
'filename': doc['filename'],
'section_name': section[0], # e.g. "Methods", "Introduction", "Conclusion"
'section_text': section[1]
}
for doc in documents
for section in doc['sections']
])
df.to_csv("documents_strict_unified.csv", index=False)
print("Saved dataset as documents_strict_unified.csv")
df.section_name.unique()
def get_document_embedding(processed_sections):
section_texts = [sec_text for _, sec_text in processed_sections if sec_text.strip()]
if not section_texts:
return np.zeros(768)
embeddings = embed_model.encode(section_texts, convert_to_numpy=True)
return np.mean(embeddings, axis=0)
def count_red_flags(processed_sections):
flags = 0
full_text = " ".join(sec_text for _, sec_text in processed_sections)
word_count = len(full_text.split())
text# Example rule: 100% accuracy if '100% accuracy' in full_text or '100 percent accuracy' in full_text: flags += 2 # Example rule: short methodology section for sec_name, sec_text in processed_sections: if 'method' in sec_name.lower(): if len(sec_text.split()) < 50: flags += 1 if word_count < 500: flags += 1 #recipe is a red flag. # 2) Check if baseline or comparison keywords are absent if ("baseline" not in full_text) and ("compare" not in full_text) and ("benchmark" not in full_text): flags += 1 # 3) Check if there's any mention of data or dataset if ("data" not in full_text) and ("dataset" not in full_text) and ("corpus" not in full_text): flags += 1 # 4) No mention of common evaluation metrics common_metrics = ["accuracy", "precision", "recall", "f1", "auc", "mse", "rmse"] if not any(metric in full_text for metric in common_metrics): flags += 1 return flags
doc_embeddings = []
for doc in documents:
emb = get_document_embedding(doc['sections'])
doc_embeddings.append(emb)
doc_embeddings = np.array(doc_embeddings)
num_clusters = 5
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
labels = kmeans.fit_predict(doc_embeddings)
for i, doc in enumerate(documents):
doc['cluster_label'] = int(labels[i])
doc['red_flag_score'] = count_red_flags(doc['sections'])
cluster_counts = Counter(labels)
print("Cluster distribution:", cluster_counts)
known_nonpub_filenames = ["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"]
known_nonpub_embeddings = []
for doc in documents:
if doc['filename'] in known_nonpub_filenames:
known_nonpub_embeddings.append(get_document_embedding(doc['sections']))
nonpub_centroid = np.mean(known_nonpub_embeddings, axis=0)
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
for doc in documents:
emb = get_document_embedding(doc['sections'])
sim = cosine_similarity(emb, nonpub_centroid)
doc['similarity_to_nonpub'] = float(sim)
similarity_threshold = 0.5
red_flag_threshold = 1
for doc in documents:
doc['maybe_nonpub'] = (doc['similarity_to_nonpub'] > similarity_threshold)
import pandas as pd
expanded_df = pd.DataFrame(documents)
expanded_df[expanded_df.filename.isin(["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"])]
import pandas as pd
labeled_df = expanded_df[expanded_df['filename'].isin([f'R{str(i).zfill(3)}.pdf' for i in range(1, 16)])]
labeled_df.sort_values(by='filename')
Based on the cosine similarity values, we can see that the lowest value for Class 0 is 0.598972 and the highest value for class 1 is 0.2980.
There is a huge difference of about 0.3 So we can set the decision boundary at value = 0.45 to get a boundary and if we want to be even more sure, we can set that to 0.50 giving out less false positives.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
labeled_files_class0 = set(["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"]) # R001 to R005
labeled_files_class1 = set(["R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf","R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"]) # R006 to R015
tsne = TSNE(n_components=2, random_state=42)
reduced_tsne = tsne.fit_transform(doc_embeddings)
x_coords = reduced_tsne[:, 0]
y_coords = reduced_tsne[:, 1]
plt.figure(figsize=(10, 7))
clusters = [doc['cluster_label'] for doc in documents]
scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6)
plt.title("K-Means Clusters (PCA 2D Projection)")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
for i, doc in enumerate(documents):
filename = doc['filename']
# Highlight labeled documents (Class 0 and Class 1)
if filename in labeled_files_class0 or filename in labeled_files_class1:
plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5)
if filename in labeled_files_class0:
label_text = "C0"
elif filename in labeled_files_class1:
label_text = "C1"
plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9)
text# Highlight maybe_nonpub documents with a distinct edge color if doc.get('maybe_nonpub', False): plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', marker='s', s=120, linewidths=1.5)
unique_clusters = list(sorted(set(clusters)))
handles = scatter.legend_elements()[0]
labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters]
plt.legend(handles, labels_for_legend, title="Clusters", loc="best")
plt.show()
pca = PCA(n_components=2, random_state=42)
reduced_pca = pca.fit_transform(doc_embeddings)
x_coords = reduced_pca[:, 0]
y_coords = reduced_pca[:, 1]
plt.figure(figsize=(10, 7))
clusters = [doc['cluster_label'] for doc in documents]
scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6)
plt.title("K-Means Clusters (PCA 2D Projection)")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
for i, doc in enumerate(documents):
filename = doc['filename']
# Highlight labeled documents (Class 0 and Class 1)
if filename in labeled_files_class0 or filename in labeled_files_class1:
plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5)
if filename in labeled_files_class0:
label_text = "C0"
elif filename in labeled_files_class1:
label_text = "C1"
plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9)
text# Highlight maybe_nonpub documents with a distinct edge color if doc.get('maybe_nonpub', False): plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', marker='s', s=120, linewidths=1.5)
unique_clusters = list(sorted(set(clusters)))
handles = scatter.legend_elements()[0]
labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters]
plt.legend(handles, labels_for_legend, title="Clusters", loc="best")
plt.show()
import seaborn as sns
similarities = [doc['similarity_to_nonpub'] for doc in documents]
similarities_labeled_c0 = []
similarities_labeled_c1 = []
similarities_maybe_nonpub = []
for doc in documents:
filename = doc['filename']
sim_val = doc['similarity_to_nonpub']
if filename in labeled_files_class0:
similarities_labeled_c0.append(sim_val)
elif filename in labeled_files_class1:
similarities_labeled_c1.append(sim_val)
if doc.get('maybe_nonpub', False):
similarities_maybe_nonpub.append(sim_val)
plt.figure(figsize=(10, 6))
sns.histplot(similarities, kde=True, color='gray', alpha=0.4, label='All Documents')
sns.histplot(similarities_labeled_c0, kde=True, color='red', alpha=0.8, label='Class 0 (R001-R005)')
sns.histplot(similarities_labeled_c1, kde=True, color='blue', alpha=0.6, label='Class 1 (R006-R015)')
sns.histplot(similarities_maybe_nonpub, kde=True, color='orange', alpha=0.5, label='Maybe Non-Publishable')
plt.title("Cosine Similarity to Non-Publishable Centroid")
plt.xlabel("Similarity Score")
plt.ylabel("Count")
plt.legend(loc='best')
plt.show()
Using this code that you gave earlier, after some modifications and using some better preprocessing of my own, I'm now able to get good results. The model is able to tell with high confidence that there are 37 non-publishable papers from 150 papers that we have. But this is purely using the cosine similarity , I have not included the rule based methods for the maybe_nonpub . This is because I felt like the rules weren't representative or good enough yet, so I will think further and analyse the papers and add better rules down the line. But for now, lets ignore it.
The model already does a good job using the similarity measure. It currently takes the centroid of the known non-publishable papers and then calculates the similarity measure based on that centroid. We get a lot of papers with high similarity, so I was thinking if I should include those with high similarity into the known_nonpub_filenames list and then recalculate the nonpub_centroid and then calculate similarity measure based on this new centroid. Then I can repeat this process, until i can no longer classify with high similarity. Is this a viable approach. Think deeply and give your thoughts on it. Then gave a code implementation for this one and tell me where I should paste it.
Then you also suggested this below approach:
"Even though you have a small labeled set, you can do a simple logistic regression or MLP with strong regularization or cross-validation. You might combine features like “paper_embedding,” and “similarity_to_nonpub.” Then use cross-validation on your 15 labeled docs to pick hyperparameters.
Small Classifier: If time permits, you can still attempt a small MLP on top of embeddings, but be wary of overfitting with only 15 labeled examples."
I like this idea and would love to hear more about this and why this might be a good idea, potential upsides and downsides and then the code implementation. I would like for it work in conjunction with the existing code and complement or aid in finding the non-publishable papers.
Till now, I was using kmeans for clustering and it has worked quite well. But you also that Hierarchical clustering or HDBSCAN can discover an arbitrary number of clusters and potentially identify outliers.
Can you think through this and guide me on this. Should I change the kmeans to something else, would that make any performance better.
I want you to take all this context and think deeply and address my queries and respond correctly.
Give two sets of end to end code. The first one should have the existing code with additional improvements by maybe using a mlp as a secondary classifier to aid and complement and also an interative kmeans implementation based on the similarity. The second code should be trying out different approaches like using hdbscan or whatever is best instead of kmeans or hierarchical clustering. or a mlp based classifier and other improvements as well.
Below is a detailed answer addressing your specific questions:
Finally, I provide two complete code examples (so you can simply paste them into your workflow):
Practical Tip: Start with a higher threshold (e.g., > 0.60) to keep false positives low. Then you can slowly lower the threshold if you want to catch borderline cases, but watch for error drift.
Below is a consolidated code block illustrating:
similarity_to_nonpub
as input features.Where to paste: Ideally, after you’ve computed
doc_embeddings
,similarity_to_nonpub
, andmaybe_nonpub
for each document. This can replace or supplement the final classification steps.
python############################################################################### # SET 1: Iterative Centroid Expansion + K-Means + MLP ############################################################################### import numpy as np from copy import deepcopy from sklearn.neural_network import MLPClassifier from sklearn.model_selection import cross_val_score import warnings warnings.filterwarnings('ignore') ############################################################################### # 1. Iterative Centroid Expansion ############################################################################### def iterative_nonpub_expansion(documents, known_nonpub_filenames, similarity_threshold=0.60, max_iterations=5): """ Iteratively expands the known non-publishable list by adding any documents with similarity > similarity_threshold. Then recalculates centroid. Arguments: - documents: list of dicts containing doc['filename'], doc['similarity_to_nonpub'], etc. - known_nonpub_filenames: initial set/list of known non-publishable filenames (strings). - similarity_threshold: similarity above which a doc is considered "non-publishable." - max_iterations: maximum number of expansions to run. Returns: - updated_nonpub_filenames: final set of non-publishable filenames after expansion """ updated_nonpub_filenames = set(known_nonpub_filenames) iteration = 0 while iteration < max_iterations: iteration += 1 # Calculate new centroid from the current set nonpub_embeddings = [] for doc in documents: if doc['filename'] in updated_nonpub_filenames: emb = doc['doc_embedding'] # we store doc_embedding for convenience nonpub_embeddings.append(emb) new_centroid = np.mean(nonpub_embeddings, axis=0) # Add docs that exceed the threshold newly_added = False for doc in documents: if doc['filename'] not in updated_nonpub_filenames: # compute similarity to new centroid emb = doc['doc_embedding'] sim = cosine_similarity(emb, new_centroid) if sim > similarity_threshold: updated_nonpub_filenames.add(doc['filename']) newly_added = True if not newly_added: print(f"No new docs added in iteration {iteration}. Stopping.") break else: print(f"Iteration {iteration} added new docs. Current size of non-pub set: {len(updated_nonpub_filenames)}.") return updated_nonpub_filenames # We'll assume you already have doc['doc_embedding'] = get_document_embedding(...) # To store doc_embedding in your existing pipeline, do something like: # for i, doc in enumerate(documents): # doc['doc_embedding'] = doc_embeddings[i] # 1.1: Store embeddings in each doc for convenience for i, doc in enumerate(documents): doc['doc_embedding'] = doc_embeddings[i] # 1.2: Perform initial similarity calculation (already done in your code, typically) # for doc in documents: # sim = cosine_similarity(doc['doc_embedding'], nonpub_centroid) # doc['similarity_to_nonpub'] = float(sim) # 1.3: Iteratively expand known_nonpub_initial = ["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"] expanded_nonpub_filenames = iterative_nonpub_expansion(documents, known_nonpub_initial, similarity_threshold=0.60, max_iterations=5) print("Final set of non-publishable (expanded) = ", expanded_nonpub_filenames) ############################################################################### # 2. K-Means (Already in your code) ############################################################################### # We'll keep your existing KMeans step. No changes needed unless you want to integrate # the newly expanded non-publishable set in some way (like removing them from K-Means, etc.). # # For reference: # labels = kmeans.fit_predict(doc_embeddings) # for i, doc in enumerate(documents): # doc['cluster_label'] = int(labels[i]) ############################################################################### # 3. Small MLP for classification ############################################################################### # We'll build a tiny dataset from the 15 labeled documents you have # (including your newly expanded nonpub set if you like). # Then we do cross-validation to evaluate how well it classifies # publishable (class 1) vs non-publishable (class 0). from sklearn.model_selection import StratifiedKFold # Let's define your 15 labeled docs (or any subset you trust): nonpub_set_c0 = set(["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"]) pub_set_c1 = set(["R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf", "R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"]) X = [] y = [] for doc in documents: fname = doc['filename'] if fname in nonpub_set_c0: X.append(doc['doc_embedding']) y.append(0) elif fname in pub_set_c1: X.append(doc['doc_embedding']) y.append(1) X = np.array(X) y = np.array(y) # If we want to include similarity as an extra feature: # We'll combine doc_embedding with similarity_to_nonpub # Just do a quick horizontal stack: X_aug = [] for doc, label in zip(documents, range(len(documents))): # but we only want those that are in our labeled set fname = doc['filename'] if fname in nonpub_set_c0 or fname in pub_set_c1: embed_768 = doc['doc_embedding'] sim_val = doc['similarity_to_nonpub'] # a single float # create a feature vector of size 769 new_vec = np.concatenate([embed_768, [sim_val]]) X_aug.append(new_vec) # We must track y in the same order y_aug = [] for doc in documents: fname = doc['filename'] if fname in nonpub_set_c0: y_aug.append(0) elif fname in pub_set_c1: y_aug.append(1) X_aug = np.array(X_aug) y_aug = np.array(y_aug) # We'll train an MLP on X_aug, y_aug mlp = MLPClassifier(hidden_layer_sizes=(64,), random_state=42, max_iter=1000) skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(mlp, X_aug, y_aug, cv=skf, scoring='accuracy') print(f"5-Fold CV Accuracy for MLP: {scores.mean():.2f} +/- {scores.std():.2f}") mlp.fit(X_aug, y_aug) # Now we can apply this MLP to all 150 documents # They each get a predicted label: 0 = non-publishable, 1 = publishable for doc in documents: embed_768 = doc['doc_embedding'] sim_val = doc['similarity_to_nonpub'] test_vec = np.concatenate([embed_768, [sim_val]]) pred_label = mlp.predict(test_vec.reshape(1, -1))[0] doc['mlp_pred_label'] = int(pred_label) # doc['mlp_pred_label'] == 0 => non-publishable, 1 => publishable
doc_embeddings
and the initial nonpub_centroid
.doc_embeddings
and doc['similarity_to_nonpub']
.This second code snippet replaces K-Means with HDBSCAN or Hierarchical Clustering. HDBSCAN requires installing the library:
bash!pip install hdbscan
We’ll also show how to do the MLP in the same manner. (You can remove the MLP part if you only want to test clustering.)
Where to paste: Replace your K-Means block with the HDBSCAN block. Keep the rest of your pipeline the same.
python############################################################################### # SET 2: Using HDBSCAN (Outlier Detection) + Optional MLP ############################################################################### !pip install hdbscan import hdbscan # 1) HDBSCAN to cluster doc_embeddings hdb = hdbscan.HDBSCAN(min_cluster_size=5, min_samples=3, metric='euclidean') hdb_labels = hdb.fit_predict(doc_embeddings) # We can store these cluster labels in documents for i, doc in enumerate(documents): doc['hdb_cluster_label'] = int(hdb_labels[i]) # -1 means outlier/no cluster # Let's see how many are labeled as outliers outlier_count = sum(1 for lbl in hdb_labels if lbl == -1) print(f"Number of outliers (no cluster assigned): {outlier_count}") # 2) Inspect cluster distribution label_counts = Counter(hdb_labels) print("HDBSCAN Cluster Distribution:", label_counts) # 3) If you want a quick visualization: import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) x_coords = reduced_pca[:,0] y_coords = reduced_pca[:,1] plt.figure(figsize=(10,7)) scatter = plt.scatter(x_coords, y_coords, c=hdb_labels, cmap='tab10', alpha=0.6) plt.title("HDBSCAN Clusters (PCA 2D Projection)") plt.xlabel("PC1") plt.ylabel("PC2") # highlight your known labeled docs labeled_files_class0 = set(["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"]) labeled_files_class1 = set(["R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf", "R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"]) for i, doc in enumerate(documents): if doc['filename'] in labeled_files_class0 or doc['filename'] in labeled_files_class1: plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if doc['filename'] in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) plt.show() # 4) Optional: Train an MLP or logistic regression using same approach as above # Build labeled X, y from the 15 known docs from sklearn.neural_network import MLPClassifier from sklearn.model_selection import StratifiedKFold X_aug = [] y_aug = [] for doc in documents: fname = doc['filename'] if fname in labeled_files_class0: # Create feature vector from doc['doc_embedding'] + doc['similarity_to_nonpub'] embed_768 = doc_embeddings[[i for i, d in enumerate(documents) if d['filename'] == fname][0]] sim_val = doc['similarity_to_nonpub'] new_vec = np.concatenate([embed_768, [sim_val]]) X_aug.append(new_vec) y_aug.append(0) elif fname in labeled_files_class1: embed_768 = doc_embeddings[[i for i, d in enumerate(documents) if d['filename'] == fname][0]] sim_val = doc['similarity_to_nonpub'] new_vec = np.concatenate([embed_768, [sim_val]]) X_aug.append(new_vec) y_aug.append(1) X_aug = np.array(X_aug) y_aug = np.array(y_aug) mlp = MLPClassifier(hidden_layer_sizes=(64,), random_state=42, max_iter=1000) skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(mlp, X_aug, y_aug, cv=skf, scoring='accuracy') print(f"[HDBSCAN] 5-Fold CV Accuracy for MLP: {scores.mean():.2f} +/- {scores.std():.2f}") mlp.fit(X_aug, y_aug) for doc in documents: # For each doc, build the same feature vector and predict embed_768 = doc['doc_embedding'] sim_val = doc['similarity_to_nonpub'] test_vec = np.concatenate([embed_768, [sim_val]]) pred_label = mlp.predict(test_vec.reshape(1, -1))[0] doc['mlp_pred_label'] = int(pred_label) print("Done: HDBSCAN + MLP classification.")
Note:
-1
indicates an outlier (i.e., no cluster). You can treat outliers as “suspect” if you suspect unusual papers might be “non-publishable,” or you can do a separate check.Ultimately, it’s best to compare them on your small labeled set (the 15 papers). Whichever approach consistently classifies them correctly (via cross-validation) is typically the safest bet to scale to the 150-paper dataset.
Feel free to integrate the above code snippets wherever they fit best in your existing pipeline. Good luck!
So I ran your above code. The set 2 failed miserably, HDBSCAN performed badly for this task. The MLP classifier overfit as well. So I dropped that.
Then I tried set 1, the iterative kmeans I thought was a good idea, but the final nonpub_centroid ended up making a false negative and a bunch of false positives, and making the boundaries between the classes even worse and laughable performance. This also only classified 34 papers are non publishable that is inclusive of mistakes.
But it seems like my initial model before these code improvements and suggestions was way better as it had predicted no false positives and had perfectly captured the true positives and true negatives, but it managed to classify 38/150 documents as non-publishable, but there are even more documents that are non publishable. But the model hasn't learnt to predict it yet. Basically it has a lot more false negatives to work on, but thats the only issue with it
I feel like the similarity measure was a good baseline as it had correctly classified based on my chosen threshold. But it doesn't classify all of them as non-publishable.
How do I infer the similar measure graph? Does more gap between the publishable class and non-publishable class in terms of similarity measure indicate that it is good. But what if 75 percent of data is under the threshold ? Though we get no false positives, there are much more false negatives. Explain how do I go about infering and understand the similarity measure plot, not only the difference between the classes but also the concentration of papers at different measures.
When i try 2d projection of the kmeans using pca and tsne, i can see that all the non-publishable papers are grouped together in a cluster and the surrounding points have been selected in the maybe_nonpub. But still some of the surrounding points are left out. How do i improve the false negative rate.
When i try different transformers, i get different similarity measures and thresholds, how do I infer which transformer is the best in this regard.
Think deeply about everything and answer.
Below is a deeper discussion of the issues you’re encountering, how to interpret your similarity-based approach, and how you might reduce false negatives (while keeping false positives in check). I’ll also address how you can evaluate different transformers to pick one that best separates publishable and non-publishable papers.
When you calculate the cosine similarity between your paper embeddings and the “non-publishable centroid,” you typically end up with a distribution of similarity scores (e.g., from 0.0 to 1.0). You may also separate out the scores for:
You mentioned that “75% of data is under the threshold” and are thus classified as publishable. This is typical if your threshold is somewhat high. The question is: Are you comfortable with that many being labeled publishable if it means you avoid false positives?
To gauge that, you can do a small precision-recall style analysis on your 15 labeled docs:
If your main problem is that the model is “missing” some truly non-publishable papers that have similarity below your chosen threshold, here are ways to tackle that:
Dual-Centroid Approach: Instead of a single “non-publishable” centroid, also compute a “publishable” centroid using your known Class 1 docs. Then, for each unlabeled paper, compare its distance (similarity) to both centroids. Whichever it’s closest to (or whichever similarity is higher) is the predicted class. This can sometimes reduce false negatives if those borderline papers are “less similar” to the publishable centroid.
Manual Spot-Check Borderlines: If you have a band around your threshold (say 0.50–0.60) where many papers lie, you could manually examine some of these borderline papers to either confirm they’re publishable or non-publishable. This manual check (even if for a handful of papers) can refine your threshold or expand your labeled set.
Active Learning: Similar to the manual spot-check, you pick the unlabeled docs that are near the threshold. Label them carefully. Recompute the centroid with these newly labeled docs. This way, you’re less likely to incorporate incorrectly labeled examples (which can happen with naive iterative expansion).
Add a Secondary Feature: If the similarity approach alone leads to false negatives, consider adding a single, carefully chosen rule-based feature. For instance, if a borderline doc is suspiciously short or lacks any mention of “data/method,” maybe that alone can nudge its final classification to non-publishable. You don’t have to rely on an entire rule-based system—just one or two very telling features might help.
Different transformers (e.g., all-mpnet-base-v2
, scibert_scivocab_uncased
, all-distilroberta-v1
) can produce different embedding spaces. Some points to remember:
all-distilroberta-v1
might produce embeddings whose cosine similarity has a narrower band overall compared to mpnet
or scibert
.scibert_scivocab_uncased
might do better if the text is very academic. Alternatively, mpnet
or a huggingface “sentence-transformers” model might be more robust across general text.Key: The “best” model will produce the largest separation between known Class 0 and Class 1, or at least a separation that is easy to threshold.
Stay with the Similarity Approach If it’s giving you 0 false positives but you want to reduce false negatives, consider:
Validate Thresholds with Known Data
Light Touch Rules
Refine the Embedding Model
Minimal Classifier
[768-dim embeddings + 1 similarity feature]
might be enough to glean some improvement without overfitting too easily.Interpretation of Similarity Plot:
Improving False Negatives:
Selecting the Best Transformer:
Why Some Methods Fail:
Recommendation:
Your strong initial results show that cosine similarity (with a well-chosen threshold) can already separate non-publishable from publishable decently. Adding more data—or carefully adjusting your threshold—will likely do more to address the false negatives than fancy expansions or clustering. If you only have 15 labeled docs, applying active learning or a dual-centroid approach can be more robust than heavy iterative expansions that risk picking up accidental false positives.
Focus on incremental improvements around your best baseline. Sometimes the simplest method (similarity + threshold) is both the most interpretable and the most robust, especially with very limited labels.
ok then give me code for the logistic regression, using the similarity measure and embeddings. Make sure it doesn't overfit. Keep in mind I have only 5 non-publishable examples and 10 publishable examples in the labeled data. Keep in mind that this should work well with already existing codes and my data structures. Then give a plot that visualizes the boundary in all the 3 different 2d projection visualizations im using. Incorporate the decision boundary in those plots.
Below is the code for those plots:
pca = PCA(n_components=2, random_state=42)
reduced_pca = pca.fit_transform(doc_embeddings)
x_coords = reduced_pca[:, 0]
y_coords = reduced_pca[:, 1]
plt.figure(figsize=(10, 7))
clusters = [doc['cluster_label'] for doc in documents]
scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6)
plt.title("K-Means Clusters (PCA 2D Projection)")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
for i, doc in enumerate(documents):
filename = doc['filename']
# Highlight labeled documents (Class 0 and Class 1)
if filename in labeled_files_class0 or filename in labeled_files_class1:
plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5)
if filename in labeled_files_class0:
label_text = "C0"
elif filename in labeled_files_class1:
label_text = "C1"
plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9)
text# Highlight maybe_nonpub documents with a distinct edge color if doc.get('maybe_nonpub', False): plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', marker='s', s=120, linewidths=1.5)
unique_clusters = list(sorted(set(clusters)))
handles = scatter.legend_elements()[0]
labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters]
plt.legend(handles, labels_for_legend, title="Clusters", loc="best")
plt.show()
Next : can we combine the embeddings of two or three different models and make a better prediction by somehow combining them and see it improves the performance. Think deeply about this and its implications. Guide me on this.
next : What about doing some sort of topic modeling on the papers and if the papers have bullshit topics that aren't relevant to the papers, maybe we can classify them as non-publishable based on some measure or even similarity measure. This is just a thought, guide me on this. If this is the right approach, then give a full me a full end to end code for this. I have separate dataset in form of csv, that has filenames and the extract texts as columns. This extracted text is not split into sections, so it spans the entire document, maybe this will be good for topic modeling and comparing documents and finding non-publishable. Think deeply about this.
You also suggested that Instead of a single “non-publishable” centroid, also compute a “publishable” centroid using your known Class 1 docs. Then, for each unlabeled paper, compare its distance (similarity) to both centroids. Whichever it’s closest to (or whichever similarity is higher) is the predicted class. This can sometimes reduce false negatives if those borderline papers are “less similar” to the publishable centroid. Can you give me a full end to end implementation of this, with also the necessary plots. This I can run in a separate notebook, so I can different methods concurrently.
think about all of this using all the context that you have and provide me a solid answer meeting all my requirements.
Below is a comprehensive answer that addresses all your requests:
Class 0
)Class 1
)doc_embeddings[i]
: a vector (e.g. 768-dimensional) for the -th documentdocuments[i]['similarity_to_nonpub']
: a float for the similarity to the “non-publishable centroid”documents[i]['filename']
: the PDF filenameWhere to store the labeled classes:
pythonlabeled_files_class0 = {"R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"} labeled_files_class1 = {"R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf", "R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"}
We’ll combine the raw embedding (-dim) + the single similarity feature = -dim total.
pythonimport numpy as np from sklearn.linear_model import LogisticRegression from sklearn.model_selection import StratifiedKFold, cross_val_score # 1) Build X, y from your 15 labeled docs X_list = [] y_list = [] for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0: # Non-publishable x_vec = np.concatenate([doc_embeddings[i], [doc['similarity_to_nonpub']]]) X_list.append(x_vec) y_list.append(0) elif fname in labeled_files_class1: # Publishable x_vec = np.concatenate([doc_embeddings[i], [doc['similarity_to_nonpub']]]) X_list.append(x_vec) y_list.append(1) X = np.array(X_list) y = np.array(y_list) # 2) Logistic Regression with strong regularization # Using 'saga' or 'liblinear' solver can handle small data well. # We also set a high C value for L2 or try a smaller C for stronger reg. logreg = LogisticRegression(solver='liblinear', C=1.0, penalty='l2', random_state=42) # 3) Cross-validation to check for overfitting skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(logreg, X, y, cv=skf, scoring='accuracy') print(f"[LogReg] 5-Fold CV accuracy: {scores.mean():.2f} +/- {scores.std():.2f}") # 4) Fit final model on entire labeled set logreg.fit(X, y) # 5) Predict for all 150 documents for i, doc in enumerate(documents): x_vec = np.concatenate([doc_embeddings[i], [doc['similarity_to_nonpub']]]) pred_label = logreg.predict(x_vec.reshape(1, -1))[0] doc['logreg_pred'] = int(pred_label) # 0=non-publishable, 1=publishable # If you want predicted probabilities: # doc['logreg_prob'] = logreg.predict_proba(x_vec.reshape(1, -1))[0][1]
You have three 2D projections (PCA, t-SNE, etc.). The logistic regression actually lives in a 769-dimensional space (embedding + similarity). Plotting that boundary in 2D is non-trivial. Typically, one of two strategies is used:
[pca_x, pca_y]
for the labeled docs. Then we can easily plot the boundary in PCA space.For simplicity, let’s do approach (1)—train a separate logistic regression purely for the 2D PCA coordinates of your labeled docs, and then we can plot the boundary. Note: This is only for illustration, not the actual classifier you use in practice.
python############ Visualization of Decision Boundary in PCA ############ import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression # 1) Prepare PCA (2D) for all docs pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] # 2) Build the training set in 2D PCA for the 15 labeled docs X_2d = [] y_2d = [] for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0: # Non-publishable X_2d.append([x_coords[i], y_coords[i]]) y_2d.append(0) elif fname in labeled_files_class1: # Publishable X_2d.append([x_coords[i], y_coords[i]]) y_2d.append(1) X_2d = np.array(X_2d) y_2d = np.array(y_2d) # 3) Train logistic regression on the 2D PCA data logreg_2d = LogisticRegression(solver='liblinear', random_state=42) logreg_2d.fit(X_2d, y_2d) # 4) Plot PCA scatter with the decision boundary plt.figure(figsize=(10,7)) # Color by cluster_label (if using K-Means) clusters = [doc['cluster_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title("K-Means Clusters (PCA 2D Projection) + LogReg Boundary") plt.xlabel("Principal Component 1") plt.ylabel("Principal Component 2") # Mark labeled docs for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0 or fname in labeled_files_class1: plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if fname in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) if doc.get('maybe_nonpub', False): plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', marker='s', s=120, linewidths=1.5) # Plot the boundary in 2D # We do this by creating a grid of (x,y), predict with logreg_2d, and draw contour. xx, yy = np.meshgrid( np.linspace(x_coords.min()-0.5, x_coords.max()+0.5, 200), np.linspace(y_coords.min()-0.5, y_coords.max()+0.5, 200) ) Z = logreg_2d.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape) plt.contour(xx, yy, Z, levels=[0.5], cmap="gray", linewidths=2) # Add cluster legend unique_clusters = list(sorted(set(clusters))) handles = scatter.legend_elements()[0] labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, labels_for_legend, title="Clusters", loc="best") plt.show()
Note: The boundary you see here is only for the 2D PCA logistic regression, not the main (embedding+similarity) logistic regression. We do this because otherwise we can’t easily show a 769-dimensional boundary in 2D.
Yes, you can combine embeddings from multiple transformer models (e.g., SciBERT, MPNet, DistilRoBERTa). The simplest approach is concatenation:
python# Suppose you computed embeddings from 2 different models: # doc_embeddings_1 (shape: [n_docs, d1]) # doc_embeddings_2 (shape: [n_docs, d2]) # for each doc in the same order combined_embeddings = np.concatenate([doc_embeddings_1, doc_embeddings_2], axis=1) # shape: [n_docs, d1 + d2]
Then you can apply the same pipeline (compute similarity, do logistic regression, etc.) on the combined vector. Or you can do more sophisticated merging, like averaging the two embeddings if they’re from the same dimensional space. But typically, concatenation is fine if you can handle the increased dimensionality.
Pros:
Cons:
sklearn.decomposition.LatentDirichletAllocation
or gensim
to extract topics.Assume you have a CSV: papers_fulltext.csv
with columns: filename
, full_text
.
python############################################################################### # TOPIC MODELING EXAMPLE ############################################################################### !pip install gensim import pandas as pd from gensim.corpora.dictionary import Dictionary from gensim.models.ldamodel import LdaModel from nltk.corpus import stopwords from nltk.tokenize import word_tokenize import re df_topics = pd.read_csv("papers_fulltext.csv") # columns: ['filename', 'full_text'] def clean_for_lda(text): # Basic cleaning for LDA text = text.lower() text = re.sub(r'[^a-z0-9\s]', '', text) tokens = word_tokenize(text) stop_words = set(stopwords.words('english')) tokens = [t for t in tokens if t not in stop_words and len(t) > 2] return tokens # Preprocess each doc docs_tokens = [clean_for_lda(t) for t in df_topics['full_text']] # Gensim requires a dictionary and a bag-of-words corpus dictionary = Dictionary(docs_tokens) dictionary.filter_extremes(no_below=3, no_above=0.6) # filter out rare & frequent tokens bow_corpus = [dictionary.doc2bow(tokens) for tokens in docs_tokens] # Train LDA with, say, 5 or 10 topics num_topics = 5 lda_model = LdaModel(bow_corpus, num_topics=num_topics, id2word=dictionary, passes=5, random_state=42) # Inspect top words in each topic for idx in range(num_topics): print(f"Topic #{idx}: {lda_model.print_topic(idx, 10)}") # Now you can infer the topic distribution of each doc doc_topics = [] for i, bow in enumerate(bow_corpus): topics_prob = lda_model.get_document_topics(bow, minimum_probability=0.0) # 'topics_prob' is a list of (topic_id, prob) doc_topics.append(topics_prob) df_topics['topic_distribution'] = doc_topics # DETECT "BULLSHIT" PAPERS? # Example: if doc's top topic = 'cooking' but the paper claims to be about 'CV' or 'ML' # you can do some heuristic check. E.g., if the paper states "NLP" but the top topic is "cooking" for i, row in df_topics.iterrows(): # topics_prob = row['topic_distribution'] => list of (topic_id, prob) top_topic_id, top_topic_prob = max(row['topic_distribution'], key=lambda x: x[1]) # If top_topic_id is not in your expected domain topics, mark it suspicious # Or do more advanced matching
You could also do a similarity approach on the doc’s topic distribution vs. a known distribution for “real ML topics.” Or if your LDA is picking up nonsense topics (like “oven, pan, salt, recipe”), you can flag those docs as suspicious.
Instead of a single “non-publishable” centroid, create two centroids:
For each unlabeled doc, compute similarity to both centroids. Predict whichever similarity is higher (or whichever distance is smaller).
python############################################################################### # DUAL-CENTROID APPROACH ############################################################################### import numpy as np # 1) Collect embeddings for known classes nonpub_embeddings = [] pub_embeddings = [] for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0: nonpub_embeddings.append(doc_embeddings[i]) elif fname in labeled_files_class1: pub_embeddings.append(doc_embeddings[i]) # 2) Compute two centroids centroid_nonpub = np.mean(nonpub_embeddings, axis=0) centroid_pub = np.mean(pub_embeddings, axis=0) def cos_sim(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) # 3) For each doc, compute similarity to each centroid for i, doc in enumerate(documents): emb = doc_embeddings[i] sim_nonpub = cos_sim(emb, centroid_nonpub) sim_pub = cos_sim(emb, centroid_pub) # Option 1: Classify whichever is higher if sim_nonpub > sim_pub: doc['dual_centroid_pred'] = 0 # non-publishable else: doc['dual_centroid_pred'] = 1 # publishable # 4) Visualization in 2D PCA pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] plt.figure(figsize=(10,7)) # color docs by 'dual_centroid_pred' dual_labels = [doc['dual_centroid_pred'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=dual_labels, cmap='bwr', alpha=0.6) plt.title("Dual Centroid Classification (PCA 2D Projection)") plt.xlabel("PC1") plt.ylabel("PC2") # Mark the known labeled docs for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0 or fname in labeled_files_class1: # black edge circle plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if fname in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i]+0.02, y_coords[i]+0.02, label_text, fontsize=9) plt.show()
Notes:
Logistic Regression
C
) and cross-validation to mitigate overfitting.Combining Multiple Embeddings
Topic Modeling
Dual-Centroid
Interpreting the Plots
With these pieces in place, you can experiment in separate notebooks:
You’ll then see which method yields the fewest false positives while also reducing false negatives, hopefully improving on your baseline. Good luck!
Topic #0: 0.006*"image" + 0.005*"images" + 0.004*"features" + 0.004*"car" + 0.003*"tags" + 0.003*"feature" + 0.003*"classification" + 0.003*"word" + 0.003*"visual" + 0.003*"label"
Topic #1: 0.008*"entropy" + 0.005*"food" + 0.005*"patterns" + 0.004*"system" + 0.004*"llm" + 0.003*"team" + 0.003*"concept" + 0.003*"caffeine" + 0.003*"diffusion" + 0.003*"behavior"
Topic #2: 0.008*"led" + 0.008*"patterns" + 0.007*"human" + 0.005*"properties" + 0.005*"field" + 0.005*"implications" + 0.005*"particularly" + 0.005*"behavior" + 0.005*"profound" + 0.004*"soil"
Topic #3: 0.005*"datasets" + 0.004*"tasks" + 0.004*"input" + 0.004*"features" + 0.004*"task" + 0.004*"feature" + 0.003*"neural" + 0.003*"image" + 0.003*"001" + 0.003*"language"
Topic #4: 0.008*"engine" + 0.007*"system" + 0.006*"llm" + 0.005*"patterns" + 0.004*"design" + 0.004*"plasticity" + 0.004*"dna" + 0.004*"human" + 0.003*"particularly" + 0.003*"items"
I used your topic modeling code in a completely new and different notebook and got this output, what does this mean? How can we make the model recognize irrelevant claims and other topics that do not belong in a machine learning setting. Then classify based on that?
combined_embeddings = np.concatenate([doc_embeddings_1, doc_embeddings_2], axis=1) # shape: [n_docs, d1 + d2]
I also want to try this embedding concatenation and how it performs. You know my existing code and how i already use one embedding for the similarity index and kmeans clustering. So based on that can you gave me code to make it work with my current code. Then give end to end code for concatenated_predictions using whatever method is the best and then also a plot to visualize and analyse. Make it proper with the context of my original code that I use.
Below is a structured explanation and code on:
From your example:
textTopic #0: 0.006*"image" + 0.005*"images" + 0.004*"features" + 0.004*"car" ... Topic #1: 0.008*"entropy" + 0.005*"food" + 0.005*"patterns" + 0.004*"system" ... Topic #2: 0.008*"led" + 0.008*"patterns" + 0.007*"human" + 0.005*"properties" ... Topic #3: 0.005*"datasets" + 0.004*"tasks" + 0.004*"input" + 0.004*"features" ... Topic #4: 0.008*"engine" + 0.007*"system" + 0.006*"llm" + 0.005*"patterns" ...
Each topic is basically a distribution over words (with probabilities). The top words give you a clue about what that topic is about. For instance:
If you see a topic with words like “recipe,” “oven,” “salt,” or “cooking,” that’s likely nonsense in an ML context. In your example, some references to “food” or “caffeine” might be suspicious if your domain is purely machine learning.
After training LDA, you can:
Caution: LDA is just unsupervised clustering of words. Sometimes an “irrelevant” topic might appear if your corpus has unusual domain text, or it might partially overlap with real ML concepts in unexpected ways. So treat this as an additional signal (like a “red flag” score), not an absolute truth.
You can certainly use two (or more) transformer models to embed each paper. Suppose:
doc_embeddings_1
is shape doc_embeddings_2
is shape Concatenate them:
pythoncombined_embeddings = np.concatenate([doc_embeddings_1, doc_embeddings_2], axis=1) # Now shape: [n_docs, d_1 + d_2]
You can then do exactly what you’ve been doing (K-Means, similarity measure, etc.) on combined_embeddings
. The biggest difference:
doc_embeddings
everywhere, you swap in combined_embeddings
.Potential Pitfalls:
combined_embeddings
to bring it down to 512 or 256 dimensions before continuing.Below is a simplified pipeline illustrating how to:
doc_embeddings_1
, doc_embeddings_2
).combined_embeddings
.Feel free to adjust the details to match your environment.
python############################################################################### # Step 0: Setup & Basic Definitions ############################################################################### import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.decomposition import PCA from collections import Counter # Labeled sets (example) labeled_files_class0 = {"R001.pdf","R002.pdf","R003.pdf","R004.pdf","R005.pdf"} # non-pub labeled_files_class1 = {"R006.pdf","R007.pdf","R008.pdf","R009.pdf","R010.pdf", "R011.pdf","R012.pdf","R013.pdf","R014.pdf","R015.pdf"} # pub def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) ############################################################################### # Step 1: Suppose you have two sets of embeddings for each doc # doc_embeddings_1, doc_embeddings_2 # shape = [n_docs, d1], [n_docs, d2] ############################################################################### # For demonstration, let's pretend doc_embeddings_1 and doc_embeddings_2 exist. # You should replace them with your actual arrays from your environment. # e.g., doc_embeddings_1 = ... # e.g., doc_embeddings_2 = ... # We'll assume you also have the 'documents' list of dicts as usual. # => documents[i]['filename'], etc. ############################################################################### # Step 2: Concatenate them ############################################################################### combined_embeddings = np.concatenate([doc_embeddings_1, doc_embeddings_2], axis=1) print("Shape of combined embeddings:", combined_embeddings.shape) ############################################################################### # Step 3: K-Means on combined embeddings ############################################################################### num_clusters = 5 kmeans = KMeans(n_clusters=num_clusters, random_state=42) labels = kmeans.fit_predict(combined_embeddings) for i, doc in enumerate(documents): doc['cluster_label'] = int(labels[i]) cluster_counts = Counter(labels) print("Cluster distribution:", cluster_counts) ############################################################################### # Step 4: Compute Non-Publishable Centroid & Similarity (optional) ############################################################################### # We'll collect embeddings for the known non-publishable docs known_nonpub_inds = [] for i, doc in enumerate(documents): if doc['filename'] in labeled_files_class0: known_nonpub_inds.append(i) nonpub_embeddings = combined_embeddings[known_nonpub_inds] nonpub_centroid = np.mean(nonpub_embeddings, axis=0) # For each doc, compute similarity for i, doc in enumerate(documents): emb = combined_embeddings[i] sim_val = cosine_similarity(emb, nonpub_centroid) doc['similarity_to_nonpub'] = float(sim_val) # Example threshold similarity_threshold = 0.5 for doc in documents: doc['maybe_nonpub'] = (doc['similarity_to_nonpub'] > similarity_threshold) ############################################################################### # Step 5: Visualization (PCA 2D of the combined embeddings) ############################################################################### pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(combined_embeddings) x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] plt.figure(figsize=(10, 7)) clusters = [doc['cluster_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title("K-Means Clusters on Combined Embeddings (PCA 2D Projection)") plt.xlabel("PC1") plt.ylabel("PC2") # Highlight labeled files for i, doc in enumerate(documents): filename = doc['filename'] if filename in labeled_files_class0 or filename in labeled_files_class1: plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if filename in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) # Highlight maybe_nonpub docs if doc.get('maybe_nonpub', False): plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', marker='s', s=120, linewidths=1.5) unique_clusters = list(sorted(set(clusters))) handles = scatter.legend_elements()[0] labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, labels_for_legend, title="Clusters", loc="best") plt.show()
doc_embeddings_1
and doc_embeddings_2
to get combined_embeddings
.combined_embeddings
, store cluster labels in doc['cluster_label']
.similarity_to_nonpub
.doc['maybe_nonpub'] = True
if similarity > 0.5 (or whichever threshold you prefer).If you suspect a doc is about “food/cooking/recipes” while it claims to be “machine learning,” you can do something like:
python# For each doc in df_topics (the LDA pipeline): # row['topic_distribution'] => list of (topic_id, prob) irrelevant_topic_ids = [1] # Suppose we decided topic #1 is "food" or nonsense threshold = 0.6 # if doc's top topic is topic #1 with prob >= 0.6 => suspicious for i, row in df_topics.iterrows(): topics_prob = row['topic_distribution'] top_topic_id, top_topic_prob = max(topics_prob, key=lambda x: x[1]) if (top_topic_id in irrelevant_topic_ids) and (top_topic_prob >= threshold): row['suspect_topic'] = True else: row['suspect_topic'] = False
You can incorporate this as an additional dimension in your main pipeline—e.g., a boolean feature has_suspect_topic
that you feed into your logistic regression or use as a simple rule.
This should give you a complete sense of how to interpret LDA outputs for “nonsense” detection and how to combine multiple embeddings in your existing pipeline, with an end-to-end example showing K-Means clustering, similarity thresholding, and a PCA visualization. Good luck!
###############################################################################
###############################################################################
import numpy as np
nonpub_embeddings = []
pub_embeddings = []
for i, doc in enumerate(documents):
fname = doc['filename']
if fname in labeled_files_class0:
nonpub_embeddings.append(doc_embeddings[i])
elif fname in labeled_files_class1:
pub_embeddings.append(doc_embeddings[i])
centroid_nonpub = np.mean(nonpub_embeddings, axis=0)
centroid_pub = np.mean(pub_embeddings, axis=0)
def cos_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
for i, doc in enumerate(documents):
emb = doc_embeddings[i]
sim_nonpub = cos_sim(emb, centroid_nonpub)
sim_pub = cos_sim(emb, centroid_pub)
text# Option 1: Classify whichever is higher if sim_nonpub > sim_pub: doc['dual_centroid_pred'] = 0 # non-publishable else: doc['dual_centroid_pred'] = 1 # publishable
pca = PCA(n_components=2, random_state=42)
reduced_pca = pca.fit_transform(doc_embeddings)
x_coords = reduced_pca[:, 0]
y_coords = reduced_pca[:, 1]
plt.figure(figsize=(10,7))
dual_labels = [doc['dual_centroid_pred'] for doc in documents]
scatter = plt.scatter(x_coords, y_coords, c=dual_labels, cmap='bwr', alpha=0.6)
plt.title("Dual Centroid Classification (PCA 2D Projection)")
plt.xlabel("PC1")
plt.ylabel("PC2")
for i, doc in enumerate(documents):
fname = doc['filename']
if fname in labeled_files_class0 or fname in labeled_files_class1:
# black edge circle
plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5)
if fname in labeled_files_class0:
label_text = "C0"
else:
label_text = "C1"
plt.text(x_coords[i]+0.02, y_coords[i]+0.02, label_text, fontsize=9)
plt.show()
final_df = pd.DataFrame(documents)
final_df[(final_df.dual_centroid_pred != final_df.maybe_nonpub)]
non_publishable_list = final_df[(final_df['dual_centroid_pred'] == 0)].filename.to_list()
this is the code im using that you gave. the model i used is scibert. This seems to be giving the uncontested best predictions and performance. So I want to try and improve this model. From this code you can see that we are only using the centroids once calculated with initial labeled examples. such that the centroids get updated over time using the newly added points and make better centroids. What do you think of this idea, yes I know if we include false positives, it could make the centroids go wrong and quickly snowball.
Why are we using similarity for assigning to centroid instead of traditional distance measures or whatever? Im already using the code based on similarity measure. so now I want to try it based on distance based approach as well. So can i get two sets of codes , so I can compare their performance.
so give me the ability to adjust the settings of the such that I can determine how similar they have to be, to be chosen into the class and how many iterations and etc. Give me a guide for this.
Also for inference, how do I go about that? When i want to add new values and find out which class they belong to? Do i use the similarity based method or do I use k nearest neighbors and why should I chose one over the other. think and give me guidance.
Finally give me four pieces of code:
For all of these methods, after running them, the labels should be produced accordingly and the label should be stored in the dataframe. with the respective column names as :
If there are any other approaches to improve, please guide me on that as well
Below is a comprehensive guide that addresses your requests:
At the end, each code snippet stores its final 0/1 labels into specific columns in documents
, namely:
doc['original_similarity']
doc['original_distance']
doc['iterative_similarity']
doc['iterative_distance']
This way, you end up with a single DataFrame that has the results of all four approaches.
Currently, your dual-centroid approach uses 5 known non-publishable and 10 known publishable examples to compute one non-publishable centroid and one publishable centroid. This works well if those examples are representative and you don’t suspect labeling errors.
Hence, any iterative approach should have a high confidence threshold (on similarity or low distance) to minimize false positives creeping in.
If you want a distance-based approach for Euclidean distance, you can do:
and pick the centroid that yields the lowest distance.
Once you have your final centroids (either after one-shot or iterative updates):
Similarity Approach:
k-Nearest Neighbors:
When to choose KNN:
We’ll assume you already have:
documents
, where each doc
has at least doc['filename']
.doc_embeddings
is a numpy array of shape .pythonlabeled_files_class0 = {"R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"} labeled_files_class1 = {"R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf", "R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"}
original_similarity
original_distance
iterative_similarity
iterative_distance
We will define each snippet in turn. At the end, you can run them all in sequence (or in separate cells) and end up with all 4 label columns in documents
.
pythonimport numpy as np import math def cos_sim(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) def euclidean_dist(a, b): return np.sqrt(np.sum((a - b)**2))
One-shot approach: pick whichever centroid yields a higher similarity. Store the predicted label in doc['original_similarity']
.
python############################ # A) ORIGINAL SIMILARITY ############################ # 1) Gather embeddings for known labeled docs nonpub_indices = [] pub_indices = [] for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0: nonpub_indices.append(i) elif fname in labeled_files_class1: pub_indices.append(i) nonpub_embeddings = doc_embeddings[nonpub_indices] pub_embeddings = doc_embeddings[pub_indices] # 2) Compute centroids centroid_nonpub = np.mean(nonpub_embeddings, axis=0) centroid_pub = np.mean(pub_embeddings, axis=0) # 3) Classify each doc by comparing cos_sim to each centroid for i, doc in enumerate(documents): emb = doc_embeddings[i] sim_nonpub = cos_sim(emb, centroid_nonpub) sim_pub = cos_sim(emb, centroid_pub) # 0 => non-publishable, 1 => publishable if sim_nonpub > sim_pub: doc['original_similarity'] = 0 else: doc['original_similarity'] = 1
Same idea, but we pick whichever centroid yields a lower distance. Store the predicted label in doc['original_distance']
.
python############################ # B) ORIGINAL DISTANCE ############################ # We assume the same centroid_nonpub, centroid_pub from above # If you are running this code separately, recompute them: nonpub_indices = [] pub_indices = [] for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0: nonpub_indices.append(i) elif fname in labeled_files_class1: pub_indices.append(i) nonpub_embeddings = doc_embeddings[nonpub_indices] pub_embeddings = doc_embeddings[pub_indices] centroid_nonpub = np.mean(nonpub_embeddings, axis=0) centroid_pub = np.mean(pub_embeddings, axis=0) # Classify using distance for i, doc in enumerate(documents): emb = doc_embeddings[i] dist_nonpub = euclidean_dist(emb, centroid_nonpub) dist_pub = euclidean_dist(emb, centroid_pub) # 0 => non-publishable, 1 => publishable if dist_nonpub < dist_pub: doc['original_distance'] = 0 else: doc['original_distance'] = 1
We iteratively include new docs whose similarity to a particular centroid surpasses a threshold. For example:
sim_to_nonpub > threshold_nonpub
and sim_to_nonpub > sim_to_pub
(i.e., it’s really closer to non-pub?), we add that doc to the nonpub set.sim_to_pub > threshold_pub
and sim_to_pub > sim_to_nonpub
, we add that doc to the pub set.max_iterations
.Parameters you might want to tune:
threshold_nonpub
, threshold_pub
: e.g. 0.60 or 0.65.max_iterations
: e.g. 5.The code below lumps them into one threshold if you prefer. We store the final label in doc['iterative_similarity']
.
python############################ # C) ITERATIVE SIMILARITY ############################ def iterative_dual_centroid_similarity(doc_embeddings, documents, labeled_files_class0, labeled_files_class1, threshold=0.65, max_iterations=5): """ Iteratively expand the sets of non-publishable and publishable docs based on which centroid they're more similar to, provided that similarity > threshold. Recompute centroids each iteration. """ # 1) Initialize sets from your labeled data nonpub_set = set(labeled_files_class0) pub_set = set(labeled_files_class1) iteration = 0 changed = True while iteration < max_iterations and changed: iteration += 1 changed = False # Compute centroids nonpub_embs = [] pub_embs = [] for i, doc in enumerate(documents): if doc['filename'] in nonpub_set: nonpub_embs.append(doc_embeddings[i]) elif doc['filename'] in pub_set: pub_embs.append(doc_embeddings[i]) # If one class is empty, skip to avoid error if len(nonpub_embs) == 0 or len(pub_embs) == 0: print("One set is empty! Stopping iterative approach.") break centroid_nonpub = np.mean(nonpub_embs, axis=0) centroid_pub = np.mean(pub_embs, axis=0) # 2) For each unlabeled doc, see if we can add it for i, doc in enumerate(documents): fname = doc['filename'] # Skip if it's already in either set if fname in nonpub_set or fname in pub_set: continue emb = doc_embeddings[i] sim_nonpub = cos_sim(emb, centroid_nonpub) sim_pub = cos_sim(emb, centroid_pub) # If it's clearly more similar to nonpub (and above threshold) if sim_nonpub > sim_pub and sim_nonpub >= threshold: nonpub_set.add(fname) changed = True # If it's clearly more similar to pub (and above threshold) elif sim_pub > sim_nonpub and sim_pub >= threshold: pub_set.add(fname) changed = True print(f"Iteration {iteration} done. Non-pub size: {len(nonpub_set)}, Pub size: {len(pub_set)}") if not changed: print("No changes in this iteration, stopping early.") # Final classification # If a doc is in nonpub_set => 0, if in pub_set => 1, else break tie for i, doc in enumerate(documents): fname = doc['filename'] if fname in nonpub_set and fname not in pub_set: doc['iterative_similarity'] = 0 elif fname in pub_set and fname not in nonpub_set: doc['iterative_similarity'] = 1 else: # doc not classified or possibly in both sets somehow # you can break ties by whichever centroid is more similar emb = doc_embeddings[i] sim_nonpub = cos_sim(emb, centroid_nonpub) sim_pub = cos_sim(emb, centroid_pub) doc['iterative_similarity'] = 0 if sim_nonpub >= sim_pub else 1 # Call the function: iterative_dual_centroid_similarity(doc_embeddings, documents, labeled_files_class0, labeled_files_class1, threshold=0.65, max_iterations=5)
Feel free to tune the threshold. If you set threshold=0.80
, you’ll be very conservative about adding new docs. If you set threshold=0.50
, you’ll likely add more docs (risking false positives).
Same iterative logic, but now we measure Euclidean distance to each centroid. If a doc is closer to the non-pub centroid by some margin, we add it to that set, provided it’s under some threshold. The threshold means “distance must be less than X” for us to add it confidently.
python############################ # D) ITERATIVE DISTANCE ############################ def iterative_dual_centroid_distance(doc_embeddings, documents, labeled_files_class0, labeled_files_class1, distance_threshold=1.0, max_iterations=5): """ Iteratively expand the sets of non-publishable and publishable docs based on which centroid they are closer to, provided the distance < distance_threshold. Recompute centroids each iteration. """ nonpub_set = set(labeled_files_class0) pub_set = set(labeled_files_class1) iteration = 0 changed = True while iteration < max_iterations and changed: iteration += 1 changed = False # Compute centroids nonpub_embs = [] pub_embs = [] for i, doc in enumerate(documents): if doc['filename'] in nonpub_set: nonpub_embs.append(doc_embeddings[i]) elif doc['filename'] in pub_set: pub_embs.append(doc_embeddings[i]) if len(nonpub_embs) == 0 or len(pub_embs) == 0: print("One set is empty! Stopping iterative approach.") break centroid_nonpub = np.mean(nonpub_embs, axis=0) centroid_pub = np.mean(pub_embs, axis=0) for i, doc in enumerate(documents): fname = doc['filename'] if fname in nonpub_set or fname in pub_set: continue emb = doc_embeddings[i] dist_nonpub = euclidean_dist(emb, centroid_nonpub) dist_pub = euclidean_dist(emb, centroid_pub) # If it's strictly closer to nonpub and below threshold if dist_nonpub < dist_pub and dist_nonpub <= distance_threshold: nonpub_set.add(fname) changed = True # If it's strictly closer to pub and below threshold elif dist_pub < dist_nonpub and dist_pub <= distance_threshold: pub_set.add(fname) changed = True print(f"Iteration {iteration} done. Non-pub size: {len(nonpub_set)}, Pub size: {len(pub_set)}") if not changed: print("No changes in this iteration, stopping early.") # Final classification for i, doc in enumerate(documents): fname = doc['filename'] if fname in nonpub_set and fname not in pub_set: doc['iterative_distance'] = 0 elif fname in pub_set and fname not in nonpub_set: doc['iterative_distance'] = 1 else: # break tie by whichever centroid is closer emb = doc_embeddings[i] dist_nonpub = euclidean_dist(emb, centroid_nonpub) dist_pub = euclidean_dist(emb, centroid_pub) doc['iterative_distance'] = 0 if dist_nonpub <= dist_pub else 1 # Call the function: iterative_dual_centroid_distance(doc_embeddings, documents, labeled_files_class0, labeled_files_class1, distance_threshold=1.0, max_iterations=5)
Note: You need to choose a “distance threshold” that makes sense in your embedding space. You might try 1.0, 1.5, 2.0, etc., then see if new docs get added. If your embeddings are large in magnitude, you might need a bigger threshold. If your embeddings are normalized (like in some BERT variants), typical Euclidean distances might be < 2.0.
After running all four methods (A, B, C, D) in sequence, each doc
in documents
will have:
doc['original_similarity']
doc['original_distance']
doc['iterative_similarity']
doc['iterative_distance']
You can then do:
pythonfinal_df = pd.DataFrame(documents) print(final_df[['filename', 'original_similarity', 'original_distance', 'iterative_similarity', 'iterative_distance']])
to see the final classification side by side.
sim_nonpub - sim_pub > 0.1
) or (dist_pub - dist_nonpub) > 0.5
. This helps filter out borderline cases.documents
and see which approach yields the best results on your 15 labeled examples (lowest false positives/negatives).max_iterations
.This should give you ample flexibility to compare all four dual-centroid methods (similarity vs. distance, one-shot vs. iterative) and store results in your DataFrame. Good luck!
all 4 methods give the exact same classification report and confusion matrix. So I guess I will stick with the first one. But I have a question. Our labeled dataset is only of size 15, with 5 being class 0 and 10 being class 1. Although the embeddings were generated on their own, we calculated the nonpub_centroid and pub_centroid based on the existing labeled data's embedding.
Since we do a similarity measure against this only. Could that mean we are overfitting ?
Because when I finally did classification report using the test data labels and the predicted labels using maybe_nonpub, I got all perfect scores in the report. F1, precision, accuracy, recall are all 1.
But we didn't do supervised learning, so model didn't learn the documents. But does having a centroid based on the existing labeled values mean overfitting as well from our classification report?
But I also manually labeled some data in the unlabeled data for my own inference, this manually labeled data cannot be used to find the f1, accuracy, precision and recall, because these are originally unlabeled data and can go against the rules of the competition I am participating. So I can only use it for my own inference. But from this manually labeled data the model seems to perform quite well and generalise decently enough. But on the test data, I get perfect classification metrics.
For more context, the data given to us is only public data, they have more private data and will test on that. Guide me through this and think deeply. I have to make a report on this process when I make a submission. Will the judges consider this an overfit?
Just to be safe, can you give me a kmeans clustering algorithm without making these clustering centroids based on the labeled data and just do kmeans its own thing and get 2 clusters, from that I can split on the data into classes? Give me this code and make sure to give me a 2d projection of pca, tsne and isomap for this method. The visualization should highlight the given labeled data ie, those 15 labeled examples given originally. Make sure the code works with my existing data structures and code and how its structured.
Below is a detailed explanation of your question about centroid-based classification possibly “overfitting,” followed by K-Means with 2 clusters that purely ignores labels when clustering. Then we visualize the K-Means results with three different 2D projections (PCA, t-SNE, and ISOMAP) in a single code snippet, highlighting your 15 labeled examples.
Is it Overfitting?
If your manual checks on unlabeled data suggest it’s doing “decently,” that’s a good sign. But keep in mind, your final performance on the private data is what the judges will rely on to see if it truly generalizes or not.
You asked for a purely unsupervised K-Means approach that does not use the labeled data to define the cluster centers. Instead, we let K-Means discover 2 clusters from the entire dataset (150 documents). Then we can see if cluster 0 aligns well with non-publishable or cluster 1 aligns well with publishable, etc.
Below is a single code snippet that does all three visualizations in subplots. Adjust as you see fit.
pythonimport numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.decomposition import PCA from sklearn.manifold import TSNE, Isomap # Suppose we have: # doc_embeddings: shape [n_docs, d] # documents: list of dicts, each with doc['filename'] # labeled_files_class0 = { "R001.pdf", ..., "R005.pdf" } # labeled_files_class1 = { "R006.pdf", ..., "R015.pdf" } # 1) Perform K-Means with 2 clusters, ignoring labels kmeans_2 = KMeans(n_clusters=2, random_state=42) kmeans_labels = kmeans_2.fit_predict(doc_embeddings) # Store cluster labels in documents for i, doc in enumerate(documents): doc['kmeans_label'] = int(kmeans_labels[i]) print("Finished K-Means with 2 clusters. Label distribution:") unique, counts = np.unique(kmeans_labels, return_counts=True) for lab, count in zip(unique, counts): print(f"Cluster {lab}: {count} documents") # 2) We'll define a helper function to plot 2D projections def plot_2d_projection(proj_matrix, title): """ proj_matrix: shape [n_docs, 2] title: string for the plot """ x_coords = proj_matrix[:, 0] y_coords = proj_matrix[:, 1] # color by kmeans_label clusters = [doc['kmeans_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title(title) plt.xlabel("Component 1") plt.ylabel("Component 2") # highlight the 15 labeled docs for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0 or fname in labeled_files_class1: # black edge circle plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if fname in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) # Optionally add a legend for cluster coloring unique_clusters = list(set(clusters)) handles = scatter.legend_elements()[0] legend_labels = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, legend_labels, title="K-Means Clusters", loc="best") # 3) Prepare 2D projections: PCA, t-SNE, and ISOMAP pca = PCA(n_components=2, random_state=42) pca_2d = pca.fit_transform(doc_embeddings) tsne = TSNE(n_components=2, random_state=42) tsne_2d = tsne.fit_transform(doc_embeddings) iso = Isomap(n_components=2) iso_2d = iso.fit_transform(doc_embeddings) # 4) Plot them in subplots fig = plt.figure(figsize=(18, 5)) plt.subplot(1, 3, 1) plot_2d_projection(pca_2d, "PCA (2D) of doc_embeddings") plt.subplot(1, 3, 2) plot_2d_projection(tsne_2d, "t-SNE (2D) of doc_embeddings") plt.subplot(1, 3, 3) plot_2d_projection(iso_2d, "ISOMAP (2D) of doc_embeddings") plt.tight_layout() plt.show()
Notes:
With this code, you can show a simpler unsupervised approach to the judges, demonstrating you didn’t rely solely on your labeled docs to define the entire classification logic. Good luck!
It is crazy now that we pure kmeans I am getting the exact same result. Why is that?
Until now we have been calculating embeddings using this below function:
def get_document_embedding(processed_sections):
section_texts = [sec_text for _, sec_text in processed_sections if sec_text.strip()]
if not section_texts:
return np.zeros(768)
embeddings = embed_model.encode(section_texts, convert_to_numpy=True)
return np.mean(embeddings, axis=0)
for this we are calculating embeddings for all sections and then just averaging them, which makes me think we lose a lot of information. So I want you to modify this code such that, for each section that section's embedding is also stored under that section_name_embedding. If some documents have no such section, then its fine to have null np.zeros as you have already did. We can also have the document level embedding as well. Now change the kmeans code to do a kmeans for every single section embedding and plot them as well using the code you gave earlier. then also give the code for the overall document level embedding visualization. make sure you modularise the visualization function, so we can loop it over section level embeddings.
Below are the only sections in the data:
['Unlabeled', 'Abstract', 'Introduction', 'Methods', 'Results',
'Related work', 'Experiments', 'Conclusion', 'Discussion',
'Background', 'References']
there are no more sections occurring in the data.
Below is a complete solution that:
We’ll assume you have:
documents
, where doc['sections']
is a list of (section_name, text)
tuples.embed_model
(like SciBERT) that provides an .encode(...)
method returning a 768-dim vector (or a different dimension; just adjust accordingly).pythonlabeled_files_class0 = {"R001.pdf","R002.pdf","R003.pdf","R004.pdf","R005.pdf"} labeled_files_class1 = {"R006.pdf","R007.pdf","R008.pdf","R009.pdf","R010.pdf", "R011.pdf","R012.pdf","R013.pdf","R014.pdf","R015.pdf"}
pythonSECTIONS = [ "Unlabeled", "Abstract", "Introduction", "Methods", "Results", "Related work", "Experiments", "Conclusion", "Discussion", "Background", "References" ]
Feel free to adjust naming or add your own custom logic. This code is meant to be modular so you can run a loop over all sections and produce separate K-Means + 2D visualizations.
pythonimport numpy as np def build_section_and_doc_embeddings(documents, embed_model): """ For each document in 'documents', build: doc['section_embeddings'] = { "Abstract": <768-dim embedding or zeros>, "Introduction": <768-dim embedding or zeros>, ... } doc['doc_embedding'] = <768-dim average across all available sections> 'documents' is a list of dicts, each with: doc['sections'] = list of (sec_name, sec_text) embed_model.encode() returns a numpy array embedding for given text input. """ SECTIONS = [ "Unlabeled", "Abstract", "Introduction", "Methods", "Results", "Related work", "Experiments", "Conclusion", "Discussion", "Background", "References" ] for doc in documents: # Initialize a dict that maps section_name -> embedding doc['section_embeddings'] = {} # We’ll store per-section embeddings in a dictionary # e.g. doc['section_embeddings']['Abstract'] = np.array([...]) # We might have multiple sections in doc['sections']. Let's gather them in a dict for easy access # e.g. actual_sections_found = { "Abstract": text_for_abstract, ... } actual_sections_found = {} for (sec_name, sec_text) in doc['sections']: actual_sections_found[sec_name] = sec_text # For each possible section in SECTIONS, build an embedding for section_name in SECTIONS: if section_name in actual_sections_found: raw_text = actual_sections_found[section_name].strip() if len(raw_text) == 0: # If empty text, store zeros doc['section_embeddings'][section_name] = np.zeros(embed_model.get_sentence_embedding_dimension()) else: # Encode this section's text embedding = embed_model.encode(raw_text, convert_to_numpy=True) # embed_model.encode(...) returns a 768-dim array if you're using SciBERT or similar # If you have large text, you might want to do chunking or similar, but here's the simple approach doc['section_embeddings'][section_name] = embedding else: # If doc doesn't have this section, store zeros doc['section_embeddings'][section_name] = np.zeros(embed_model.get_sentence_embedding_dimension()) # Now, compute doc['doc_embedding'] by averaging all non-zero embeddings nonzero_vectors = [] for sec_name, emb in doc['section_embeddings'].items(): # If emb is all zeros, skip if np.any(emb): # i.e. not all zeros nonzero_vectors.append(emb) if len(nonzero_vectors) == 0: # If no valid sections, store a zeros vector doc['doc_embedding'] = np.zeros(embed_model.get_sentence_embedding_dimension()) else: doc['doc_embedding'] = np.mean(nonzero_vectors, axis=0)
Notes:
embed_model.encode(raw_text, convert_to_numpy=True)
to get a single embedding for that entire section.We’ll define a function that:
k
(for instance, k=5
).doc['kmeans_{section_name}']
.doc_embedding
.pythonfrom sklearn.cluster import KMeans def run_kmeans_for_section(documents, section_name, n_clusters=5, random_state=42): """ Gathers the embedding for 'section_name' from each doc and runs K-Means. Then stores the cluster label in doc['kmeans_{section_name}']. """ # 1) Build an array of shape [n_docs, embedding_dim] for this section all_embeddings = [] for doc in documents: # E.g. doc['section_embeddings']['Abstract'] emb = doc['section_embeddings'][section_name] all_embeddings.append(emb) all_embeddings = np.array(all_embeddings) # shape: [n_docs, d] # 2) K-Means kmeans = KMeans(n_clusters=n_clusters, random_state=random_state) labels = kmeans.fit_predict(all_embeddings) # 3) Store the label in the doc for i, doc in enumerate(documents): doc[f'kmeans_{section_name}'] = int(labels[i]) def run_kmeans_for_doc_embedding(documents, n_clusters=5, random_state=42): """ Gathers doc['doc_embedding'] for each doc, runs K-Means, and stores cluster label in doc['kmeans_doc']. """ # 1) Build array of shape [n_docs, d] from doc['doc_embedding'] all_doc_emb = [doc['doc_embedding'] for doc in documents] all_doc_emb = np.array(all_doc_emb) # 2) K-Means kmeans = KMeans(n_clusters=n_clusters, random_state=random_state) labels = kmeans.fit_predict(all_doc_emb) # 3) Store label for i, doc in enumerate(documents): doc['kmeans_doc'] = int(labels[i])
We want to pass in a numpy array of shape [n_docs, d]
and a list of cluster labels. Then we do:
We create subplots for each. We also highlight the labeled docs. This is very similar to your previous code, but more modular so we can re-use it for each section or for doc-level embeddings.
pythonimport matplotlib.pyplot as plt from sklearn.decomposition import PCA from sklearn.manifold import TSNE, Isomap def visualize_2d_projections(embeddings, cluster_labels, documents, labeled_files_class0, labeled_files_class1, title_prefix=""): """ embeddings: shape [n_docs, d] cluster_labels: shape [n_docs], e.g. from KMeans documents: the list of doc dicts labeled_files_class0, labeled_files_class1: sets for highlight title_prefix: string to prefix in the title, e.g. "Abstract" or "doc_embedding" """ # 1) Prepare the 3 transformations pca = PCA(n_components=2, random_state=42) pca_2d = pca.fit_transform(embeddings) tsne = TSNE(n_components=2, random_state=42) tsne_2d = tsne.fit_transform(embeddings) iso = Isomap(n_components=2) iso_2d = iso.fit_transform(embeddings) fig = plt.figure(figsize=(18, 5)) # We'll define a helper closure to plot each sub-figure def plot_projection(ax, proj_matrix, method_name): x_coords = proj_matrix[:, 0] y_coords = proj_matrix[:, 1] scatter = ax.scatter(x_coords, y_coords, c=cluster_labels, cmap='tab10', alpha=0.6) ax.set_title(f"{title_prefix} - {method_name}") ax.set_xlabel("Component 1") ax.set_ylabel("Component 2") for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0 or fname in labeled_files_class1: ax.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) label_text = "C0" if fname in labeled_files_class0 else "C1" ax.text(x_coords[i]+0.02, y_coords[i]+0.02, label_text, fontsize=9) unique_clusters = list(sorted(set(cluster_labels))) handles = scatter.legend_elements()[0] legend_labels = [f"Cluster {uc}" for uc in unique_clusters] ax.legend(handles, legend_labels, title="K-Means Clusters", loc="best") ax1 = fig.add_subplot(1, 3, 1) plot_projection(ax1, pca_2d, "PCA") ax2 = fig.add_subplot(1, 3, 2) plot_projection(ax2, tsne_2d, "t-SNE") ax3 = fig.add_subplot(1, 3, 3) plot_projection(ax3, iso_2d, "ISOMAP") plt.tight_layout() plt.show()
Below is an example workflow that you can adapt to your code:
python############################################################################### # EXAMPLE WORKFLOW ############################################################################### # 1) Build section embeddings and doc-level embeddings build_section_and_doc_embeddings(documents, embed_model) # 2) For each section in our known set, run K-Means + visualize SECTIONS = [ "Unlabeled", "Abstract", "Introduction", "Methods", "Results", "Related work", "Experiments", "Conclusion", "Discussion", "Background", "References" ] for section_name in SECTIONS: # 2a) Run K-Means run_kmeans_for_section(documents, section_name, n_clusters=5, random_state=42) # 2b) Gather the embeddings array for that section section_embs = [] section_labels = [] for i, doc in enumerate(documents): emb = doc['section_embeddings'][section_name] lbl = doc[f'kmeans_{section_name}'] section_embs.append(emb) section_labels.append(lbl) section_embs = np.array(section_embs) section_labels = np.array(section_labels) # 2c) Visualize visualize_2d_projections( embeddings=section_embs, cluster_labels=section_labels, documents=documents, labeled_files_class0=labeled_files_class0, labeled_files_class1=labeled_files_class1, title_prefix=f"{section_name} Section" ) # 3) Do doc-level K-Means run_kmeans_for_doc_embedding(documents, n_clusters=5, random_state=42) # 4) Gather doc-level embeddings + labels doc_embs = [] doc_labels = [] for i, doc in enumerate(documents): emb = doc['doc_embedding'] lbl = doc['kmeans_doc'] doc_embs.append(emb) doc_labels.append(lbl) doc_embs = np.array(doc_embs) doc_labels = np.array(doc_labels) visualize_2d_projections( embeddings=doc_embs, cluster_labels=doc_labels, documents=documents, labeled_files_class0=labeled_files_class0, labeled_files_class1=labeled_files_class1, title_prefix="Document-Level" )
What this does:
doc['section_embeddings'][section_name]
.doc['doc_embedding']
.k=2
if you want pure “non-pub vs. pub”—that’s up to you).If you want exactly 2 clusters for “publishable vs. non-publishable,” just replace n_clusters=5
with n_clusters=2
. Or do both and see how they differ.
This modular approach should integrate cleanly with your existing data structures (documents
, embed_model
). You can tweak n_clusters
, or do further analysis on each section’s clusters to see if, for instance, “Abstract” embeddings separate out certain flawed papers more clearly.
Good luck, and feel free to adapt the code as needed!
Even with these i get the same results for the final classification. So what we can conclude.
Below are some reflections on why you might still be seeing identical final classifications despite introducing section-level embeddings, multiple clustering approaches, etc.:
Homogeneity in the Data
If most of your papers (both publishable and non-publishable) share similar language and vocabulary in every section—i.e., the “Results” and “Discussion” sections look more or less the same across many papers—then averaging or section-level embeddings might not introduce enough variability to shift the clusters. In other words, the model “sees” roughly the same signals no matter how you segment the text.
Strong Dominant Signal
Some domain-specific terms or writing style might dominate across all sections. For example, if every paper includes standard ML terminology or references, it might overwhelm the differences that typically differentiate publishable vs. non-publishable content. Thus, whether you look at the “Introduction” or “Methods” or combine them, the embeddings land in roughly the same space.
Limited Label Diversity
You only have 15 labeled examples (5 non-publishable, 10 publishable). If they are not diverse enough or if the unlabeled set is large and somewhat homogeneous, the clustering and centroid-based methods often yield the same outcome. The data itself may not be providing strong opposing signals to re-separate the clusters.
Robustness of the Embedding Model
If your chosen embedding model (e.g., SciBERT) consistently encodes the papers in a similar fashion, the different partitionings (section-level vs. entire doc) might just lead to near-identical vectors. For instance, if the “Methods” section heavily overlaps with the “Introduction” content, the average or sum might not differ enough to rearrange the cluster memberships.
The Data is Not Contradictory
The fact that multiple approaches lead to the same classification implies that, from a textual perspective, the boundary between “publishable” and “non-publishable” (as your data is capturing it) is consistent. The data is not easily teased apart differently by slicing or chunking.
Consistency in the Embedding Space
You might actually have a solid embedding approach. If all these transformations (section-level vs. doc-level) converge to the same outcome, it might mean the representation is stable and not very sensitive to how you slice the document text.
Possibility of Domain Constraints
Some tasks simply do not vary enough across sections for them to produce differing clusters. If every paper has the same general structure, length, or domain keywords, the embeddings can look very alike regardless of how you chunk or average them.
Refine or Add Additional Features
Examine “Borderline” Cases
Increase Labeled Diversity
Try Another Embedding Model
Validate with External Data
Getting the same final classification from different chunking/averaging strategies often means:
It does not necessarily mean you are doing something wrong—rather, it suggests your data and chosen embeddings converge on the same representation regardless of how you slice the paper text. This can be interpreted as stability in your approach.
The best way to be sure is to test on truly unseen data or carefully label more documents to see if there are any subtle differences the current pipeline is missing.