LLM-Based Classification Strategies

Создано: 10 января 2025 г.

Отвечено с использованием O1 от Chat01

Вопрос

Below is the problem statement for a national level data science hackathon that I'm participating in.

The task involves developing a framework that can classify research papers as
either "Publishable" or "Non-Publishable" based on the evaluation of their
content. The goal is to create a robust system capable of identifying critical
issues, such as inappropriate methodologies, incoherent arguments, or
unsubstantiated claims, that affect the suitability of a paper for publication. For
instance, a research paper that applies methodologies or techniques that are not
well-suited to the problem being addressed, without adequate justification or
adaptation to the context, would be considered unsuitable. Similarly, a paper that
presents arguments that are unclear, disorganized, or lack logical coherence, or
one that claims results that appear unusually high or unrealistic without sufficient
evidence or proper validation, would also fall into the "Non-Publishable" category

A dataset of 150 research papers is provided for classification, with 15 labeled
papers available for reference to guide the development of the framework. The
framework should be designed to accurately classify papers into the appropriate
category, ensuring that it can handle a wide range of research topics and
maintain consistency across different types of content. The proposed framework
must be capable of systematically analyzing these and other aspects of research
papers to ensure a reliable and objective evaluation process. The solution should
demonstrate high accuracy in detecting such issues, ensuring its applicability
across a range of research domains and scalability for future use with larger
datasets.

Think deeply about this guide me. For context, the 15 labeled papers are imbalanced with 10 being class 'Publishable' and only 5 being class 'non-publishable', so its in the ratio 2:1. The labeled data is also very very low.

Every paper in the 150 given papers has thousands of words spanning several pages. Think about if splitting these documents into major sections like abstract, introduction, methodology, related works, experiments and conclusion could benefit in any way? But there is also the problem that not every papers have the same sections and could vary across different papers and a lot of papers may have subheadings and what not. The papers also span across different topics, but the publishable papers even in the unlabeled dataset would be papers that could be put into one of the following 5 conferences - CVPR, NeurIPS, KDD, EMNLP, TMLR, so the topic and what the papers talk about should fall into the topics that majorly dominate these 5 conferences. Equipped with this knowledge, think about what sort of approach could be taken and what sort of model could be used and all other approaches. Think deeply for this

We cannot directly upload the data to a GPT4 and ask it to classify it as we are judged in this hackathon based on
- Approach towards identification of publishability
- Accuracy and F1 Score of determining publishability.

With that said, we can still use llms in other ways if we want, this is just for your clarification. We have no preferences regarding the method we use, we only want results.

Think deeply about this problem and provide different ways to approach it, different techniques or chain of thoughts to address this problem given the heavily imbalanced dataset and all other conditions mentioned. But try to suggest the ones that will work the best given the context in this query. Make sure your answer is extremely detailed and covers and clears all my questions, doubts and ideas.

Мысль

Addressing imbalanced data

I’m developing a strategy to classify research papers as publishable or non-publishable, given a 2:1 imbalance ratio and 15 labeled papers. This involves examining text cleaning, representation, and various model approaches like generative, pipeline, and embedding models.

Handling imbalanced data

I’m thinking through how to handle data imbalance and section segmentation, possibly using bag-of-words. Discussing multi-class pipeline approaches, data augmentation, self-supervised methods, and classification strategies.

Thought about research paper classification strategies for 18 seconds

Ответ

Below is a systematic breakdown of possible approaches and considerations for building a highly accurate and robust framework to classify research papers as “Publishable” or “Non-Publishable,” given the constraints and context you described:


1. Handling Limited and Imbalanced Labeled Data

1.1 Data Augmentation Strategies

  1. Synthetic Document Generation / Data Augmentation

    • Paraphrasing: One could generate variants of the 5 “Non-Publishable” papers to increase the minority class. Although it’s tricky to ensure that paraphrased versions remain logically consistent with the original “Non-Publishable” label, it can help expand coverage of phrases and styles.
    • Section Swaps or Partial Text Replacement: For instance, you might keep the introduction and methodology the same but replace the conclusion with a rephrased or newly generated variant, thereby creating multiple versions of the same paper. This approach is more controlled than straightforward paraphrasing because you know exactly which sections you are modifying.
  2. Document Splitting + Label Inheritance

    • You might split each labeled paper into coherent sections or even paragraphs. Each segment “inherits” the global label. Then you could train on this expanded set of labeled text segments. While you lose some context, you gain more training samples.
    • Careful: the label is at the “paper” level, not necessarily at the chunk level. Some smaller chunks might not by themselves indicate whether a paper is publishable. But as a data augmentation strategy, it can help models learn textual cues from smaller segments.
  3. Active Learning or Semi-Supervised Learning

    • With 15 labeled papers, you might use an active learning loop:
      • Train an initial classifier on the small labeled set.
      • Use it to make predictions on the unlabeled 135 papers.
      • Select examples where the model is most uncertain (or most likely to be mislabeled) for human annotation.
    • This could help you quickly surface the most informative examples to be labeled, improving model performance with minimal labeling effort.

1.2 Handling Class Imbalance in Training

  1. Oversampling / Undersampling

    • SMOTE for Text: Synthetic Minority Over-Sampling Technique (SMOTE) can be applied in the feature space (embedding vectors). If you create vector embeddings for each document (e.g., with BERT or Sentence Transformers), you can apply SMOTE on these embeddings to generate synthetic minority samples for the “Non-Publishable” class.
    • Random Oversampling or Undersampling: Simple approaches to either oversample the “Non-Publishable” class or undersample “Publishable.” Given the extremely small number of “Non-Publishable” papers, oversampling (or advanced synthetic generation) may be more suitable than undersampling (which would remove potentially valuable “Publishable” data).
  2. Class Weighting / Focal Loss

    • Adjust the loss function by introducing higher weights for the minority class. In many deep learning frameworks, you can pass a parameter that sets the loss weighting for classes.
    • Alternatively, Focal Loss is useful in cases where the model might easily classify the majority class but struggles with the minority class.

2. Dealing with Large Documents of Varying Structure

2.1 Segmenting by Sections or Sub-Sections

  • Splitting the documents by major sections (Abstract, Introduction, Methodology, etc.) can help isolate where certain “red flags” for non-publishability might appear. For instance:

    • Methodology section might reveal flawed or inappropriate methods without justification.
    • Results / Experiments section might show unrealistic claims or unsubstantiated results.
    • Introduction / Related Work might show a lack of clarity or organization.
  • Issue: Not all papers have a uniform section structure.

    • Heuristic-based Section Detection: Use a simple set of regex patterns or markers (e.g., “1. Introduction”, “2. Background”, “Conclusion”, etc.) to segment text. Even if some papers don’t follow a standard template, you can still capture partial boundaries.
    • Dynamic Sub-Section Chunking: If some papers do not have clear headings or the headings vary significantly, you can chunk the paper into equal size segments (e.g., 1,000 tokens each). Then you can label or store the approximate position (beginning, middle, end), which correlates somewhat with Introduction, Methodology, Conclusion, etc.

2.2 Representation of Sections

  • Section-level Embeddings: You could represent each section (or chunk) as an embedding using a large language model (e.g., BERT, RoBERTa, Sentence Transformers). This yields multiple embeddings per paper. You then aggregate these embeddings (e.g., by averaging them) into a single representation of the paper.
  • Hierarchical Model: A hierarchical approach first encodes each section and then combines them with an attention mechanism. This can help the model learn that certain sections might hold more weight in publishability decisions.

2.3 Checking for Coherence and Methodological Soundness

  • Some signals that a paper is “Non-Publishable”:
    1. Very short or incomplete methodology for a problem that clearly needs more detail.
    2. Results that are suspiciously high without rigorous justification.
    3. Inconsistencies or abrupt shifts in writing style that indicate disorganization.
  • A rule-based or heuristic-based system (in parallel with an ML model) could flag suspicious patterns in these sections. Combined signals from these heuristics can augment your final classifier’s confidence.

3. Potential NLP Approaches

3.1 Pre-trained Transformers (Fine-Tuned)

Given the text-length and complexity, large pre-trained language models (PLMs) such as BERT, RoBERTa, DeBERTa, or Longformer (for longer context) can be used:

  1. Longformer / BigBird

    • These architectures handle longer sequences than base BERT, which typically maxes out around 512 tokens. For entire research papers, you can feed more context without aggressively truncating.
    • You can do a [CLS] token classification or pool the final embeddings for classification.
  2. Chunk-Level Approach with BERT

    • If you cannot use a long-context model, split the document into chunks of 512 tokens (or ~1,000 words, etc.) and pass each chunk through a BERT-based model.
    • Aggregate chunk-level embeddings or predictions (e.g., using mean or attention-based pooling) to form a final label for the entire paper.
  3. Instruction-Finetuned or GPT-based Models Locally

    • While you cannot simply upload the entire dataset to GPT-4, you could leverage open-source instruction-finetuned models (like Llama 2, or Dolly, or other smaller GPT-like models) locally.
    • Fine-tune them on your 15 labeled papers (plus augmented data). This might be more feasible for capturing the notion of “publishability” because these models are fairly strong in domain reasoning if you shape your prompts or fine-tuning properly.

3.2 Classical ML + Feature Engineering

For smaller data scenarios, a simpler pipeline might be surprisingly effective:

  1. Feature Extraction

    • TF-IDF: Standard textual features. Possibly weighted more heavily in the methodology and results sections.
    • Readability Scores: (e.g., Flesch-Kincaid) to capture clarity.
    • Semantic Similarity to Known “Non-Publishable” examples: Create an embedding for the entire text and measure the distance to embeddings from known non-publishable documents.
    • Keyword / Keyphrase extraction: Searching for “unsubstantiated,” “unproven,” or “lack of” in context might be a signal.
  2. Classification Algorithms

    • Logistic Regression or Random Forest with class weighting.
    • Gradient Boosted Trees (XGBoost / LightGBM).
    • Even though these classical methods are overshadowed by large transformers on large data, they can perform competitively if your data is small and features are carefully engineered.

3.3 Multi-Task or Transfer Learning from “Topic Classification”

  • You mentioned that the “Publishable” papers would fit into CVPR, NeurIPS, KDD, EMNLP, TMLR topics. One approach is:
    • Train a classifier to identify if a paper is about computer vision, NLP, data mining, etc. (You can do this with unlabeled data, because you can approximate the conference domain of the unlabeled paper by some heuristic or by an LLM prompt for domain classification. This step is somewhat easier to verify).
    • Then you use that knowledge as an auxiliary task, so the model “knows” what a typical CVPR or NeurIPS paper looks like. This can help it detect anomalies or major flaws in an “alleged” domain paper.

4. Practical Workflow Example

Below is an illustration of a pipeline that incorporates many of the above ideas:

  1. Preprocessing & Document Splitting

    • Attempt to segment papers into major sections if possible (using heading detection or simple chunk-based segmentation otherwise).
  2. Embeddings

    • Use a sentence-transformer or a Longformer variant to produce embeddings for each section (or chunk).
  3. Aggregation

    • Combine embeddings for each chunk (using mean/attention).
    • Optionally, append a few engineered features (readability scores, average citations in references, etc.) to the final embedding vector.
  4. Address Imbalance

    • Perform some oversampling (e.g., random oversampling for the 5 “Non-Publishable” papers) or synthetic minority data generation in the embedding space (SMOTE).
    • Additionally, use class weighting or Focal Loss during training.
  5. Train Classifier

    • Fine-tune a robust model (e.g., a small MLP on top of embeddings, or fine-tune a transformer with a classification head).
    • Validate on a small hold-out from the 15 labeled samples or via cross-validation if you can’t spare a hold-out set.
  6. Active Learning for Additional Labels

    • Run inference on the unlabeled 135 papers.
    • Identify 10–15 borderline examples that the model is least confident about.
    • Manually label them (if hackathon rules / time permits).
    • Retrain the model to further improve performance.

5. Potential Use of Large Language Models (LLMs) as Tools (Not Direct Classification)

While you cannot upload all data to GPT-4 or other LLMs directly for classification, you can still use them to aid certain subtasks:

  1. Section Extraction / Title Identification

    • You can prompt an LLM to generate regular expressions or a procedure for detecting sections from a typical research paper. You implement that code locally and run it on your documents.
    • Or you can “ask” an LLM to propose heuristics for identifying a flawed methodology section. You can embed those heuristics into your code.
  2. Topic Tagging

    • You can feed small abstracts or summarizations to an LLM to help determine if the paper belongs to CV, NLP, data mining, etc. This is short text, so it’s feasible to do partial queries. Then feed that domain label as an additional feature to your classifier.
  3. Keyword Generation

    • Use an LLM to generate domain-specific “red flag” phrases for non-publishable content. Then you can build a dictionary-based detection system or incorporate it as a feature in your main model.

6. Ensuring Scalability and Consistency

  • Scalability: By establishing a robust pipeline (section extraction → embeddings → classifier), you can easily add more labeled examples over time. If, in the future, you get thousands more labeled papers, you can seamlessly retrain the same pipeline at scale.
  • Consistency Across Domains:
    • Because you’re dealing with multiple potential domains (CV, NLP, data mining, etc.), your model needs generalizable “red flags.”
    • Transfer learning from a generic language model (Longformer, BERT, etc.) already helps in capturing domain-invariant signals.
    • The additional domain classification step can also increase consistency by letting the model see typical domain structures.

7. Summary of Key Recommendations

  1. Work with chunks or sections of the paper, using a hierarchical or chunk-based modeling approach (Longformer is ideal for large contexts).
  2. Imbalance is a major issue; tackle it with oversampling (SMOTE in embedding space), class weighting, or focal loss.
  3. Data augmentation: Explore paraphrasing or partial text modifications to increase the “Non-Publishable” set.
  4. Active / Semi-Supervised Learning: Leverage the unlabeled 135 papers to refine your model iteratively.
  5. Feature engineering: Combine advanced embeddings with simpler signals (keyword checks, readability, suspicious results).
  6. LLMs can still be used as supportive tools (e.g., for section extraction heuristics, domain classification, or generating “red flag” keywords) without directly uploading the full dataset for classification.
  7. Keep the pipeline modular so that you can easily experiment with different embedding models or classification heads.

Following these steps in a well-structured pipeline can yield a high-performing system, even with very limited labeled data and heavily imbalanced classes. By integrating data-centric approaches (like augmentation and active learning) with advanced NLP architectures (like Longformer/BERT for large text), you stand a strong chance of building a solution that meets the accuracy and F1 requirements for the hackathon.

Вопрос

For context, we have already tried semi-supervised and self-training approaches with the available limited labeled dataset with the combination of transformers like scibert, roberta, bert, sentence transformers, zero shot classification, etc and also tried the SMOTE oversampling of embeddings. But the end result was always overfitting or even if the model worked we couldn't infer its performance due to an even smaller test set typically consisting of one or two examples from a class. So we are not able to gauge the model's performance and improve it.

So we would like to try different approaches and especially unsupervised learning and train on the the entire corpus of 150 documents without context of the labels and somehow finally classify all the 150 papers based on the extracted features or some other method to distinguish the publishable and non-publishable papers. Our thought process is that we cannot use the labeled papers which are 15 in number to train as it overfits or is harder to infer from.

We extracted the text from the documents using pdfplumber library using the code :
pdf_text = ""
try:
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
pdf_text += page.extract_text()

and we only did basic preprocessing that is always done for the NLP tasks using the below code.

def preprocess_text(text):
text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
text = text.lower()
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
return ' '.join(tokens)

df['Processed_Content'] = df['Content'].apply(preprocess_text)

We are not sure if this preprocessing is enough for this task and for papers spanning such huge volumes of text. So think about this and tell us if there is a better suited way for our application.

- Heuristic-based Section Detection: Use a simple set of regex patterns or markers (e.g., “1. Introduction”, “2. Background”, “Conclusion”, etc.) to segment text. Even if some papers don’t follow a standard template, you can still capture partial boundaries.

-Hierarchical Model: A hierarchical approach first encodes each section and then combines them with an attention mechanism. This can help the model learn that certain sections might hold more weight in publishability decisions.

-A rule-based or heuristic-based system (in parallel with an ML model) could flag suspicious patterns in these sections. Combined signals from these heuristics can augment your final classifier’s confidence.

-Semantic Similarity to Known “Non-Publishable” examples: Create an embedding for the entire text and measure the distance to embeddings from known non-publishable documents.

-Fine-tune a robust model (e.g., a small MLP on top of embeddings, or fine-tune a transformer with a classification head).

-You can prompt an LLM to generate regular expressions or a procedure for detecting sections from a typical research paper. You implement that code locally and run it on your documents.

These are the ideas that we like from what you suggested. Go into more detail and think through these ideas for our specific application with the context of past attempts at this task and tell us how do we go about it step by step and then give us end to end code for each of them.

Keep in mind we are running in a kaggle environment and give code that utilizes the GPU or TPU accordingly.

Finally all my thoughts entered in this prompt or my own, they are neither objective nor non refutable, so you can still find flaws in my thought process and suggest ways of improving if there are any.

Мысль

Taking a closer look

I’m gathering info on unsupervised classification for 150 papers, considering heuristic-based section detection, hierarchical models, rule-based systems, semantic similarity, fine-tuning, and LLM-generated procedures.

Crafting the plan

Thinking through a Python-based approach with Kaggle libraries, using GPU/TPU. It involves PDF extraction, heuristic detection, unsupervised classification, hierarchical models, and rule-based or LLM-generated pattern generation.

Mapping the process

I’m planning to provide end-to-end Python code for each approach, focusing on unsupervised methods like clustering to spot suspicious patterns. Code blocks will be concise and illustrative.

Assessing documents

I’m pulling together code to chunk and store PDFs, apply unsupervised techniques like clustering with Sentence Transformers and KMeans for outlier detection, and develop a rule-based approach for spotting missing elements.

Assessing similarity

I’m crafting a semantic similarity snippet to match document embeddings, suggesting non-publishable examples if distances fall within a set threshold, and noting the need for GPU specification in Kaggle.

Thought about alternative approaches for classification for 9 seconds

Ответ

Below is a detailed, end-to-end outline of how you might implement several of the ideas you highlighted—ranging from heuristic-based section detection, a hierarchical model design, rule-based “red flag” detection, unsupervised clustering, and leveraging semantic similarity. The goal is to provide a variety of methods you can try in your Kaggle environment. Where relevant, I’ve included code snippets (in Python) that you can adapt. I’ll also point out potential pitfalls or ways to improve your approach.


Table of Contents

  1. Preprocessing & Section Extraction
    1.1 Basic Preprocessing vs. Context-Aware Preprocessing
    1.2 Heuristic-Based Section Detection

  2. Unsupervised/Clustering Approach
    2.1 Embedding Generation (Sentence Transformers)
    2.2 K-Means / Hierarchical Clustering
    2.3 Interpreting Clusters & Using Minimal Labels

  3. Hierarchical Model for Long Documents
    3.1 Chunking into Sections (or Subsections)
    3.2 Encoding & Combining Section Embeddings
    3.3 Fine-Tuning a Small Classifier Head (Optional)

  4. Rule-Based “Red Flag” Detection
    4.1 Designing Heuristics
    4.2 Scoring & Combining with Other Methods

  5. Semantic Similarity to Known Non-Publishable Papers
    5.1 Building a Similarity Index
    5.2 Threshold-Based Classification

  6. Putting It All Together: Example Pipeline

  7. Improving & Checking for Pitfalls

  8. Complete Code Examples

    • 8.1 Section Extraction & Preprocessing
    • 8.2 Unsupervised Clustering (KMeans)
    • 8.3 Rule-Based Signals
    • 8.4 Semantic Similarity Scoring

Throughout, I’ll highlight ways to utilize your small set of 15 labeled papers effectively—even if it’s only at the end as a final “label assignment” step.


1. Preprocessing & Section Extraction

1.1 Basic Preprocessing vs. Context-Aware Preprocessing

What you’re currently doing:

def preprocess_text(text): text = re.sub(r'[^a-zA-Z0-9\s]', '', text) text = text.lower() tokens = word_tokenize(text) stop_words = set(stopwords.words('english')) tokens = [word for word in tokens if word not in stop_words] return ' '.join(tokens)
  • Pros: Removes punctuation, lowercases, eliminates stopwords. This is a decent baseline for classical NLP.
  • Cons:
    • You lose some valuable tokens like parentheses, percentages (useful for results sections: e.g., “accuracy = 99.9%!”).
    • You might break references to figures/tables (e.g., “Fig. 2.1” becomes “fig 21”).
    • Removing certain special tokens could hamper the detection of domain-specific notation.

Suggestions:

  1. Keep punctuation in some form or at least keep certain symbols like %, =, maybe “( )” if relevant.
  2. Consider lemmatization rather than just removing stopwords—sometimes lemmatization or a domain-specific tokenizer (like SciSpaCy for scientific text) can help.
  3. Retain numeric tokens with some minimal cleaning. For example, if a paper claims “Accuracy of 150%,” that’s obviously suspicious and might be relevant to detect.

1.2 Heuristic-Based Section Detection

Many research papers follow a rough structure with headings like:

  • Abstract
  • Introduction
  • Related Work / Background
  • Methodology / Experiments
  • Results
  • Conclusion / Future Work

Heuristic approach:

  • Use regex patterns to capture lines that look like headings (e.g., “^([0-9]+.)+\s+[A-Z].+$” or just “^(abstract|introduction|conclusion|references)[^a-zA-Z]+” ignoring case).
  • If you can’t find a heading, split by large chunks (e.g., 1,000–2,000 words) as a fallback.

Below is a conceptual snippet for Kaggle (Python) to detect sections:

import re def split_into_sections(pdf_text): # Possible common headings to look for headings = [ r"(abstract)\b", r"(introduction)\b", r"(background|related\s?work)\b", r"(methodology|methods|experimental\s?setup)\b", r"(results|experiments)\b", r"(discussion)\b", r"(conclusion|conclusions)\b", r"(references)\b" ] # We'll store tuples of (section_name, section_text) sections = [] current_section_name = "Unlabeled" current_section_text = [] lines = pdf_text.split('\n') for line in lines: line_lower = line.lower().strip() matched_heading = None for hd in headings: if re.match(hd, line_lower): matched_heading = line.strip() break if matched_heading: # Save previous section if we have any text if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) # Start a new section current_section_name = matched_heading current_section_text = [] else: current_section_text.append(line) # Add the last collected section if any if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) return sections # Example usage # pdf_text = "Your extracted text from pdfplumber" # sections = split_into_sections(pdf_text) # for sec_name, sec_content in sections: # print(sec_name, len(sec_content.split()))

What to do next:

  • You can pass each sec_content through your preprocess_text or a more domain-aware pipeline.
  • This helps your unsupervised approaches or hierarchical model treat each major section somewhat separately.

2. Unsupervised/Clustering Approach

2.1 Embedding Generation (Sentence Transformers)

  1. Sentence Transformers (e.g., all-mpnet-base-v2, all-distilroberta-v1, or a domain-specific model like allenai/scibert_scivocab_uncased) can create embeddings for text segments.
  2. For each of your 150 documents, produce one embedding for each major section (or chunk). Then either average them to get a single “document embedding,” or keep them separate for hierarchical modeling.

2.2 K-Means / Hierarchical Clustering

  • K-Means is a straightforward choice if you want a fixed number of clusters. However, deciding the number of clusters is tricky (especially if you have 2 classes—“Publishable” vs. “Non-Publishable”—but the actual text may form more than 2 clusters).
  • Hierarchical clustering or HDBSCAN can discover an arbitrary number of clusters and potentially identify outliers.

2.3 Interpreting Clusters & Using Minimal Labels

  1. Cluster the 150 documents (using their embeddings) into, say, k=5k = 5 or k=6k = 6 clusters.
  2. Look at which cluster your 5 known “Non-Publishable” papers end up in. If they cluster together or in smaller distinct subgroups, that is a strong signal.
  3. Optionally, label entire clusters as “likely publishable” or “likely non-publishable” if they strongly align with your known labeled docs. If a cluster has no known non-publishable examples, treat it carefully or look for rule-based signals to see if it might be “suspect.”

3. Hierarchical Model for Long Documents

3.1 Chunking into Sections

By splitting papers into sections or ~1,000-word chunks:

def chunk_text(text, chunk_size=1000): words = text.split() for i in range(0, len(words), chunk_size): yield ' '.join(words[i:i+chunk_size])

Then process each chunk with a transformer (like SciBERT) to obtain a vector.

3.2 Encoding & Combining Section Embeddings

One approach (a “hierarchical” approach) is:

  1. Encode each chunk with a transformer → you get chunk_embeddings of shape [num_chunks, embedding_dim].
  2. Combine chunk_embeddings with an attention or simple average. For instance: paper_embedding=mean(chunk_embeddings) \text{paper\_embedding} = \text{mean}(chunk\_embeddings)
  3. Use the final paper_embedding for either clustering (unsupervised) or as input to a small neural classifier if you do want to use your labeled data.

3.3 Fine-Tuning a Small Classifier Head (Optional)

Even if you have only 15 labeled documents, you could:

  1. Generate chunk-level embeddings for all 150 documents.
  2. Use the 15 labeled docs to train a tiny classifier (like an MLP or logistic regression) that takes the aggregated embedding as input.
  3. Apply that classifier to all 150 docs.
  4. Because your labeled data is tiny, you’ll risk overfitting—so you might rely on cross-validation or the cluster-based signals to calibrate or confirm.

4. Rule-Based “Red Flag” Detection

4.1 Designing Heuristics

Some potential signals for “Non-Publishable”:

  1. Suspicious High-Level Claims: Searching for tokens like “100% accuracy,” “state-of-the-art with no baseline,” etc.
  2. Weak or Missing Methodology: If the “Methodology” section has fewer than X words, or if certain keywords (“experiment,” “implementation,” “dataset,” “train,” “test”) are missing.
  3. Lack of references or citations: If the “References” or “Bibliography” section is absent or very short.

4.2 Scoring & Combining with Other Methods

You could create a simple “red flag” score for each paper:

  • For each rule triggered, add +1 (or +some weighting).
  • At the end, if total “red flag” score exceeds a threshold, you flag the paper as suspicious.

This approach can run in parallel with your unsupervised model. You might ultimately label a paper “non-publishable” if:

  • It’s in the same cluster as known non-publishable docs, and
  • It exhibits at least 1–2 rule-based red flags.

5. Semantic Similarity to Known Non-Publishable Papers

5.1 Building a Similarity Index

  1. Compute the average embedding of your 5 known non-publishable papers, call it nonpub_centroid.
  2. For each unlabeled paper, compute its embedding and compute cosine similarity to nonpub_centroid.
  3. If the similarity is above a certain threshold, the paper is considered suspicious. Alternatively, keep the top N “closest” papers as candidates for “non-publishable.”

5.2 Threshold-Based Classification

A simple approach:

  • You might pick a similarity threshold by checking your known 5 non-publishable vs. 10 publishable papers.
  • If your 5 known non-publishable have an average similarity of 0.8 to the centroid, while the 10 publishable ones have 0.5, then you can choose ~0.65 as a cutoff.

6. Putting It All Together: Example Pipeline

  1. Extract PDF text using pdfplumber.
  2. Heuristic-based section detection (if possible).
  3. Preprocess each section (decide on minimal vs. domain-aware cleaning).
  4. Generate embeddings (e.g., using a SentenceTransformer model).
  5. Combine section embeddings → one doc embedding (mean or attention).
  6. Unsupervised clustering or semantic similarity approach.
  7. Rule-based checks for suspicious patterns.
  8. (Optional) Small classifier using your 15 labeled docs.

7. Improving & Checking for Pitfalls

  • Overfitting to the 15 labeled docs:

    • If you do any supervised training, consider cross-validation or a simple “leave-one-out” approach to get a rough measure.
    • Or rely primarily on unsupervised methods, then do a final check with the 15 labeled docs to see if it aligns with your intuition.
  • Inconsistent Section Headings:

    • Many papers might not follow standard headings. You might need fallback chunking.
  • Large Variation in Paper Length:

    • Some papers might be 1–2 pages short workshop papers, others 20+ pages. Make sure your chunking logic can handle extremes.
  • GPU/TPU Usage:

    • In Kaggle, you can enable a GPU by going to Settings > Accelerator. Using a transformer model for embeddings typically works well on GPUs.

8. Complete Code Examples

Below are some consolidated code snippets (in Python) that you can adapt in a Kaggle notebook. I’ll assume you’ve already installed pdfplumber and sentence-transformers. (If not, you can do !pip install pdfplumber sentence-transformers at the top of your Kaggle notebook.)

8.1 Section Extraction & Preprocessing

import re import pdfplumber import torch import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize # Make sure to download NLTK data if not already present # nltk.download('punkt') # nltk.download('stopwords') def extract_text_from_pdf(pdf_path): """Extract all text from a PDF using pdfplumber.""" pdf_text = "" with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: page_text = page.extract_text() if page_text: pdf_text += page_text + "\n" return pdf_text def custom_preprocess(text): """ A slightly more cautious preprocessing that: - Keeps numbers, percentage signs, - Lowercases, - Removes some punctuation but keeps essential symbols. """ # Lowercase text = text.lower() # Keep only certain punctuation: letters, numbers, whitespace, punctuation like '%', '=', etc. text = re.sub(r'[^a-z0-9%\=$$\.\,\s]', '', text) # Tokenize tokens = word_tokenize(text) # Remove stopwords (optional) stop_words = set(stopwords.words('english')) tokens = [t for t in tokens if t not in stop_words] return ' '.join(tokens) def split_into_sections(pdf_text): headings = [ r"(abstract)\b", r"(introduction)\b", r"(background|related\s?work)\b", r"(methodology|methods|experimental\s?setup)\b", r"(results|experiments)\b", r"(discussion)\b", r"(conclusion|conclusions)\b", r"(references)\b" ] sections = [] current_section_name = "Unlabeled" current_section_text = [] lines = pdf_text.split('\n') for line in lines: line_lower = line.strip().lower() matched_heading = None for hd in headings: if re.match(hd, line_lower): matched_heading = line.strip() break if matched_heading: if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) current_section_name = matched_heading current_section_text = [] else: current_section_text.append(line) if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) return sections

Usage in a Kaggle environment:

# Suppose you have a folder "papers" with PDF files import os pdf_dir = "./papers" documents = [] for filename in os.listdir(pdf_dir): if filename.endswith(".pdf"): path = os.path.join(pdf_dir, filename) raw_text = extract_text_from_pdf(path) sections = split_into_sections(raw_text) # Preprocess each section processed_sections = [] for sec_name, sec_text in sections: processed_text = custom_preprocess(sec_text) processed_sections.append((sec_name, processed_text)) # Store the document data documents.append({ 'filename': filename, 'sections': processed_sections })

8.2 Unsupervised Clustering (KMeans)

Below is a simple example of clustering the average embedding of each paper. We’ll use Sentence Transformers to get embeddings.

!pip install sentence-transformers from sentence_transformers import SentenceTransformer from sklearn.cluster import KMeans import numpy as np # Load a pre-trained or domain-specific model (SciBERT, etc.) model_name = 'all-mpnet-base-v2' # or 'allenai/scibert_scivocab_uncased' device = 'cuda' if torch.cuda.is_available() else 'cpu' embed_model = SentenceTransformer(model_name, device=device) def get_document_embedding(processed_sections): """ Generate embeddings for each section, then average them. processed_sections: list of (sec_name, text) after preprocessing. """ section_texts = [sec_text for _, sec_text in processed_sections if sec_text.strip()] if not section_texts: # fallback if there's no text return np.zeros(768) # Encode the entire batch embeddings = embed_model.encode(section_texts, convert_to_numpy=True) doc_embedding = np.mean(embeddings, axis=0) return doc_embedding # Compute embeddings for all documents doc_embeddings = [] for doc in documents: embedding = get_document_embedding(doc['sections']) doc_embeddings.append(embedding) doc_embeddings = np.array(doc_embeddings) # K-Means Clustering num_clusters = 5 # arbitrary choice; you can experiment kmeans = KMeans(n_clusters=num_clusters, random_state=42) labels = kmeans.fit_predict(doc_embeddings) for i, doc in enumerate(documents): doc['cluster_label'] = labels[i] # Inspect the cluster distribution from collections import Counter cluster_counts = Counter(labels) print(cluster_counts)

Next Steps:

  • Check which clusters your known 5 non-publishable (if you do know their filenames) fall into.
  • Potentially label entire clusters as suspicious or safe, or use rule-based signals to refine.

8.3 Rule-Based Signals

Create a few simple functions to detect “red flags,” e.g.:

def count_red_flags(processed_sections): flags = 0 # 1) Check for unrealistic claims: e.g., "100% accuracy" text_joined = " ".join(sec_text for _, sec_text in processed_sections) if '100% accuracy' in text_joined or '100 percent accuracy' in text_joined: flags += 1 # 2) Check if Methodology section is too short for sec_name, sec_text in processed_sections: if 'method' in sec_name.lower(): # Adjust threshold to your taste if len(sec_text.split()) < 100: flags += 1 # 3) Check if there's a references section at all has_references = any('references' in sec_name.lower() for sec_name, _ in processed_sections) if not has_references: flags += 1 return flags # Example usage for doc in documents: doc['red_flag_score'] = count_red_flags(doc['sections']) # Then you could set a threshold threshold = 1 for doc in documents: if doc['red_flag_score'] >= threshold: doc['suspicious'] = True else: doc['suspicious'] = False

8.4 Semantic Similarity Scoring

If you have 5 known non-publishable papers:

known_nonpub_embeddings = [] for doc in documents: if doc['filename'] in known_nonpub_filenames: # e.g. the 5 known non-publishable emb = get_document_embedding(doc['sections']) known_nonpub_embeddings.append(emb) nonpub_centroid = np.mean(known_nonpub_embeddings, axis=0) def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) # Score each doc for doc in documents: emb = get_document_embedding(doc['sections']) sim = cosine_similarity(emb, nonpub_centroid) doc['similarity_to_nonpub'] = sim

Then set a threshold:

# For instance, see the distribution of similarity_to_nonpub for known publishable vs. known non-publishable # This helps you pick a threshold threshold = 0.7 for doc in documents: doc['maybe_nonpub'] = (doc['similarity_to_nonpub'] > threshold)

Putting It All Together in a Step-by-Step Manner

Below is a condensed version of how this might look in an actual Kaggle Notebook with minimal commentary:

# Step 0: Install dependencies (if needed) !pip install pdfplumber sentence-transformers import os import pdfplumber import re import torch import nltk import numpy as np from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from sentence_transformers import SentenceTransformer from sklearn.cluster import KMeans from collections import Counter device = 'cuda' if torch.cuda.is_available() else 'cpu' model_name = 'all-mpnet-base-v2' embed_model = SentenceTransformer(model_name, device=device) # Download NLTK data nltk.download('punkt') nltk.download('stopwords') def extract_text_from_pdf(pdf_path): pdf_text = "" with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: page_text = page.extract_text() if page_text: pdf_text += page_text + "\n" return pdf_text def custom_preprocess(text): text = text.lower() # Keep certain punctuation text = re.sub(r'[^a-z0-9%\=$$\.\,\s]', '', text) tokens = word_tokenize(text) stop_words = set(stopwords.words('english')) tokens = [t for t in tokens if t not in stop_words] return ' '.join(tokens) def split_into_sections(pdf_text): headings = [ r"(abstract)\b", r"(introduction)\b", r"(background|related\s?work)\b", r"(methodology|methods|experimental\s?setup)\b", r"(results|experiments)\b", r"(discussion)\b", r"(conclusion|conclusions)\b", r"(references)\b" ] sections = [] current_section_name = "Unlabeled" current_section_text = [] lines = pdf_text.split('\n') for line in lines: line_lower = line.strip().lower() matched_heading = None for hd in headings: if re.match(hd, line_lower): matched_heading = line.strip() break if matched_heading: if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) current_section_name = matched_heading current_section_text = [] else: current_section_text.append(line) if current_section_text: sections.append((current_section_name, "\n".join(current_section_text))) return sections def get_document_embedding(processed_sections): section_texts = [sec_text for _, sec_text in processed_sections if sec_text.strip()] if not section_texts: return np.zeros(768) embeddings = embed_model.encode(section_texts, convert_to_numpy=True) return np.mean(embeddings, axis=0) def count_red_flags(processed_sections): flags = 0 text_joined = " ".join(sec_text for _, sec_text in processed_sections) # Example rule: 100% accuracy if '100% accuracy' in text_joined or '100 percent accuracy' in text_joined: flags += 1 # Example rule: short methodology section for sec_name, sec_text in processed_sections: if 'method' in sec_name.lower(): if len(sec_text.split()) < 100: flags += 1 # Example rule: no references has_references = any('references' in sec_name.lower() for sec_name, _ in processed_sections) if not has_references: flags += 1 return flags # ===== MAIN PIPELINE ===== pdf_dir = "./papers" documents = [] for filename in os.listdir(pdf_dir): if filename.endswith(".pdf"): path = os.path.join(pdf_dir, filename) raw_text = extract_text_from_pdf(path) sections = split_into_sections(raw_text) # Preprocess each section processed_sections = [] for sec_name, sec_text in sections: processed_text = custom_preprocess(sec_text) processed_sections.append((sec_name, processed_text)) documents.append({ 'filename': filename, 'sections': processed_sections }) # Compute embeddings for each doc doc_embeddings = [] for doc in documents: emb = get_document_embedding(doc['sections']) doc_embeddings.append(emb) doc_embeddings = np.array(doc_embeddings) # KMeans clustering num_clusters = 5 kmeans = KMeans(n_clusters=num_clusters, random_state=42) labels = kmeans.fit_predict(doc_embeddings) for i, doc in enumerate(documents): doc['cluster_label'] = int(labels[i]) doc['red_flag_score'] = count_red_flags(doc['sections']) cluster_counts = Counter(labels) print("Cluster distribution:", cluster_counts) # Suppose we know the filenames of the 5 known non-publishable known_nonpub_filenames = ["nonpub1.pdf", "nonpub2.pdf", "nonpub3.pdf", "nonpub4.pdf", "nonpub5.pdf"] known_nonpub_embeddings = [] for doc in documents: if doc['filename'] in known_nonpub_filenames: known_nonpub_embeddings.append(get_document_embedding(doc['sections'])) nonpub_centroid = np.mean(known_nonpub_embeddings, axis=0) def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) for doc in documents: emb = get_document_embedding(doc['sections']) sim = cosine_similarity(emb, nonpub_centroid) doc['similarity_to_nonpub'] = float(sim) # Example threshold for suspicious similarity_threshold = 0.7 red_flag_threshold = 1 for doc in documents: doc['maybe_nonpub'] = (doc['similarity_to_nonpub'] > similarity_threshold) \ or (doc['red_flag_score'] >= red_flag_threshold) # Inspect final results for doc in documents: print(doc['filename'], doc['cluster_label'], doc['red_flag_score'], doc['similarity_to_nonpub'], doc['maybe_nonpub'])
  1. Load PDFs → Extract text
  2. Heuristics to split into sections
  3. Preprocessing that’s a bit more selective
  4. Compute embeddings for each doc (averaging section embeddings)
  5. Clustering (K-Means) → see if certain clusters align with known non-publishable examples
  6. Rule-based flags → short methodology, missing references, suspicious claims
  7. Semantic similarity to known non-publishable papers (centroid)
  8. Combine signals to produce a final guess (maybe_nonpub)

Final Notes & Key Takeaways

  1. Unsupervised + Heuristics: Since you have such limited labeled data, leaning on unsupervised clustering + rule-based detection is a solid approach.
  2. Section Extraction: Imperfect headings are still better than none. For papers that don’t match headings, fallback chunking is a must.
  3. Semantic Similarity: Quick way to see how close new papers are to known “bad” examples.
  4. Small Classifier: If time permits, you can still attempt a small MLP on top of embeddings, but be wary of overfitting with only 15 labeled examples.
  5. Spot-Check: Even with unsupervised methods, you still want to do some manual inspection of the clusters and flagged papers.
  6. Continual Improvement: If you can incrementally label more papers (even if it’s partial or just verifying cluster assignments), you’ll see big gains in performance.

I hope this gives you a more comprehensive roadmap—and enough code to try out each approach in Kaggle. Feel free to tweak thresholds, the choice of embedding model, clustering method, and rule-based patterns to adapt to your specific corpus. Above all, remember that your domain knowledge about “what makes a paper unpublishable” can be translated into powerful heuristics that complement the unsupervised features.

Вопрос

# KMeans clustering
num_clusters = 5
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
labels = kmeans.fit_predict(doc_embeddings)

for i, doc in enumerate(documents):
doc['cluster_label'] = int(labels[i])
doc['red_flag_score'] = count_red_flags(doc['sections'])

cluster_counts = Counter(labels)
print("Cluster distribution:", cluster_counts)

This was the clustering code that you have given in the previous response. Can you now give me code to visualize them. So i can understand where the different classes are and how they are distributed. I want you to give labels or highlight for the following 15 filenames when visualizing the clusters as they are my labeled examples, I wanna see where they end up in.
The pdfs are named as R001.pdf, R002.pdf, R003.pdf....until R014.pdf, R015.pdf. R001 - R005 are of class 0, so label them as such and the remaining 10 are class 1, so label them as such too.

def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

for doc in documents:
emb = get_document_embedding(doc['sections'])
sim = cosine_similarity(emb, nonpub_centroid)
doc['similarity_to_nonpub'] = float(sim)

# Example threshold for suspicious
similarity_threshold = 0.6
red_flag_threshold = 1

for doc in documents:
doc['maybe_nonpub'] = (doc['similarity_to_nonpub'] > similarity_threshold)


This was the code you gave for finding the cosine similarity between documents, can you give me the code to visualize the distribution of cosine similarities and also give labels for those 15 filenames I mentioned or atleast highlight them in some form. I would also like some form of histogram/density curve for this additionally.

Can you give some more red flag rules? We found out that not all papers have a reference section, even if they are publishable, so that was not a good metric that you gave. I would also like to see the red flag distribution as some form of a plot. Give code for that as well.

Can you suggest me any improvements?

Мысль

Framing the approach

I’m setting up for a KMeans visualization by reducing dimensionality with PCA, then creating a scatter plot to color or label clusters, highlighting specific examples.

Visualizing similarities

I’m preparing to create a histogram or density curve to show cosine similarity distributions, using colors or shapes to highlight 15 specific documents and their unique characteristics.

Thought for 4 seconds

Ответ

Below is a structured walk-through of how to visualize the clustering results, the cosine similarity distribution, and the red flag scores for your documents. I’ve also included code snippets that:

  • Highlight the 15 labeled PDFs (R001.pdf to R015.pdf), with R001–R005 labeled as class 0 and R006–R015 labeled as class 1.
  • Plot the clustering in 2D using either PCA or t-SNE.
  • Draw a histogram (or KDE) of cosine similarity values, highlighting the labeled ones.
  • Provide additional “red flag” heuristics and illustrate how to visualize their distribution.

Where possible, I’ve included commentary for improvements or alternative approaches.


1. Visualizing the K-Means Clusters

Since doc_embeddings is likely high-dimensional (e.g., 768 dimensions from a transformer), we’ll reduce the dimensionality to 2D for plotting. You can use PCA, t-SNE, or UMAP. Below, I’ll show PCA for simplicity, but feel free to replace it with t-SNE or UMAP.

import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA from sklearn.manifold import TSNE # We have: # doc_embeddings (shape: [num_documents, embedding_dim]) # documents (a list of dicts with 'filename' and 'cluster_label') # Let's define our labeled sets labeled_files_class0 = set([f"R00{i}.pdf" for i in range(1,6)]) # R001 to R005 labeled_files_class1 = set([f"R00{i}.pdf" for i in range(6,16)]) # R006 to R015 # (1) PCA for 2D visualization pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) # (2) Alternatively, you can try t-SNE (comment out PCA if you want this) # tsne = TSNE(n_components=2, random_state=42) # reduced_tsne = tsne.fit_transform(doc_embeddings) # Then use reduced_tsne instead of reduced_pca below. # We'll use reduced_pca here for demonstration. x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] # Create a figure for the cluster visualization plt.figure(figsize=(10, 7)) # Plot each document as a point, colored by cluster_label clusters = [doc['cluster_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title("K-Means Clusters (PCA 2D Projection)") plt.xlabel("Principal Component 1") plt.ylabel("Principal Component 2") # Now let's highlight or label the 15 known PDFs for i, doc in enumerate(documents): filename = doc['filename'] # If this doc is one of the 15 labeled if filename in labeled_files_class0 or filename in labeled_files_class1: # We'll put a black edge or a special marker plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) # Optionally, annotate text # e.g., display 'C0' or 'C1' for class 0 or class 1 if filename in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i]+0.02, y_coords[i]+0.02, label_text, fontsize=9) # Add a legend for the cluster colors # We'll add it by mapping each cluster label to a color patch unique_clusters = list(sorted(set(clusters))) handles = scatter.legend_elements()[0] labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, labels_for_legend, title="Clusters", loc="best") plt.show()

Explanation / Notes

  • We used PCA to reduce to 2D, then plotted each document as a scatter point colored by its cluster_label.
  • The 15 labeled PDFs (R001-R005 are class 0, R006-R015 are class 1) are drawn with a black edge around the marker, and we label them in text as “C0” or “C1.”
  • This way, you can visually see how your labeled examples are distributed across clusters.

If you want a more advanced or non-linear reduction (like t-SNE or UMAP), just uncomment the relevant lines, and refer to reduced_tsne[:, 0] and reduced_tsne[:, 1] for x and y coordinates.


2. Visualizing Cosine Similarities

You have code that computes doc['similarity_to_nonpub'] for each document. Let’s plot a histogram (or KDE plot) of these similarity scores and highlight the 15 labeled PDFs.

import seaborn as sns # Extract similarity scores for all docs similarities = [doc['similarity_to_nonpub'] for doc in documents] # Make a list of labeled similarities, plus the label for each similarities_labeled_c0 = [] similarities_labeled_c1 = [] similarities_unlabeled = [] for i, doc in enumerate(documents): sim_val = doc['similarity_to_nonpub'] filename = doc['filename'] if filename in labeled_files_class0: similarities_labeled_c0.append(sim_val) elif filename in labeled_files_class1: similarities_labeled_c1.append(sim_val) else: similarities_unlabeled.append(sim_val) plt.figure(figsize=(10, 6)) # Plot distribution with seaborn's histplot or kdeplot sns.histplot(similarities, kde=True, color='gray', alpha=0.4, label='All Documents') # Plot labeled class 0 sns.histplot(similarities_labeled_c0, color='red', alpha=0.8, kde=True, label='Class 0 (R001-R005)') # Plot labeled class 1 sns.histplot(similarities_labeled_c1, color='blue', alpha=0.6, kde=True, label='Class 1 (R006-R015)') plt.title("Cosine Similarity to Non-Publishable Centroid") plt.xlabel("Similarity Score") plt.ylabel("Count") plt.legend(loc='best') plt.show()

Explanation / Notes

  • We separate out the distributions for:
    • All documents (gray)
    • Class 0 labeled docs (red)
    • Class 1 labeled docs (blue)
  • This helps you see whether your “Non-Publishable” examples (class 0) indeed have higher similarity to the nonpub_centroid than the “Publishable” ones.

Additional Variation

  • If you prefer a KDE-only approach (smooth density curve), you can use sns.kdeplot(...) instead of sns.histplot(kde=True, ...).
  • If you want to overlay them differently, you can create separate subplots or adjust transparency.

3. More Red Flag Rules

You mentioned that having or not having a references section wasn’t always accurate. Below are some alternative heuristics:

  1. Extremely Short Paper: If the overall word count is below some threshold (e.g., < 1000 words), it might be suspicious for a full conference paper.
  2. No Mention of Baseline / Comparison: For example, check if words like “baseline,” “comparison,” or “benchmark” appear in the method/results sections. A lack of any mention might be suspicious.
  3. No Mention of Datasets / Data: Many (but not all) publishable ML papers talk about dataset(s). If the text never mentions “dataset,” “corpus,” or “data used,” that might be suspicious.
  4. No Mention of Metrics: E.g., no mention of “accuracy,” “precision,” “recall,” “auc,” “f1,” “mse,” “rmse,” etc. in the results section.
  5. Excessive Self-Citations or “I” / “we” usage in the methodology (less formal). This is more subjective.
  6. Inconsistent or Contradictory numeric results: If the results section claims two different accuracies for the same experiment. (This is more complex to implement and requires deeper NLP to parse the content.)

Below is an updated example of a “red flag” function using some of these additional heuristics:

def count_red_flags(processed_sections): flags = 0 full_text = " ".join(sec_text for _, sec_text in processed_sections) word_count = len(full_text.split()) # 1) Very short paper if word_count < 1000: flags += 1 # 2) Check if baseline or comparison keywords are absent if ("baseline" not in full_text) and ("compare" not in full_text) and ("benchmark" not in full_text): flags += 1 # 3) Check if there's any mention of data or dataset if ("data" not in full_text) and ("dataset" not in full_text) and ("corpus" not in full_text): flags += 1 # 4) No mention of common evaluation metrics common_metrics = ["accuracy", "precision", "recall", "f1", "auc", "mse", "rmse"] if not any(metric in full_text for metric in common_metrics): flags += 1 return flags

Things to Keep in Mind

  • The above rules are heuristics and can yield false positives. For instance, a theoretical paper might not mention “dataset,” or a short paper might still be legitimate (like a short workshop paper).
  • Always tailor these rules to your actual data. Adjust thresholds and keyword sets as you learn more about your corpus.

4. Visualizing Red Flag Scores

Similar to the cosine similarity distribution, you might want to plot a histogram of red_flag_score across all documents. We’ll again highlight the labeled ones.

import seaborn as sns red_flag_scores = [doc['red_flag_score'] for doc in documents] scores_labeled_c0 = [] scores_labeled_c1 = [] scores_unlabeled = [] for doc in documents: score_val = doc['red_flag_score'] filename = doc['filename'] if filename in labeled_files_class0: scores_labeled_c0.append(score_val) elif filename in labeled_files_class1: scores_labeled_c1.append(score_val) else: scores_unlabeled.append(score_val) plt.figure(figsize=(10, 6)) sns.histplot(red_flag_scores, color='gray', alpha=0.4, binwidth=1, label='All Documents') sns.histplot(scores_labeled_c0, color='red', alpha=0.8, binwidth=1, label='Class 0 (R001-R005)') sns.histplot(scores_labeled_c1, color='blue', alpha=0.6, binwidth=1, label='Class 1 (R006-R015)') plt.title("Red Flag Score Distribution") plt.xlabel("Red Flag Score (count of triggered heuristics)") plt.ylabel("Number of Papers") plt.legend(loc='best') plt.show()

This will give you a sense of how many rules each document is triggering, and whether your known class 0 or class 1 docs align with high or low red-flag counts.


5. Suggested Improvements

  1. Refine & Validate Heuristics
    • Inspect actual “publishable” vs. “non-publishable” papers to confirm that your rules reflect reality. For example, if some well-written short paper triggers 3 red flags, maybe you need to adjust thresholds.
  2. Combine Scores
    • You can combine red flag scores with clustering or similarity. For instance, mark a paper as “non-publishable” if it triggers >2 red flags and has a high similarity to a known non-publishable centroid.
  3. Add Some Manual Spot-Checking
    • Because you have only 15 labeled papers, additional manual checks or partial labeling might help refine your thresholds.
  4. Consider More Advanced Dimensionality Reduction
    • t-SNE or UMAP might cluster your embeddings more intuitively than PCA for non-linear separations.
  5. Use a Small Classifier as a Final Step
    • Even though you have a small labeled set, you can do a simple logistic regression or MLP with strong regularization or cross-validation. You might combine features like “paper_embedding,” “red_flag_score,” and “similarity_to_nonpub.” Then use cross-validation on your 15 labeled docs to pick hyperparameters.
  6. Look for Contradictory Results
    • Potentially parse numeric results to see if there are contradictions (like claiming “accuracy of 90%” in one paragraph and “accuracy of 95%” in another for the same experiment). This is advanced but can be a strong heuristic.

Summary

  • Clustering Visualization: Use PCA/t-SNE on doc_embeddings, color by cluster, highlight labeled docs.
  • Cosine Similarity Distribution: Plot a histogram or KDE, highlight labeled docs.
  • Red Flag Heuristics: Expand your set of rules to cover more typical “publishability” signals (baselines, data usage, metrics, etc.).
  • Red Flag Distribution: Similarly, plot a histogram of how many heuristics each paper triggers and see if your labeled examples match your intuition.
  • Refinement: Iteratively refine thresholds and rules after examining outliers.

By combining these visualizations, you’ll get a much better understanding of how your dataset is structured, how your known “non-publishable” vs. “publishable” examples fit in, and which heuristics or similarity thresholds might be most effective.

Вопрос

clusters.pngclusters.png
image.pngimage.png
Using the codes that you gave for both visualizing the clusters using pca and visualizing the cosine similarity. The graph seems to visualize all the non-publishable papers, but only a subset of the 10 publishable papers. You can see in the image, that only 4 out of 10 publishable papers are visualized in the clusters.png and I think the same case in the histplot as well. Why is that. Can you rectify and give me the correct code.

Ответ

The issue you're describing likely stems from how the labeled files (R001 to R015) are matched against the filenames in your documents list. If there’s any mismatch in how the filenames are stored (e.g., missing .pdf extensions, inconsistent capitalization, or whitespace issues), some of the labeled files might not be correctly identified, resulting in only a subset being visualized.

Here’s a step-by-step rectification and improved code to ensure all 15 labeled files are visualized properly:


Step 1: Verify File Matching Logic

Ensure that the filenames for your labeled files are correctly matched against the filenames in your documents list. Normalize both sets of filenames to avoid mismatches.

# Define labeled files with consistent formatting labeled_files_class0 = {f"R00{i}.pdf" for i in range(1, 6)} # R001-R005 labeled_files_class1 = {f"R00{i}.pdf" for i in range(6, 16)} # R006-R015 # Normalize filenames in the documents list for comparison for doc in documents: doc['normalized_filename'] = doc['filename'].strip().lower() # Normalize filenames # Ensure the labeled sets are also normalized labeled_files_class0 = {f.strip().lower() for f in labeled_files_class0} labeled_files_class1 = {f.strip().lower() for f in labeled_files_class1}

Step 2: Correct Clustering Visualization Code

Update the clustering visualization code to ensure all labeled files are highlighted properly:

# PCA for dimensionality reduction pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) # Extract coordinates x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] # Create figure plt.figure(figsize=(10, 7)) clusters = [doc['cluster_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title("K-Means Clusters (PCA 2D Projection)") plt.xlabel("Principal Component 1") plt.ylabel("Principal Component 2") # Highlight labeled files for i, doc in enumerate(documents): filename = doc['normalized_filename'] if filename in labeled_files_class0 or filename in labeled_files_class1: plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) # Label the points if filename in labeled_files_class0: label_text = "C0" elif filename in labeled_files_class1: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) # Add legend for clusters unique_clusters = list(sorted(set(clusters))) handles = scatter.legend_elements()[0] labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, labels_for_legend, title="Clusters", loc="best") plt.show()

Step 3: Correct Cosine Similarity Visualization Code

Update the cosine similarity plot to ensure all labeled files are included:

# Extract cosine similarity values similarities = [doc['similarity_to_nonpub'] for doc in documents] # Categorize similarities similarities_labeled_c0 = [] similarities_labeled_c1 = [] for doc in documents: filename = doc['normalized_filename'] sim_val = doc['similarity_to_nonpub'] if filename in labeled_files_class0: similarities_labeled_c0.append(sim_val) elif filename in labeled_files_class1: similarities_labeled_c1.append(sim_val) plt.figure(figsize=(10, 6)) # Plot overall distribution sns.histplot(similarities, kde=True, color='gray', alpha=0.4, label='All Documents') # Highlight class 0 sns.histplot(similarities_labeled_c0, kde=True, color='red', alpha=0.8, label='Class 0 (R001-R005)') # Highlight class 1 sns.histplot(similarities_labeled_c1, kde=True, color='blue', alpha=0.6, label='Class 1 (R006-R015)') plt.title("Cosine Similarity to Non-Publishable Centroid") plt.xlabel("Similarity Score") plt.ylabel("Count") plt.legend(loc='best') plt.show()

Step 4: Debugging Mismatches

If any of the 15 labeled files still fail to appear in the plots, print the filenames for debugging:

# Check which labeled files are not being matched missing_files_c0 = labeled_files_class0 - {doc['normalized_filename'] for doc in documents} missing_files_c1 = labeled_files_class1 - {doc['normalized_filename'] for doc in documents} print("Missing Class 0 Files:", missing_files_c0) print("Missing Class 1 Files:", missing_files_c1)

Ensure the filenames in the labeled lists (labeled_files_class0, labeled_files_class1) exactly match the filenames in documents.


Explanation of Changes

  1. Normalized Filenames: Ensured consistency by normalizing all filenames to lowercase and stripping whitespace before matching.
  2. Debugging Missing Matches: Added code to detect and print any labeled files that aren’t matched.
  3. Updated Visualization: Plots now handle all 15 labeled files and highlight them appropriately.

Let me know if the updated code resolves the issue or if further adjustments are needed!

Вопрос

can you adjust the previous visualization codes to highlight and visualize the documents that are maybe_nonpub == True. So I can visualize how they are distributed in the clusters and cosine similarity distribution. Make sure it works without errors.

Ответ

Here's how you can adjust the previous visualization codes to highlight the documents where maybe_nonpub == True. This involves updating both the clustering visualization and the cosine similarity distribution visualization.


1. Clustering Visualization with maybe_nonpub Highlighting

We’ll add a special highlight for documents marked as maybe_nonpub == True in the clustering plot. These will be shown with a distinct edge color (e.g., orange).

# PCA for dimensionality reduction pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) # Extract coordinates x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] # Create figure plt.figure(figsize=(10, 7)) clusters = [doc['cluster_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title("K-Means Clusters (PCA 2D Projection)") plt.xlabel("Principal Component 1") plt.ylabel("Principal Component 2") # Highlight labeled files for i, doc in enumerate(documents): filename = doc['normalized_filename'] # Highlight labeled documents (Class 0 and Class 1) if filename in labeled_files_class0 or filename in labeled_files_class1: plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if filename in labeled_files_class0: label_text = "C0" elif filename in labeled_files_class1: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) # Highlight maybe_nonpub documents with a distinct edge color if doc.get('maybe_nonpub', False): plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', marker='s', s=120, linewidths=1.5) # Add legend for clusters unique_clusters = list(sorted(set(clusters))) handles = scatter.legend_elements()[0] labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, labels_for_legend, title="Clusters", loc="best") plt.show()

Explanation:

  1. Black Circles: Highlight the labeled documents (R001 to R015) with class annotations (C0 and C1).
  2. Orange Squares: Highlight documents where maybe_nonpub == True with orange square edges.

2. Cosine Similarity Distribution with maybe_nonpub Highlighting

We’ll adjust the cosine similarity plot to highlight the distribution of documents where maybe_nonpub == True in addition to the labeled documents.

# Extract cosine similarity values similarities = [doc['similarity_to_nonpub'] for doc in documents] # Categorize similarities similarities_labeled_c0 = [] similarities_labeled_c1 = [] similarities_maybe_nonpub = [] for doc in documents: filename = doc['normalized_filename'] sim_val = doc['similarity_to_nonpub'] if filename in labeled_files_class0: similarities_labeled_c0.append(sim_val) elif filename in labeled_files_class1: similarities_labeled_c1.append(sim_val) if doc.get('maybe_nonpub', False): similarities_maybe_nonpub.append(sim_val) plt.figure(figsize=(10, 6)) # Plot overall distribution sns.histplot(similarities, kde=True, color='gray', alpha=0.4, label='All Documents') # Highlight class 0 sns.histplot(similarities_labeled_c0, kde=True, color='red', alpha=0.8, label='Class 0 (R001-R005)') # Highlight class 1 sns.histplot(similarities_labeled_c1, kde=True, color='blue', alpha=0.6, label='Class 1 (R006-R015)') # Highlight maybe_nonpub sns.histplot(similarities_maybe_nonpub, kde=True, color='orange', alpha=0.5, label='Maybe Non-Publishable') plt.title("Cosine Similarity to Non-Publishable Centroid") plt.xlabel("Similarity Score") plt.ylabel("Count") plt.legend(loc='best') plt.show()

Explanation:

  1. Gray Distribution: All documents.
  2. Red and Blue Distributions: Labeled Class 0 (R001-R005) and Class 1 (R006-R015), respectively.
  3. Orange Distribution: Documents marked as maybe_nonpub == True.

Testing and Debugging

To ensure the plots work without errors:

  1. Check that every document in documents has a normalized filename (normalized_filename) and the maybe_nonpub attribute (default to False if not already present).
  2. If errors persist, print out examples of documents missing expected attributes:
    for doc in documents: if 'normalized_filename' not in doc or 'maybe_nonpub' not in doc: print("Missing attributes in:", doc)

Let me know if the updated code highlights the maybe_nonpub documents correctly and if further refinements are needed!

Вопрос

# Step 0: Install dependencies (if needed)
!pip install pymupdf sentence-transformers

import time
import warnings
warnings.filterwarnings('ignore')
import os
import fitz
import re
import torch
import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from collections import Counter
import pickle
import json
import pandas as pd
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'all-mpnet-base-v2' #'all-mpnet-base-v2' #all-distilroberta-v1 #allenai/scibert_scivocab_uncased #t5-base #sentence-transformers/all-roberta-large-v1
embed_model = SentenceTransformer(model_name, device=device)

# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')


# Preprocessing

############################################
# Step 1: PDF Extraction with PyMuPDF
############################################
def extract_text_pymupdf(pdf_path):
"""
Use PyMuPDF to extract text from each page,
returning a single string for the entire PDF.
"""
doc_text = []
with fitz.open(pdf_path) as doc:
for page in doc:
page_text = page.get_text("text")
doc_text.append(page_text)
return "\n".join(doc_text)

############################################
# Step 2: Strict Section Splitting with Unified Headings
############################################
# Regex for section headings that appear on a line by themselves (case-insensitive):
# Optional leading numbering, e.g. "1. Introduction"
# Then one of the words: abstract, introduction, background, related work, method(s), methodology,
# experiments, results, discussion, conclusion(s), references
# The line must end right after the heading text.
heading_pattern = re.compile(
r"^(?:\d+\.\s*)?"
r"(abstract|introduction|background|related\s*work|methods?|methodology|experiments|results|discussion|conclusions?|references)\s*$",
re.IGNORECASE
)

def unify_heading(heading_str):
"""
Convert certain headings to a unified label:
- methods, method, methodology -> Methods
- conclusions, conclusion -> Conclusion
- all else -> capitalized heading (e.g. 'Introduction', 'Results', etc.)
"""
heading_lower = heading_str.lower()

if heading_lower in {"methods", "method", "methodology"}:
return "Methods"
elif heading_lower in {"conclusion", "conclusions"}:
return "Conclusion"
else:
# Capitalize first letter, rest lower
# e.g. "introduction" -> "Introduction", "results" -> "Results"
return heading_lower.capitalize()

def split_into_strict_sections(full_text):
"""
Splits text by newlines. Then checks if the *entire* line
matches typical headings. If it does, we start a new section.
Otherwise, accumulate lines in the current section.
"""
sections = []
current_section_name = "Unlabeled"
current_lines = []

# Split on newline
lines = full_text.split('\n')
for line in lines:
line_stripped = line.strip()
match = heading_pattern.match(line_stripped.lower())
if match:
# Save previous lines under the old section
if current_lines:
sections.append((current_section_name, "\n".join(current_lines)))

# The matched text (e.g. 'methods', 'introduction', etc.)
matched_text = match.group(1)
# Normalize it
section_name = unify_heading(matched_text)

current_section_name = section_name
current_lines = []
else:
current_lines.append(line_stripped)

# Append leftover lines to the final section
if current_lines:
sections.append((current_section_name, "\n".join(current_lines)))

return sections

############################################
# Step 3: Text Preprocessing
############################################
def custom_preprocess(text):
"""
Light text cleaning:
- Lowercase
- Remove extraneous punctuation (but keep basic . , % etc.)
- Remove stopwords
"""
text = text.lower()
# Adjust regex to keep or remove punctuation as desired
text = re.sub(r'[^a-z0-9%\=$$\.\,\s]', '', text)

tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
tokens = [t for t in tokens if t not in stop_words]

return ' '.join(tokens)

############################################
# Step 4: Recursive PDF Finder
############################################
def find_all_pdfs(root_dir):
pdf_paths = []
for root, _, files in os.walk(root_dir):
for file in files:
if file.lower().endswith(".pdf"):
pdf_paths.append(os.path.join(root, file))
return pdf_paths

############################################
# Step 5: Main Pipeline
############################################
def process_pdfs_strict_unified(root_folder):
"""
1) Recursively find all PDFs
2) Extract text with PyMuPDF
3) Strictly split into sections
4) Normalize headings (e.g., 'Methodology' -> 'Methods')
5) Preprocess each section
6) Return a list of dicts
"""
pdf_paths = find_all_pdfs(root_folder)
documents = []

start_time = time.time()
for pdf_path in pdf_paths:
raw_text = extract_text_pymupdf(pdf_path)
sections = split_into_strict_sections(raw_text)

processed_sections = []
for (sec_name, sec_text) in sections:
clean_text = custom_preprocess(sec_text)
processed_sections.append((sec_name, clean_text))

doc_info = {
'filename': os.path.basename(pdf_path),
'sections': processed_sections
}
documents.append(doc_info)

end_time = time.time()
print(f"Processed {len(documents)} PDFs in {end_time - start_time:.2f} seconds.")
return documents

############################################
# Step 6: Example Usage (Convert to DataFrame + Save as CSV)
############################################
root_folder = "/kaggle/input/kdsh-2025-dataset/KDSH_2025_Dataset" # Modify as needed
documents = process_pdfs_strict_unified(root_folder)

# Convert the documents list into a DataFrame
df = pd.DataFrame([
{
'filename': doc['filename'],
'section_name': section[0], # e.g. "Methods", "Introduction", "Conclusion"
'section_text': section[1]
}
for doc in documents
for section in doc['sections']
])

# Save the DataFrame as a CSV file
df.to_csv("documents_strict_unified.csv", index=False)
print("Saved dataset as documents_strict_unified.csv")

df.section_name.unique()

# Embedding

def get_document_embedding(processed_sections):
section_texts = [sec_text for _, sec_text in processed_sections if sec_text.strip()]
if not section_texts:
return np.zeros(768)
embeddings = embed_model.encode(section_texts, convert_to_numpy=True)
return np.mean(embeddings, axis=0)

def count_red_flags(processed_sections):
flags = 0
full_text = " ".join(sec_text for _, sec_text in processed_sections)
word_count = len(full_text.split())

# Example rule: 100% accuracy
if '100% accuracy' in full_text or '100 percent accuracy' in full_text:
flags += 2

# Example rule: short methodology section
for sec_name, sec_text in processed_sections:
if 'method' in sec_name.lower():
if len(sec_text.split()) < 50:
flags += 1

if word_count < 500:
flags += 1

#recipe is a red flag.

# 2) Check if baseline or comparison keywords are absent
if ("baseline" not in full_text) and ("compare" not in full_text) and ("benchmark" not in full_text):
flags += 1

# 3) Check if there's any mention of data or dataset
if ("data" not in full_text) and ("dataset" not in full_text) and ("corpus" not in full_text):
flags += 1

# 4) No mention of common evaluation metrics
common_metrics = ["accuracy", "precision", "recall", "f1", "auc", "mse", "rmse"]
if not any(metric in full_text for metric in common_metrics):
flags += 1

return flags


# Compute embeddings for each doc
doc_embeddings = []
for doc in documents:
emb = get_document_embedding(doc['sections'])
doc_embeddings.append(emb)
doc_embeddings = np.array(doc_embeddings)

# KMeans clustering
num_clusters = 5
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
labels = kmeans.fit_predict(doc_embeddings)

for i, doc in enumerate(documents):
doc['cluster_label'] = int(labels[i])
doc['red_flag_score'] = count_red_flags(doc['sections'])

cluster_counts = Counter(labels)
print("Cluster distribution:", cluster_counts)

# Suppose we know the filenames of the 5 known non-publishable
known_nonpub_filenames = ["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"]
known_nonpub_embeddings = []

for doc in documents:
if doc['filename'] in known_nonpub_filenames:
known_nonpub_embeddings.append(get_document_embedding(doc['sections']))

nonpub_centroid = np.mean(known_nonpub_embeddings, axis=0)

def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

for doc in documents:
emb = get_document_embedding(doc['sections'])
sim = cosine_similarity(emb, nonpub_centroid)
doc['similarity_to_nonpub'] = float(sim)

# Example threshold for suspicious
similarity_threshold = 0.5
red_flag_threshold = 1

for doc in documents:
doc['maybe_nonpub'] = (doc['similarity_to_nonpub'] > similarity_threshold)




import pandas as pd

# Convert the list of dictionaries into a DataFrame without splitting sections
expanded_df = pd.DataFrame(documents)



expanded_df[expanded_df.filename.isin(["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"])]

import pandas as pd

# Sample data for demonstration (expand_df should be your actual DataFrame)
# expanded_df = pd.DataFrame(...)

# Filter rows where 'filename' column equals values from 'R001.pdf' to 'R015.pdf'
labeled_df = expanded_df[expanded_df['filename'].isin([f'R{str(i).zfill(3)}.pdf' for i in range(1, 16)])]

# Display the filtered rows
labeled_df.sort_values(by='filename')

Based on the cosine similarity values, we can see that the lowest value for Class 0 is 0.598972 and the highest value for class 1 is 0.2980.
There is a huge difference of about 0.3 So we can set the decision boundary at value = 0.45 to get a boundary and if we want to be even more sure, we can set that to 0.50 giving out less false positives.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# We have:
# doc_embeddings (shape: [num_documents, embedding_dim])
# documents (a list of dicts with 'filename' and 'cluster_label')

# Let's define our labeled sets
labeled_files_class0 = set(["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"]) # R001 to R005
labeled_files_class1 = set(["R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf","R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"]) # R006 to R015

# (2) Alternatively, you can try t-SNE (comment out PCA if you want this)
tsne = TSNE(n_components=2, random_state=42)
reduced_tsne = tsne.fit_transform(doc_embeddings)
#
x_coords = reduced_tsne[:, 0]
y_coords = reduced_tsne[:, 1]


# Create figure
plt.figure(figsize=(10, 7))
clusters = [doc['cluster_label'] for doc in documents]
scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6)

plt.title("K-Means Clusters (PCA 2D Projection)")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")

# Highlight labeled files
for i, doc in enumerate(documents):
filename = doc['filename']
# Highlight labeled documents (Class 0 and Class 1)
if filename in labeled_files_class0 or filename in labeled_files_class1:
plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5)
if filename in labeled_files_class0:
label_text = "C0"
elif filename in labeled_files_class1:
label_text = "C1"
plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9)

# Highlight maybe_nonpub documents with a distinct edge color
if doc.get('maybe_nonpub', False):
plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', marker='s', s=120, linewidths=1.5)

# Add legend for clusters
unique_clusters = list(sorted(set(clusters)))
handles = scatter.legend_elements()[0]
labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters]
plt.legend(handles, labels_for_legend, title="Clusters", loc="best")

plt.show()

# PCA for dimensionality reduction
pca = PCA(n_components=2, random_state=42)
reduced_pca = pca.fit_transform(doc_embeddings)

# Extract coordinates
x_coords = reduced_pca[:, 0]
y_coords = reduced_pca[:, 1]

# Create figure
plt.figure(figsize=(10, 7))
clusters = [doc['cluster_label'] for doc in documents]
scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6)

plt.title("K-Means Clusters (PCA 2D Projection)")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")

# Highlight labeled files
for i, doc in enumerate(documents):
filename = doc['filename']
# Highlight labeled documents (Class 0 and Class 1)
if filename in labeled_files_class0 or filename in labeled_files_class1:
plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5)
if filename in labeled_files_class0:
label_text = "C0"
elif filename in labeled_files_class1:
label_text = "C1"
plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9)

# Highlight maybe_nonpub documents with a distinct edge color
if doc.get('maybe_nonpub', False):
plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', marker='s', s=120, linewidths=1.5)

# Add legend for clusters
unique_clusters = list(sorted(set(clusters)))
handles = scatter.legend_elements()[0]
labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters]
plt.legend(handles, labels_for_legend, title="Clusters", loc="best")

plt.show()

# Extract cosine similarity values
import seaborn as sns
similarities = [doc['similarity_to_nonpub'] for doc in documents]

# Categorize similarities
similarities_labeled_c0 = []
similarities_labeled_c1 = []
similarities_maybe_nonpub = []

for doc in documents:
filename = doc['filename']
sim_val = doc['similarity_to_nonpub']
if filename in labeled_files_class0:
similarities_labeled_c0.append(sim_val)
elif filename in labeled_files_class1:
similarities_labeled_c1.append(sim_val)
if doc.get('maybe_nonpub', False):
similarities_maybe_nonpub.append(sim_val)

plt.figure(figsize=(10, 6))

# Plot overall distribution
sns.histplot(similarities, kde=True, color='gray', alpha=0.4, label='All Documents')

# Highlight class 0
sns.histplot(similarities_labeled_c0, kde=True, color='red', alpha=0.8, label='Class 0 (R001-R005)')

# Highlight class 1
sns.histplot(similarities_labeled_c1, kde=True, color='blue', alpha=0.6, label='Class 1 (R006-R015)')

# Highlight maybe_nonpub
sns.histplot(similarities_maybe_nonpub, kde=True, color='orange', alpha=0.5, label='Maybe Non-Publishable')

plt.title("Cosine Similarity to Non-Publishable Centroid")
plt.xlabel("Similarity Score")
plt.ylabel("Count")
plt.legend(loc='best')

plt.show()

Using this code that you gave earlier, after some modifications and using some better preprocessing of my own, I'm now able to get good results. The model is able to tell with high confidence that there are 37 non-publishable papers from 150 papers that we have. But this is purely using the cosine similarity , I have not included the rule based methods for the maybe_nonpub . This is because I felt like the rules weren't representative or good enough yet, so I will think further and analyse the papers and add better rules down the line. But for now, lets ignore it.

The model already does a good job using the similarity measure. It currently takes the centroid of the known non-publishable papers and then calculates the similarity measure based on that centroid. We get a lot of papers with high similarity, so I was thinking if I should include those with high similarity into the known_nonpub_filenames list and then recalculate the nonpub_centroid and then calculate similarity measure based on this new centroid. Then I can repeat this process, until i can no longer classify with high similarity. Is this a viable approach. Think deeply and give your thoughts on it. Then gave a code implementation for this one and tell me where I should paste it.

Then you also suggested this below approach:
"Even though you have a small labeled set, you can do a simple logistic regression or MLP with strong regularization or cross-validation. You might combine features like “paper_embedding,” and “similarity_to_nonpub.” Then use cross-validation on your 15 labeled docs to pick hyperparameters.
**Small Classifier**: If time permits, you can still attempt a small MLP on top of embeddings, but be wary of overfitting with only 15 labeled examples."
I like this idea and would love to hear more about this and why this might be a good idea, potential upsides and downsides and then the code implementation. I would like for it work in conjunction with the existing code and complement or aid in finding the non-publishable papers.

Till now, I was using kmeans for clustering and it has worked quite well. But you also that Hierarchical clustering or HDBSCAN can discover an arbitrary number of clusters and potentially identify outliers.
Can you think through this and guide me on this. Should I change the kmeans to something else, would that make any performance better.

I want you to take all this context and think deeply and address my queries and respond correctly.

Give two sets of end to end code. The first one should have the existing code with additional improvements by maybe using a mlp as a secondary classifier to aid and complement and also an interative kmeans implementation based on the similarity. The second code should be trying out different approaches like using hdbscan or whatever is best instead of kmeans or hierarchical clustering. or a mlp based classifier and other improvements as well.

Мысль

Thought for 7 seconds

Ответ

Below is a detailed answer addressing your specific questions:

  1. Is it viable to iteratively update the non-publishable centroid by adding newly identified “high-similarity” papers to the known non-publishable list?
  2. Why might a small MLP (or logistic regression) classifier be useful, and how do we integrate it with the existing code?
  3. Should we replace K-Means with something like Hierarchical Clustering or HDBSCAN, and will that likely improve performance?

Finally, I provide two complete code examples (so you can simply paste them into your workflow):

  • Set 1: The first code keeps your K-Means + centroid-based similarity approach, adds an iterative centroid expansion (semi-supervised “self-labeling”) step, and then demonstrates a small MLP for classification, complementing the similarity approach.
  • Set 2: The second code tries HDBSCAN for outlier/cluster discovery instead of K-Means, and also shows an alternative logistic regression or MLP approach—just enough to illustrate how you might integrate another clustering approach and a classifier.

1. Iterative Centroid Expansion: Pros and Cons

Pros

  1. Simple Semi-Supervised Method: By adding unlabeled documents that have high similarity to the known “non-publishable” centroid, you effectively enlarge your labeled “non-publishable” pool.
  2. Refined Centroid: Each iteration recalculates the centroid with the newly added documents, potentially capturing more nuanced aspects of “non-publishable” text.
  3. Fewer False Negatives: If your initial set is correct, you are less likely to miss documents that are truly “non-publishable.”

Cons

  1. Risk of Error Propagation: If a paper is mistakenly added to the “non-publishable” pool, it can shift the centroid incorrectly and cause more incorrect classifications.
  2. Subjectivity of Similarity Threshold: You must set a similarity threshold. If it’s too low, you’ll add documents too eagerly; if it’s too high, you might ignore borderline papers.

Practical Tip: Start with a higher threshold (e.g., > 0.60) to keep false positives low. Then you can slowly lower the threshold if you want to catch borderline cases, but watch for error drift.


2. Small MLP or Logistic Regression

Why Might This Help?

  1. Combining Multiple Features: A simple centroid-based similarity is only one dimension. If you also incorporate features such as “embedding vectors,” “similarity_to_nonpub,” or “red_flag_score,” your classifier can learn a more robust boundary.
  2. Cross-Validation: Even with 15 labeled examples, you can do cross-validation (like 5-fold or leave-one-out) to help prevent overfitting and select hyperparameters.
  3. Capture Non-Linear Boundaries: An MLP can model more complex decision surfaces than a linear threshold on similarity.

Downsides

  1. Risk of Overfitting: 15 labeled examples is very few.
  2. You Must Carefully Validate: Rely on cross-validation or you won’t know if you’re just memorizing the 15 examples.

3. Alternatives to K-Means: HDBSCAN or Hierarchical

  • HDBSCAN: A density-based clustering algorithm that can identify outliers as their own cluster or leave them as unassigned. This might catch unusual documents that do not fit into your main “publishable” clusters.
  • Hierarchical Clustering: Can give you a dendrogram and you can decide the cut. But it can be memory-intensive for large datasets.
  • Will It Improve Performance? Possibly, if your data truly forms irregular clusters or if you care about identifying “noise/outlier” documents specifically. However, K-Means is simpler and often works well for dense embeddings. It’s worth an experiment if you have time.

Set 1: End-to-End Code (Iterative Centroid + K-Means + MLP)

Below is a consolidated code block illustrating:

  1. Iterative centroid expansion for “non-publishable” detection.
  2. An MLP that uses the paper embeddings plus similarity_to_nonpub as input features.

Where to paste: Ideally, after you’ve computed doc_embeddings, similarity_to_nonpub, and maybe_nonpub for each document. This can replace or supplement the final classification steps.

############################################################################### # SET 1: Iterative Centroid Expansion + K-Means + MLP ############################################################################### import numpy as np from copy import deepcopy from sklearn.neural_network import MLPClassifier from sklearn.model_selection import cross_val_score import warnings warnings.filterwarnings('ignore') ############################################################################### # 1. Iterative Centroid Expansion ############################################################################### def iterative_nonpub_expansion(documents, known_nonpub_filenames, similarity_threshold=0.60, max_iterations=5): """ Iteratively expands the known non-publishable list by adding any documents with similarity > similarity_threshold. Then recalculates centroid. Arguments: - documents: list of dicts containing doc['filename'], doc['similarity_to_nonpub'], etc. - known_nonpub_filenames: initial set/list of known non-publishable filenames (strings). - similarity_threshold: similarity above which a doc is considered "non-publishable." - max_iterations: maximum number of expansions to run. Returns: - updated_nonpub_filenames: final set of non-publishable filenames after expansion """ updated_nonpub_filenames = set(known_nonpub_filenames) iteration = 0 while iteration < max_iterations: iteration += 1 # Calculate new centroid from the current set nonpub_embeddings = [] for doc in documents: if doc['filename'] in updated_nonpub_filenames: emb = doc['doc_embedding'] # we store doc_embedding for convenience nonpub_embeddings.append(emb) new_centroid = np.mean(nonpub_embeddings, axis=0) # Add docs that exceed the threshold newly_added = False for doc in documents: if doc['filename'] not in updated_nonpub_filenames: # compute similarity to new centroid emb = doc['doc_embedding'] sim = cosine_similarity(emb, new_centroid) if sim > similarity_threshold: updated_nonpub_filenames.add(doc['filename']) newly_added = True if not newly_added: print(f"No new docs added in iteration {iteration}. Stopping.") break else: print(f"Iteration {iteration} added new docs. Current size of non-pub set: {len(updated_nonpub_filenames)}.") return updated_nonpub_filenames # We'll assume you already have doc['doc_embedding'] = get_document_embedding(...) # To store doc_embedding in your existing pipeline, do something like: # for i, doc in enumerate(documents): # doc['doc_embedding'] = doc_embeddings[i] # 1.1: Store embeddings in each doc for convenience for i, doc in enumerate(documents): doc['doc_embedding'] = doc_embeddings[i] # 1.2: Perform initial similarity calculation (already done in your code, typically) # for doc in documents: # sim = cosine_similarity(doc['doc_embedding'], nonpub_centroid) # doc['similarity_to_nonpub'] = float(sim) # 1.3: Iteratively expand known_nonpub_initial = ["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"] expanded_nonpub_filenames = iterative_nonpub_expansion(documents, known_nonpub_initial, similarity_threshold=0.60, max_iterations=5) print("Final set of non-publishable (expanded) = ", expanded_nonpub_filenames) ############################################################################### # 2. K-Means (Already in your code) ############################################################################### # We'll keep your existing KMeans step. No changes needed unless you want to integrate # the newly expanded non-publishable set in some way (like removing them from K-Means, etc.). # # For reference: # labels = kmeans.fit_predict(doc_embeddings) # for i, doc in enumerate(documents): # doc['cluster_label'] = int(labels[i]) ############################################################################### # 3. Small MLP for classification ############################################################################### # We'll build a tiny dataset from the 15 labeled documents you have # (including your newly expanded nonpub set if you like). # Then we do cross-validation to evaluate how well it classifies # publishable (class 1) vs non-publishable (class 0). from sklearn.model_selection import StratifiedKFold # Let's define your 15 labeled docs (or any subset you trust): nonpub_set_c0 = set(["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"]) pub_set_c1 = set(["R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf", "R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"]) X = [] y = [] for doc in documents: fname = doc['filename'] if fname in nonpub_set_c0: X.append(doc['doc_embedding']) y.append(0) elif fname in pub_set_c1: X.append(doc['doc_embedding']) y.append(1) X = np.array(X) y = np.array(y) # If we want to include similarity as an extra feature: # We'll combine doc_embedding with similarity_to_nonpub # Just do a quick horizontal stack: X_aug = [] for doc, label in zip(documents, range(len(documents))): # but we only want those that are in our labeled set fname = doc['filename'] if fname in nonpub_set_c0 or fname in pub_set_c1: embed_768 = doc['doc_embedding'] sim_val = doc['similarity_to_nonpub'] # a single float # create a feature vector of size 769 new_vec = np.concatenate([embed_768, [sim_val]]) X_aug.append(new_vec) # We must track y in the same order y_aug = [] for doc in documents: fname = doc['filename'] if fname in nonpub_set_c0: y_aug.append(0) elif fname in pub_set_c1: y_aug.append(1) X_aug = np.array(X_aug) y_aug = np.array(y_aug) # We'll train an MLP on X_aug, y_aug mlp = MLPClassifier(hidden_layer_sizes=(64,), random_state=42, max_iter=1000) skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(mlp, X_aug, y_aug, cv=skf, scoring='accuracy') print(f"5-Fold CV Accuracy for MLP: {scores.mean():.2f} +/- {scores.std():.2f}") mlp.fit(X_aug, y_aug) # Now we can apply this MLP to all 150 documents # They each get a predicted label: 0 = non-publishable, 1 = publishable for doc in documents: embed_768 = doc['doc_embedding'] sim_val = doc['similarity_to_nonpub'] test_vec = np.concatenate([embed_768, [sim_val]]) pred_label = mlp.predict(test_vec.reshape(1, -1))[0] doc['mlp_pred_label'] = int(pred_label) # doc['mlp_pred_label'] == 0 => non-publishable, 1 => publishable

Where to Paste This in Your Code

  • Step 1 (Iterative expansion) should come after you’ve computed doc_embeddings and the initial nonpub_centroid.
  • Step 2 is just your existing K-Means code—no big change required.
  • Step 3 (MLP) can go at the very end, after you have doc_embeddings and doc['similarity_to_nonpub'].

Set 2: Alternative Clustering (HDBSCAN) + (Optionally) MLP

This second code snippet replaces K-Means with HDBSCAN or Hierarchical Clustering. HDBSCAN requires installing the library:

!pip install hdbscan

We’ll also show how to do the MLP in the same manner. (You can remove the MLP part if you only want to test clustering.)

Where to paste: Replace your K-Means block with the HDBSCAN block. Keep the rest of your pipeline the same.

############################################################################### # SET 2: Using HDBSCAN (Outlier Detection) + Optional MLP ############################################################################### !pip install hdbscan import hdbscan # 1) HDBSCAN to cluster doc_embeddings hdb = hdbscan.HDBSCAN(min_cluster_size=5, min_samples=3, metric='euclidean') hdb_labels = hdb.fit_predict(doc_embeddings) # We can store these cluster labels in documents for i, doc in enumerate(documents): doc['hdb_cluster_label'] = int(hdb_labels[i]) # -1 means outlier/no cluster # Let's see how many are labeled as outliers outlier_count = sum(1 for lbl in hdb_labels if lbl == -1) print(f"Number of outliers (no cluster assigned): {outlier_count}") # 2) Inspect cluster distribution label_counts = Counter(hdb_labels) print("HDBSCAN Cluster Distribution:", label_counts) # 3) If you want a quick visualization: import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) x_coords = reduced_pca[:,0] y_coords = reduced_pca[:,1] plt.figure(figsize=(10,7)) scatter = plt.scatter(x_coords, y_coords, c=hdb_labels, cmap='tab10', alpha=0.6) plt.title("HDBSCAN Clusters (PCA 2D Projection)") plt.xlabel("PC1") plt.ylabel("PC2") # highlight your known labeled docs labeled_files_class0 = set(["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"]) labeled_files_class1 = set(["R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf", "R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"]) for i, doc in enumerate(documents): if doc['filename'] in labeled_files_class0 or doc['filename'] in labeled_files_class1: plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if doc['filename'] in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) plt.show() # 4) Optional: Train an MLP or logistic regression using same approach as above # Build labeled X, y from the 15 known docs from sklearn.neural_network import MLPClassifier from sklearn.model_selection import StratifiedKFold X_aug = [] y_aug = [] for doc in documents: fname = doc['filename'] if fname in labeled_files_class0: # Create feature vector from doc['doc_embedding'] + doc['similarity_to_nonpub'] embed_768 = doc_embeddings[[i for i, d in enumerate(documents) if d['filename'] == fname][0]] sim_val = doc['similarity_to_nonpub'] new_vec = np.concatenate([embed_768, [sim_val]]) X_aug.append(new_vec) y_aug.append(0) elif fname in labeled_files_class1: embed_768 = doc_embeddings[[i for i, d in enumerate(documents) if d['filename'] == fname][0]] sim_val = doc['similarity_to_nonpub'] new_vec = np.concatenate([embed_768, [sim_val]]) X_aug.append(new_vec) y_aug.append(1) X_aug = np.array(X_aug) y_aug = np.array(y_aug) mlp = MLPClassifier(hidden_layer_sizes=(64,), random_state=42, max_iter=1000) skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(mlp, X_aug, y_aug, cv=skf, scoring='accuracy') print(f"[HDBSCAN] 5-Fold CV Accuracy for MLP: {scores.mean():.2f} +/- {scores.std():.2f}") mlp.fit(X_aug, y_aug) for doc in documents: # For each doc, build the same feature vector and predict embed_768 = doc['doc_embedding'] sim_val = doc['similarity_to_nonpub'] test_vec = np.concatenate([embed_768, [sim_val]]) pred_label = mlp.predict(test_vec.reshape(1, -1))[0] doc['mlp_pred_label'] = int(pred_label) print("Done: HDBSCAN + MLP classification.")

Note:

  • With HDBSCAN, -1 indicates an outlier (i.e., no cluster). You can treat outliers as “suspect” if you suspect unusual papers might be “non-publishable,” or you can do a separate check.
  • You can still incorporate the iterative centroid approach if you like, or skip it in this second snippet.

Which Approach Is “Best”?

  1. Iterative centroid + K-Means is simpler to conceptualize, especially with a known set of non-publishable examples.
  2. HDBSCAN can automatically detect outliers (possible “non-publishable?”) and form an arbitrary number of clusters.
  3. MLP or logistic regression classification is valuable if you want a single “publishable vs. non-publishable” output that can blend multiple features (embedding, similarity, red flags).

Ultimately, it’s best to compare them on your small labeled set (the 15 papers). Whichever approach consistently classifies them correctly (via cross-validation) is typically the safest bet to scale to the 150-paper dataset.


Final Takeaways

  1. Iterative centroid refinement is a valid semi-supervised approach but be cautious about error drift.
  2. MLP or logistic regression can fuse multiple signals (embedding, similarity, later on red flags) into a single decision. Use cross-validation carefully to avoid overfitting.
  3. Switching from K-Means to HDBSCAN might help discover outliers or irregular cluster shapes, but it’s not guaranteed to be a silver bullet. It’s often worth an experiment.
  4. Rule-based signals can still be added later once you refine them. They can even be additional features for your classifier.

Feel free to integrate the above code snippets wherever they fit best in your existing pipeline. Good luck!

Вопрос

So I ran your above code. The set 2 failed miserably, HDBSCAN performed badly for this task. The MLP classifier overfit as well. So I dropped that.

Then I tried set 1, the iterative kmeans I thought was a good idea, but the final nonpub_centroid ended up making a false negative and a bunch of false positives, and making the boundaries between the classes even worse and laughable performance. This also only classified 34 papers are non publishable that is inclusive of mistakes.

But it seems like my initial model before these code improvements and suggestions was way better as it had predicted no false positives and had perfectly captured the true positives and true negatives, but it managed to classify 38/150 documents as non-publishable, but there are even more documents that are non publishable. But the model hasn't learnt to predict it yet. Basically it has a lot more false negatives to work on, but thats the only issue with it

I feel like the similarity measure was a good baseline as it had correctly classified based on my chosen threshold. But it doesn't classify all of them as non-publishable.

How do I infer the similar measure graph? Does more gap between the publishable class and non-publishable class in terms of similarity measure indicate that it is good. But what if 75 percent of data is under the threshold ? Though we get no false positives, there are much more false negatives. Explain how do I go about infering and understand the similarity measure plot, not only the difference between the classes but also the concentration of papers at different measures.

When i try 2d projection of the kmeans using pca and tsne, i can see that all the non-publishable papers are grouped together in a cluster and the surrounding points have been selected in the maybe_nonpub. But still some of the surrounding points are left out. How do i improve the false negative rate.

When i try different transformers, i get different similarity measures and thresholds, how do I infer which transformer is the best in this regard.

Think deeply about everything and answer.

Мысль

Thought for 6 seconds

Ответ

Below is a deeper discussion of the issues you’re encountering, how to interpret your similarity-based approach, and how you might reduce false negatives (while keeping false positives in check). I’ll also address how you can evaluate different transformers to pick one that best separates publishable and non-publishable papers.


1. Interpreting the Similarity Measure Plot

When you calculate the cosine similarity between your paper embeddings and the “non-publishable centroid,” you typically end up with a distribution of similarity scores (e.g., from 0.0 to 1.0). You may also separate out the scores for:

  • Known Non-Publishable examples (Class 0)
  • Known Publishable examples (Class 1)
  • Unlabeled examples

1.1 Gap or Separation Between Classes

  • Large Gap is Usually Good: A big difference in similarity between known non-publishable and known publishable means your embedding space is effectively separating these classes. If your known Class 0 docs all have similarity around 0.6–0.8, while known Class 1 docs are around 0.2–0.3, that’s a strong signal that your chosen embedding model is capturing relevant distinctions.
  • Trade-Off: You still have to choose a threshold that splits the unlabeled documents into “likely non-publishable” vs. “likely publishable.”

1.2 Balancing False Positives vs. False Negatives

  • High Threshold:
    • Fewer false positives (i.e., you rarely label a truly publishable paper as “non-publishable”).
    • Potentially more false negatives (you miss some actual non-publishable papers because the similarity was 0.58 while your threshold is 0.60).
  • Lower Threshold:
    • You’ll catch more real non-publishable papers (fewer false negatives).
    • But you may start to label some publishable papers as “non-publishable” (increase in false positives).

1.3 Concentration of Papers at Different Similarities

You mentioned that “75% of data is under the threshold” and are thus classified as publishable. This is typical if your threshold is somewhat high. The question is: Are you comfortable with that many being labeled publishable if it means you avoid false positives?

To gauge that, you can do a small precision-recall style analysis on your 15 labeled docs:

  1. Vary the threshold from (e.g.) 0.2 to 0.8.
  2. Compute false positives and false negatives among your known labeled docs at each threshold.
  3. Pick a threshold that yields an acceptable balance (maybe you accept a few false positives if it significantly lowers your false negatives).

2. Reducing False Negatives (Keeping the Similarity Baseline)

If your main problem is that the model is “missing” some truly non-publishable papers that have similarity below your chosen threshold, here are ways to tackle that:

  1. Dual-Centroid Approach: Instead of a single “non-publishable” centroid, also compute a “publishable” centroid using your known Class 1 docs. Then, for each unlabeled paper, compare its distance (similarity) to both centroids. Whichever it’s closest to (or whichever similarity is higher) is the predicted class. This can sometimes reduce false negatives if those borderline papers are “less similar” to the publishable centroid.

  2. Manual Spot-Check Borderlines: If you have a band around your threshold (say 0.50–0.60) where many papers lie, you could manually examine some of these borderline papers to either confirm they’re publishable or non-publishable. This manual check (even if for a handful of papers) can refine your threshold or expand your labeled set.

  3. Active Learning: Similar to the manual spot-check, you pick the unlabeled docs that are near the threshold. Label them carefully. Recompute the centroid with these newly labeled docs. This way, you’re less likely to incorporate incorrectly labeled examples (which can happen with naive iterative expansion).

  4. Add a Secondary Feature: If the similarity approach alone leads to false negatives, consider adding a single, carefully chosen rule-based feature. For instance, if a borderline doc is suspiciously short or lacks any mention of “data/method,” maybe that alone can nudge its final classification to non-publishable. You don’t have to rely on an entire rule-based system—just one or two very telling features might help.


3. Evaluating Different Transformers

Different transformers (e.g., all-mpnet-base-v2, scibert_scivocab_uncased, all-distilroberta-v1) can produce different embedding spaces. Some points to remember:

  1. They May Have Different Ranges of Similarity: For instance, all-distilroberta-v1 might produce embeddings whose cosine similarity has a narrower band overall compared to mpnet or scibert.
  2. Choosing the “Best”: Typically, you do this by checking how well each embedding model separates your known labeled examples. For your 15 labeled docs, measure something like AUROC, F1, or a simple accuracy at an optimal threshold.
    • Example: For each model, compute embeddings, compute centroid, then do a threshold sweep on those 15 known docs. Whichever gets the best average of (precision, recall, or F1) is presumably “best” for your data.
  3. Domain-Specific: If your research papers come from domains like computer vision, NLP, data mining, etc., scibert_scivocab_uncased might do better if the text is very academic. Alternatively, mpnet or a huggingface “sentence-transformers” model might be more robust across general text.

Key: The “best” model will produce the largest separation between known Class 0 and Class 1, or at least a separation that is easy to threshold.


4. Why the Other Approaches Might Have Failed

4.1 Iterative Centroid Expansion Problems

  • If a single misclassified doc gets added to the non-publishable set, the centroid can shift incorrectly and pull in more false positives. This can degrade performance quickly.
  • If the threshold is set too low in the iterative step, you may pick up borderline or truly publishable docs.

4.2 HDBSCAN or Other Clustering

  • HDBSCAN can be very helpful if your data has irregular clusters and outliers. But if your data truly forms two main “blobs” (publishable vs. non-publishable) and you only have 15 labeled examples, HDBSCAN might create many small clusters or label many items as outliers.
  • When you tested HDBSCAN, it probably ended up dividing the data in ways that didn’t align well with your known labels. That can happen if the embedded distribution is not strongly “density-based.”

4.3 Overfitting with MLP

  • With only 15 examples in total, an MLP can easily memorize your training set. Because your dataset is so small, the model sees the same few samples over and over. This typically leads to near-perfect training accuracy but poor real-world generalization.

5. Key Suggestions Going Forward

  1. Stay with the Similarity Approach If it’s giving you 0 false positives but you want to reduce false negatives, consider:

    • Lowering the threshold slightly.
    • Using a second “publishable” centroid.
    • Doing a small manual check for borderline papers around your threshold.
  2. Validate Thresholds with Known Data

    • Even though you have only 15 labeled docs, do a threshold sweep on that small set to see how many more non-publishable docs you catch if you shift from 0.60 to 0.55, and how many new false positives that introduces.
  3. Light Touch Rules

    • Add 1–2 carefully chosen heuristics—particularly ones that catch certain forms of “nonsense” or glaring omissions. If you can rescue even a handful of false negatives by checking for a “tell-tale sign of nonsense,” that might help.
  4. Refine the Embedding Model

    • Try multiple transformers on your 15-labeled sample. For each model, see if the gap between Class 0 and Class 1 is bigger or smaller.
    • Possibly test domain-specific ones (e.g., SciBERT if the text is scientific).
    • Pick whichever produces the best separation (lowest overlap in similarity distributions).
  5. Minimal Classifier

    • If you want to do a classifier, consider Logistic Regression or a small model with strong regularization rather than a deeper MLP. For instance, a linear boundary on [768-dim embeddings + 1 similarity feature] might be enough to glean some improvement without overfitting too easily.

6. Putting It All Together

  1. Interpretation of Similarity Plot:

    • If your known classes have a large gap, that’s typically good.
    • A large chunk of unlabeled docs might remain below threshold → potential false negatives. You can shift the threshold or do a second pass with manual checks or additional features.
  2. Improving False Negatives:

    • Lower threshold or do manual checking around borderline.
    • Possibly incorporate a small domain-specific rule to catch certain obvious mistakes.
    • Consider a dual-centroid approach (distance to non-pub vs. distance to pub).
  3. Selecting the Best Transformer:

    • Evaluate each transformer’s separation on your labeled data.
    • Whichever yields the “cleanest” gap or highest F1 (with threshold tuning) is likely best.
  4. Why Some Methods Fail:

    • Iterative centroid expansion is very sensitive to even a single mislabel.
    • HDBSCAN might not match your data structure.
    • MLP can overfit on 15 examples.
  5. Recommendation:

    • Stick with your current similarity approach if it reliably yields 0 false positives, and find ways to reduce false negatives: either a slightly lower threshold, an additional centroid for publishable, or minimal extra signals.
    • Expand your labeled set if possible (even 5–10 more doc labels) to better tune the threshold or simple logistic regression.

Final Thoughts

Your strong initial results show that cosine similarity (with a well-chosen threshold) can already separate non-publishable from publishable decently. Adding more data—or carefully adjusting your threshold—will likely do more to address the false negatives than fancy expansions or clustering. If you only have 15 labeled docs, applying active learning or a dual-centroid approach can be more robust than heavy iterative expansions that risk picking up accidental false positives.

Focus on incremental improvements around your best baseline. Sometimes the simplest method (similarity + threshold) is both the most interpretable and the most robust, especially with very limited labels.

Вопрос

ok then give me code for the logistic regression, using the similarity measure and embeddings. Make sure it doesn't overfit. Keep in mind I have only 5 non-publishable examples and 10 publishable examples in the labeled data. Keep in mind that this should work well with already existing codes and my data structures. Then give a plot that visualizes the boundary in all the 3 different 2d projection visualizations im using. Incorporate the decision boundary in those plots.
Below is the code for those plots:

# PCA for dimensionality reduction
pca = PCA(n_components=2, random_state=42)
reduced_pca = pca.fit_transform(doc_embeddings)

# Extract coordinates
x_coords = reduced_pca[:, 0]
y_coords = reduced_pca[:, 1]

# Create figure
plt.figure(figsize=(10, 7))
clusters = [doc['cluster_label'] for doc in documents]
scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6)

plt.title("K-Means Clusters (PCA 2D Projection)")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")

# Highlight labeled files
for i, doc in enumerate(documents):
filename = doc['filename']
# Highlight labeled documents (Class 0 and Class 1)
if filename in labeled_files_class0 or filename in labeled_files_class1:
plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5)
if filename in labeled_files_class0:
label_text = "C0"
elif filename in labeled_files_class1:
label_text = "C1"
plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9)

# Highlight maybe_nonpub documents with a distinct edge color
if doc.get('maybe_nonpub', False):
plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', marker='s', s=120, linewidths=1.5)

# Add legend for clusters
unique_clusters = list(sorted(set(clusters)))
handles = scatter.legend_elements()[0]
labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters]
plt.legend(handles, labels_for_legend, title="Clusters", loc="best")

plt.show()

Next : can we combine the embeddings of two or three different models and make a better prediction by somehow combining them and see it improves the performance. Think deeply about this and its implications. Guide me on this.

next : What about doing some sort of topic modeling on the papers and if the papers have bullshit topics that aren't relevant to the papers, maybe we can classify them as non-publishable based on some measure or even similarity measure. This is just a thought, guide me on this. If this is the right approach, then give a full me a full end to end code for this. I have separate dataset in form of csv, that has filenames and the extract texts as columns. This extracted text is not split into sections, so it spans the entire document, maybe this will be good for topic modeling and comparing documents and finding non-publishable. Think deeply about this.

You also suggested that Instead of a single “non-publishable” centroid, also compute a “publishable” centroid using your known Class 1 docs. Then, for each unlabeled paper, compare its distance (similarity) to both centroids. Whichever it’s closest to (or whichever similarity is higher) is the predicted class. This can sometimes reduce false negatives if those borderline papers are “less similar” to the publishable centroid. Can you give me a full end to end implementation of this, with also the necessary plots. This I can run in a separate notebook, so I can different methods concurrently.

think about all of this using all the context that you have and provide me a solid answer meeting all my requirements.

Мысль

Thought for 8 seconds

Ответ

Below is a comprehensive answer that addresses all your requests:

  1. A logistic regression classifier that uses (a) document embeddings plus (b) the similarity measure as features, while trying to avoid overfitting given your tiny labeled set (5 “non-publishable” + 10 “publishable”).
  2. Plotting the decision boundary on your 2D projections (PCA/t-SNE/etc.). We’ll discuss the best practical way to visualize a decision boundary in 2D, given that logistic regression will have more than 2 input features if we include the similarity measure.
  3. Combining multiple embeddings (from different models) to improve performance.
  4. Topic modeling for detecting “bullshit” or irrelevant content, with a code example (e.g., using LDA or a simpler approach) on your CSV dataset.
  5. Dual-centroid approach (one centroid for publishable, one for non-publishable) and code to classify based on whichever centroid is closer, plus a simple visualization.

Part A: Logistic Regression on Embeddings + Similarity

1. Overview

  • We have 15 labeled PDFs:
    • 5 known non-publishable (Class 0)
    • 10 known publishable (Class 1)
  • We assume we already have:
    • doc_embeddings[i]: a vector (e.g. 768-dimensional) for the ii-th document
    • documents[i]['similarity_to_nonpub']: a float for the similarity to the “non-publishable centroid”
    • documents[i]['filename']: the PDF filename

Where to store the labeled classes:

  • Suppose we define two sets:
    labeled_files_class0 = {"R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"} labeled_files_class1 = {"R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf", "R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"}

2. Creating the Feature Matrix (X) and Labels (y)

We’ll combine the raw embedding (768768-dim) + the single similarity feature = 769769-dim total.

import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.model_selection import StratifiedKFold, cross_val_score # 1) Build X, y from your 15 labeled docs X_list = [] y_list = [] for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0: # Non-publishable x_vec = np.concatenate([doc_embeddings[i], [doc['similarity_to_nonpub']]]) X_list.append(x_vec) y_list.append(0) elif fname in labeled_files_class1: # Publishable x_vec = np.concatenate([doc_embeddings[i], [doc['similarity_to_nonpub']]]) X_list.append(x_vec) y_list.append(1) X = np.array(X_list) y = np.array(y_list) # 2) Logistic Regression with strong regularization # Using 'saga' or 'liblinear' solver can handle small data well. # We also set a high C value for L2 or try a smaller C for stronger reg. logreg = LogisticRegression(solver='liblinear', C=1.0, penalty='l2', random_state=42) # 3) Cross-validation to check for overfitting skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(logreg, X, y, cv=skf, scoring='accuracy') print(f"[LogReg] 5-Fold CV accuracy: {scores.mean():.2f} +/- {scores.std():.2f}") # 4) Fit final model on entire labeled set logreg.fit(X, y) # 5) Predict for all 150 documents for i, doc in enumerate(documents): x_vec = np.concatenate([doc_embeddings[i], [doc['similarity_to_nonpub']]]) pred_label = logreg.predict(x_vec.reshape(1, -1))[0] doc['logreg_pred'] = int(pred_label) # 0=non-publishable, 1=publishable # If you want predicted probabilities: # doc['logreg_prob'] = logreg.predict_proba(x_vec.reshape(1, -1))[0][1]

3. Visualizing the Decision Boundary in 2D Plots

You have three 2D projections (PCA, t-SNE, etc.). The logistic regression actually lives in a 769-dimensional space (embedding + similarity). Plotting that boundary in 2D is non-trivial. Typically, one of two strategies is used:

  1. Train a 2D logistic regression just for visualization – i.e., train it only on [pca_x, pca_y] for the labeled docs. Then we can easily plot the boundary in PCA space.
  2. Project the logistic regression’s 769D boundary into the 2D PCA space. This is more complex and not always straightforward.

For simplicity, let’s do approach (1)—train a separate logistic regression purely for the 2D PCA coordinates of your labeled docs, and then we can plot the boundary. Note: This is only for illustration, not the actual classifier you use in practice.

############ Visualization of Decision Boundary in PCA ############ import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression # 1) Prepare PCA (2D) for all docs pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] # 2) Build the training set in 2D PCA for the 15 labeled docs X_2d = [] y_2d = [] for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0: # Non-publishable X_2d.append([x_coords[i], y_coords[i]]) y_2d.append(0) elif fname in labeled_files_class1: # Publishable X_2d.append([x_coords[i], y_coords[i]]) y_2d.append(1) X_2d = np.array(X_2d) y_2d = np.array(y_2d) # 3) Train logistic regression on the 2D PCA data logreg_2d = LogisticRegression(solver='liblinear', random_state=42) logreg_2d.fit(X_2d, y_2d) # 4) Plot PCA scatter with the decision boundary plt.figure(figsize=(10,7)) # Color by cluster_label (if using K-Means) clusters = [doc['cluster_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title("K-Means Clusters (PCA 2D Projection) + LogReg Boundary") plt.xlabel("Principal Component 1") plt.ylabel("Principal Component 2") # Mark labeled docs for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0 or fname in labeled_files_class1: plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if fname in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) if doc.get('maybe_nonpub', False): plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', marker='s', s=120, linewidths=1.5) # Plot the boundary in 2D # We do this by creating a grid of (x,y), predict with logreg_2d, and draw contour. xx, yy = np.meshgrid( np.linspace(x_coords.min()-0.5, x_coords.max()+0.5, 200), np.linspace(y_coords.min()-0.5, y_coords.max()+0.5, 200) ) Z = logreg_2d.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape) plt.contour(xx, yy, Z, levels=[0.5], cmap="gray", linewidths=2) # Add cluster legend unique_clusters = list(sorted(set(clusters))) handles = scatter.legend_elements()[0] labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, labels_for_legend, title="Clusters", loc="best") plt.show()

Note: The boundary you see here is only for the 2D PCA logistic regression, not the main (embedding+similarity) logistic regression. We do this because otherwise we can’t easily show a 769-dimensional boundary in 2D.


Part B: Combining Multiple Embeddings

Yes, you can combine embeddings from multiple transformer models (e.g., SciBERT, MPNet, DistilRoBERTa). The simplest approach is concatenation:

# Suppose you computed embeddings from 2 different models: # doc_embeddings_1 (shape: [n_docs, d1]) # doc_embeddings_2 (shape: [n_docs, d2]) # for each doc in the same order combined_embeddings = np.concatenate([doc_embeddings_1, doc_embeddings_2], axis=1) # shape: [n_docs, d1 + d2]

Then you can apply the same pipeline (compute similarity, do logistic regression, etc.) on the combined vector. Or you can do more sophisticated merging, like averaging the two embeddings if they’re from the same dimensional space. But typically, concatenation is fine if you can handle the increased dimensionality.

Pros:

  • Potentially captures different aspects of the text from different models.
  • If one model is better at capturing domain-specific semantics, while another is better at general language, combining can help.

Cons:

  • Risk of overfitting with very high-dimensional vectors, especially with only 15 labeled docs.
  • Typically requires strong regularization or dimensionality reduction (like PCA on the combined embeddings to reduce them back down to a manageable dimension).

Part C: Topic Modeling for “Bullshit” Topics

1. Idea

  • If the paper’s text claims it’s about “machine learning,” but your topic model picks up random words about “cooking recipes,” that’s suspicious.
  • You can do LDA (Latent Dirichlet Allocation) or an embeddings-based approach. For a simple LDA example, you just need the full text in a CSV, then run sklearn.decomposition.LatentDirichletAllocation or gensim to extract topics.

2. Example End-to-End Code (Using Gensim)

Assume you have a CSV: papers_fulltext.csv with columns: filename, full_text.

############################################################################### # TOPIC MODELING EXAMPLE ############################################################################### !pip install gensim import pandas as pd from gensim.corpora.dictionary import Dictionary from gensim.models.ldamodel import LdaModel from nltk.corpus import stopwords from nltk.tokenize import word_tokenize import re df_topics = pd.read_csv("papers_fulltext.csv") # columns: ['filename', 'full_text'] def clean_for_lda(text): # Basic cleaning for LDA text = text.lower() text = re.sub(r'[^a-z0-9\s]', '', text) tokens = word_tokenize(text) stop_words = set(stopwords.words('english')) tokens = [t for t in tokens if t not in stop_words and len(t) > 2] return tokens # Preprocess each doc docs_tokens = [clean_for_lda(t) for t in df_topics['full_text']] # Gensim requires a dictionary and a bag-of-words corpus dictionary = Dictionary(docs_tokens) dictionary.filter_extremes(no_below=3, no_above=0.6) # filter out rare & frequent tokens bow_corpus = [dictionary.doc2bow(tokens) for tokens in docs_tokens] # Train LDA with, say, 5 or 10 topics num_topics = 5 lda_model = LdaModel(bow_corpus, num_topics=num_topics, id2word=dictionary, passes=5, random_state=42) # Inspect top words in each topic for idx in range(num_topics): print(f"Topic #{idx}: {lda_model.print_topic(idx, 10)}") # Now you can infer the topic distribution of each doc doc_topics = [] for i, bow in enumerate(bow_corpus): topics_prob = lda_model.get_document_topics(bow, minimum_probability=0.0) # 'topics_prob' is a list of (topic_id, prob) doc_topics.append(topics_prob) df_topics['topic_distribution'] = doc_topics # DETECT "BULLSHIT" PAPERS? # Example: if doc's top topic = 'cooking' but the paper claims to be about 'CV' or 'ML' # you can do some heuristic check. E.g., if the paper states "NLP" but the top topic is "cooking" for i, row in df_topics.iterrows(): # topics_prob = row['topic_distribution'] => list of (topic_id, prob) top_topic_id, top_topic_prob = max(row['topic_distribution'], key=lambda x: x[1]) # If top_topic_id is not in your expected domain topics, mark it suspicious # Or do more advanced matching

You could also do a similarity approach on the doc’s topic distribution vs. a known distribution for “real ML topics.” Or if your LDA is picking up nonsense topics (like “oven, pan, salt, recipe”), you can flag those docs as suspicious.


Part D: Dual-Centroid Approach

Instead of a single “non-publishable” centroid, create two centroids:

  • Centroid_NonPub: average embedding of the 5 known non-publishable docs
  • Centroid_Pub: average embedding of the 10 known publishable docs

For each unlabeled doc, compute similarity to both centroids. Predict whichever similarity is higher (or whichever distance is smaller).

Example Code

############################################################################### # DUAL-CENTROID APPROACH ############################################################################### import numpy as np # 1) Collect embeddings for known classes nonpub_embeddings = [] pub_embeddings = [] for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0: nonpub_embeddings.append(doc_embeddings[i]) elif fname in labeled_files_class1: pub_embeddings.append(doc_embeddings[i]) # 2) Compute two centroids centroid_nonpub = np.mean(nonpub_embeddings, axis=0) centroid_pub = np.mean(pub_embeddings, axis=0) def cos_sim(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) # 3) For each doc, compute similarity to each centroid for i, doc in enumerate(documents): emb = doc_embeddings[i] sim_nonpub = cos_sim(emb, centroid_nonpub) sim_pub = cos_sim(emb, centroid_pub) # Option 1: Classify whichever is higher if sim_nonpub > sim_pub: doc['dual_centroid_pred'] = 0 # non-publishable else: doc['dual_centroid_pred'] = 1 # publishable # 4) Visualization in 2D PCA pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(doc_embeddings) x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] plt.figure(figsize=(10,7)) # color docs by 'dual_centroid_pred' dual_labels = [doc['dual_centroid_pred'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=dual_labels, cmap='bwr', alpha=0.6) plt.title("Dual Centroid Classification (PCA 2D Projection)") plt.xlabel("PC1") plt.ylabel("PC2") # Mark the known labeled docs for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0 or fname in labeled_files_class1: # black edge circle plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if fname in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i]+0.02, y_coords[i]+0.02, label_text, fontsize=9) plt.show()

Notes:

  • This approach can sometimes reduce false negatives if certain borderline docs are “less similar” to non-publishable than to publishable.
  • If a doc is equidistant, you can either random-tie break or label it as publishable, etc.

Part E: Summary & Recommendations

  1. Logistic Regression

    • Great for combining embeddings + similarity in a single feature vector.
    • Use strong regularization (smaller C) and cross-validation to mitigate overfitting.
    • For 2D boundary plots, train a parallel “2D logistic regression” on the PCA coordinates purely for visualization.
  2. Combining Multiple Embeddings

    • Concatenate or average them.
    • Watch out for dimensional explosion. Possibly do PCA to reduce combined embeddings from 1500D down to ~512.
  3. Topic Modeling

    • LDA or another approach can detect weird content.
    • If you find an entire cluster of words that are obviously not relevant to your research domain, that can be an extra signal.
  4. Dual-Centroid

    • A straightforward way to reduce false negatives if you see borderline docs are not that close to the non-pub centroid, but also not strongly close to publishable. You pick whichever is higher.
  5. Interpreting the Plots

    • For logistic regression or dual-centroid classification, the direct 2D boundary in PCA space is only an approximate view. Don’t rely on it exclusively to measure true performance. Always check the labeled docs to see if predictions match the ground truth.

With these pieces in place, you can experiment in separate notebooks:

  • One for your logistic regression pipeline (embedding + similarity).
  • Another for the dual-centroid approach.
  • A third for topic modeling heuristics.

You’ll then see which method yields the fewest false positives while also reducing false negatives, hopefully improving on your baseline. Good luck!

Вопрос

Topic #0: 0.006*"image" + 0.005*"images" + 0.004*"features" + 0.004*"car" + 0.003*"tags" + 0.003*"feature" + 0.003*"classification" + 0.003*"word" + 0.003*"visual" + 0.003*"label"
Topic #1: 0.008*"entropy" + 0.005*"food" + 0.005*"patterns" + 0.004*"system" + 0.004*"llm" + 0.003*"team" + 0.003*"concept" + 0.003*"caffeine" + 0.003*"diffusion" + 0.003*"behavior"
Topic #2: 0.008*"led" + 0.008*"patterns" + 0.007*"human" + 0.005*"properties" + 0.005*"field" + 0.005*"implications" + 0.005*"particularly" + 0.005*"behavior" + 0.005*"profound" + 0.004*"soil"
Topic #3: 0.005*"datasets" + 0.004*"tasks" + 0.004*"input" + 0.004*"features" + 0.004*"task" + 0.004*"feature" + 0.003*"neural" + 0.003*"image" + 0.003*"001" + 0.003*"language"
Topic #4: 0.008*"engine" + 0.007*"system" + 0.006*"llm" + 0.005*"patterns" + 0.004*"design" + 0.004*"plasticity" + 0.004*"dna" + 0.004*"human" + 0.003*"particularly" + 0.003*"items"

I used your topic modeling code in a completely new and different notebook and got this output, what does this mean? How can we make the model recognize irrelevant claims and other topics that do not belong in a machine learning setting. Then classify based on that?


# Suppose you computed embeddings from 2 different models:
# doc_embeddings_1 (shape: [n_docs, d1])
# doc_embeddings_2 (shape: [n_docs, d2])
# for each doc in the same order
combined_embeddings = np.concatenate([doc_embeddings_1, doc_embeddings_2], axis=1) # shape: [n_docs, d1 + d2]

I also want to try this embedding concatenation and how it performs. You know my existing code and how i already use one embedding for the similarity index and kmeans clustering. So based on that can you gave me code to make it work with my current code. Then give end to end code for concatenated_predictions using whatever method is the best and then also a plot to visualize and analyse. Make it proper with the context of my original code that I use.

Мысль

Crafting interpretation guidelines

I'm piecing together specific steps for interpreting LDA topics, pinpointing irrelevant claims, merging embeddings from various models, and then providing the unified code to combine these embeddings and visualize the results.

Crafting a pipeline

I’m working on a method to interpret LDA topics, identify irrelevant ones using a rule-based strategy, merge embeddings from various models, and visualize the results with a final consistent code snippet.

Thought for 9 seconds

Ответ

Below is a structured explanation and code on:

  1. How to interpret the LDA topics and use them (optionally) to flag papers that contain irrelevant or “nonsense” content for a machine-learning setting.
  2. Combining embeddings from two different models into a single feature (via concatenation) so you can still do your usual K-Means clustering, similarity-based classification, or logistic regression.
  3. An end-to-end code snippet demonstrating how to incorporate concatenated embeddings into your existing pipeline (including K-Means, similarity measure, and a final visualization).

1. Interpreting the LDA Topics

From your example:

Topic #0: 0.006*"image" + 0.005*"images" + 0.004*"features" + 0.004*"car" ...
Topic #1: 0.008*"entropy" + 0.005*"food" + 0.005*"patterns" + 0.004*"system" ...
Topic #2: 0.008*"led" + 0.008*"patterns" + 0.007*"human" + 0.005*"properties" ...
Topic #3: 0.005*"datasets" + 0.004*"tasks" + 0.004*"input" + 0.004*"features" ...
Topic #4: 0.008*"engine" + 0.007*"system" + 0.006*"llm" + 0.005*"patterns" ...

Each topic is basically a distribution over words (with probabilities). The top words give you a clue about what that topic is about. For instance:

  • Topic #0 might be about images, features, classification. That’s plausibly relevant to ML (especially vision).
  • Topic #1 is partially about entropy, food, patterns, behavior — a bit more random. Possibly some “food” documents ended up in your dataset.
  • Topic #3 looks strongly ML-related: “datasets,” “tasks,” “input,” “features,” “neural,” etc.

If you see a topic with words like “recipe,” “oven,” “salt,” or “cooking,” that’s likely nonsense in an ML context. In your example, some references to “food” or “caffeine” might be suspicious if your domain is purely machine learning.

1.1 Flagging “Irrelevant” or “Nonsense” Topics

After training LDA, you can:

  1. Identify which of the kk topics look “irrelevant.” For example, if “Topic #1” strongly references “food” or “caffeine” and has minimal ML terms, you might label that topic as “nonsense.”
  2. Infer each paper’s topic distribution:
    • If the paper’s top topic is “food/caffeine” with a high probability (> 0.6 or so), you can consider it suspicious for an ML conference.
  3. Create a Heuristic:
    • If max(topic distribution)\max(\text{topic distribution}) is an irrelevant topic, then flag as “non-publishable” or at least “suspect.”

Caution: LDA is just unsupervised clustering of words. Sometimes an “irrelevant” topic might appear if your corpus has unusual domain text, or it might partially overlap with real ML concepts in unexpected ways. So treat this as an additional signal (like a “red flag” score), not an absolute truth.


2. Combining Embeddings from Two Models

You can certainly use two (or more) transformer models to embed each paper. Suppose:

  • doc_embeddings_1 is shape [n_docs,d1][n\_docs, d_1]
  • doc_embeddings_2 is shape [n_docs,d2][n\_docs, d_2]

Concatenate them:

combined_embeddings = np.concatenate([doc_embeddings_1, doc_embeddings_2], axis=1) # Now shape: [n_docs, d_1 + d_2]

You can then do exactly what you’ve been doing (K-Means, similarity measure, etc.) on combined_embeddings. The biggest difference:

  • Instead of using doc_embeddings everywhere, you swap in combined_embeddings.

Potential Pitfalls:

  • If d1+d2d_1 + d_2 is large (e.g., 768 + 768 = 1536), you could risk overfitting with only 15 labeled docs. You might consider applying a small PCA or dimensionality reduction step on combined_embeddings to bring it down to 512 or 256 dimensions before continuing.

3. End-to-End Code with Concatenated Embeddings

Below is a simplified pipeline illustrating how to:

  1. Load or compute your two sets of embeddings (doc_embeddings_1, doc_embeddings_2).
  2. Concatenate them into combined_embeddings.
  3. Perform K-Means on the combined embeddings.
  4. Compute a “non-publishable centroid” from known examples, then measure similarity.
  5. Plot your PCA-based 2D projection + clustering.

Feel free to adjust the details to match your environment.

############################################################################### # Step 0: Setup & Basic Definitions ############################################################################### import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.decomposition import PCA from collections import Counter # Labeled sets (example) labeled_files_class0 = {"R001.pdf","R002.pdf","R003.pdf","R004.pdf","R005.pdf"} # non-pub labeled_files_class1 = {"R006.pdf","R007.pdf","R008.pdf","R009.pdf","R010.pdf", "R011.pdf","R012.pdf","R013.pdf","R014.pdf","R015.pdf"} # pub def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) ############################################################################### # Step 1: Suppose you have two sets of embeddings for each doc # doc_embeddings_1, doc_embeddings_2 # shape = [n_docs, d1], [n_docs, d2] ############################################################################### # For demonstration, let's pretend doc_embeddings_1 and doc_embeddings_2 exist. # You should replace them with your actual arrays from your environment. # e.g., doc_embeddings_1 = ... # e.g., doc_embeddings_2 = ... # We'll assume you also have the 'documents' list of dicts as usual. # => documents[i]['filename'], etc. ############################################################################### # Step 2: Concatenate them ############################################################################### combined_embeddings = np.concatenate([doc_embeddings_1, doc_embeddings_2], axis=1) print("Shape of combined embeddings:", combined_embeddings.shape) ############################################################################### # Step 3: K-Means on combined embeddings ############################################################################### num_clusters = 5 kmeans = KMeans(n_clusters=num_clusters, random_state=42) labels = kmeans.fit_predict(combined_embeddings) for i, doc in enumerate(documents): doc['cluster_label'] = int(labels[i]) cluster_counts = Counter(labels) print("Cluster distribution:", cluster_counts) ############################################################################### # Step 4: Compute Non-Publishable Centroid & Similarity (optional) ############################################################################### # We'll collect embeddings for the known non-publishable docs known_nonpub_inds = [] for i, doc in enumerate(documents): if doc['filename'] in labeled_files_class0: known_nonpub_inds.append(i) nonpub_embeddings = combined_embeddings[known_nonpub_inds] nonpub_centroid = np.mean(nonpub_embeddings, axis=0) # For each doc, compute similarity for i, doc in enumerate(documents): emb = combined_embeddings[i] sim_val = cosine_similarity(emb, nonpub_centroid) doc['similarity_to_nonpub'] = float(sim_val) # Example threshold similarity_threshold = 0.5 for doc in documents: doc['maybe_nonpub'] = (doc['similarity_to_nonpub'] > similarity_threshold) ############################################################################### # Step 5: Visualization (PCA 2D of the combined embeddings) ############################################################################### pca = PCA(n_components=2, random_state=42) reduced_pca = pca.fit_transform(combined_embeddings) x_coords = reduced_pca[:, 0] y_coords = reduced_pca[:, 1] plt.figure(figsize=(10, 7)) clusters = [doc['cluster_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title("K-Means Clusters on Combined Embeddings (PCA 2D Projection)") plt.xlabel("PC1") plt.ylabel("PC2") # Highlight labeled files for i, doc in enumerate(documents): filename = doc['filename'] if filename in labeled_files_class0 or filename in labeled_files_class1: plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if filename in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) # Highlight maybe_nonpub docs if doc.get('maybe_nonpub', False): plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', marker='s', s=120, linewidths=1.5) unique_clusters = list(sorted(set(clusters))) handles = scatter.legend_elements()[0] labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, labels_for_legend, title="Clusters", loc="best") plt.show()

Explanation of the Code

  1. Concatenate doc_embeddings_1 and doc_embeddings_2 to get combined_embeddings.
  2. K-Means on combined_embeddings, store cluster labels in doc['cluster_label'].
  3. Non-Publishable Centroid: We average the embeddings of your 5 known “non-publishable” docs. Then, for each doc, we compute similarity_to_nonpub.
  4. Threshold to set doc['maybe_nonpub'] = True if similarity > 0.5 (or whichever threshold you prefer).
  5. PCA from 2D is purely for visualization. The actual dimension is d1+d2d_1 + d_2.

4. Using LDA to Flag Irrelevant Content

If you suspect a doc is about “food/cooking/recipes” while it claims to be “machine learning,” you can do something like:

# For each doc in df_topics (the LDA pipeline): # row['topic_distribution'] => list of (topic_id, prob) irrelevant_topic_ids = [1] # Suppose we decided topic #1 is "food" or nonsense threshold = 0.6 # if doc's top topic is topic #1 with prob >= 0.6 => suspicious for i, row in df_topics.iterrows(): topics_prob = row['topic_distribution'] top_topic_id, top_topic_prob = max(topics_prob, key=lambda x: x[1]) if (top_topic_id in irrelevant_topic_ids) and (top_topic_prob >= threshold): row['suspect_topic'] = True else: row['suspect_topic'] = False

You can incorporate this as an additional dimension in your main pipeline—e.g., a boolean feature has_suspect_topic that you feed into your logistic regression or use as a simple rule.


Final Thoughts

  1. Topic Modeling: Good for an extra “red flag” if the doc’s top topic is obviously irrelevant. Not foolproof, but helpful.
  2. Embedding Concatenation: Potentially more robust representation, but watch out for overfitting with limited labeled data. Possibly apply a PCA step on the combined embeddings if dimension is too large.
  3. Threshold Tuning: Continues to be crucial for controlling false positives vs. false negatives in the similarity-based approach.
  4. Logistic Regression / MLP on concatenated embeddings is also an option—just be sure to use cross-validation to avoid overfitting.

This should give you a complete sense of how to interpret LDA outputs for “nonsense” detection and how to combine multiple embeddings in your existing pipeline, with an end-to-end example showing K-Means clustering, similarity thresholding, and a PCA visualization. Good luck!

Вопрос

###############################################################################
# DUAL-CENTROID APPROACH
###############################################################################
import numpy as np
# 1) Collect embeddings for known classes
nonpub_embeddings = []
pub_embeddings = []
for i, doc in enumerate(documents):
fname = doc['filename']
if fname in labeled_files_class0:
nonpub_embeddings.append(doc_embeddings[i])
elif fname in labeled_files_class1:
pub_embeddings.append(doc_embeddings[i])

# 2) Compute two centroids
centroid_nonpub = np.mean(nonpub_embeddings, axis=0)
centroid_pub = np.mean(pub_embeddings, axis=0)

def cos_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# 3) For each doc, compute similarity to each centroid
for i, doc in enumerate(documents):
emb = doc_embeddings[i]
sim_nonpub = cos_sim(emb, centroid_nonpub)
sim_pub = cos_sim(emb, centroid_pub)

# Option 1: Classify whichever is higher
if sim_nonpub > sim_pub:
doc['dual_centroid_pred'] = 0 # non-publishable
else:
doc['dual_centroid_pred'] = 1 # publishable

# 4) Visualization in 2D PCA
pca = PCA(n_components=2, random_state=42)
reduced_pca = pca.fit_transform(doc_embeddings)
x_coords = reduced_pca[:, 0]
y_coords = reduced_pca[:, 1]

plt.figure(figsize=(10,7))

# color docs by 'dual_centroid_pred'
dual_labels = [doc['dual_centroid_pred'] for doc in documents]
scatter = plt.scatter(x_coords, y_coords, c=dual_labels, cmap='bwr', alpha=0.6)
plt.title("Dual Centroid Classification (PCA 2D Projection)")
plt.xlabel("PC1")
plt.ylabel("PC2")

# Mark the known labeled docs
for i, doc in enumerate(documents):
fname = doc['filename']
if fname in labeled_files_class0 or fname in labeled_files_class1:
# black edge circle
plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5)
if fname in labeled_files_class0:
label_text = "C0"
else:
label_text = "C1"
plt.text(x_coords[i]+0.02, y_coords[i]+0.02, label_text, fontsize=9)

plt.show()

final_df = pd.DataFrame(documents)
final_df[(final_df.dual_centroid_pred != final_df.maybe_nonpub)]

non_publishable_list = final_df[(final_df['dual_centroid_pred'] == 0)].filename.to_list()

this is the code im using that you gave. the model i used is scibert. This seems to be giving the uncontested best predictions and performance. So I want to try and improve this model. From this code you can see that we are only using the centroids once calculated with initial labeled examples. such that the centroids get updated over time using the newly added points and make better centroids. What do you think of this idea, yes I know if we include false positives, it could make the centroids go wrong and quickly snowball.
Why are we using similarity for assigning to centroid instead of traditional distance measures or whatever? Im already using the code based on similarity measure. so now I want to try it based on distance based approach as well. So can i get two sets of codes , so I can compare their performance.

so give me the ability to adjust the settings of the such that I can determine how similar they have to be, to be chosen into the class and how many iterations and etc. Give me a guide for this.

Also for inference, how do I go about that? When i want to add new values and find out which class they belong to? Do i use the similarity based method or do I use k nearest neighbors and why should I chose one over the other. think and give me guidance.

Finally give me four pieces of code:
- existing dual centroid method based on similarity .
- existing dual centroid method but based on some distance measure.
- dual centroid method but that iteratively includes more points, recomputes centroids and repeats until some condition that I have control over. It should only include more points based on high similarity so as to avoid false positives.. THis should be based on similarity measure.
- dual centroid method that iteratively includes more points, recompute centroids and iterates. But this should be based on a distance measure/ distance based approach.

For all of these methods, after running them, the labels should be produced accordingly and the label should be stored in the dataframe. with the respective column names as :
- original_similarity
- original_distance
- iterative_similarity
- iterative_distance

If there are any other approaches to improve, please guide me on that as well

Мысль

Building upon dual centroid methods

I’m looking to establish four distinct code blocks identifying invariant and iterative dual centroid classifier methods, using similarity and distance measures on 'nonpub_embeddings' and 'pub_embeddings'. Progress would involve iterative inclusion and centroid recomputations.

Defining the framework

I’m detailing four code blocks to produce 'original_similarity', 'original_distance', 'iterative_similarity', 'iterative_distance' columns. I’ll define an iterative approach with thresholds and maximum iterations, ensuring comprehensive labels for each document.

Deciding on column names

OK, let's finalize 'original_similarity', 'original_distance', 'iterative_similarity', and 'iterative_distance' to classify documents into 'publishable' and 'nonpublishable'. The strategy involves iterative inclusion and centroid recomputations.

Thought for 9 seconds

Ответ

Below is a comprehensive guide that addresses your requests:

  1. Why updating centroids iteratively can help (and the risks).
  2. Why we used cosine similarity vs. distance (and how to switch to a distance-based approach).
  3. Inference strategy (adding new documents in the future).
  4. Four separate code examples for the dual-centroid method:
    • (A) Original (one-shot) classification using similarity.
    • (B) Original (one-shot) classification using distance.
    • (C) Iterative classification using similarity, re-computing centroids.
    • (D) Iterative classification using distance, re-computing centroids.

At the end, each code snippet stores its final 0/1 labels into specific columns in documents, namely:

  • doc['original_similarity']
  • doc['original_distance']
  • doc['iterative_similarity']
  • doc['iterative_distance']

This way, you end up with a single DataFrame that has the results of all four approaches.


1. Updating Centroids Iteratively

Currently, your dual-centroid approach uses 5 known non-publishable and 10 known publishable examples to compute one non-publishable centroid and one publishable centroid. This works well if those examples are representative and you don’t suspect labeling errors.

  • Pro: If you discover more truly non-publishable docs from the unlabeled set, including them in the “non-pub” centroid might capture that region of embedding space more accurately (reducing false negatives).
  • Con: A single false positive can drag the non-pub centroid away from its true region, causing a chain reaction of misclassification.

Hence, any iterative approach should have a high confidence threshold (on similarity or low distance) to minimize false positives creeping in.


2. Similarity vs. Distance

  • Cosine Similarity is typical for text embeddings from BERT-like models. A high similarity means the vectors are pointing in roughly the same direction in the high-dimensional space.
  • Distance (e.g., Euclidean) is also valid. For BERT embeddings, some prefer cosine because the magnitude of embeddings can vary, and the direction often carries the semantic content.

If you want a distance-based approach for Euclidean distance, you can do:

d(a,b)=i(aibi)2d(a, b) = \sqrt{\sum_i (a_i - b_i)^2}

and pick the centroid that yields the lowest distance.


3. Inference with New Documents

Once you have your final centroids (either after one-shot or iterative updates):

  • Similarity Approach:

    1. Compute the embedding for the new doc.
    2. Calculate its cosine similarity to the “non-pub” centroid and to the “pub” centroid.
    3. Whichever similarity is higher -> that class.
  • k-Nearest Neighbors:

    1. Keep all your known labeled examples.
    2. For a new doc, find the kk nearest neighbors in the embedding space.
    3. Use majority vote or weighted vote to classify.

When to choose KNN:

  • If you suspect your classes might be multi-modal or complexly shaped, or you have more than a handful of labeled examples, KNN can capture more local structure.
  • If your labeled set is extremely small (only 15 docs), KNN might be very sensitive. The dual-centroid approach is simpler, but also less flexible if your classes are not unimodal.

4. Four Separate Code Snippets

We’ll assume you already have:

  • A list of dictionaries called documents, where each doc has at least doc['filename'].
  • doc_embeddings is a numpy array of shape [n_docs,d][n\_docs, d].
  • Two sets:
    labeled_files_class0 = {"R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"} labeled_files_class1 = {"R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf", "R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"}
  • SciBERT or any other embeddings can be used.
  • You want columns in your final DataFrame:
    • original_similarity
    • original_distance
    • iterative_similarity
    • iterative_distance

We will define each snippet in turn. At the end, you can run them all in sequence (or in separate cells) and end up with all 4 label columns in documents.

Helper Functions

import numpy as np import math def cos_sim(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) def euclidean_dist(a, b): return np.sqrt(np.sum((a - b)**2))

A) Original Dual Centroid – Similarity

One-shot approach: pick whichever centroid yields a higher similarity. Store the predicted label in doc['original_similarity'].

############################ # A) ORIGINAL SIMILARITY ############################ # 1) Gather embeddings for known labeled docs nonpub_indices = [] pub_indices = [] for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0: nonpub_indices.append(i) elif fname in labeled_files_class1: pub_indices.append(i) nonpub_embeddings = doc_embeddings[nonpub_indices] pub_embeddings = doc_embeddings[pub_indices] # 2) Compute centroids centroid_nonpub = np.mean(nonpub_embeddings, axis=0) centroid_pub = np.mean(pub_embeddings, axis=0) # 3) Classify each doc by comparing cos_sim to each centroid for i, doc in enumerate(documents): emb = doc_embeddings[i] sim_nonpub = cos_sim(emb, centroid_nonpub) sim_pub = cos_sim(emb, centroid_pub) # 0 => non-publishable, 1 => publishable if sim_nonpub > sim_pub: doc['original_similarity'] = 0 else: doc['original_similarity'] = 1

B) Original Dual Centroid – Distance

Same idea, but we pick whichever centroid yields a lower distance. Store the predicted label in doc['original_distance'].

############################ # B) ORIGINAL DISTANCE ############################ # We assume the same centroid_nonpub, centroid_pub from above # If you are running this code separately, recompute them: nonpub_indices = [] pub_indices = [] for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0: nonpub_indices.append(i) elif fname in labeled_files_class1: pub_indices.append(i) nonpub_embeddings = doc_embeddings[nonpub_indices] pub_embeddings = doc_embeddings[pub_indices] centroid_nonpub = np.mean(nonpub_embeddings, axis=0) centroid_pub = np.mean(pub_embeddings, axis=0) # Classify using distance for i, doc in enumerate(documents): emb = doc_embeddings[i] dist_nonpub = euclidean_dist(emb, centroid_nonpub) dist_pub = euclidean_dist(emb, centroid_pub) # 0 => non-publishable, 1 => publishable if dist_nonpub < dist_pub: doc['original_distance'] = 0 else: doc['original_distance'] = 1

C) Iterative Dual Centroid – Similarity

We iteratively include new docs whose similarity to a particular centroid surpasses a threshold. For example:

  • If sim_to_nonpub > threshold_nonpub and sim_to_nonpub > sim_to_pub (i.e., it’s really closer to non-pub?), we add that doc to the nonpub set.
  • If sim_to_pub > threshold_pub and sim_to_pub > sim_to_nonpub, we add that doc to the pub set.
  • Recompute centroids.
  • Stop if no new docs were added or if we reach max_iterations.

Parameters you might want to tune:

  • threshold_nonpub, threshold_pub: e.g. 0.60 or 0.65.
  • max_iterations: e.g. 5.

The code below lumps them into one threshold if you prefer. We store the final label in doc['iterative_similarity'].

############################ # C) ITERATIVE SIMILARITY ############################ def iterative_dual_centroid_similarity(doc_embeddings, documents, labeled_files_class0, labeled_files_class1, threshold=0.65, max_iterations=5): """ Iteratively expand the sets of non-publishable and publishable docs based on which centroid they're more similar to, provided that similarity > threshold. Recompute centroids each iteration. """ # 1) Initialize sets from your labeled data nonpub_set = set(labeled_files_class0) pub_set = set(labeled_files_class1) iteration = 0 changed = True while iteration < max_iterations and changed: iteration += 1 changed = False # Compute centroids nonpub_embs = [] pub_embs = [] for i, doc in enumerate(documents): if doc['filename'] in nonpub_set: nonpub_embs.append(doc_embeddings[i]) elif doc['filename'] in pub_set: pub_embs.append(doc_embeddings[i]) # If one class is empty, skip to avoid error if len(nonpub_embs) == 0 or len(pub_embs) == 0: print("One set is empty! Stopping iterative approach.") break centroid_nonpub = np.mean(nonpub_embs, axis=0) centroid_pub = np.mean(pub_embs, axis=0) # 2) For each unlabeled doc, see if we can add it for i, doc in enumerate(documents): fname = doc['filename'] # Skip if it's already in either set if fname in nonpub_set or fname in pub_set: continue emb = doc_embeddings[i] sim_nonpub = cos_sim(emb, centroid_nonpub) sim_pub = cos_sim(emb, centroid_pub) # If it's clearly more similar to nonpub (and above threshold) if sim_nonpub > sim_pub and sim_nonpub >= threshold: nonpub_set.add(fname) changed = True # If it's clearly more similar to pub (and above threshold) elif sim_pub > sim_nonpub and sim_pub >= threshold: pub_set.add(fname) changed = True print(f"Iteration {iteration} done. Non-pub size: {len(nonpub_set)}, Pub size: {len(pub_set)}") if not changed: print("No changes in this iteration, stopping early.") # Final classification # If a doc is in nonpub_set => 0, if in pub_set => 1, else break tie for i, doc in enumerate(documents): fname = doc['filename'] if fname in nonpub_set and fname not in pub_set: doc['iterative_similarity'] = 0 elif fname in pub_set and fname not in nonpub_set: doc['iterative_similarity'] = 1 else: # doc not classified or possibly in both sets somehow # you can break ties by whichever centroid is more similar emb = doc_embeddings[i] sim_nonpub = cos_sim(emb, centroid_nonpub) sim_pub = cos_sim(emb, centroid_pub) doc['iterative_similarity'] = 0 if sim_nonpub >= sim_pub else 1 # Call the function: iterative_dual_centroid_similarity(doc_embeddings, documents, labeled_files_class0, labeled_files_class1, threshold=0.65, max_iterations=5)

Feel free to tune the threshold. If you set threshold=0.80, you’ll be very conservative about adding new docs. If you set threshold=0.50, you’ll likely add more docs (risking false positives).


D) Iterative Dual Centroid – Distance

Same iterative logic, but now we measure Euclidean distance to each centroid. If a doc is closer to the non-pub centroid by some margin, we add it to that set, provided it’s under some threshold. The threshold means “distance must be less than X” for us to add it confidently.

############################ # D) ITERATIVE DISTANCE ############################ def iterative_dual_centroid_distance(doc_embeddings, documents, labeled_files_class0, labeled_files_class1, distance_threshold=1.0, max_iterations=5): """ Iteratively expand the sets of non-publishable and publishable docs based on which centroid they are closer to, provided the distance < distance_threshold. Recompute centroids each iteration. """ nonpub_set = set(labeled_files_class0) pub_set = set(labeled_files_class1) iteration = 0 changed = True while iteration < max_iterations and changed: iteration += 1 changed = False # Compute centroids nonpub_embs = [] pub_embs = [] for i, doc in enumerate(documents): if doc['filename'] in nonpub_set: nonpub_embs.append(doc_embeddings[i]) elif doc['filename'] in pub_set: pub_embs.append(doc_embeddings[i]) if len(nonpub_embs) == 0 or len(pub_embs) == 0: print("One set is empty! Stopping iterative approach.") break centroid_nonpub = np.mean(nonpub_embs, axis=0) centroid_pub = np.mean(pub_embs, axis=0) for i, doc in enumerate(documents): fname = doc['filename'] if fname in nonpub_set or fname in pub_set: continue emb = doc_embeddings[i] dist_nonpub = euclidean_dist(emb, centroid_nonpub) dist_pub = euclidean_dist(emb, centroid_pub) # If it's strictly closer to nonpub and below threshold if dist_nonpub < dist_pub and dist_nonpub <= distance_threshold: nonpub_set.add(fname) changed = True # If it's strictly closer to pub and below threshold elif dist_pub < dist_nonpub and dist_pub <= distance_threshold: pub_set.add(fname) changed = True print(f"Iteration {iteration} done. Non-pub size: {len(nonpub_set)}, Pub size: {len(pub_set)}") if not changed: print("No changes in this iteration, stopping early.") # Final classification for i, doc in enumerate(documents): fname = doc['filename'] if fname in nonpub_set and fname not in pub_set: doc['iterative_distance'] = 0 elif fname in pub_set and fname not in nonpub_set: doc['iterative_distance'] = 1 else: # break tie by whichever centroid is closer emb = doc_embeddings[i] dist_nonpub = euclidean_dist(emb, centroid_nonpub) dist_pub = euclidean_dist(emb, centroid_pub) doc['iterative_distance'] = 0 if dist_nonpub <= dist_pub else 1 # Call the function: iterative_dual_centroid_distance(doc_embeddings, documents, labeled_files_class0, labeled_files_class1, distance_threshold=1.0, max_iterations=5)

Note: You need to choose a “distance threshold” that makes sense in your embedding space. You might try 1.0, 1.5, 2.0, etc., then see if new docs get added. If your embeddings are large in magnitude, you might need a bigger threshold. If your embeddings are normalized (like in some BERT variants), typical Euclidean distances might be < 2.0.


5. Final DataFrame and Label Columns

After running all four methods (A, B, C, D) in sequence, each doc in documents will have:

  • doc['original_similarity']
  • doc['original_distance']
  • doc['iterative_similarity']
  • doc['iterative_distance']

You can then do:

final_df = pd.DataFrame(documents) print(final_df[['filename', 'original_similarity', 'original_distance', 'iterative_similarity', 'iterative_distance']])

to see the final classification side by side.


6. Additional Approaches

  • KNN: If you eventually grow your labeled set beyond 15 docs, you can do k-Nearest Neighbors. For each doc, find the top kk labeled neighbors in the embedding space, and do a majority vote. This can handle multi-modal distributions better.
  • Hybrid Approach: You could do a single iteration of centroid-based classification, then confirm the borderline docs manually to reduce false positives.
  • Confidence Scores: For iterative approaches, you might only add a doc to a class if it’s much closer to that centroid than the other (e.g., sim_nonpub - sim_pub > 0.1) or (dist_pub - dist_nonpub) > 0.5. This helps filter out borderline cases.
  • Combine Similarity + Distance: Sometimes you can fuse features (like “cosine similarity to each centroid” plus “Euclidean distance to each centroid”) in a small logistic regression, but that might be overkill for 15 labeled docs.

Putting It All Together

  1. Run each snippet (A, B, C, D) in your environment.
  2. Check the final documents and see which approach yields the best results on your 15 labeled examples (lowest false positives/negatives).
  3. Tune thresholds (e.g., similarity > 0.6 or distance < 1.0) and max_iterations.
  4. Inference with new docs can be as simple as “which centroid is it closer to?” or “which centroid is it more similar to?”—or use a KNN approach.

This should give you ample flexibility to compare all four dual-centroid methods (similarity vs. distance, one-shot vs. iterative) and store results in your DataFrame. Good luck!

Вопрос

all 4 methods give the exact same classification report and confusion matrix. So I guess I will stick with the first one. But I have a question. Our labeled dataset is only of size 15, with 5 being class 0 and 10 being class 1. Although the embeddings were generated on their own, we calculated the nonpub_centroid and pub_centroid based on the existing labeled data's embedding.

Since we do a similarity measure against this only. Could that mean we are overfitting ?
Because when I finally did classification report using the test data labels and the predicted labels using maybe_nonpub, I got all perfect scores in the report. F1, precision, accuracy, recall are all 1.

But we didn't do supervised learning, so model didn't learn the documents. But does having a centroid based on the existing labeled values mean overfitting as well from our classification report?

But I also manually labeled some data in the unlabeled data for my own inference, this manually labeled data cannot be used to find the f1, accuracy, precision and recall, because these are originally unlabeled data and can go against the rules of the competition I am participating. So I can only use it for my own inference. But from this manually labeled data the model seems to perform quite well and generalise decently enough. But on the test data, I get perfect classification metrics.

For more context, the data given to us is only public data, they have more private data and will test on that. Guide me through this and think deeply. I have to make a report on this process when I make a submission. Will the judges consider this an overfit?

Just to be safe, can you give me a kmeans clustering algorithm without making these clustering centroids based on the labeled data and just do kmeans its own thing and get 2 clusters, from that I can split on the data into classes? Give me this code and make sure to give me a 2d projection of pca, tsne and isomap for this method. The visualization should highlight the given labeled data ie, those 15 labeled examples given originally. Make sure the code works with my existing data structures and code and how its structured.

Мысль

Applying clustering

I’m piecing together a method to cluster data using KMeans, then project it to 2D with PCA, t-SNE, and Isomap. Notably, no supervised classification is employed, and 15 labeled points are highlighted.

Outlining the approach

I’m detailing the plan to visualize 2D data plots using PCA, t-SNE, and Isomap. Each subplot will showcase clusters and highlight 15 labeled data points for clarity.

Thought for 6 seconds

Ответ

Below is a detailed explanation of your question about centroid-based classification possibly “overfitting,” followed by K-Means with 2 clusters that purely ignores labels when clustering. Then we visualize the K-Means results with three different 2D projections (PCA, t-SNE, and ISOMAP) in a single code snippet, highlighting your 15 labeled examples.


1. Are We Overfitting with Dual-Centroid?

1.1 Why You See Perfect Scores on Your 15 Docs

  • You only have 15 labeled examples: 5 non-publishable (Class 0) and 10 publishable (Class 1).
  • You compute two centroids from these 15 points and then classify exactly these same 15 points.
  • Of course, each point is going to be closest (or most similar) to its own class centroid. Hence, the classification report on these exact 15 docs will give a perfect score (F1 = 1.0).

Is it Overfitting?

  • Technically, you’re not doing supervised training in the typical sense (e.g., training a classifier and backpropagating). You’re simply using the data to define two points (the centroids). In that sense, it’s not “overfitting” in the normal machine-learning sense.
  • However, because your labeled set is so small, it’s trivial to achieve 100% accuracy on that same set. This doesn’t necessarily reflect how the method will perform on unseen data.

1.2 Will Judges Consider This Overfitting?

  • In many competitions, using only 15 labeled examples to tune a model that performs “perfectly” on those same 15 examples is indeed suspicious. But your approach is unavoidably going to produce a perfect classification on them, because you are directly measuring distance to their own means.
  • The real test is whether your classification logic generalizes to the private test set.

If your manual checks on unlabeled data suggest it’s doing “decently,” that’s a good sign. But keep in mind, your final performance on the private data is what the judges will rely on to see if it truly generalizes or not.


2. K-Means Without Using Labels (2 Clusters)

You asked for a purely unsupervised K-Means approach that does not use the labeled data to define the cluster centers. Instead, we let K-Means discover 2 clusters from the entire dataset (150 documents). Then we can see if cluster 0 aligns well with non-publishable or cluster 1 aligns well with publishable, etc.

2.1 Step-by-Step Code Explanation

  1. Fit K-Means with k=2k=2.
  2. Assign each doc a cluster label (00 or 11).
  3. Visualize the results in 2D using:
    • PCA
    • t-SNE
    • ISOMAP
      Each of these is a different dimensionality-reduction technique for visualization.
  4. Highlight your 15 labeled docs with black edges (and annotate them as “C0” or “C1” for clarity).

Below is a single code snippet that does all three visualizations in subplots. Adjust as you see fit.


3. Code Snippet

import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.decomposition import PCA from sklearn.manifold import TSNE, Isomap # Suppose we have: # doc_embeddings: shape [n_docs, d] # documents: list of dicts, each with doc['filename'] # labeled_files_class0 = { "R001.pdf", ..., "R005.pdf" } # labeled_files_class1 = { "R006.pdf", ..., "R015.pdf" } # 1) Perform K-Means with 2 clusters, ignoring labels kmeans_2 = KMeans(n_clusters=2, random_state=42) kmeans_labels = kmeans_2.fit_predict(doc_embeddings) # Store cluster labels in documents for i, doc in enumerate(documents): doc['kmeans_label'] = int(kmeans_labels[i]) print("Finished K-Means with 2 clusters. Label distribution:") unique, counts = np.unique(kmeans_labels, return_counts=True) for lab, count in zip(unique, counts): print(f"Cluster {lab}: {count} documents") # 2) We'll define a helper function to plot 2D projections def plot_2d_projection(proj_matrix, title): """ proj_matrix: shape [n_docs, 2] title: string for the plot """ x_coords = proj_matrix[:, 0] y_coords = proj_matrix[:, 1] # color by kmeans_label clusters = [doc['kmeans_label'] for doc in documents] scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6) plt.title(title) plt.xlabel("Component 1") plt.ylabel("Component 2") # highlight the 15 labeled docs for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0 or fname in labeled_files_class1: # black edge circle plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) if fname in labeled_files_class0: label_text = "C0" else: label_text = "C1" plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9) # Optionally add a legend for cluster coloring unique_clusters = list(set(clusters)) handles = scatter.legend_elements()[0] legend_labels = [f"Cluster {uc}" for uc in unique_clusters] plt.legend(handles, legend_labels, title="K-Means Clusters", loc="best") # 3) Prepare 2D projections: PCA, t-SNE, and ISOMAP pca = PCA(n_components=2, random_state=42) pca_2d = pca.fit_transform(doc_embeddings) tsne = TSNE(n_components=2, random_state=42) tsne_2d = tsne.fit_transform(doc_embeddings) iso = Isomap(n_components=2) iso_2d = iso.fit_transform(doc_embeddings) # 4) Plot them in subplots fig = plt.figure(figsize=(18, 5)) plt.subplot(1, 3, 1) plot_2d_projection(pca_2d, "PCA (2D) of doc_embeddings") plt.subplot(1, 3, 2) plot_2d_projection(tsne_2d, "t-SNE (2D) of doc_embeddings") plt.subplot(1, 3, 3) plot_2d_projection(iso_2d, "ISOMAP (2D) of doc_embeddings") plt.tight_layout() plt.show()

Notes:

  • We do not use the 15 labeled examples for training the clusters. K-Means is purely unsupervised.
  • After the clustering, you can inspect how many of your 5 known class-0 docs ended up in cluster 0 or cluster 1, similarly for your 10 class-1 docs. This can help you decide if cluster 0 corresponds to non-publishable or publishable.
  • If you want a single classification from K-Means, you could say “Cluster 0 => non-publishable, Cluster 1 => publishable” if the majority of your class-0 docs ended up in cluster 0, for example. However, that is a post-hoc labeling approach.

4. Final Thoughts

  • Perfect Classification on 15 Docs: A side effect of using those same docs to define your centroids. This is not overfitting in the typical sense, but be aware it tells you very little about true generalization.
  • Private Test: The real test is on the private data. As long as you’ve carefully selected your method and tried not to artificially tune it too heavily to those 15 examples, you should be fine.
  • K-Means 2 Clusters: This is a purely unsupervised baseline ignoring labels. Use the 2D visualizations to see if it forms any meaningful separation. You might find that your known 5 non-publishable docs are mostly in one cluster, and your 10 publishable docs are in another. Or it might mix them.

With this code, you can show a simpler unsupervised approach to the judges, demonstrating you didn’t rely solely on your labeled docs to define the entire classification logic. Good luck!

Вопрос

It is crazy now that we pure kmeans I am getting the exact same result. Why is that?

Until now we have been calculating embeddings using this below function:
def get_document_embedding(processed_sections):
section_texts = [sec_text for _, sec_text in processed_sections if sec_text.strip()]
if not section_texts:
return np.zeros(768)
embeddings = embed_model.encode(section_texts, convert_to_numpy=True)
return np.mean(embeddings, axis=0)

for this we are calculating embeddings for all sections and then just averaging them, which makes me think we lose a lot of information. So I want you to modify this code such that, for each section that section's embedding is also stored under that section_name_embedding. If some documents have no such section, then its fine to have null np.zeros as you have already did. We can also have the document level embedding as well. Now change the kmeans code to do a kmeans for every single section embedding and plot them as well using the code you gave earlier. then also give the code for the overall document level embedding visualization. make sure you modularise the visualization function, so we can loop it over section level embeddings.

Below are the only sections in the data:
['Unlabeled', 'Abstract', 'Introduction', 'Methods', 'Results',
'Related work', 'Experiments', 'Conclusion', 'Discussion',
'Background', 'References']

there are no more sections occurring in the data.

Мысль

Organizing embeddings

I’m mapping out a way to store embeddings for each document section and an overall document embedding. This involves modularizing the visualization for easier access to individual section KMeans results.

Formulating the plan

I’m outlining a modular function to build embeddings for each document section and an overall document embedding. Then, I’ll apply KMeans clustering and visualize 2D embeddings using PCA, t-SNE, or Isomap.

Pulling together final code

I'm composing a comprehensive code snippet. It builds section embeddings and doc-level embeddings, applies KMeans clustering, and visualizes using a modularized approach, maintaining the user's preferred coding style.

Thought for 10 seconds

Ответ

Below is a complete solution that:

  1. Stores embeddings per section (one embedding for “Abstract,” one for “Introduction,” etc.) in each document’s data structure.
  2. Computes an overall document-level embedding (by averaging all available section embeddings).
  3. Runs K-Means separately for each section-level embedding (as well as the overall document embedding).
  4. Visualizes each set of embeddings (PCA, t-SNE, ISOMAP) in a modular fashion, highlighting your 15 labeled examples.

We’ll assume you have:

  • A list of dictionaries called documents, where doc['sections'] is a list of (section_name, text) tuples.
  • A pre-trained model embed_model (like SciBERT) that provides an .encode(...) method returning a 768-dim vector (or a different dimension; just adjust accordingly).
  • Your labeled sets:
    labeled_files_class0 = {"R001.pdf","R002.pdf","R003.pdf","R004.pdf","R005.pdf"} labeled_files_class1 = {"R006.pdf","R007.pdf","R008.pdf","R009.pdf","R010.pdf", "R011.pdf","R012.pdf","R013.pdf","R014.pdf","R015.pdf"}
  • A known set of potential sections:
    SECTIONS = [ "Unlabeled", "Abstract", "Introduction", "Methods", "Results", "Related work", "Experiments", "Conclusion", "Discussion", "Background", "References" ]

Feel free to adjust naming or add your own custom logic. This code is meant to be modular so you can run a loop over all sections and produce separate K-Means + 2D visualizations.


1) Building Section Embeddings and Overall Document Embeddings

import numpy as np def build_section_and_doc_embeddings(documents, embed_model): """ For each document in 'documents', build: doc['section_embeddings'] = { "Abstract": <768-dim embedding or zeros>, "Introduction": <768-dim embedding or zeros>, ... } doc['doc_embedding'] = <768-dim average across all available sections> 'documents' is a list of dicts, each with: doc['sections'] = list of (sec_name, sec_text) embed_model.encode() returns a numpy array embedding for given text input. """ SECTIONS = [ "Unlabeled", "Abstract", "Introduction", "Methods", "Results", "Related work", "Experiments", "Conclusion", "Discussion", "Background", "References" ] for doc in documents: # Initialize a dict that maps section_name -> embedding doc['section_embeddings'] = {} # We’ll store per-section embeddings in a dictionary # e.g. doc['section_embeddings']['Abstract'] = np.array([...]) # We might have multiple sections in doc['sections']. Let's gather them in a dict for easy access # e.g. actual_sections_found = { "Abstract": text_for_abstract, ... } actual_sections_found = {} for (sec_name, sec_text) in doc['sections']: actual_sections_found[sec_name] = sec_text # For each possible section in SECTIONS, build an embedding for section_name in SECTIONS: if section_name in actual_sections_found: raw_text = actual_sections_found[section_name].strip() if len(raw_text) == 0: # If empty text, store zeros doc['section_embeddings'][section_name] = np.zeros(embed_model.get_sentence_embedding_dimension()) else: # Encode this section's text embedding = embed_model.encode(raw_text, convert_to_numpy=True) # embed_model.encode(...) returns a 768-dim array if you're using SciBERT or similar # If you have large text, you might want to do chunking or similar, but here's the simple approach doc['section_embeddings'][section_name] = embedding else: # If doc doesn't have this section, store zeros doc['section_embeddings'][section_name] = np.zeros(embed_model.get_sentence_embedding_dimension()) # Now, compute doc['doc_embedding'] by averaging all non-zero embeddings nonzero_vectors = [] for sec_name, emb in doc['section_embeddings'].items(): # If emb is all zeros, skip if np.any(emb): # i.e. not all zeros nonzero_vectors.append(emb) if len(nonzero_vectors) == 0: # If no valid sections, store a zeros vector doc['doc_embedding'] = np.zeros(embed_model.get_sentence_embedding_dimension()) else: doc['doc_embedding'] = np.mean(nonzero_vectors, axis=0)

Notes:

  • We use embed_model.encode(raw_text, convert_to_numpy=True) to get a single embedding for that entire section.
  • If a section is missing (or is empty), we store a zeros vector.
  • The overall doc embedding is the mean of all non-zero section embeddings.

2) Running K-Means for Each Section + Document-Level

We’ll define a function that:

  1. Gathers embeddings for a given “section_name” across all docs.
  2. Applies K-Means with some k (for instance, k=5).
  3. Stores the cluster label in doc['kmeans_{section_name}'].
  4. Optionally do the same for doc_embedding.
from sklearn.cluster import KMeans def run_kmeans_for_section(documents, section_name, n_clusters=5, random_state=42): """ Gathers the embedding for 'section_name' from each doc and runs K-Means. Then stores the cluster label in doc['kmeans_{section_name}']. """ # 1) Build an array of shape [n_docs, embedding_dim] for this section all_embeddings = [] for doc in documents: # E.g. doc['section_embeddings']['Abstract'] emb = doc['section_embeddings'][section_name] all_embeddings.append(emb) all_embeddings = np.array(all_embeddings) # shape: [n_docs, d] # 2) K-Means kmeans = KMeans(n_clusters=n_clusters, random_state=random_state) labels = kmeans.fit_predict(all_embeddings) # 3) Store the label in the doc for i, doc in enumerate(documents): doc[f'kmeans_{section_name}'] = int(labels[i]) def run_kmeans_for_doc_embedding(documents, n_clusters=5, random_state=42): """ Gathers doc['doc_embedding'] for each doc, runs K-Means, and stores cluster label in doc['kmeans_doc']. """ # 1) Build array of shape [n_docs, d] from doc['doc_embedding'] all_doc_emb = [doc['doc_embedding'] for doc in documents] all_doc_emb = np.array(all_doc_emb) # 2) K-Means kmeans = KMeans(n_clusters=n_clusters, random_state=random_state) labels = kmeans.fit_predict(all_doc_emb) # 3) Store label for i, doc in enumerate(documents): doc['kmeans_doc'] = int(labels[i])

3) Visualization: PCA, t-SNE, ISOMAP in a Single Function

We want to pass in a numpy array of shape [n_docs, d] and a list of cluster labels. Then we do:

  • PCA
  • t-SNE
  • ISOMAP

We create subplots for each. We also highlight the labeled docs. This is very similar to your previous code, but more modular so we can re-use it for each section or for doc-level embeddings.

import matplotlib.pyplot as plt from sklearn.decomposition import PCA from sklearn.manifold import TSNE, Isomap def visualize_2d_projections(embeddings, cluster_labels, documents, labeled_files_class0, labeled_files_class1, title_prefix=""): """ embeddings: shape [n_docs, d] cluster_labels: shape [n_docs], e.g. from KMeans documents: the list of doc dicts labeled_files_class0, labeled_files_class1: sets for highlight title_prefix: string to prefix in the title, e.g. "Abstract" or "doc_embedding" """ # 1) Prepare the 3 transformations pca = PCA(n_components=2, random_state=42) pca_2d = pca.fit_transform(embeddings) tsne = TSNE(n_components=2, random_state=42) tsne_2d = tsne.fit_transform(embeddings) iso = Isomap(n_components=2) iso_2d = iso.fit_transform(embeddings) fig = plt.figure(figsize=(18, 5)) # We'll define a helper closure to plot each sub-figure def plot_projection(ax, proj_matrix, method_name): x_coords = proj_matrix[:, 0] y_coords = proj_matrix[:, 1] scatter = ax.scatter(x_coords, y_coords, c=cluster_labels, cmap='tab10', alpha=0.6) ax.set_title(f"{title_prefix} - {method_name}") ax.set_xlabel("Component 1") ax.set_ylabel("Component 2") for i, doc in enumerate(documents): fname = doc['filename'] if fname in labeled_files_class0 or fname in labeled_files_class1: ax.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5) label_text = "C0" if fname in labeled_files_class0 else "C1" ax.text(x_coords[i]+0.02, y_coords[i]+0.02, label_text, fontsize=9) unique_clusters = list(sorted(set(cluster_labels))) handles = scatter.legend_elements()[0] legend_labels = [f"Cluster {uc}" for uc in unique_clusters] ax.legend(handles, legend_labels, title="K-Means Clusters", loc="best") ax1 = fig.add_subplot(1, 3, 1) plot_projection(ax1, pca_2d, "PCA") ax2 = fig.add_subplot(1, 3, 2) plot_projection(ax2, tsne_2d, "t-SNE") ax3 = fig.add_subplot(1, 3, 3) plot_projection(ax3, iso_2d, "ISOMAP") plt.tight_layout() plt.show()

4) Putting It All Together

Below is an example workflow that you can adapt to your code:

############################################################################### # EXAMPLE WORKFLOW ############################################################################### # 1) Build section embeddings and doc-level embeddings build_section_and_doc_embeddings(documents, embed_model) # 2) For each section in our known set, run K-Means + visualize SECTIONS = [ "Unlabeled", "Abstract", "Introduction", "Methods", "Results", "Related work", "Experiments", "Conclusion", "Discussion", "Background", "References" ] for section_name in SECTIONS: # 2a) Run K-Means run_kmeans_for_section(documents, section_name, n_clusters=5, random_state=42) # 2b) Gather the embeddings array for that section section_embs = [] section_labels = [] for i, doc in enumerate(documents): emb = doc['section_embeddings'][section_name] lbl = doc[f'kmeans_{section_name}'] section_embs.append(emb) section_labels.append(lbl) section_embs = np.array(section_embs) section_labels = np.array(section_labels) # 2c) Visualize visualize_2d_projections( embeddings=section_embs, cluster_labels=section_labels, documents=documents, labeled_files_class0=labeled_files_class0, labeled_files_class1=labeled_files_class1, title_prefix=f"{section_name} Section" ) # 3) Do doc-level K-Means run_kmeans_for_doc_embedding(documents, n_clusters=5, random_state=42) # 4) Gather doc-level embeddings + labels doc_embs = [] doc_labels = [] for i, doc in enumerate(documents): emb = doc['doc_embedding'] lbl = doc['kmeans_doc'] doc_embs.append(emb) doc_labels.append(lbl) doc_embs = np.array(doc_embs) doc_labels = np.array(doc_labels) visualize_2d_projections( embeddings=doc_embs, cluster_labels=doc_labels, documents=documents, labeled_files_class0=labeled_files_class0, labeled_files_class1=labeled_files_class1, title_prefix="Document-Level" )

What this does:

  • Builds per-section embeddings in doc['section_embeddings'][section_name].
  • Builds a doc-level embedding in doc['doc_embedding'].
  • Runs K-Means for each section (with 5 clusters, but you can set k=2 if you want pure “non-pub vs. pub”—that’s up to you).
  • Visualizes each section’s clustering in 3 different 2D projections (PCA, t-SNE, ISOMAP).
  • Finally, does the same for the doc-level embeddings.

If you want exactly 2 clusters for “publishable vs. non-publishable,” just replace n_clusters=5 with n_clusters=2. Or do both and see how they differ.


5) Summary

  1. Section-Specific Embeddings: You’re no longer losing detail by averaging everything upfront. Each section gets its own embedding.
  2. Doc-Level Embedding: Still useful for a high-level summarization.
  3. Unchanged Labeled Data: You can still highlight your 5 non-pub and 10 pub examples on each of these K-Means cluster plots.
  4. Why You Might See Similar Clustering: If many documents share similar structure or your embedding model lumps them similarly, you might still end up with near-identical clusters. But hopefully, focusing on specific sections (like “Abstract,” “Methods,” etc.) can reveal more nuanced differences.

This modular approach should integrate cleanly with your existing data structures (documents, embed_model). You can tweak n_clusters, or do further analysis on each section’s clusters to see if, for instance, “Abstract” embeddings separate out certain flawed papers more clearly.

Good luck, and feel free to adapt the code as needed!

Вопрос

Even with these i get the same results for the final classification. So what we can conclude.

Мысль

luck with your competition submission! for a second

Ответ

Below are some reflections on why you might still be seeing identical final classifications despite introducing section-level embeddings, multiple clustering approaches, etc.:


1. Why the Results Could Be the Same

  1. Homogeneity in the Data
    If most of your papers (both publishable and non-publishable) share similar language and vocabulary in every section—i.e., the “Results” and “Discussion” sections look more or less the same across many papers—then averaging or section-level embeddings might not introduce enough variability to shift the clusters. In other words, the model “sees” roughly the same signals no matter how you segment the text.

  2. Strong Dominant Signal
    Some domain-specific terms or writing style might dominate across all sections. For example, if every paper includes standard ML terminology or references, it might overwhelm the differences that typically differentiate publishable vs. non-publishable content. Thus, whether you look at the “Introduction” or “Methods” or combine them, the embeddings land in roughly the same space.

  3. Limited Label Diversity
    You only have 15 labeled examples (5 non-publishable, 10 publishable). If they are not diverse enough or if the unlabeled set is large and somewhat homogeneous, the clustering and centroid-based methods often yield the same outcome. The data itself may not be providing strong opposing signals to re-separate the clusters.

  4. Robustness of the Embedding Model
    If your chosen embedding model (e.g., SciBERT) consistently encodes the papers in a similar fashion, the different partitionings (section-level vs. entire doc) might just lead to near-identical vectors. For instance, if the “Methods” section heavily overlaps with the “Introduction” content, the average or sum might not differ enough to rearrange the cluster memberships.


2. What These Results Suggest

  1. The Data is Not Contradictory
    The fact that multiple approaches lead to the same classification implies that, from a textual perspective, the boundary between “publishable” and “non-publishable” (as your data is capturing it) is consistent. The data is not easily teased apart differently by slicing or chunking.

  2. Consistency in the Embedding Space
    You might actually have a solid embedding approach. If all these transformations (section-level vs. doc-level) converge to the same outcome, it might mean the representation is stable and not very sensitive to how you slice the document text.

  3. Possibility of Domain Constraints
    Some tasks simply do not vary enough across sections for them to produce differing clusters. If every paper has the same general structure, length, or domain keywords, the embeddings can look very alike regardless of how you chunk or average them.


3. Potential Next Steps

  1. Refine or Add Additional Features

    • Heuristics: If you suspect certain “red flags” (like extremely short Methods sections), encode that as an additional numerical feature.
    • Topic Modeling: Already tried? If you find certain nonsense topics, feed that as an extra signal into classification.
  2. Examine “Borderline” Cases

    • If you have any borderline docs (confidence near 0.5 in a similarity-based approach), inspect them manually. Are they truly borderline or does the text contain something that’s not being captured by the embedding?
  3. Increase Labeled Diversity

    • If feasible, label more papers or ensure your 5 non-publishable examples are truly representative of varied reasons for rejection. If all 5 share an extremely similar writing style, the centroid might not generalize.
  4. Try Another Embedding Model

    • Domain-Focused: If your dataset is heavily domain-specific, perhaps a different or more specialized model might reveal distinctions SciBERT is glossing over.
    • Document Transformers (Longformer, BigBird) can capture longer contexts.
  5. Validate with External Data

    • If you have any external “gold standard” non-publishable or publishable papers, run them through to see if they get classified as expected. This can reveal if your approach is truly robust or just consistent within your 150-paper set.

4. Conclusion

Getting the same final classification from different chunking/averaging strategies often means:

  • The differences between sections are not large enough to reorder embeddings in a meaningful way, or
  • The embedding model is robust in how it encodes text, leading to stable results.

It does not necessarily mean you are doing something wrong—rather, it suggests your data and chosen embeddings converge on the same representation regardless of how you slice the paper text. This can be interpreted as stability in your approach.

The best way to be sure is to test on truly unseen data or carefully label more documents to see if there are any subtle differences the current pipeline is missing.

Поделиться этим Q&A