LLM-Based Classification Strategies

Question

Below is the problem statement for a national level data science hackathon that I'm participating in.

The task involves developing a framework that can classify research papers as
either "Publishable" or "Non-Publishable" based on the evaluation of their
content. The goal is to create a robust system capable of identifying critical
issues, such as inappropriate methodologies, incoherent arguments, or
unsubstantiated claims, that affect the suitability of a paper for publication. For
instance, a research paper that applies methodologies or techniques that are not
well-suited to the problem being addressed, without adequate justification or
adaptation to the context, would be considered unsuitable. Similarly, a paper that
presents arguments that are unclear, disorganized, or lack logical coherence, or
one that claims results that appear unusually high or unrealistic without sufficient
evidence or proper validation, would also fall into the "Non-Publishable" category

A dataset of 150 research papers is provided for classification, with 15 labeled
papers available for reference to guide the development of the framework. The
framework should be designed to accurately classify papers into the appropriate
category, ensuring that it can handle a wide range of research topics and
maintain consistency across different types of content. The proposed framework
must be capable of systematically analyzing these and other aspects of research
papers to ensure a reliable and objective evaluation process. The solution should
demonstrate high accuracy in detecting such issues, ensuring its applicability
across a range of research domains and scalability for future use with larger
datasets.

Think deeply about this guide me. For context, the 15 labeled papers are imbalanced with 10 being class 'Publishable' and only 5 being class 'non-publishable', so its in the ratio 2:1. The labeled data is also very very low.

Every paper in the 150 given papers has thousands of words spanning several pages. Think about if splitting these documents into major sections like abstract, introduction, methodology, related works, experiments and conclusion could benefit in any way? But there is also the problem that not every papers have the same sections and could vary across different papers and a lot of papers may have subheadings and what not. The papers also span across different topics, but the publishable papers even in the unlabeled dataset would be papers that could be put into one of the following 5 conferences - CVPR, NeurIPS, KDD, EMNLP, TMLR, so the topic and what the papers talk about should fall into the topics that majorly dominate these 5 conferences. Equipped with this knowledge, think about what sort of approach could be taken and what sort of model could be used and all other approaches. Think deeply for this

We cannot directly upload the data to a GPT4 and ask it to classify it as we are judged in this hackathon based on

Approach towards identification of publishability
Accuracy and F1 Score of determining publishability.

With that said, we can still use llms in other ways if we want, this is just for your clarification. We have no preferences regarding the method we use, we only want results.

Think deeply about this problem and provide different ways to approach it, different techniques or chain of thoughts to address this problem given the heavily imbalanced dataset and all other conditions mentioned. But try to suggest the ones that will work the best given the context in this query. Make sure your answer is extremely detailed and covers and clears all my questions, doubts and ideas.

Answer 1

Below is a systematic breakdown of possible approaches and considerations for building a highly accurate and robust framework to classify research papers as “Publishable” or “Non-Publishable,” given the constraints and context you described:

1. Handling Limited and Imbalanced Labeled Data

1.1 Data Augmentation Strategies

Synthetic Document Generation / Data Augmentation
- Paraphrasing: One could generate variants of the 5 “Non-Publishable” papers to increase the minority class. Although it’s tricky to ensure that paraphrased versions remain logically consistent with the original “Non-Publishable” label, it can help expand coverage of phrases and styles.
- Section Swaps or Partial Text Replacement: For instance, you might keep the introduction and methodology the same but replace the conclusion with a rephrased or newly generated variant, thereby creating multiple versions of the same paper. This approach is more controlled than straightforward paraphrasing because you know exactly which sections you are modifying.
Document Splitting + Label Inheritance
- You might split each labeled paper into coherent sections or even paragraphs. Each segment “inherits” the global label. Then you could train on this expanded set of labeled text segments. While you lose some context, you gain more training samples.
- Careful: the label is at the “paper” level, not necessarily at the chunk level. Some smaller chunks might not by themselves indicate whether a paper is publishable. But as a data augmentation strategy, it can help models learn textual cues from smaller segments.
Active Learning or Semi-Supervised Learning
- With 15 labeled papers, you might use an active learning loop:
  - Train an initial classifier on the small labeled set.
  - Use it to make predictions on the unlabeled 135 papers.
  - Select examples where the model is most uncertain (or most likely to be mislabeled) for human annotation.
- This could help you quickly surface the most informative examples to be labeled, improving model performance with minimal labeling effort.

1.2 Handling Class Imbalance in Training

Oversampling / Undersampling
- SMOTE for Text: Synthetic Minority Over-Sampling Technique (SMOTE) can be applied in the feature space (embedding vectors). If you create vector embeddings for each document (e.g., with BERT or Sentence Transformers), you can apply SMOTE on these embeddings to generate synthetic minority samples for the “Non-Publishable” class.
- Random Oversampling or Undersampling: Simple approaches to either oversample the “Non-Publishable” class or undersample “Publishable.” Given the extremely small number of “Non-Publishable” papers, oversampling (or advanced synthetic generation) may be more suitable than undersampling (which would remove potentially valuable “Publishable” data).
Class Weighting / Focal Loss
- Adjust the loss function by introducing higher weights for the minority class. In many deep learning frameworks, you can pass a parameter that sets the loss weighting for classes.
- Alternatively, Focal Loss is useful in cases where the model might easily classify the majority class but struggles with the minority class.

2. Dealing with Large Documents of Varying Structure

2.1 Segmenting by Sections or Sub-Sections

Splitting the documents by major sections (Abstract, Introduction, Methodology, etc.) can help isolate where certain “red flags” for non-publishability might appear. For instance:
- Methodology section might reveal flawed or inappropriate methods without justification.
- Results / Experiments section might show unrealistic claims or unsubstantiated results.
- Introduction / Related Work might show a lack of clarity or organization.
Issue: Not all papers have a uniform section structure.
- Heuristic-based Section Detection: Use a simple set of regex patterns or markers (e.g., “1. Introduction”, “2. Background”, “Conclusion”, etc.) to segment text. Even if some papers don’t follow a standard template, you can still capture partial boundaries.
- Dynamic Sub-Section Chunking: If some papers do not have clear headings or the headings vary significantly, you can chunk the paper into equal size segments (e.g., 1,000 tokens each). Then you can label or store the approximate position (beginning, middle, end), which correlates somewhat with Introduction, Methodology, Conclusion, etc.

2.2 Representation of Sections

Section-level Embeddings: You could represent each section (or chunk) as an embedding using a large language model (e.g., BERT, RoBERTa, Sentence Transformers). This yields multiple embeddings per paper. You then aggregate these embeddings (e.g., by averaging them) into a single representation of the paper.
Hierarchical Model: A hierarchical approach first encodes each section and then combines them with an attention mechanism. This can help the model learn that certain sections might hold more weight in publishability decisions.

2.3 Checking for Coherence and Methodological Soundness

Some signals that a paper is “Non-Publishable”:
1. Very short or incomplete methodology for a problem that clearly needs more detail.
2. Results that are suspiciously high without rigorous justification.
3. Inconsistencies or abrupt shifts in writing style that indicate disorganization.
A rule-based or heuristic-based system (in parallel with an ML model) could flag suspicious patterns in these sections. Combined signals from these heuristics can augment your final classifier’s confidence.

3. Potential NLP Approaches

3.1 Pre-trained Transformers (Fine-Tuned)

Given the text-length and complexity, large pre-trained language models (PLMs) such as BERT, RoBERTa, DeBERTa, or Longformer (for longer context) can be used:

Longformer / BigBird
- These architectures handle longer sequences than base BERT, which typically maxes out around 512 tokens. For entire research papers, you can feed more context without aggressively truncating.
- You can do a [CLS] token classification or pool the final embeddings for classification.
Chunk-Level Approach with BERT
- If you cannot use a long-context model, split the document into chunks of 512 tokens (or ~1,000 words, etc.) and pass each chunk through a BERT-based model.
- Aggregate chunk-level embeddings or predictions (e.g., using mean or attention-based pooling) to form a final label for the entire paper.
Instruction-Finetuned or GPT-based Models Locally
- While you cannot simply upload the entire dataset to GPT-4, you could leverage open-source instruction-finetuned models (like Llama 2, or Dolly, or other smaller GPT-like models) locally.
- Fine-tune them on your 15 labeled papers (plus augmented data). This might be more feasible for capturing the notion of “publishability” because these models are fairly strong in domain reasoning if you shape your prompts or fine-tuning properly.

3.2 Classical ML + Feature Engineering

For smaller data scenarios, a simpler pipeline might be surprisingly effective:

Feature Extraction
- TF-IDF: Standard textual features. Possibly weighted more heavily in the methodology and results sections.
- Readability Scores: (e.g., Flesch-Kincaid) to capture clarity.
- Semantic Similarity to Known “Non-Publishable” examples: Create an embedding for the entire text and measure the distance to embeddings from known non-publishable documents.
- Keyword / Keyphrase extraction: Searching for “unsubstantiated,” “unproven,” or “lack of” in context might be a signal.
Classification Algorithms
- Logistic Regression or Random Forest with class weighting.
- Gradient Boosted Trees (XGBoost / LightGBM).
- Even though these classical methods are overshadowed by large transformers on large data, they can perform competitively if your data is small and features are carefully engineered.

3.3 Multi-Task or Transfer Learning from “Topic Classification”

You mentioned that the “Publishable” papers would fit into CVPR, NeurIPS, KDD, EMNLP, TMLR topics. One approach is:
- Train a classifier to identify if a paper is about computer vision, NLP, data mining, etc. (You can do this with unlabeled data, because you can approximate the conference domain of the unlabeled paper by some heuristic or by an LLM prompt for domain classification. This step is somewhat easier to verify).
- Then you use that knowledge as an auxiliary task, so the model “knows” what a typical CVPR or NeurIPS paper looks like. This can help it detect anomalies or major flaws in an “alleged” domain paper.

4. Practical Workflow Example

Below is an illustration of a pipeline that incorporates many of the above ideas:

Preprocessing & Document Splitting
- Attempt to segment papers into major sections if possible (using heading detection or simple chunk-based segmentation otherwise).
Embeddings
- Use a sentence-transformer or a Longformer variant to produce embeddings for each section (or chunk).
Aggregation
- Combine embeddings for each chunk (using mean/attention).
- Optionally, append a few engineered features (readability scores, average citations in references, etc.) to the final embedding vector.
Address Imbalance
- Perform some oversampling (e.g., random oversampling for the 5 “Non-Publishable” papers) or synthetic minority data generation in the embedding space (SMOTE).
- Additionally, use class weighting or Focal Loss during training.
Train Classifier
- Fine-tune a robust model (e.g., a small MLP on top of embeddings, or fine-tune a transformer with a classification head).
- Validate on a small hold-out from the 15 labeled samples or via cross-validation if you can’t spare a hold-out set.
Active Learning for Additional Labels
- Run inference on the unlabeled 135 papers.
- Identify 10–15 borderline examples that the model is least confident about.
- Manually label them (if hackathon rules / time permits).
- Retrain the model to further improve performance.

5. Potential Use of Large Language Models (LLMs) as Tools (Not Direct Classification)

While you cannot upload all data to GPT-4 or other LLMs directly for classification, you can still use them to aid certain subtasks:

Section Extraction / Title Identification
- You can prompt an LLM to generate regular expressions or a procedure for detecting sections from a typical research paper. You implement that code locally and run it on your documents.
- Or you can “ask” an LLM to propose heuristics for identifying a flawed methodology section. You can embed those heuristics into your code.
Topic Tagging
- You can feed small abstracts or summarizations to an LLM to help determine if the paper belongs to CV, NLP, data mining, etc. This is short text, so it’s feasible to do partial queries. Then feed that domain label as an additional feature to your classifier.
Keyword Generation
- Use an LLM to generate domain-specific “red flag” phrases for non-publishable content. Then you can build a dictionary-based detection system or incorporate it as a feature in your main model.

6. Ensuring Scalability and Consistency

Scalability: By establishing a robust pipeline (section extraction → embeddings → classifier), you can easily add more labeled examples over time. If, in the future, you get thousands more labeled papers, you can seamlessly retrain the same pipeline at scale.
Consistency Across Domains:
- Because you’re dealing with multiple potential domains (CV, NLP, data mining, etc.), your model needs generalizable “red flags.”
- Transfer learning from a generic language model (Longformer, BERT, etc.) already helps in capturing domain-invariant signals.
- The additional domain classification step can also increase consistency by letting the model see typical domain structures.

7. Summary of Key Recommendations

Work with chunks or sections of the paper, using a hierarchical or chunk-based modeling approach (Longformer is ideal for large contexts).
Imbalance is a major issue; tackle it with oversampling (SMOTE in embedding space), class weighting, or focal loss.
Data augmentation: Explore paraphrasing or partial text modifications to increase the “Non-Publishable” set.
Active / Semi-Supervised Learning: Leverage the unlabeled 135 papers to refine your model iteratively.
Feature engineering: Combine advanced embeddings with simpler signals (keyword checks, readability, suspicious results).
LLMs can still be used as supportive tools (e.g., for section extraction heuristics, domain classification, or generating “red flag” keywords) without directly uploading the full dataset for classification.
Keep the pipeline modular so that you can easily experiment with different embedding models or classification heads.

Following these steps in a well-structured pipeline can yield a high-performing system, even with very limited labeled data and heavily imbalanced classes. By integrating data-centric approaches (like augmentation and active learning) with advanced NLP architectures (like Longformer/BERT for large text), you stand a strong chance of building a solution that meets the accuracy and F1 requirements for the hackathon.

Answer 2

Below is a detailed, end-to-end outline of how you might implement several of the ideas you highlighted—ranging from heuristic-based section detection, a hierarchical model design, rule-based “red flag” detection, unsupervised clustering, and leveraging semantic similarity. The goal is to provide a variety of methods you can try in your Kaggle environment. Where relevant, I’ve included code snippets (in Python) that you can adapt. I’ll also point out potential pitfalls or ways to improve your approach.

Preprocessing & Section Extraction
1.1 Basic Preprocessing vs. Context-Aware Preprocessing
1.2 Heuristic-Based Section Detection
Unsupervised/Clustering Approach
2.1 Embedding Generation (Sentence Transformers)
2.2 K-Means / Hierarchical Clustering
2.3 Interpreting Clusters & Using Minimal Labels
Hierarchical Model for Long Documents
3.1 Chunking into Sections (or Subsections)
3.2 Encoding & Combining Section Embeddings
3.3 Fine-Tuning a Small Classifier Head (Optional)
Rule-Based “Red Flag” Detection
4.1 Designing Heuristics
4.2 Scoring & Combining with Other Methods
Semantic Similarity to Known Non-Publishable Papers
5.1 Building a Similarity Index
5.2 Threshold-Based Classification
Putting It All Together: Example Pipeline
Improving & Checking for Pitfalls
Complete Code Examples
- 8.1 Section Extraction & Preprocessing
- 8.2 Unsupervised Clustering (KMeans)
- 8.3 Rule-Based Signals
- 8.4 Semantic Similarity Scoring

Throughout, I’ll highlight ways to utilize your small set of 15 labeled papers effectively—even if it’s only at the end as a final “label assignment” step.

1. Preprocessing & Section Extraction

1.1 Basic Preprocessing vs. Context-Aware Preprocessing

What you’re currently doing:

python
def preprocess_text(text):
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = text.lower()
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

Pros: Removes punctuation, lowercases, eliminates stopwords. This is a decent baseline for classical NLP.
Cons:
- You lose some valuable tokens like parentheses, percentages (useful for results sections: e.g., “accuracy = 99.9%!”).
- You might break references to figures/tables (e.g., “Fig. 2.1” becomes “fig 21”).
- Removing certain special tokens could hamper the detection of domain-specific notation.

Suggestions:

Keep punctuation in some form or at least keep certain symbols like %, =, maybe “( )” if relevant.
Consider lemmatization rather than just removing stopwords—sometimes lemmatization or a domain-specific tokenizer (like SciSpaCy for scientific text) can help.
Retain numeric tokens with some minimal cleaning. For example, if a paper claims “Accuracy of 150%,” that’s obviously suspicious and might be relevant to detect.

1.2 Heuristic-Based Section Detection

Many research papers follow a rough structure with headings like:

Abstract
Introduction
Related Work / Background
Methodology / Experiments
Results
Conclusion / Future Work

Heuristic approach:

Use regex patterns to capture lines that look like headings (e.g., “^([0-9]+.)+\s+[A-Z].+$” or just “^(abstract|introduction|conclusion|references)[^a-zA-Z]+” ignoring case).
If you can’t find a heading, split by large chunks (e.g., 1,000–2,000 words) as a fallback.

Below is a conceptual snippet for Kaggle (Python) to detect sections:

python
import re

def split_into_sections(pdf_text):
    # Possible common headings to look for
    headings = [
        r"(abstract)\b",
        r"(introduction)\b",
        r"(background|related\s?work)\b",
        r"(methodology|methods|experimental\s?setup)\b",
        r"(results|experiments)\b",
        r"(discussion)\b",
        r"(conclusion|conclusions)\b",
        r"(references)\b"
    ]

    # We'll store tuples of (section_name, section_text)
    sections = []
    current_section_name = "Unlabeled"
    current_section_text = []

    lines = pdf_text.split('\n')
    for line in lines:
        line_lower = line.lower().strip()
        matched_heading = None
        
        for hd in headings:
            if re.match(hd, line_lower):
                matched_heading = line.strip()
                break

        if matched_heading:
            # Save previous section if we have any text
            if current_section_text:
                sections.append((current_section_name, "\n".join(current_section_text)))
            # Start a new section
            current_section_name = matched_heading
            current_section_text = []
        else:
            current_section_text.append(line)

    # Add the last collected section if any
    if current_section_text:
        sections.append((current_section_name, "\n".join(current_section_text)))

    return sections

# Example usage
# pdf_text = "Your extracted text from pdfplumber"
# sections = split_into_sections(pdf_text)
# for sec_name, sec_content in sections:
#     print(sec_name, len(sec_content.split()))

What to do next:

You can pass each sec_content through your preprocess_text or a more domain-aware pipeline.
This helps your unsupervised approaches or hierarchical model treat each major section somewhat separately.

2. Unsupervised/Clustering Approach

2.1 Embedding Generation (Sentence Transformers)

Sentence Transformers (e.g., all-mpnet-base-v2, all-distilroberta-v1, or a domain-specific model like allenai/scibert_scivocab_uncased) can create embeddings for text segments.
For each of your 150 documents, produce one embedding for each major section (or chunk). Then either average them to get a single “document embedding,” or keep them separate for hierarchical modeling.

2.2 K-Means / Hierarchical Clustering

K-Means is a straightforward choice if you want a fixed number of clusters. However, deciding the number of clusters is tricky (especially if you have 2 classes—“Publishable” vs. “Non-Publishable”—but the actual text may form more than 2 clusters).
Hierarchical clustering or HDBSCAN can discover an arbitrary number of clusters and potentially identify outliers.

2.3 Interpreting Clusters & Using Minimal Labels

Cluster the 150 documents (using their embeddings) into, say, $k = 5$ or $k = 6$ clusters.
Look at which cluster your 5 known “Non-Publishable” papers end up in. If they cluster together or in smaller distinct subgroups, that is a strong signal.
Optionally, label entire clusters as “likely publishable” or “likely non-publishable” if they strongly align with your known labeled docs. If a cluster has no known non-publishable examples, treat it carefully or look for rule-based signals to see if it might be “suspect.”

3. Hierarchical Model for Long Documents

3.1 Chunking into Sections

By splitting papers into sections or ~1,000-word chunks:

python
def chunk_text(text, chunk_size=1000):
    words = text.split()
    for i in range(0, len(words), chunk_size):
        yield ' '.join(words[i:i+chunk_size])

Then process each chunk with a transformer (like SciBERT) to obtain a vector.

3.2 Encoding & Combining Section Embeddings

One approach (a “hierarchical” approach) is:

Encode each chunk with a transformer → you get chunk_embeddings of shape [num_chunks, embedding_dim].
Combine chunk_embeddings with an attention or simple average. For instance: $\text{paper\_embedding} = \text{mean}(chunk\_embeddings)$
Use the final paper_embedding for either clustering (unsupervised) or as input to a small neural classifier if you do want to use your labeled data.

3.3 Fine-Tuning a Small Classifier Head (Optional)

Even if you have only 15 labeled documents, you could:

Generate chunk-level embeddings for all 150 documents.
Use the 15 labeled docs to train a tiny classifier (like an MLP or logistic regression) that takes the aggregated embedding as input.
Apply that classifier to all 150 docs.
Because your labeled data is tiny, you’ll risk overfitting—so you might rely on cross-validation or the cluster-based signals to calibrate or confirm.

4. Rule-Based “Red Flag” Detection

4.1 Designing Heuristics

Some potential signals for “Non-Publishable”:

Suspicious High-Level Claims: Searching for tokens like “100% accuracy,” “state-of-the-art with no baseline,” etc.
Weak or Missing Methodology: If the “Methodology” section has fewer than X words, or if certain keywords (“experiment,” “implementation,” “dataset,” “train,” “test”) are missing.
Lack of references or citations: If the “References” or “Bibliography” section is absent or very short.

4.2 Scoring & Combining with Other Methods

You could create a simple “red flag” score for each paper:

For each rule triggered, add +1 (or +some weighting).
At the end, if total “red flag” score exceeds a threshold, you flag the paper as suspicious.

This approach can run in parallel with your unsupervised model. You might ultimately label a paper “non-publishable” if:

It’s in the same cluster as known non-publishable docs, and
It exhibits at least 1–2 rule-based red flags.

5. Semantic Similarity to Known Non-Publishable Papers

5.1 Building a Similarity Index

Compute the average embedding of your 5 known non-publishable papers, call it nonpub_centroid.
For each unlabeled paper, compute its embedding and compute cosine similarity to nonpub_centroid.
If the similarity is above a certain threshold, the paper is considered suspicious. Alternatively, keep the top N “closest” papers as candidates for “non-publishable.”

5.2 Threshold-Based Classification

A simple approach:

You might pick a similarity threshold by checking your known 5 non-publishable vs. 10 publishable papers.
If your 5 known non-publishable have an average similarity of 0.8 to the centroid, while the 10 publishable ones have 0.5, then you can choose ~0.65 as a cutoff.

6. Putting It All Together: Example Pipeline

Extract PDF text using pdfplumber.
Heuristic-based section detection (if possible).
Preprocess each section (decide on minimal vs. domain-aware cleaning).
Generate embeddings (e.g., using a SentenceTransformer model).
Combine section embeddings → one doc embedding (mean or attention).
Unsupervised clustering or semantic similarity approach.
Rule-based checks for suspicious patterns.
(Optional) Small classifier using your 15 labeled docs.

7. Improving & Checking for Pitfalls

Overfitting to the 15 labeled docs:
- If you do any supervised training, consider cross-validation or a simple “leave-one-out” approach to get a rough measure.
- Or rely primarily on unsupervised methods, then do a final check with the 15 labeled docs to see if it aligns with your intuition.
Inconsistent Section Headings:
- Many papers might not follow standard headings. You might need fallback chunking.
Large Variation in Paper Length:
- Some papers might be 1–2 pages short workshop papers, others 20+ pages. Make sure your chunking logic can handle extremes.
GPU/TPU Usage:
- In Kaggle, you can enable a GPU by going to Settings > Accelerator. Using a transformer model for embeddings typically works well on GPUs.

8. Complete Code Examples

Below are some consolidated code snippets (in Python) that you can adapt in a Kaggle notebook. I’ll assume you’ve already installed pdfplumber and sentence-transformers. (If not, you can do !pip install pdfplumber sentence-transformers at the top of your Kaggle notebook.)

8.1 Section Extraction & Preprocessing

python
import re
import pdfplumber
import torch
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Make sure to download NLTK data if not already present
# nltk.download('punkt')
# nltk.download('stopwords')

def extract_text_from_pdf(pdf_path):
    """Extract all text from a PDF using pdfplumber."""
    pdf_text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                pdf_text += page_text + "\n"
    return pdf_text

def custom_preprocess(text):
    """
    A slightly more cautious preprocessing that:
    - Keeps numbers, percentage signs,
    - Lowercases,
    - Removes some punctuation but keeps essential symbols.
    """
    # Lowercase
    text = text.lower()

    # Keep only certain punctuation: letters, numbers, whitespace, punctuation like '%', '=', etc.
    text = re.sub(r'[^a-z0-9%\=\(\)\.\,\s]', '', text)

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords (optional)
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words]

    return ' '.join(tokens)

def split_into_sections(pdf_text):
    headings = [
        r"(abstract)\b",
        r"(introduction)\b",
        r"(background|related\s?work)\b",
        r"(methodology|methods|experimental\s?setup)\b",
        r"(results|experiments)\b",
        r"(discussion)\b",
        r"(conclusion|conclusions)\b",
        r"(references)\b"
    ]

    sections = []
    current_section_name = "Unlabeled"
    current_section_text = []

    lines = pdf_text.split('\n')
    for line in lines:
        line_lower = line.strip().lower()
        matched_heading = None
        for hd in headings:
            if re.match(hd, line_lower):
                matched_heading = line.strip()
                break

        if matched_heading:
            if current_section_text:
                sections.append((current_section_name, "\n".join(current_section_text)))
            current_section_name = matched_heading
            current_section_text = []
        else:
            current_section_text.append(line)

    if current_section_text:
        sections.append((current_section_name, "\n".join(current_section_text)))

    return sections

Usage in a Kaggle environment:

python
# Suppose you have a folder "papers" with PDF files
import os

pdf_dir = "./papers"
documents = []

for filename in os.listdir(pdf_dir):
    if filename.endswith(".pdf"):
        path = os.path.join(pdf_dir, filename)
        raw_text = extract_text_from_pdf(path)
        sections = split_into_sections(raw_text)

        # Preprocess each section
        processed_sections = []
        for sec_name, sec_text in sections:
            processed_text = custom_preprocess(sec_text)
            processed_sections.append((sec_name, processed_text))

        # Store the document data
        documents.append({
            'filename': filename,
            'sections': processed_sections
        })

8.2 Unsupervised Clustering (KMeans)

Below is a simple example of clustering the average embedding of each paper. We’ll use Sentence Transformers to get embeddings.

python
!pip install sentence-transformers

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import numpy as np

# Load a pre-trained or domain-specific model (SciBERT, etc.)
model_name = 'all-mpnet-base-v2'  # or 'allenai/scibert_scivocab_uncased'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
embed_model = SentenceTransformer(model_name, device=device)

def get_document_embedding(processed_sections):
    """
    Generate embeddings for each section, then average them.
    processed_sections: list of (sec_name, text) after preprocessing.
    """
    section_texts = [sec_text for _, sec_text in processed_sections if sec_text.strip()]
    if not section_texts:
        # fallback if there's no text
        return np.zeros(768)

    # Encode the entire batch
    embeddings = embed_model.encode(section_texts, convert_to_numpy=True)
    doc_embedding = np.mean(embeddings, axis=0)
    return doc_embedding

# Compute embeddings for all documents
doc_embeddings = []
for doc in documents:
    embedding = get_document_embedding(doc['sections'])
    doc_embeddings.append(embedding)
doc_embeddings = np.array(doc_embeddings)

# K-Means Clustering
num_clusters = 5  # arbitrary choice; you can experiment
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
labels = kmeans.fit_predict(doc_embeddings)

for i, doc in enumerate(documents):
    doc['cluster_label'] = labels[i]

# Inspect the cluster distribution
from collections import Counter
cluster_counts = Counter(labels)
print(cluster_counts)

Next Steps:

Check which clusters your known 5 non-publishable (if you do know their filenames) fall into.
Potentially label entire clusters as suspicious or safe, or use rule-based signals to refine.

8.3 Rule-Based Signals

Create a few simple functions to detect “red flags,” e.g.:

python
def count_red_flags(processed_sections):
    flags = 0

    # 1) Check for unrealistic claims: e.g., "100% accuracy"
    text_joined = " ".join(sec_text for _, sec_text in processed_sections)
    if '100% accuracy' in text_joined or '100 percent accuracy' in text_joined:
        flags += 1

    # 2) Check if Methodology section is too short
    for sec_name, sec_text in processed_sections:
        if 'method' in sec_name.lower():
            # Adjust threshold to your taste
            if len(sec_text.split()) < 100:
                flags += 1

    # 3) Check if there's a references section at all
    has_references = any('references' in sec_name.lower() for sec_name, _ in processed_sections)
    if not has_references:
        flags += 1

    return flags

# Example usage
for doc in documents:
    doc['red_flag_score'] = count_red_flags(doc['sections'])

# Then you could set a threshold
threshold = 1
for doc in documents:
    if doc['red_flag_score'] >= threshold:
        doc['suspicious'] = True
    else:
        doc['suspicious'] = False

8.4 Semantic Similarity Scoring

If you have 5 known non-publishable papers:

python
known_nonpub_embeddings = []
for doc in documents:
    if doc['filename'] in known_nonpub_filenames:  # e.g. the 5 known non-publishable
        emb = get_document_embedding(doc['sections'])
        known_nonpub_embeddings.append(emb)
nonpub_centroid = np.mean(known_nonpub_embeddings, axis=0)

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Score each doc
for doc in documents:
    emb = get_document_embedding(doc['sections'])
    sim = cosine_similarity(emb, nonpub_centroid)
    doc['similarity_to_nonpub'] = sim

Then set a threshold:

python
# For instance, see the distribution of similarity_to_nonpub for known publishable vs. known non-publishable
# This helps you pick a threshold
threshold = 0.7

for doc in documents:
    doc['maybe_nonpub'] = (doc['similarity_to_nonpub'] > threshold)

Putting It All Together in a Step-by-Step Manner

Below is a condensed version of how this might look in an actual Kaggle Notebook with minimal commentary:

python
# Step 0: Install dependencies (if needed)
!pip install pdfplumber sentence-transformers

import os
import pdfplumber
import re
import torch
import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from collections import Counter

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'all-mpnet-base-v2'
embed_model = SentenceTransformer(model_name, device=device)

# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')

def extract_text_from_pdf(pdf_path):
    pdf_text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                pdf_text += page_text + "\n"
    return pdf_text

def custom_preprocess(text):
    text = text.lower()
    # Keep certain punctuation
    text = re.sub(r'[^a-z0-9%\=\(\)\.\,\s]', '', text)
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words]
    return ' '.join(tokens)

def split_into_sections(pdf_text):
    headings = [
        r"(abstract)\b",
        r"(introduction)\b",
        r"(background|related\s?work)\b",
        r"(methodology|methods|experimental\s?setup)\b",
        r"(results|experiments)\b",
        r"(discussion)\b",
        r"(conclusion|conclusions)\b",
        r"(references)\b"
    ]

    sections = []
    current_section_name = "Unlabeled"
    current_section_text = []

    lines = pdf_text.split('\n')
    for line in lines:
        line_lower = line.strip().lower()
        matched_heading = None
        for hd in headings:
            if re.match(hd, line_lower):
                matched_heading = line.strip()
                break

        if matched_heading:
            if current_section_text:
                sections.append((current_section_name, "\n".join(current_section_text)))
            current_section_name = matched_heading
            current_section_text = []
        else:
            current_section_text.append(line)

    if current_section_text:
        sections.append((current_section_name, "\n".join(current_section_text)))

    return sections

def get_document_embedding(processed_sections):
    section_texts = [sec_text for _, sec_text in processed_sections if sec_text.strip()]
    if not section_texts:
        return np.zeros(768)
    embeddings = embed_model.encode(section_texts, convert_to_numpy=True)
    return np.mean(embeddings, axis=0)

def count_red_flags(processed_sections):
    flags = 0
    text_joined = " ".join(sec_text for _, sec_text in processed_sections)

    # Example rule: 100% accuracy
    if '100% accuracy' in text_joined or '100 percent accuracy' in text_joined:
        flags += 1

    # Example rule: short methodology section
    for sec_name, sec_text in processed_sections:
        if 'method' in sec_name.lower():
            if len(sec_text.split()) < 100:
                flags += 1

    # Example rule: no references
    has_references = any('references' in sec_name.lower() for sec_name, _ in processed_sections)
    if not has_references:
        flags += 1

    return flags


# ===== MAIN PIPELINE =====
pdf_dir = "./papers"
documents = []

for filename in os.listdir(pdf_dir):
    if filename.endswith(".pdf"):
        path = os.path.join(pdf_dir, filename)
        raw_text = extract_text_from_pdf(path)
        sections = split_into_sections(raw_text)

        # Preprocess each section
        processed_sections = []
        for sec_name, sec_text in sections:
            processed_text = custom_preprocess(sec_text)
            processed_sections.append((sec_name, processed_text))

        documents.append({
            'filename': filename,
            'sections': processed_sections
        })

# Compute embeddings for each doc
doc_embeddings = []
for doc in documents:
    emb = get_document_embedding(doc['sections'])
    doc_embeddings.append(emb)
doc_embeddings = np.array(doc_embeddings)

# KMeans clustering
num_clusters = 5
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
labels = kmeans.fit_predict(doc_embeddings)

for i, doc in enumerate(documents):
    doc['cluster_label'] = int(labels[i])
    doc['red_flag_score'] = count_red_flags(doc['sections'])

cluster_counts = Counter(labels)
print("Cluster distribution:", cluster_counts)

# Suppose we know the filenames of the 5 known non-publishable
known_nonpub_filenames = ["nonpub1.pdf", "nonpub2.pdf", "nonpub3.pdf", "nonpub4.pdf", "nonpub5.pdf"]
known_nonpub_embeddings = []

for doc in documents:
    if doc['filename'] in known_nonpub_filenames:
        known_nonpub_embeddings.append(get_document_embedding(doc['sections']))

nonpub_centroid = np.mean(known_nonpub_embeddings, axis=0)

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

for doc in documents:
    emb = get_document_embedding(doc['sections'])
    sim = cosine_similarity(emb, nonpub_centroid)
    doc['similarity_to_nonpub'] = float(sim)

# Example threshold for suspicious
similarity_threshold = 0.7
red_flag_threshold = 1

for doc in documents:
    doc['maybe_nonpub'] = (doc['similarity_to_nonpub'] > similarity_threshold) \
                          or (doc['red_flag_score'] >= red_flag_threshold)

# Inspect final results
for doc in documents:
    print(doc['filename'], doc['cluster_label'], doc['red_flag_score'], doc['similarity_to_nonpub'], doc['maybe_nonpub'])

Load PDFs → Extract text
Heuristics to split into sections
Preprocessing that’s a bit more selective
Compute embeddings for each doc (averaging section embeddings)
Clustering (K-Means) → see if certain clusters align with known non-publishable examples
Rule-based flags → short methodology, missing references, suspicious claims
Semantic similarity to known non-publishable papers (centroid)
Combine signals to produce a final guess (maybe_nonpub)

Final Notes & Key Takeaways

Unsupervised + Heuristics: Since you have such limited labeled data, leaning on unsupervised clustering + rule-based detection is a solid approach.
Section Extraction: Imperfect headings are still better than none. For papers that don’t match headings, fallback chunking is a must.
Semantic Similarity: Quick way to see how close new papers are to known “bad” examples.
Small Classifier: If time permits, you can still attempt a small MLP on top of embeddings, but be wary of overfitting with only 15 labeled examples.
Spot-Check: Even with unsupervised methods, you still want to do some manual inspection of the clusters and flagged papers.
Continual Improvement: If you can incrementally label more papers (even if it’s partial or just verifying cluster assignments), you’ll see big gains in performance.

I hope this gives you a more comprehensive roadmap—and enough code to try out each approach in Kaggle. Feel free to tweak thresholds, the choice of embedding model, clustering method, and rule-based patterns to adapt to your specific corpus. Above all, remember that your domain knowledge about “what makes a paper unpublishable” can be translated into powerful heuristics that complement the unsupervised features.

Answer 3

Below is a structured walk-through of how to visualize the clustering results, the cosine similarity distribution, and the red flag scores for your documents. I’ve also included code snippets that:

Highlight the 15 labeled PDFs (R001.pdf to R015.pdf), with R001–R005 labeled as class 0 and R006–R015 labeled as class 1.
Plot the clustering in 2D using either PCA or t-SNE.
Draw a histogram (or KDE) of cosine similarity values, highlighting the labeled ones.
Provide additional “red flag” heuristics and illustrate how to visualize their distribution.

Where possible, I’ve included commentary for improvements or alternative approaches.

1. Visualizing the K-Means Clusters

Since doc_embeddings is likely high-dimensional (e.g., 768 dimensions from a transformer), we’ll reduce the dimensionality to 2D for plotting. You can use PCA, t-SNE, or UMAP. Below, I’ll show PCA for simplicity, but feel free to replace it with t-SNE or UMAP.

python
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# We have: 
#   doc_embeddings (shape: [num_documents, embedding_dim])
#   documents (a list of dicts with 'filename' and 'cluster_label')

# Let's define our labeled sets
labeled_files_class0 = set([f"R00{i}.pdf" for i in range(1,6)])  # R001 to R005
labeled_files_class1 = set([f"R00{i}.pdf" for i in range(6,16)]) # R006 to R015

# (1) PCA for 2D visualization
pca = PCA(n_components=2, random_state=42)
reduced_pca = pca.fit_transform(doc_embeddings)

# (2) Alternatively, you can try t-SNE (comment out PCA if you want this)
# tsne = TSNE(n_components=2, random_state=42)
# reduced_tsne = tsne.fit_transform(doc_embeddings)
# Then use reduced_tsne instead of reduced_pca below.

# We'll use reduced_pca here for demonstration.
x_coords = reduced_pca[:, 0]
y_coords = reduced_pca[:, 1]

# Create a figure for the cluster visualization
plt.figure(figsize=(10, 7))

# Plot each document as a point, colored by cluster_label
clusters = [doc['cluster_label'] for doc in documents]
scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6)

plt.title("K-Means Clusters (PCA 2D Projection)")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")

# Now let's highlight or label the 15 known PDFs
for i, doc in enumerate(documents):
    filename = doc['filename']
    # If this doc is one of the 15 labeled
    if filename in labeled_files_class0 or filename in labeled_files_class1:
        # We'll put a black edge or a special marker
        plt.scatter(x_coords[i], y_coords[i], 
                    c='none', edgecolors='black', 
                    marker='o', s=100, linewidths=1.5)

        # Optionally, annotate text
        # e.g., display 'C0' or 'C1' for class 0 or class 1
        if filename in labeled_files_class0:
            label_text = "C0"
        else:
            label_text = "C1"
        plt.text(x_coords[i]+0.02, y_coords[i]+0.02, label_text, fontsize=9)

# Add a legend for the cluster colors
# We'll add it by mapping each cluster label to a color patch
unique_clusters = list(sorted(set(clusters)))
handles = scatter.legend_elements()[0]
labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters]
plt.legend(handles, labels_for_legend, title="Clusters", loc="best")

plt.show()

Explanation / Notes

We used PCA to reduce to 2D, then plotted each document as a scatter point colored by its cluster_label.
The 15 labeled PDFs (R001-R005 are class 0, R006-R015 are class 1) are drawn with a black edge around the marker, and we label them in text as “C0” or “C1.”
This way, you can visually see how your labeled examples are distributed across clusters.

If you want a more advanced or non-linear reduction (like t-SNE or UMAP), just uncomment the relevant lines, and refer to reduced_tsne[:, 0] and reduced_tsne[:, 1] for x and y coordinates.

2. Visualizing Cosine Similarities

You have code that computes doc['similarity_to_nonpub'] for each document. Let’s plot a histogram (or KDE plot) of these similarity scores and highlight the 15 labeled PDFs.

python
import seaborn as sns

# Extract similarity scores for all docs
similarities = [doc['similarity_to_nonpub'] for doc in documents]

# Make a list of labeled similarities, plus the label for each
similarities_labeled_c0 = []
similarities_labeled_c1 = []
similarities_unlabeled = []

for i, doc in enumerate(documents):
    sim_val = doc['similarity_to_nonpub']
    filename = doc['filename']
    if filename in labeled_files_class0:
        similarities_labeled_c0.append(sim_val)
    elif filename in labeled_files_class1:
        similarities_labeled_c1.append(sim_val)
    else:
        similarities_unlabeled.append(sim_val)

plt.figure(figsize=(10, 6))

# Plot distribution with seaborn's histplot or kdeplot
sns.histplot(similarities, kde=True, color='gray', alpha=0.4, label='All Documents')

# Plot labeled class 0
sns.histplot(similarities_labeled_c0, color='red', alpha=0.8, kde=True, label='Class 0 (R001-R005)')

# Plot labeled class 1
sns.histplot(similarities_labeled_c1, color='blue', alpha=0.6, kde=True, label='Class 1 (R006-R015)')

plt.title("Cosine Similarity to Non-Publishable Centroid")
plt.xlabel("Similarity Score")
plt.ylabel("Count")

plt.legend(loc='best')
plt.show()

Explanation / Notes

We separate out the distributions for:
- All documents (gray)
- Class 0 labeled docs (red)
- Class 1 labeled docs (blue)
This helps you see whether your “Non-Publishable” examples (class 0) indeed have higher similarity to the nonpub_centroid than the “Publishable” ones.

Additional Variation

If you prefer a KDE-only approach (smooth density curve), you can use sns.kdeplot(...) instead of sns.histplot(kde=True, ...).
If you want to overlay them differently, you can create separate subplots or adjust transparency.

3. More Red Flag Rules

You mentioned that having or not having a references section wasn’t always accurate. Below are some alternative heuristics:

Extremely Short Paper: If the overall word count is below some threshold (e.g., < 1000 words), it might be suspicious for a full conference paper.
No Mention of Baseline / Comparison: For example, check if words like “baseline,” “comparison,” or “benchmark” appear in the method/results sections. A lack of any mention might be suspicious.
No Mention of Datasets / Data: Many (but not all) publishable ML papers talk about dataset(s). If the text never mentions “dataset,” “corpus,” or “data used,” that might be suspicious.
No Mention of Metrics: E.g., no mention of “accuracy,” “precision,” “recall,” “auc,” “f1,” “mse,” “rmse,” etc. in the results section.
Excessive Self-Citations or “I” / “we” usage in the methodology (less formal). This is more subjective.
Inconsistent or Contradictory numeric results: If the results section claims two different accuracies for the same experiment. (This is more complex to implement and requires deeper NLP to parse the content.)

Below is an updated example of a “red flag” function using some of these additional heuristics:

python
def count_red_flags(processed_sections):
    flags = 0
    full_text = " ".join(sec_text for _, sec_text in processed_sections)
    word_count = len(full_text.split())

    # 1) Very short paper
    if word_count < 1000:
        flags += 1

    # 2) Check if baseline or comparison keywords are absent
    if ("baseline" not in full_text) and ("compare" not in full_text) and ("benchmark" not in full_text):
        flags += 1

    # 3) Check if there's any mention of data or dataset
    if ("data" not in full_text) and ("dataset" not in full_text) and ("corpus" not in full_text):
        flags += 1

    # 4) No mention of common evaluation metrics
    common_metrics = ["accuracy", "precision", "recall", "f1", "auc", "mse", "rmse"]
    if not any(metric in full_text for metric in common_metrics):
        flags += 1

    return flags

Things to Keep in Mind

The above rules are heuristics and can yield false positives. For instance, a theoretical paper might not mention “dataset,” or a short paper might still be legitimate (like a short workshop paper).
Always tailor these rules to your actual data. Adjust thresholds and keyword sets as you learn more about your corpus.

4. Visualizing Red Flag Scores

Similar to the cosine similarity distribution, you might want to plot a histogram of red_flag_score across all documents. We’ll again highlight the labeled ones.

python
import seaborn as sns

red_flag_scores = [doc['red_flag_score'] for doc in documents]
scores_labeled_c0 = []
scores_labeled_c1 = []
scores_unlabeled = []

for doc in documents:
    score_val = doc['red_flag_score']
    filename = doc['filename']
    if filename in labeled_files_class0:
        scores_labeled_c0.append(score_val)
    elif filename in labeled_files_class1:
        scores_labeled_c1.append(score_val)
    else:
        scores_unlabeled.append(score_val)

plt.figure(figsize=(10, 6))

sns.histplot(red_flag_scores, color='gray', alpha=0.4, binwidth=1, label='All Documents')
sns.histplot(scores_labeled_c0, color='red', alpha=0.8, binwidth=1, label='Class 0 (R001-R005)')
sns.histplot(scores_labeled_c1, color='blue', alpha=0.6, binwidth=1, label='Class 1 (R006-R015)')

plt.title("Red Flag Score Distribution")
plt.xlabel("Red Flag Score (count of triggered heuristics)")
plt.ylabel("Number of Papers")
plt.legend(loc='best')
plt.show()

This will give you a sense of how many rules each document is triggering, and whether your known class 0 or class 1 docs align with high or low red-flag counts.

5. Suggested Improvements

Refine & Validate Heuristics
- Inspect actual “publishable” vs. “non-publishable” papers to confirm that your rules reflect reality. For example, if some well-written short paper triggers 3 red flags, maybe you need to adjust thresholds.
Combine Scores
- You can combine red flag scores with clustering or similarity. For instance, mark a paper as “non-publishable” if it triggers >2 red flags and has a high similarity to a known non-publishable centroid.
Add Some Manual Spot-Checking
- Because you have only 15 labeled papers, additional manual checks or partial labeling might help refine your thresholds.
Consider More Advanced Dimensionality Reduction
- t-SNE or UMAP might cluster your embeddings more intuitively than PCA for non-linear separations.
Use a Small Classifier as a Final Step
- Even though you have a small labeled set, you can do a simple logistic regression or MLP with strong regularization or cross-validation. You might combine features like “paper_embedding,” “red_flag_score,” and “similarity_to_nonpub.” Then use cross-validation on your 15 labeled docs to pick hyperparameters.
Look for Contradictory Results
- Potentially parse numeric results to see if there are contradictions (like claiming “accuracy of 90%” in one paragraph and “accuracy of 95%” in another for the same experiment). This is advanced but can be a strong heuristic.

Summary

Clustering Visualization: Use PCA/t-SNE on doc_embeddings, color by cluster, highlight labeled docs.
Cosine Similarity Distribution: Plot a histogram or KDE, highlight labeled docs.
Red Flag Heuristics: Expand your set of rules to cover more typical “publishability” signals (baselines, data usage, metrics, etc.).
Red Flag Distribution: Similarly, plot a histogram of how many heuristics each paper triggers and see if your labeled examples match your intuition.
Refinement: Iteratively refine thresholds and rules after examining outliers.

By combining these visualizations, you’ll get a much better understanding of how your dataset is structured, how your known “non-publishable” vs. “publishable” examples fit in, and which heuristics or similarity thresholds might be most effective.

Answer 4

The issue you're describing likely stems from how the labeled files (R001 to R015) are matched against the filenames in your documents list. If there’s any mismatch in how the filenames are stored (e.g., missing .pdf extensions, inconsistent capitalization, or whitespace issues), some of the labeled files might not be correctly identified, resulting in only a subset being visualized.

Here’s a step-by-step rectification and improved code to ensure all 15 labeled files are visualized properly:

Step 1: Verify File Matching Logic

Ensure that the filenames for your labeled files are correctly matched against the filenames in your documents list. Normalize both sets of filenames to avoid mismatches.

python
# Define labeled files with consistent formatting
labeled_files_class0 = {f"R00{i}.pdf" for i in range(1, 6)}  # R001-R005
labeled_files_class1 = {f"R00{i}.pdf" for i in range(6, 16)}  # R006-R015

# Normalize filenames in the documents list for comparison
for doc in documents:
    doc['normalized_filename'] = doc['filename'].strip().lower()  # Normalize filenames

# Ensure the labeled sets are also normalized
labeled_files_class0 = {f.strip().lower() for f in labeled_files_class0}
labeled_files_class1 = {f.strip().lower() for f in labeled_files_class1}

Step 2: Correct Clustering Visualization Code

Update the clustering visualization code to ensure all labeled files are highlighted properly:

python
# PCA for dimensionality reduction
pca = PCA(n_components=2, random_state=42)
reduced_pca = pca.fit_transform(doc_embeddings)

# Extract coordinates
x_coords = reduced_pca[:, 0]
y_coords = reduced_pca[:, 1]

# Create figure
plt.figure(figsize=(10, 7))
clusters = [doc['cluster_label'] for doc in documents]
scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6)

plt.title("K-Means Clusters (PCA 2D Projection)")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")

# Highlight labeled files
for i, doc in enumerate(documents):
    filename = doc['normalized_filename']
    if filename in labeled_files_class0 or filename in labeled_files_class1:
        plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5)

        # Label the points
        if filename in labeled_files_class0:
            label_text = "C0"
        elif filename in labeled_files_class1:
            label_text = "C1"
        plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9)

# Add legend for clusters
unique_clusters = list(sorted(set(clusters)))
handles = scatter.legend_elements()[0]
labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters]
plt.legend(handles, labels_for_legend, title="Clusters", loc="best")

plt.show()

Step 3: Correct Cosine Similarity Visualization Code

Update the cosine similarity plot to ensure all labeled files are included:

python
# Extract cosine similarity values
similarities = [doc['similarity_to_nonpub'] for doc in documents]

# Categorize similarities
similarities_labeled_c0 = []
similarities_labeled_c1 = []

for doc in documents:
    filename = doc['normalized_filename']
    sim_val = doc['similarity_to_nonpub']
    if filename in labeled_files_class0:
        similarities_labeled_c0.append(sim_val)
    elif filename in labeled_files_class1:
        similarities_labeled_c1.append(sim_val)

plt.figure(figsize=(10, 6))

# Plot overall distribution
sns.histplot(similarities, kde=True, color='gray', alpha=0.4, label='All Documents')

# Highlight class 0
sns.histplot(similarities_labeled_c0, kde=True, color='red', alpha=0.8, label='Class 0 (R001-R005)')

# Highlight class 1
sns.histplot(similarities_labeled_c1, kde=True, color='blue', alpha=0.6, label='Class 1 (R006-R015)')

plt.title("Cosine Similarity to Non-Publishable Centroid")
plt.xlabel("Similarity Score")
plt.ylabel("Count")
plt.legend(loc='best')

plt.show()

Step 4: Debugging Mismatches

If any of the 15 labeled files still fail to appear in the plots, print the filenames for debugging:

python
# Check which labeled files are not being matched
missing_files_c0 = labeled_files_class0 - {doc['normalized_filename'] for doc in documents}
missing_files_c1 = labeled_files_class1 - {doc['normalized_filename'] for doc in documents}

print("Missing Class 0 Files:", missing_files_c0)
print("Missing Class 1 Files:", missing_files_c1)

Ensure the filenames in the labeled lists (labeled_files_class0, labeled_files_class1) exactly match the filenames in documents.

Explanation of Changes

Normalized Filenames: Ensured consistency by normalizing all filenames to lowercase and stripping whitespace before matching.
Debugging Missing Matches: Added code to detect and print any labeled files that aren’t matched.
Updated Visualization: Plots now handle all 15 labeled files and highlight them appropriately.

Let me know if the updated code resolves the issue or if further adjustments are needed!

Answer 5

Here's how you can adjust the previous visualization codes to highlight the documents where maybe_nonpub == True. This involves updating both the clustering visualization and the cosine similarity distribution visualization.

1. Clustering Visualization with `maybe_nonpub` Highlighting

We’ll add a special highlight for documents marked as maybe_nonpub == True in the clustering plot. These will be shown with a distinct edge color (e.g., orange).

python
# PCA for dimensionality reduction
pca = PCA(n_components=2, random_state=42)
reduced_pca = pca.fit_transform(doc_embeddings)

# Extract coordinates
x_coords = reduced_pca[:, 0]
y_coords = reduced_pca[:, 1]

# Create figure
plt.figure(figsize=(10, 7))
clusters = [doc['cluster_label'] for doc in documents]
scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6)

plt.title("K-Means Clusters (PCA 2D Projection)")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")

# Highlight labeled files
for i, doc in enumerate(documents):
    filename = doc['normalized_filename']
    # Highlight labeled documents (Class 0 and Class 1)
    if filename in labeled_files_class0 or filename in labeled_files_class1:
        plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5)
        if filename in labeled_files_class0:
            label_text = "C0"
        elif filename in labeled_files_class1:
            label_text = "C1"
        plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9)

    # Highlight maybe_nonpub documents with a distinct edge color
    if doc.get('maybe_nonpub', False):
        plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', marker='s', s=120, linewidths=1.5)

# Add legend for clusters
unique_clusters = list(sorted(set(clusters)))
handles = scatter.legend_elements()[0]
labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters]
plt.legend(handles, labels_for_legend, title="Clusters", loc="best")

plt.show()

Explanation:

Black Circles: Highlight the labeled documents (R001 to R015) with class annotations (C0 and C1).
Orange Squares: Highlight documents where maybe_nonpub == True with orange square edges.

2. Cosine Similarity Distribution with `maybe_nonpub` Highlighting

We’ll adjust the cosine similarity plot to highlight the distribution of documents where maybe_nonpub == True in addition to the labeled documents.

python
# Extract cosine similarity values
similarities = [doc['similarity_to_nonpub'] for doc in documents]

# Categorize similarities
similarities_labeled_c0 = []
similarities_labeled_c1 = []
similarities_maybe_nonpub = []

for doc in documents:
    filename = doc['normalized_filename']
    sim_val = doc['similarity_to_nonpub']
    if filename in labeled_files_class0:
        similarities_labeled_c0.append(sim_val)
    elif filename in labeled_files_class1:
        similarities_labeled_c1.append(sim_val)
    if doc.get('maybe_nonpub', False):
        similarities_maybe_nonpub.append(sim_val)

plt.figure(figsize=(10, 6))

# Plot overall distribution
sns.histplot(similarities, kde=True, color='gray', alpha=0.4, label='All Documents')

# Highlight class 0
sns.histplot(similarities_labeled_c0, kde=True, color='red', alpha=0.8, label='Class 0 (R001-R005)')

# Highlight class 1
sns.histplot(similarities_labeled_c1, kde=True, color='blue', alpha=0.6, label='Class 1 (R006-R015)')

# Highlight maybe_nonpub
sns.histplot(similarities_maybe_nonpub, kde=True, color='orange', alpha=0.5, label='Maybe Non-Publishable')

plt.title("Cosine Similarity to Non-Publishable Centroid")
plt.xlabel("Similarity Score")
plt.ylabel("Count")
plt.legend(loc='best')

plt.show()

Explanation:

Gray Distribution: All documents.
Red and Blue Distributions: Labeled Class 0 (R001-R005) and Class 1 (R006-R015), respectively.
Orange Distribution: Documents marked as maybe_nonpub == True.

Testing and Debugging

To ensure the plots work without errors:

Check that every document in documents has a normalized filename (normalized_filename) and the maybe_nonpub attribute (default to False if not already present).

If errors persist, print out examples of documents missing expected attributes:

python
for doc in documents:
    if 'normalized_filename' not in doc or 'maybe_nonpub' not in doc:
        print("Missing attributes in:", doc)

Let me know if the updated code highlights the maybe_nonpub documents correctly and if further refinements are needed!

Answer 6

Below is a detailed answer addressing your specific questions:

Is it viable to iteratively update the non-publishable centroid by adding newly identified “high-similarity” papers to the known non-publishable list?
Why might a small MLP (or logistic regression) classifier be useful, and how do we integrate it with the existing code?
Should we replace K-Means with something like Hierarchical Clustering or HDBSCAN, and will that likely improve performance?

Finally, I provide two complete code examples (so you can simply paste them into your workflow):

Set 1: The first code keeps your K-Means + centroid-based similarity approach, adds an iterative centroid expansion (semi-supervised “self-labeling”) step, and then demonstrates a small MLP for classification, complementing the similarity approach.
Set 2: The second code tries HDBSCAN for outlier/cluster discovery instead of K-Means, and also shows an alternative logistic regression or MLP approach—just enough to illustrate how you might integrate another clustering approach and a classifier.

1. Iterative Centroid Expansion: Pros and Cons

Pros

Simple Semi-Supervised Method: By adding unlabeled documents that have high similarity to the known “non-publishable” centroid, you effectively enlarge your labeled “non-publishable” pool.
Refined Centroid: Each iteration recalculates the centroid with the newly added documents, potentially capturing more nuanced aspects of “non-publishable” text.
Fewer False Negatives: If your initial set is correct, you are less likely to miss documents that are truly “non-publishable.”

Cons

Risk of Error Propagation: If a paper is mistakenly added to the “non-publishable” pool, it can shift the centroid incorrectly and cause more incorrect classifications.
Subjectivity of Similarity Threshold: You must set a similarity threshold. If it’s too low, you’ll add documents too eagerly; if it’s too high, you might ignore borderline papers.

Practical Tip: Start with a higher threshold (e.g., > 0.60) to keep false positives low. Then you can slowly lower the threshold if you want to catch borderline cases, but watch for error drift.

2. Small MLP or Logistic Regression

Why Might This Help?

Combining Multiple Features: A simple centroid-based similarity is only one dimension. If you also incorporate features such as “embedding vectors,” “similarity_to_nonpub,” or “red_flag_score,” your classifier can learn a more robust boundary.
Cross-Validation: Even with 15 labeled examples, you can do cross-validation (like 5-fold or leave-one-out) to help prevent overfitting and select hyperparameters.
Capture Non-Linear Boundaries: An MLP can model more complex decision surfaces than a linear threshold on similarity.

Downsides

Risk of Overfitting: 15 labeled examples is very few.
You Must Carefully Validate: Rely on cross-validation or you won’t know if you’re just memorizing the 15 examples.

3. Alternatives to K-Means: HDBSCAN or Hierarchical

HDBSCAN: A density-based clustering algorithm that can identify outliers as their own cluster or leave them as unassigned. This might catch unusual documents that do not fit into your main “publishable” clusters.
Hierarchical Clustering: Can give you a dendrogram and you can decide the cut. But it can be memory-intensive for large datasets.
Will It Improve Performance? Possibly, if your data truly forms irregular clusters or if you care about identifying “noise/outlier” documents specifically. However, K-Means is simpler and often works well for dense embeddings. It’s worth an experiment if you have time.

Set 1: End-to-End Code (Iterative Centroid + K-Means + MLP)

Below is a consolidated code block illustrating:

Iterative centroid expansion for “non-publishable” detection.
An MLP that uses the paper embeddings plus similarity_to_nonpub as input features.

Where to paste: Ideally, after you’ve computed doc_embeddings, similarity_to_nonpub, and maybe_nonpub for each document. This can replace or supplement the final classification steps.

python
###############################################################################
# SET 1: Iterative Centroid Expansion + K-Means + MLP
###############################################################################

import numpy as np
from copy import deepcopy
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')

###############################################################################
# 1. Iterative Centroid Expansion
###############################################################################

def iterative_nonpub_expansion(documents, known_nonpub_filenames, similarity_threshold=0.60, max_iterations=5):
    """
    Iteratively expands the known non-publishable list by adding any documents
    with similarity > similarity_threshold. Then recalculates centroid.
    
    Arguments:
    - documents: list of dicts containing doc['filename'], doc['similarity_to_nonpub'], etc.
    - known_nonpub_filenames: initial set/list of known non-publishable filenames (strings).
    - similarity_threshold: similarity above which a doc is considered "non-publishable."
    - max_iterations: maximum number of expansions to run.
    
    Returns:
    - updated_nonpub_filenames: final set of non-publishable filenames after expansion
    """
    updated_nonpub_filenames = set(known_nonpub_filenames)
    iteration = 0

    while iteration < max_iterations:
        iteration += 1
        # Calculate new centroid from the current set
        nonpub_embeddings = []
        for doc in documents:
            if doc['filename'] in updated_nonpub_filenames:
                emb = doc['doc_embedding']  # we store doc_embedding for convenience
                nonpub_embeddings.append(emb)
        new_centroid = np.mean(nonpub_embeddings, axis=0)

        # Add docs that exceed the threshold
        newly_added = False
        for doc in documents:
            if doc['filename'] not in updated_nonpub_filenames:
                # compute similarity to new centroid
                emb = doc['doc_embedding']
                sim = cosine_similarity(emb, new_centroid)
                if sim > similarity_threshold:
                    updated_nonpub_filenames.add(doc['filename'])
                    newly_added = True
        
        if not newly_added:
            print(f"No new docs added in iteration {iteration}. Stopping.")
            break
        else:
            print(f"Iteration {iteration} added new docs. Current size of non-pub set: {len(updated_nonpub_filenames)}.")

    return updated_nonpub_filenames

# We'll assume you already have doc['doc_embedding'] = get_document_embedding(...)
# To store doc_embedding in your existing pipeline, do something like:
# for i, doc in enumerate(documents):
#     doc['doc_embedding'] = doc_embeddings[i]

# 1.1: Store embeddings in each doc for convenience
for i, doc in enumerate(documents):
    doc['doc_embedding'] = doc_embeddings[i]

# 1.2: Perform initial similarity calculation (already done in your code, typically)
# for doc in documents:
#     sim = cosine_similarity(doc['doc_embedding'], nonpub_centroid)
#     doc['similarity_to_nonpub'] = float(sim)

# 1.3: Iteratively expand
known_nonpub_initial = ["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"]
expanded_nonpub_filenames = iterative_nonpub_expansion(documents, known_nonpub_initial,
                                                       similarity_threshold=0.60,
                                                       max_iterations=5)
print("Final set of non-publishable (expanded) = ", expanded_nonpub_filenames)


###############################################################################
# 2. K-Means (Already in your code)
###############################################################################
# We'll keep your existing KMeans step. No changes needed unless you want to integrate
# the newly expanded non-publishable set in some way (like removing them from K-Means, etc.).
# 
# For reference:
# labels = kmeans.fit_predict(doc_embeddings)
# for i, doc in enumerate(documents):
#     doc['cluster_label'] = int(labels[i])


###############################################################################
# 3. Small MLP for classification
###############################################################################
# We'll build a tiny dataset from the 15 labeled documents you have 
# (including your newly expanded nonpub set if you like). 
# Then we do cross-validation to evaluate how well it classifies 
# publishable (class 1) vs non-publishable (class 0).

from sklearn.model_selection import StratifiedKFold

# Let's define your 15 labeled docs (or any subset you trust):
nonpub_set_c0 = set(["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"])
pub_set_c1 = set(["R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf",
                  "R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"])

X = []
y = []

for doc in documents:
    fname = doc['filename']
    if fname in nonpub_set_c0:
        X.append(doc['doc_embedding'])
        y.append(0)
    elif fname in pub_set_c1:
        X.append(doc['doc_embedding'])
        y.append(1)

X = np.array(X)
y = np.array(y)

# If we want to include similarity as an extra feature:
# We'll combine doc_embedding with similarity_to_nonpub
# Just do a quick horizontal stack: 
X_aug = []
for doc, label in zip(documents, range(len(documents))):
    # but we only want those that are in our labeled set
    fname = doc['filename']
    if fname in nonpub_set_c0 or fname in pub_set_c1:
        embed_768 = doc['doc_embedding']
        sim_val = doc['similarity_to_nonpub']  # a single float
        # create a feature vector of size 769
        new_vec = np.concatenate([embed_768, [sim_val]])
        X_aug.append(new_vec)

# We must track y in the same order
y_aug = []
for doc in documents:
    fname = doc['filename']
    if fname in nonpub_set_c0:
        y_aug.append(0)
    elif fname in pub_set_c1:
        y_aug.append(1)

X_aug = np.array(X_aug)
y_aug = np.array(y_aug)

# We'll train an MLP on X_aug, y_aug
mlp = MLPClassifier(hidden_layer_sizes=(64,), random_state=42, max_iter=1000)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(mlp, X_aug, y_aug, cv=skf, scoring='accuracy')
print(f"5-Fold CV Accuracy for MLP: {scores.mean():.2f} +/- {scores.std():.2f}")

mlp.fit(X_aug, y_aug)

# Now we can apply this MLP to all 150 documents
# They each get a predicted label: 0 = non-publishable, 1 = publishable
for doc in documents:
    embed_768 = doc['doc_embedding']
    sim_val = doc['similarity_to_nonpub']
    test_vec = np.concatenate([embed_768, [sim_val]])
    pred_label = mlp.predict(test_vec.reshape(1, -1))[0]
    doc['mlp_pred_label'] = int(pred_label)

# doc['mlp_pred_label'] == 0 => non-publishable, 1 => publishable

Where to Paste This in Your Code

Step 1 (Iterative expansion) should come after you’ve computed doc_embeddings and the initial nonpub_centroid.
Step 2 is just your existing K-Means code—no big change required.
Step 3 (MLP) can go at the very end, after you have doc_embeddings and doc['similarity_to_nonpub'].

Set 2: Alternative Clustering (HDBSCAN) + (Optionally) MLP

This second code snippet replaces K-Means with HDBSCAN or Hierarchical Clustering. HDBSCAN requires installing the library:

bash
!pip install hdbscan

We’ll also show how to do the MLP in the same manner. (You can remove the MLP part if you only want to test clustering.)

Where to paste: Replace your K-Means block with the HDBSCAN block. Keep the rest of your pipeline the same.

python
###############################################################################
# SET 2: Using HDBSCAN (Outlier Detection) + Optional MLP
###############################################################################
!pip install hdbscan
import hdbscan

# 1) HDBSCAN to cluster doc_embeddings
hdb = hdbscan.HDBSCAN(min_cluster_size=5, min_samples=3, metric='euclidean')
hdb_labels = hdb.fit_predict(doc_embeddings)

# We can store these cluster labels in documents
for i, doc in enumerate(documents):
    doc['hdb_cluster_label'] = int(hdb_labels[i])  # -1 means outlier/no cluster

# Let's see how many are labeled as outliers
outlier_count = sum(1 for lbl in hdb_labels if lbl == -1)
print(f"Number of outliers (no cluster assigned): {outlier_count}")

# 2) Inspect cluster distribution
label_counts = Counter(hdb_labels)
print("HDBSCAN Cluster Distribution:", label_counts)

# 3) If you want a quick visualization:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA

pca = PCA(n_components=2, random_state=42)
reduced_pca = pca.fit_transform(doc_embeddings)

x_coords = reduced_pca[:,0]
y_coords = reduced_pca[:,1]

plt.figure(figsize=(10,7))
scatter = plt.scatter(x_coords, y_coords, c=hdb_labels, cmap='tab10', alpha=0.6)
plt.title("HDBSCAN Clusters (PCA 2D Projection)")
plt.xlabel("PC1")
plt.ylabel("PC2")

# highlight your known labeled docs
labeled_files_class0 = set(["R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"])
labeled_files_class1 = set(["R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf",
                            "R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"])

for i, doc in enumerate(documents):
    if doc['filename'] in labeled_files_class0 or doc['filename'] in labeled_files_class1:
        plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black',
                    marker='o', s=100, linewidths=1.5)
        if doc['filename'] in labeled_files_class0:
            label_text = "C0"
        else:
            label_text = "C1"
        plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9)

plt.show()

# 4) Optional: Train an MLP or logistic regression using same approach as above
# Build labeled X, y from the 15 known docs
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import StratifiedKFold

X_aug = []
y_aug = []

for doc in documents:
    fname = doc['filename']
    if fname in labeled_files_class0:
        # Create feature vector from doc['doc_embedding'] + doc['similarity_to_nonpub']
        embed_768 = doc_embeddings[[i for i, d in enumerate(documents) if d['filename'] == fname][0]]
        sim_val = doc['similarity_to_nonpub']
        new_vec = np.concatenate([embed_768, [sim_val]])
        X_aug.append(new_vec)
        y_aug.append(0)
    elif fname in labeled_files_class1:
        embed_768 = doc_embeddings[[i for i, d in enumerate(documents) if d['filename'] == fname][0]]
        sim_val = doc['similarity_to_nonpub']
        new_vec = np.concatenate([embed_768, [sim_val]])
        X_aug.append(new_vec)
        y_aug.append(1)

X_aug = np.array(X_aug)
y_aug = np.array(y_aug)

mlp = MLPClassifier(hidden_layer_sizes=(64,), random_state=42, max_iter=1000)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(mlp, X_aug, y_aug, cv=skf, scoring='accuracy')
print(f"[HDBSCAN] 5-Fold CV Accuracy for MLP: {scores.mean():.2f} +/- {scores.std():.2f}")

mlp.fit(X_aug, y_aug)

for doc in documents:
    # For each doc, build the same feature vector and predict
    embed_768 = doc['doc_embedding']
    sim_val = doc['similarity_to_nonpub']
    test_vec = np.concatenate([embed_768, [sim_val]])
    pred_label = mlp.predict(test_vec.reshape(1, -1))[0]
    doc['mlp_pred_label'] = int(pred_label)

print("Done: HDBSCAN + MLP classification.")

Note:

With HDBSCAN, -1 indicates an outlier (i.e., no cluster). You can treat outliers as “suspect” if you suspect unusual papers might be “non-publishable,” or you can do a separate check.
You can still incorporate the iterative centroid approach if you like, or skip it in this second snippet.

Which Approach Is “Best”?

Iterative centroid + K-Means is simpler to conceptualize, especially with a known set of non-publishable examples.
HDBSCAN can automatically detect outliers (possible “non-publishable?”) and form an arbitrary number of clusters.
MLP or logistic regression classification is valuable if you want a single “publishable vs. non-publishable” output that can blend multiple features (embedding, similarity, red flags).

Ultimately, it’s best to compare them on your small labeled set (the 15 papers). Whichever approach consistently classifies them correctly (via cross-validation) is typically the safest bet to scale to the 150-paper dataset.

Final Takeaways

Iterative centroid refinement is a valid semi-supervised approach but be cautious about error drift.
MLP or logistic regression can fuse multiple signals (embedding, similarity, later on red flags) into a single decision. Use cross-validation carefully to avoid overfitting.
Switching from K-Means to HDBSCAN might help discover outliers or irregular cluster shapes, but it’s not guaranteed to be a silver bullet. It’s often worth an experiment.
Rule-based signals can still be added later once you refine them. They can even be additional features for your classifier.

Feel free to integrate the above code snippets wherever they fit best in your existing pipeline. Good luck!

Answer 7

Below is a deeper discussion of the issues you’re encountering, how to interpret your similarity-based approach, and how you might reduce false negatives (while keeping false positives in check). I’ll also address how you can evaluate different transformers to pick one that best separates publishable and non-publishable papers.

1. Interpreting the Similarity Measure Plot

When you calculate the cosine similarity between your paper embeddings and the “non-publishable centroid,” you typically end up with a distribution of similarity scores (e.g., from 0.0 to 1.0). You may also separate out the scores for:

Known Non-Publishable examples (Class 0)
Known Publishable examples (Class 1)
Unlabeled examples

1.1 Gap or Separation Between Classes

Large Gap is Usually Good: A big difference in similarity between known non-publishable and known publishable means your embedding space is effectively separating these classes. If your known Class 0 docs all have similarity around 0.6–0.8, while known Class 1 docs are around 0.2–0.3, that’s a strong signal that your chosen embedding model is capturing relevant distinctions.
Trade-Off: You still have to choose a threshold that splits the unlabeled documents into “likely non-publishable” vs. “likely publishable.”

1.2 Balancing False Positives vs. False Negatives

High Threshold:
- Fewer false positives (i.e., you rarely label a truly publishable paper as “non-publishable”).
- Potentially more false negatives (you miss some actual non-publishable papers because the similarity was 0.58 while your threshold is 0.60).
Lower Threshold:
- You’ll catch more real non-publishable papers (fewer false negatives).
- But you may start to label some publishable papers as “non-publishable” (increase in false positives).

1.3 Concentration of Papers at Different Similarities

You mentioned that “75% of data is under the threshold” and are thus classified as publishable. This is typical if your threshold is somewhat high. The question is: Are you comfortable with that many being labeled publishable if it means you avoid false positives?

To gauge that, you can do a small precision-recall style analysis on your 15 labeled docs:

Vary the threshold from (e.g.) 0.2 to 0.8.
Compute false positives and false negatives among your known labeled docs at each threshold.
Pick a threshold that yields an acceptable balance (maybe you accept a few false positives if it significantly lowers your false negatives).

2. Reducing False Negatives (Keeping the Similarity Baseline)

If your main problem is that the model is “missing” some truly non-publishable papers that have similarity below your chosen threshold, here are ways to tackle that:

Dual-Centroid Approach: Instead of a single “non-publishable” centroid, also compute a “publishable” centroid using your known Class 1 docs. Then, for each unlabeled paper, compare its distance (similarity) to both centroids. Whichever it’s closest to (or whichever similarity is higher) is the predicted class. This can sometimes reduce false negatives if those borderline papers are “less similar” to the publishable centroid.
Manual Spot-Check Borderlines: If you have a band around your threshold (say 0.50–0.60) where many papers lie, you could manually examine some of these borderline papers to either confirm they’re publishable or non-publishable. This manual check (even if for a handful of papers) can refine your threshold or expand your labeled set.
Active Learning: Similar to the manual spot-check, you pick the unlabeled docs that are near the threshold. Label them carefully. Recompute the centroid with these newly labeled docs. This way, you’re less likely to incorporate incorrectly labeled examples (which can happen with naive iterative expansion).
Add a Secondary Feature: If the similarity approach alone leads to false negatives, consider adding a single, carefully chosen rule-based feature. For instance, if a borderline doc is suspiciously short or lacks any mention of “data/method,” maybe that alone can nudge its final classification to non-publishable. You don’t have to rely on an entire rule-based system—just one or two very telling features might help.

3. Evaluating Different Transformers

Different transformers (e.g., all-mpnet-base-v2, scibert_scivocab_uncased, all-distilroberta-v1) can produce different embedding spaces. Some points to remember:

They May Have Different Ranges of Similarity: For instance, all-distilroberta-v1 might produce embeddings whose cosine similarity has a narrower band overall compared to mpnet or scibert.
Choosing the “Best”: Typically, you do this by checking how well each embedding model separates your known labeled examples. For your 15 labeled docs, measure something like AUROC, F1, or a simple accuracy at an optimal threshold.
- Example: For each model, compute embeddings, compute centroid, then do a threshold sweep on those 15 known docs. Whichever gets the best average of (precision, recall, or F1) is presumably “best” for your data.
Domain-Specific: If your research papers come from domains like computer vision, NLP, data mining, etc., scibert_scivocab_uncased might do better if the text is very academic. Alternatively, mpnet or a huggingface “sentence-transformers” model might be more robust across general text.

Key: The “best” model will produce the largest separation between known Class 0 and Class 1, or at least a separation that is easy to threshold.

4. Why the Other Approaches Might Have Failed

4.1 Iterative Centroid Expansion Problems

If a single misclassified doc gets added to the non-publishable set, the centroid can shift incorrectly and pull in more false positives. This can degrade performance quickly.
If the threshold is set too low in the iterative step, you may pick up borderline or truly publishable docs.

4.2 HDBSCAN or Other Clustering

HDBSCAN can be very helpful if your data has irregular clusters and outliers. But if your data truly forms two main “blobs” (publishable vs. non-publishable) and you only have 15 labeled examples, HDBSCAN might create many small clusters or label many items as outliers.
When you tested HDBSCAN, it probably ended up dividing the data in ways that didn’t align well with your known labels. That can happen if the embedded distribution is not strongly “density-based.”

4.3 Overfitting with MLP

With only 15 examples in total, an MLP can easily memorize your training set. Because your dataset is so small, the model sees the same few samples over and over. This typically leads to near-perfect training accuracy but poor real-world generalization.

5. Key Suggestions Going Forward

Stay with the Similarity Approach If it’s giving you 0 false positives but you want to reduce false negatives, consider:
- Lowering the threshold slightly.
- Using a second “publishable” centroid.
- Doing a small manual check for borderline papers around your threshold.
Validate Thresholds with Known Data
- Even though you have only 15 labeled docs, do a threshold sweep on that small set to see how many more non-publishable docs you catch if you shift from 0.60 to 0.55, and how many new false positives that introduces.
Light Touch Rules
- Add 1–2 carefully chosen heuristics—particularly ones that catch certain forms of “nonsense” or glaring omissions. If you can rescue even a handful of false negatives by checking for a “tell-tale sign of nonsense,” that might help.
Refine the Embedding Model
- Try multiple transformers on your 15-labeled sample. For each model, see if the gap between Class 0 and Class 1 is bigger or smaller.
- Possibly test domain-specific ones (e.g., SciBERT if the text is scientific).
- Pick whichever produces the best separation (lowest overlap in similarity distributions).
Minimal Classifier
- If you want to do a classifier, consider Logistic Regression or a small model with strong regularization rather than a deeper MLP. For instance, a linear boundary on [768-dim embeddings + 1 similarity feature] might be enough to glean some improvement without overfitting too easily.

6. Putting It All Together

Interpretation of Similarity Plot:
- If your known classes have a large gap, that’s typically good.
- A large chunk of unlabeled docs might remain below threshold → potential false negatives. You can shift the threshold or do a second pass with manual checks or additional features.
Improving False Negatives:
- Lower threshold or do manual checking around borderline.
- Possibly incorporate a small domain-specific rule to catch certain obvious mistakes.
- Consider a dual-centroid approach (distance to non-pub vs. distance to pub).
Selecting the Best Transformer:
- Evaluate each transformer’s separation on your labeled data.
- Whichever yields the “cleanest” gap or highest F1 (with threshold tuning) is likely best.
Why Some Methods Fail:
- Iterative centroid expansion is very sensitive to even a single mislabel.
- HDBSCAN might not match your data structure.
- MLP can overfit on 15 examples.
Recommendation:
- Stick with your current similarity approach if it reliably yields 0 false positives, and find ways to reduce false negatives: either a slightly lower threshold, an additional centroid for publishable, or minimal extra signals.
- Expand your labeled set if possible (even 5–10 more doc labels) to better tune the threshold or simple logistic regression.

Final Thoughts

Your strong initial results show that cosine similarity (with a well-chosen threshold) can already separate non-publishable from publishable decently. Adding more data—or carefully adjusting your threshold—will likely do more to address the false negatives than fancy expansions or clustering. If you only have 15 labeled docs, applying active learning or a dual-centroid approach can be more robust than heavy iterative expansions that risk picking up accidental false positives.

Focus on incremental improvements around your best baseline. Sometimes the simplest method (similarity + threshold) is both the most interpretable and the most robust, especially with very limited labels.

Answer 8

Below is a comprehensive answer that addresses all your requests:

A logistic regression classifier that uses (a) document embeddings plus (b) the similarity measure as features, while trying to avoid overfitting given your tiny labeled set (5 “non-publishable” + 10 “publishable”).
Plotting the decision boundary on your 2D projections (PCA/t-SNE/etc.). We’ll discuss the best practical way to visualize a decision boundary in 2D, given that logistic regression will have more than 2 input features if we include the similarity measure.
Combining multiple embeddings (from different models) to improve performance.
Topic modeling for detecting “bullshit” or irrelevant content, with a code example (e.g., using LDA or a simpler approach) on your CSV dataset.
Dual-centroid approach (one centroid for publishable, one for non-publishable) and code to classify based on whichever centroid is closer, plus a simple visualization.

Part A: Logistic Regression on Embeddings + Similarity

1. Overview

We have 15 labeled PDFs:
- 5 known non-publishable (Class 0)
- 10 known publishable (Class 1)
We assume we already have:
- doc_embeddings[i]: a vector (e.g. 768-dimensional) for the $i$ -th document
- documents[i]['similarity_to_nonpub']: a float for the similarity to the “non-publishable centroid”
- documents[i]['filename']: the PDF filename

Where to store the labeled classes:

Suppose we define two sets:

python
labeled_files_class0 = {"R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"}
labeled_files_class1 = {"R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf",
                        "R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"}

2. Creating the Feature Matrix (X) and Labels (y)

We’ll combine the raw embedding ( $768$ -dim) + the single similarity feature = $769$ -dim total.

python
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score

# 1) Build X, y from your 15 labeled docs
X_list = []
y_list = []

for i, doc in enumerate(documents):
    fname = doc['filename']
    if fname in labeled_files_class0:
        # Non-publishable
        x_vec = np.concatenate([doc_embeddings[i], [doc['similarity_to_nonpub']]])
        X_list.append(x_vec)
        y_list.append(0)
    elif fname in labeled_files_class1:
        # Publishable
        x_vec = np.concatenate([doc_embeddings[i], [doc['similarity_to_nonpub']]])
        X_list.append(x_vec)
        y_list.append(1)

X = np.array(X_list)
y = np.array(y_list)

# 2) Logistic Regression with strong regularization
#    Using 'saga' or 'liblinear' solver can handle small data well.
#    We also set a high C value for L2 or try a smaller C for stronger reg.
logreg = LogisticRegression(solver='liblinear', C=1.0, penalty='l2', random_state=42)

# 3) Cross-validation to check for overfitting
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(logreg, X, y, cv=skf, scoring='accuracy')
print(f"[LogReg] 5-Fold CV accuracy: {scores.mean():.2f} +/- {scores.std():.2f}")

# 4) Fit final model on entire labeled set
logreg.fit(X, y)

# 5) Predict for all 150 documents
for i, doc in enumerate(documents):
    x_vec = np.concatenate([doc_embeddings[i], [doc['similarity_to_nonpub']]])
    pred_label = logreg.predict(x_vec.reshape(1, -1))[0]
    doc['logreg_pred'] = int(pred_label)  # 0=non-publishable, 1=publishable

# If you want predicted probabilities:
# doc['logreg_prob'] = logreg.predict_proba(x_vec.reshape(1, -1))[0][1]

3. Visualizing the Decision Boundary in 2D Plots

You have three 2D projections (PCA, t-SNE, etc.). The logistic regression actually lives in a 769-dimensional space (embedding + similarity). Plotting that boundary in 2D is non-trivial. Typically, one of two strategies is used:

Train a 2D logistic regression just for visualization – i.e., train it only on [pca_x, pca_y] for the labeled docs. Then we can easily plot the boundary in PCA space.
Project the logistic regression’s 769D boundary into the 2D PCA space. This is more complex and not always straightforward.

For simplicity, let’s do approach (1)—train a separate logistic regression purely for the 2D PCA coordinates of your labeled docs, and then we can plot the boundary. Note: This is only for illustration, not the actual classifier you use in practice.

python
############ Visualization of Decision Boundary in PCA ############
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

# 1) Prepare PCA (2D) for all docs
pca = PCA(n_components=2, random_state=42)
reduced_pca = pca.fit_transform(doc_embeddings)
x_coords = reduced_pca[:, 0]
y_coords = reduced_pca[:, 1]

# 2) Build the training set in 2D PCA for the 15 labeled docs
X_2d = []
y_2d = []
for i, doc in enumerate(documents):
    fname = doc['filename']
    if fname in labeled_files_class0:
        # Non-publishable
        X_2d.append([x_coords[i], y_coords[i]])
        y_2d.append(0)
    elif fname in labeled_files_class1:
        # Publishable
        X_2d.append([x_coords[i], y_coords[i]])
        y_2d.append(1)

X_2d = np.array(X_2d)
y_2d = np.array(y_2d)

# 3) Train logistic regression on the 2D PCA data
logreg_2d = LogisticRegression(solver='liblinear', random_state=42)
logreg_2d.fit(X_2d, y_2d)

# 4) Plot PCA scatter with the decision boundary
plt.figure(figsize=(10,7))

# Color by cluster_label (if using K-Means)
clusters = [doc['cluster_label'] for doc in documents]
scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6)

plt.title("K-Means Clusters (PCA 2D Projection) + LogReg Boundary")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")

# Mark labeled docs
for i, doc in enumerate(documents):
    fname = doc['filename']
    if fname in labeled_files_class0 or fname in labeled_files_class1:
        plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5)
        if fname in labeled_files_class0:
            label_text = "C0"
        else:
            label_text = "C1"
        plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9)
    if doc.get('maybe_nonpub', False):
        plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', marker='s', s=120, linewidths=1.5)

# Plot the boundary in 2D
# We do this by creating a grid of (x,y), predict with logreg_2d, and draw contour.
xx, yy = np.meshgrid(
    np.linspace(x_coords.min()-0.5, x_coords.max()+0.5, 200),
    np.linspace(y_coords.min()-0.5, y_coords.max()+0.5, 200)
)
Z = logreg_2d.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
plt.contour(xx, yy, Z, levels=[0.5], cmap="gray", linewidths=2)

# Add cluster legend
unique_clusters = list(sorted(set(clusters)))
handles = scatter.legend_elements()[0]
labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters]
plt.legend(handles, labels_for_legend, title="Clusters", loc="best")

plt.show()

Note: The boundary you see here is only for the 2D PCA logistic regression, not the main (embedding+similarity) logistic regression. We do this because otherwise we can’t easily show a 769-dimensional boundary in 2D.

Part B: Combining Multiple Embeddings

Yes, you can combine embeddings from multiple transformer models (e.g., SciBERT, MPNet, DistilRoBERTa). The simplest approach is concatenation:

python
# Suppose you computed embeddings from 2 different models:
# doc_embeddings_1 (shape: [n_docs, d1])
# doc_embeddings_2 (shape: [n_docs, d2])
# for each doc in the same order
combined_embeddings = np.concatenate([doc_embeddings_1, doc_embeddings_2], axis=1)  # shape: [n_docs, d1 + d2]

Then you can apply the same pipeline (compute similarity, do logistic regression, etc.) on the combined vector. Or you can do more sophisticated merging, like averaging the two embeddings if they’re from the same dimensional space. But typically, concatenation is fine if you can handle the increased dimensionality.

Pros:

Potentially captures different aspects of the text from different models.
If one model is better at capturing domain-specific semantics, while another is better at general language, combining can help.

Cons:

Risk of overfitting with very high-dimensional vectors, especially with only 15 labeled docs.
Typically requires strong regularization or dimensionality reduction (like PCA on the combined embeddings to reduce them back down to a manageable dimension).

Part C: Topic Modeling for “Bullshit” Topics

1. Idea

If the paper’s text claims it’s about “machine learning,” but your topic model picks up random words about “cooking recipes,” that’s suspicious.
You can do LDA (Latent Dirichlet Allocation) or an embeddings-based approach. For a simple LDA example, you just need the full text in a CSV, then run sklearn.decomposition.LatentDirichletAllocation or gensim to extract topics.

2. Example End-to-End Code (Using Gensim)

Assume you have a CSV: papers_fulltext.csv with columns: filename, full_text.

python
###############################################################################
# TOPIC MODELING EXAMPLE
###############################################################################
!pip install gensim

import pandas as pd
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

df_topics = pd.read_csv("papers_fulltext.csv")  # columns: ['filename', 'full_text']

def clean_for_lda(text):
    # Basic cleaning for LDA
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words and len(t) > 2]
    return tokens

# Preprocess each doc
docs_tokens = [clean_for_lda(t) for t in df_topics['full_text']]

# Gensim requires a dictionary and a bag-of-words corpus
dictionary = Dictionary(docs_tokens)
dictionary.filter_extremes(no_below=3, no_above=0.6)  # filter out rare & frequent tokens
bow_corpus = [dictionary.doc2bow(tokens) for tokens in docs_tokens]

# Train LDA with, say, 5 or 10 topics
num_topics = 5
lda_model = LdaModel(bow_corpus, num_topics=num_topics, id2word=dictionary, passes=5, random_state=42)

# Inspect top words in each topic
for idx in range(num_topics):
    print(f"Topic #{idx}: {lda_model.print_topic(idx, 10)}")

# Now you can infer the topic distribution of each doc
doc_topics = []
for i, bow in enumerate(bow_corpus):
    topics_prob = lda_model.get_document_topics(bow, minimum_probability=0.0)
    # 'topics_prob' is a list of (topic_id, prob)
    doc_topics.append(topics_prob)

df_topics['topic_distribution'] = doc_topics

# DETECT "BULLSHIT" PAPERS?
# Example: if doc's top topic = 'cooking' but the paper claims to be about 'CV' or 'ML' 
# you can do some heuristic check. E.g., if the paper states "NLP" but the top topic is "cooking"

for i, row in df_topics.iterrows():
    # topics_prob = row['topic_distribution'] => list of (topic_id, prob)
    top_topic_id, top_topic_prob = max(row['topic_distribution'], key=lambda x: x[1])
    # If top_topic_id is not in your expected domain topics, mark it suspicious
    # Or do more advanced matching

You could also do a similarity approach on the doc’s topic distribution vs. a known distribution for “real ML topics.” Or if your LDA is picking up nonsense topics (like “oven, pan, salt, recipe”), you can flag those docs as suspicious.

Part D: Dual-Centroid Approach

Instead of a single “non-publishable” centroid, create two centroids:

Centroid_NonPub: average embedding of the 5 known non-publishable docs
Centroid_Pub: average embedding of the 10 known publishable docs

For each unlabeled doc, compute similarity to both centroids. Predict whichever similarity is higher (or whichever distance is smaller).

Example Code

python
###############################################################################
# DUAL-CENTROID APPROACH
###############################################################################
import numpy as np

# 1) Collect embeddings for known classes
nonpub_embeddings = []
pub_embeddings = []
for i, doc in enumerate(documents):
    fname = doc['filename']
    if fname in labeled_files_class0:
        nonpub_embeddings.append(doc_embeddings[i])
    elif fname in labeled_files_class1:
        pub_embeddings.append(doc_embeddings[i])

# 2) Compute two centroids
centroid_nonpub = np.mean(nonpub_embeddings, axis=0)
centroid_pub = np.mean(pub_embeddings, axis=0)

def cos_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# 3) For each doc, compute similarity to each centroid
for i, doc in enumerate(documents):
    emb = doc_embeddings[i]
    sim_nonpub = cos_sim(emb, centroid_nonpub)
    sim_pub = cos_sim(emb, centroid_pub)

    # Option 1: Classify whichever is higher
    if sim_nonpub > sim_pub:
        doc['dual_centroid_pred'] = 0  # non-publishable
    else:
        doc['dual_centroid_pred'] = 1  # publishable

# 4) Visualization in 2D PCA
pca = PCA(n_components=2, random_state=42)
reduced_pca = pca.fit_transform(doc_embeddings)
x_coords = reduced_pca[:, 0]
y_coords = reduced_pca[:, 1]

plt.figure(figsize=(10,7))

# color docs by 'dual_centroid_pred'
dual_labels = [doc['dual_centroid_pred'] for doc in documents]
scatter = plt.scatter(x_coords, y_coords, c=dual_labels, cmap='bwr', alpha=0.6)
plt.title("Dual Centroid Classification (PCA 2D Projection)")
plt.xlabel("PC1")
plt.ylabel("PC2")

# Mark the known labeled docs
for i, doc in enumerate(documents):
    fname = doc['filename']
    if fname in labeled_files_class0 or fname in labeled_files_class1:
        # black edge circle
        plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', marker='o', s=100, linewidths=1.5)
        if fname in labeled_files_class0:
            label_text = "C0"
        else:
            label_text = "C1"
        plt.text(x_coords[i]+0.02, y_coords[i]+0.02, label_text, fontsize=9)

plt.show()

Notes:

This approach can sometimes reduce false negatives if certain borderline docs are “less similar” to non-publishable than to publishable.
If a doc is equidistant, you can either random-tie break or label it as publishable, etc.

Part E: Summary & Recommendations

Logistic Regression
- Great for combining embeddings + similarity in a single feature vector.
- Use strong regularization (smaller C) and cross-validation to mitigate overfitting.
- For 2D boundary plots, train a parallel “2D logistic regression” on the PCA coordinates purely for visualization.
Combining Multiple Embeddings
- Concatenate or average them.
- Watch out for dimensional explosion. Possibly do PCA to reduce combined embeddings from 1500D down to ~512.
Topic Modeling
- LDA or another approach can detect weird content.
- If you find an entire cluster of words that are obviously not relevant to your research domain, that can be an extra signal.
Dual-Centroid
- A straightforward way to reduce false negatives if you see borderline docs are not that close to the non-pub centroid, but also not strongly close to publishable. You pick whichever is higher.
Interpreting the Plots
- For logistic regression or dual-centroid classification, the direct 2D boundary in PCA space is only an approximate view. Don’t rely on it exclusively to measure true performance. Always check the labeled docs to see if predictions match the ground truth.

With these pieces in place, you can experiment in separate notebooks:

One for your logistic regression pipeline (embedding + similarity).
Another for the dual-centroid approach.
A third for topic modeling heuristics.

You’ll then see which method yields the fewest false positives while also reducing false negatives, hopefully improving on your baseline. Good luck!

Answer 9

Below is a structured explanation and code on:

How to interpret the LDA topics and use them (optionally) to flag papers that contain irrelevant or “nonsense” content for a machine-learning setting.
Combining embeddings from two different models into a single feature (via concatenation) so you can still do your usual K-Means clustering, similarity-based classification, or logistic regression.
An end-to-end code snippet demonstrating how to incorporate concatenated embeddings into your existing pipeline (including K-Means, similarity measure, and a final visualization).

1. Interpreting the LDA Topics

From your example:

text
Topic #0: 0.006*"image" + 0.005*"images" + 0.004*"features" + 0.004*"car" ...
Topic #1: 0.008*"entropy" + 0.005*"food" + 0.005*"patterns" + 0.004*"system" ...
Topic #2: 0.008*"led" + 0.008*"patterns" + 0.007*"human" + 0.005*"properties" ...
Topic #3: 0.005*"datasets" + 0.004*"tasks" + 0.004*"input" + 0.004*"features" ...
Topic #4: 0.008*"engine" + 0.007*"system" + 0.006*"llm" + 0.005*"patterns" ...

Each topic is basically a distribution over words (with probabilities). The top words give you a clue about what that topic is about. For instance:

Topic #0 might be about images, features, classification. That’s plausibly relevant to ML (especially vision).
Topic #1 is partially about entropy, food, patterns, behavior — a bit more random. Possibly some “food” documents ended up in your dataset.
Topic #3 looks strongly ML-related: “datasets,” “tasks,” “input,” “features,” “neural,” etc.

If you see a topic with words like “recipe,” “oven,” “salt,” or “cooking,” that’s likely nonsense in an ML context. In your example, some references to “food” or “caffeine” might be suspicious if your domain is purely machine learning.

1.1 Flagging “Irrelevant” or “Nonsense” Topics

After training LDA, you can:

Identify which of the $k$ topics look “irrelevant.” For example, if “Topic #1” strongly references “food” or “caffeine” and has minimal ML terms, you might label that topic as “nonsense.”
Infer each paper’s topic distribution:
- If the paper’s top topic is “food/caffeine” with a high probability (> 0.6 or so), you can consider it suspicious for an ML conference.
Create a Heuristic:
- If $\max(\text{topic distribution})$ is an irrelevant topic, then flag as “non-publishable” or at least “suspect.”

Caution: LDA is just unsupervised clustering of words. Sometimes an “irrelevant” topic might appear if your corpus has unusual domain text, or it might partially overlap with real ML concepts in unexpected ways. So treat this as an additional signal (like a “red flag” score), not an absolute truth.

2. Combining Embeddings from Two Models

You can certainly use two (or more) transformer models to embed each paper. Suppose:

doc_embeddings_1 is shape $[n\_docs, d_1]$
doc_embeddings_2 is shape $[n\_docs, d_2]$

Concatenate them:

python
combined_embeddings = np.concatenate([doc_embeddings_1, doc_embeddings_2], axis=1)  
# Now shape: [n_docs, d_1 + d_2]

You can then do exactly what you’ve been doing (K-Means, similarity measure, etc.) on combined_embeddings. The biggest difference:

Instead of using doc_embeddings everywhere, you swap in combined_embeddings.

Potential Pitfalls:

If $d_1 + d_2$ is large (e.g., 768 + 768 = 1536), you could risk overfitting with only 15 labeled docs. You might consider applying a small PCA or dimensionality reduction step on combined_embeddings to bring it down to 512 or 256 dimensions before continuing.

3. End-to-End Code with Concatenated Embeddings

Below is a simplified pipeline illustrating how to:

Load or compute your two sets of embeddings (doc_embeddings_1, doc_embeddings_2).
Concatenate them into combined_embeddings.
Perform K-Means on the combined embeddings.
Compute a “non-publishable centroid” from known examples, then measure similarity.
Plot your PCA-based 2D projection + clustering.

Feel free to adjust the details to match your environment.

python
###############################################################################
# Step 0: Setup & Basic Definitions
###############################################################################
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from collections import Counter

# Labeled sets (example)
labeled_files_class0 = {"R001.pdf","R002.pdf","R003.pdf","R004.pdf","R005.pdf"}  # non-pub
labeled_files_class1 = {"R006.pdf","R007.pdf","R008.pdf","R009.pdf","R010.pdf",
                        "R011.pdf","R012.pdf","R013.pdf","R014.pdf","R015.pdf"} # pub

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

###############################################################################
# Step 1: Suppose you have two sets of embeddings for each doc
#         doc_embeddings_1, doc_embeddings_2
#         shape = [n_docs, d1], [n_docs, d2]
###############################################################################
# For demonstration, let's pretend doc_embeddings_1 and doc_embeddings_2 exist.
# You should replace them with your actual arrays from your environment.

# e.g., doc_embeddings_1 = ...
# e.g., doc_embeddings_2 = ...

# We'll assume you also have the 'documents' list of dicts as usual.
# => documents[i]['filename'], etc.

###############################################################################
# Step 2: Concatenate them
###############################################################################
combined_embeddings = np.concatenate([doc_embeddings_1, doc_embeddings_2], axis=1)
print("Shape of combined embeddings:", combined_embeddings.shape)

###############################################################################
# Step 3: K-Means on combined embeddings
###############################################################################
num_clusters = 5
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
labels = kmeans.fit_predict(combined_embeddings)

for i, doc in enumerate(documents):
    doc['cluster_label'] = int(labels[i])

cluster_counts = Counter(labels)
print("Cluster distribution:", cluster_counts)

###############################################################################
# Step 4: Compute Non-Publishable Centroid & Similarity (optional)
###############################################################################
# We'll collect embeddings for the known non-publishable docs
known_nonpub_inds = []
for i, doc in enumerate(documents):
    if doc['filename'] in labeled_files_class0:
        known_nonpub_inds.append(i)

nonpub_embeddings = combined_embeddings[known_nonpub_inds]
nonpub_centroid = np.mean(nonpub_embeddings, axis=0)

# For each doc, compute similarity
for i, doc in enumerate(documents):
    emb = combined_embeddings[i]
    sim_val = cosine_similarity(emb, nonpub_centroid)
    doc['similarity_to_nonpub'] = float(sim_val)

# Example threshold
similarity_threshold = 0.5
for doc in documents:
    doc['maybe_nonpub'] = (doc['similarity_to_nonpub'] > similarity_threshold)

###############################################################################
# Step 5: Visualization (PCA 2D of the combined embeddings)
###############################################################################
pca = PCA(n_components=2, random_state=42)
reduced_pca = pca.fit_transform(combined_embeddings)

x_coords = reduced_pca[:, 0]
y_coords = reduced_pca[:, 1]

plt.figure(figsize=(10, 7))
clusters = [doc['cluster_label'] for doc in documents]
scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6)

plt.title("K-Means Clusters on Combined Embeddings (PCA 2D Projection)")
plt.xlabel("PC1")
plt.ylabel("PC2")

# Highlight labeled files
for i, doc in enumerate(documents):
    filename = doc['filename']
    if filename in labeled_files_class0 or filename in labeled_files_class1:
        plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='black', 
                    marker='o', s=100, linewidths=1.5)
        if filename in labeled_files_class0:
            label_text = "C0"
        else:
            label_text = "C1"
        plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9)

    # Highlight maybe_nonpub docs
    if doc.get('maybe_nonpub', False):
        plt.scatter(x_coords[i], y_coords[i], c='none', edgecolors='orange', 
                    marker='s', s=120, linewidths=1.5)

unique_clusters = list(sorted(set(clusters)))
handles = scatter.legend_elements()[0]
labels_for_legend = [f"Cluster {uc}" for uc in unique_clusters]
plt.legend(handles, labels_for_legend, title="Clusters", loc="best")

plt.show()

Explanation of the Code

Concatenate doc_embeddings_1 and doc_embeddings_2 to get combined_embeddings.
K-Means on combined_embeddings, store cluster labels in doc['cluster_label'].
Non-Publishable Centroid: We average the embeddings of your 5 known “non-publishable” docs. Then, for each doc, we compute similarity_to_nonpub.
Threshold to set doc['maybe_nonpub'] = True if similarity > 0.5 (or whichever threshold you prefer).
PCA from 2D is purely for visualization. The actual dimension is $d_1 + d_2$ .

4. Using LDA to Flag Irrelevant Content

If you suspect a doc is about “food/cooking/recipes” while it claims to be “machine learning,” you can do something like:

python
# For each doc in df_topics (the LDA pipeline):
#   row['topic_distribution'] => list of (topic_id, prob)

irrelevant_topic_ids = [1]  # Suppose we decided topic #1 is "food" or nonsense
threshold = 0.6  # if doc's top topic is topic #1 with prob >= 0.6 => suspicious

for i, row in df_topics.iterrows():
    topics_prob = row['topic_distribution']
    top_topic_id, top_topic_prob = max(topics_prob, key=lambda x: x[1])
    if (top_topic_id in irrelevant_topic_ids) and (top_topic_prob >= threshold):
        row['suspect_topic'] = True
    else:
        row['suspect_topic'] = False

You can incorporate this as an additional dimension in your main pipeline—e.g., a boolean feature has_suspect_topic that you feed into your logistic regression or use as a simple rule.

Final Thoughts

Topic Modeling: Good for an extra “red flag” if the doc’s top topic is obviously irrelevant. Not foolproof, but helpful.
Embedding Concatenation: Potentially more robust representation, but watch out for overfitting with limited labeled data. Possibly apply a PCA step on the combined embeddings if dimension is too large.
Threshold Tuning: Continues to be crucial for controlling false positives vs. false negatives in the similarity-based approach.
Logistic Regression / MLP on concatenated embeddings is also an option—just be sure to use cross-validation to avoid overfitting.

This should give you a complete sense of how to interpret LDA outputs for “nonsense” detection and how to combine multiple embeddings in your existing pipeline, with an end-to-end example showing K-Means clustering, similarity thresholding, and a PCA visualization. Good luck!

Answer 10

Below is a comprehensive guide that addresses your requests:

Why updating centroids iteratively can help (and the risks).
Why we used cosine similarity vs. distance (and how to switch to a distance-based approach).
Inference strategy (adding new documents in the future).
Four separate code examples for the dual-centroid method:
- (A) Original (one-shot) classification using similarity.
- (B) Original (one-shot) classification using distance.
- (C) Iterative classification using similarity, re-computing centroids.
- (D) Iterative classification using distance, re-computing centroids.

At the end, each code snippet stores its final 0/1 labels into specific columns in documents, namely:

doc['original_similarity']
doc['original_distance']
doc['iterative_similarity']
doc['iterative_distance']

This way, you end up with a single DataFrame that has the results of all four approaches.

1. Updating Centroids Iteratively

Currently, your dual-centroid approach uses 5 known non-publishable and 10 known publishable examples to compute one non-publishable centroid and one publishable centroid. This works well if those examples are representative and you don’t suspect labeling errors.

Pro: If you discover more truly non-publishable docs from the unlabeled set, including them in the “non-pub” centroid might capture that region of embedding space more accurately (reducing false negatives).
Con: A single false positive can drag the non-pub centroid away from its true region, causing a chain reaction of misclassification.

Hence, any iterative approach should have a high confidence threshold (on similarity or low distance) to minimize false positives creeping in.

2. Similarity vs. Distance

Cosine Similarity is typical for text embeddings from BERT-like models. A high similarity means the vectors are pointing in roughly the same direction in the high-dimensional space.
Distance (e.g., Euclidean) is also valid. For BERT embeddings, some prefer cosine because the magnitude of embeddings can vary, and the direction often carries the semantic content.

If you want a distance-based approach for Euclidean distance, you can do:

d(a, b) = \sqrt{\sum_i (a_i - b_i)^2}

and pick the centroid that yields the lowest distance.

3. Inference with New Documents

Once you have your final centroids (either after one-shot or iterative updates):

Similarity Approach:
1. Compute the embedding for the new doc.
2. Calculate its cosine similarity to the “non-pub” centroid and to the “pub” centroid.
3. Whichever similarity is higher -> that class.
k-Nearest Neighbors:
1. Keep all your known labeled examples.
2. For a new doc, find the $k$ nearest neighbors in the embedding space.
3. Use majority vote or weighted vote to classify.

When to choose KNN:

If you suspect your classes might be multi-modal or complexly shaped, or you have more than a handful of labeled examples, KNN can capture more local structure.
If your labeled set is extremely small (only 15 docs), KNN might be very sensitive. The dual-centroid approach is simpler, but also less flexible if your classes are not unimodal.

4. Four Separate Code Snippets

We’ll assume you already have:

A list of dictionaries called documents, where each doc has at least doc['filename'].
doc_embeddings is a numpy array of shape $[n\_docs, d]$ .

Two sets:

python
labeled_files_class0 = {"R001.pdf", "R002.pdf", "R003.pdf", "R004.pdf", "R005.pdf"}
labeled_files_class1 = {"R006.pdf", "R007.pdf", "R008.pdf", "R009.pdf", "R010.pdf", 
                        "R011.pdf", "R012.pdf", "R013.pdf", "R014.pdf", "R015.pdf"}

SciBERT or any other embeddings can be used.
You want columns in your final DataFrame:
- original_similarity
- original_distance
- iterative_similarity
- iterative_distance

We will define each snippet in turn. At the end, you can run them all in sequence (or in separate cells) and end up with all 4 label columns in documents.

Helper Functions

python
import numpy as np
import math

def cos_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def euclidean_dist(a, b):
    return np.sqrt(np.sum((a - b)**2))

A) Original Dual Centroid – Similarity

One-shot approach: pick whichever centroid yields a higher similarity. Store the predicted label in doc['original_similarity'].

python
############################
# A) ORIGINAL SIMILARITY
############################

# 1) Gather embeddings for known labeled docs
nonpub_indices = []
pub_indices = []

for i, doc in enumerate(documents):
    fname = doc['filename']
    if fname in labeled_files_class0:
        nonpub_indices.append(i)
    elif fname in labeled_files_class1:
        pub_indices.append(i)

nonpub_embeddings = doc_embeddings[nonpub_indices]
pub_embeddings = doc_embeddings[pub_indices]

# 2) Compute centroids
centroid_nonpub = np.mean(nonpub_embeddings, axis=0)
centroid_pub = np.mean(pub_embeddings, axis=0)

# 3) Classify each doc by comparing cos_sim to each centroid
for i, doc in enumerate(documents):
    emb = doc_embeddings[i]
    sim_nonpub = cos_sim(emb, centroid_nonpub)
    sim_pub = cos_sim(emb, centroid_pub)
    
    # 0 => non-publishable, 1 => publishable
    if sim_nonpub > sim_pub:
        doc['original_similarity'] = 0
    else:
        doc['original_similarity'] = 1

B) Original Dual Centroid – Distance

Same idea, but we pick whichever centroid yields a lower distance. Store the predicted label in doc['original_distance'].

python
############################
# B) ORIGINAL DISTANCE
############################

# We assume the same centroid_nonpub, centroid_pub from above
# If you are running this code separately, recompute them:

nonpub_indices = []
pub_indices = []
for i, doc in enumerate(documents):
    fname = doc['filename']
    if fname in labeled_files_class0:
        nonpub_indices.append(i)
    elif fname in labeled_files_class1:
        pub_indices.append(i)

nonpub_embeddings = doc_embeddings[nonpub_indices]
pub_embeddings = doc_embeddings[pub_indices]

centroid_nonpub = np.mean(nonpub_embeddings, axis=0)
centroid_pub = np.mean(pub_embeddings, axis=0)

# Classify using distance
for i, doc in enumerate(documents):
    emb = doc_embeddings[i]
    dist_nonpub = euclidean_dist(emb, centroid_nonpub)
    dist_pub = euclidean_dist(emb, centroid_pub)

    # 0 => non-publishable, 1 => publishable
    if dist_nonpub < dist_pub:
        doc['original_distance'] = 0
    else:
        doc['original_distance'] = 1

C) Iterative Dual Centroid – Similarity

We iteratively include new docs whose similarity to a particular centroid surpasses a threshold. For example:

If sim_to_nonpub > threshold_nonpub and sim_to_nonpub > sim_to_pub (i.e., it’s really closer to non-pub?), we add that doc to the nonpub set.
If sim_to_pub > threshold_pub and sim_to_pub > sim_to_nonpub, we add that doc to the pub set.
Recompute centroids.
Stop if no new docs were added or if we reach max_iterations.

Parameters you might want to tune:

threshold_nonpub, threshold_pub: e.g. 0.60 or 0.65.
max_iterations: e.g. 5.

The code below lumps them into one threshold if you prefer. We store the final label in doc['iterative_similarity'].

python
############################
# C) ITERATIVE SIMILARITY
############################

def iterative_dual_centroid_similarity(doc_embeddings, documents, 
                                       labeled_files_class0, labeled_files_class1, 
                                       threshold=0.65, max_iterations=5):
    """
    Iteratively expand the sets of non-publishable and publishable docs based on
    which centroid they're more similar to, provided that similarity > threshold.
    Recompute centroids each iteration.
    """
    # 1) Initialize sets from your labeled data
    nonpub_set = set(labeled_files_class0)
    pub_set = set(labeled_files_class1)

    iteration = 0
    changed = True

    while iteration < max_iterations and changed:
        iteration += 1
        changed = False

        # Compute centroids
        nonpub_embs = []
        pub_embs = []
        for i, doc in enumerate(documents):
            if doc['filename'] in nonpub_set:
                nonpub_embs.append(doc_embeddings[i])
            elif doc['filename'] in pub_set:
                pub_embs.append(doc_embeddings[i])
        
        # If one class is empty, skip to avoid error
        if len(nonpub_embs) == 0 or len(pub_embs) == 0:
            print("One set is empty! Stopping iterative approach.")
            break

        centroid_nonpub = np.mean(nonpub_embs, axis=0)
        centroid_pub = np.mean(pub_embs, axis=0)

        # 2) For each unlabeled doc, see if we can add it
        for i, doc in enumerate(documents):
            fname = doc['filename']

            # Skip if it's already in either set
            if fname in nonpub_set or fname in pub_set:
                continue

            emb = doc_embeddings[i]
            sim_nonpub = cos_sim(emb, centroid_nonpub)
            sim_pub = cos_sim(emb, centroid_pub)

            # If it's clearly more similar to nonpub (and above threshold)
            if sim_nonpub > sim_pub and sim_nonpub >= threshold:
                nonpub_set.add(fname)
                changed = True
            # If it's clearly more similar to pub (and above threshold)
            elif sim_pub > sim_nonpub and sim_pub >= threshold:
                pub_set.add(fname)
                changed = True
        
        print(f"Iteration {iteration} done. Non-pub size: {len(nonpub_set)}, Pub size: {len(pub_set)}")

        if not changed:
            print("No changes in this iteration, stopping early.")

    # Final classification
    # If a doc is in nonpub_set => 0, if in pub_set => 1, else break tie
    for i, doc in enumerate(documents):
        fname = doc['filename']
        if fname in nonpub_set and fname not in pub_set:
            doc['iterative_similarity'] = 0
        elif fname in pub_set and fname not in nonpub_set:
            doc['iterative_similarity'] = 1
        else:
            # doc not classified or possibly in both sets somehow
            # you can break ties by whichever centroid is more similar
            emb = doc_embeddings[i]
            sim_nonpub = cos_sim(emb, centroid_nonpub)
            sim_pub = cos_sim(emb, centroid_pub)
            doc['iterative_similarity'] = 0 if sim_nonpub >= sim_pub else 1

# Call the function:
iterative_dual_centroid_similarity(doc_embeddings, documents, labeled_files_class0, labeled_files_class1, 
                                   threshold=0.65, max_iterations=5)

Feel free to tune the threshold. If you set threshold=0.80, you’ll be very conservative about adding new docs. If you set threshold=0.50, you’ll likely add more docs (risking false positives).

D) Iterative Dual Centroid – Distance

Same iterative logic, but now we measure Euclidean distance to each centroid. If a doc is closer to the non-pub centroid by some margin, we add it to that set, provided it’s under some threshold. The threshold means “distance must be less than X” for us to add it confidently.

python
############################
# D) ITERATIVE DISTANCE
############################

def iterative_dual_centroid_distance(doc_embeddings, documents, 
                                     labeled_files_class0, labeled_files_class1, 
                                     distance_threshold=1.0, max_iterations=5):
    """
    Iteratively expand the sets of non-publishable and publishable docs based on
    which centroid they are closer to, provided the distance < distance_threshold.
    Recompute centroids each iteration.
    """
    nonpub_set = set(labeled_files_class0)
    pub_set = set(labeled_files_class1)

    iteration = 0
    changed = True

    while iteration < max_iterations and changed:
        iteration += 1
        changed = False

        # Compute centroids
        nonpub_embs = []
        pub_embs = []
        for i, doc in enumerate(documents):
            if doc['filename'] in nonpub_set:
                nonpub_embs.append(doc_embeddings[i])
            elif doc['filename'] in pub_set:
                pub_embs.append(doc_embeddings[i])
        
        if len(nonpub_embs) == 0 or len(pub_embs) == 0:
            print("One set is empty! Stopping iterative approach.")
            break

        centroid_nonpub = np.mean(nonpub_embs, axis=0)
        centroid_pub = np.mean(pub_embs, axis=0)

        for i, doc in enumerate(documents):
            fname = doc['filename']
            if fname in nonpub_set or fname in pub_set:
                continue

            emb = doc_embeddings[i]
            dist_nonpub = euclidean_dist(emb, centroid_nonpub)
            dist_pub = euclidean_dist(emb, centroid_pub)

            # If it's strictly closer to nonpub and below threshold
            if dist_nonpub < dist_pub and dist_nonpub <= distance_threshold:
                nonpub_set.add(fname)
                changed = True
            # If it's strictly closer to pub and below threshold
            elif dist_pub < dist_nonpub and dist_pub <= distance_threshold:
                pub_set.add(fname)
                changed = True
        
        print(f"Iteration {iteration} done. Non-pub size: {len(nonpub_set)}, Pub size: {len(pub_set)}")

        if not changed:
            print("No changes in this iteration, stopping early.")

    # Final classification
    for i, doc in enumerate(documents):
        fname = doc['filename']
        if fname in nonpub_set and fname not in pub_set:
            doc['iterative_distance'] = 0
        elif fname in pub_set and fname not in nonpub_set:
            doc['iterative_distance'] = 1
        else:
            # break tie by whichever centroid is closer
            emb = doc_embeddings[i]
            dist_nonpub = euclidean_dist(emb, centroid_nonpub)
            dist_pub = euclidean_dist(emb, centroid_pub)
            doc['iterative_distance'] = 0 if dist_nonpub <= dist_pub else 1

# Call the function:
iterative_dual_centroid_distance(doc_embeddings, documents, labeled_files_class0, labeled_files_class1, 
                                 distance_threshold=1.0, max_iterations=5)

Note: You need to choose a “distance threshold” that makes sense in your embedding space. You might try 1.0, 1.5, 2.0, etc., then see if new docs get added. If your embeddings are large in magnitude, you might need a bigger threshold. If your embeddings are normalized (like in some BERT variants), typical Euclidean distances might be < 2.0.

5. Final DataFrame and Label Columns

After running all four methods (A, B, C, D) in sequence, each doc in documents will have:

doc['original_similarity']
doc['original_distance']
doc['iterative_similarity']
doc['iterative_distance']

You can then do:

python
final_df = pd.DataFrame(documents)
print(final_df[['filename', 
                'original_similarity', 
                'original_distance', 
                'iterative_similarity', 
                'iterative_distance']])

to see the final classification side by side.

6. Additional Approaches

KNN: If you eventually grow your labeled set beyond 15 docs, you can do k-Nearest Neighbors. For each doc, find the top $k$ labeled neighbors in the embedding space, and do a majority vote. This can handle multi-modal distributions better.
Hybrid Approach: You could do a single iteration of centroid-based classification, then confirm the borderline docs manually to reduce false positives.
Confidence Scores: For iterative approaches, you might only add a doc to a class if it’s much closer to that centroid than the other (e.g., sim_nonpub - sim_pub > 0.1) or (dist_pub - dist_nonpub) > 0.5. This helps filter out borderline cases.
Combine Similarity + Distance: Sometimes you can fuse features (like “cosine similarity to each centroid” plus “Euclidean distance to each centroid”) in a small logistic regression, but that might be overkill for 15 labeled docs.

Putting It All Together

Run each snippet (A, B, C, D) in your environment.
Check the final documents and see which approach yields the best results on your 15 labeled examples (lowest false positives/negatives).
Tune thresholds (e.g., similarity > 0.6 or distance < 1.0) and max_iterations.
Inference with new docs can be as simple as “which centroid is it closer to?” or “which centroid is it more similar to?”—or use a KNN approach.

This should give you ample flexibility to compare all four dual-centroid methods (similarity vs. distance, one-shot vs. iterative) and store results in your DataFrame. Good luck!

Answer 11

Below is a detailed explanation of your question about centroid-based classification possibly “overfitting,” followed by K-Means with 2 clusters that purely ignores labels when clustering. Then we visualize the K-Means results with three different 2D projections (PCA, t-SNE, and ISOMAP) in a single code snippet, highlighting your 15 labeled examples.

1. Are We Overfitting with Dual-Centroid?

1.1 Why You See Perfect Scores on Your 15 Docs

You only have 15 labeled examples: 5 non-publishable (Class 0) and 10 publishable (Class 1).
You compute two centroids from these 15 points and then classify exactly these same 15 points.
Of course, each point is going to be closest (or most similar) to its own class centroid. Hence, the classification report on these exact 15 docs will give a perfect score (F1 = 1.0).

Is it Overfitting?

Technically, you’re not doing supervised training in the typical sense (e.g., training a classifier and backpropagating). You’re simply using the data to define two points (the centroids). In that sense, it’s not “overfitting” in the normal machine-learning sense.
However, because your labeled set is so small, it’s trivial to achieve 100% accuracy on that same set. This doesn’t necessarily reflect how the method will perform on unseen data.

1.2 Will Judges Consider This Overfitting?

In many competitions, using only 15 labeled examples to tune a model that performs “perfectly” on those same 15 examples is indeed suspicious. But your approach is unavoidably going to produce a perfect classification on them, because you are directly measuring distance to their own means.
The real test is whether your classification logic generalizes to the private test set.

If your manual checks on unlabeled data suggest it’s doing “decently,” that’s a good sign. But keep in mind, your final performance on the private data is what the judges will rely on to see if it truly generalizes or not.

2. K-Means Without Using Labels (2 Clusters)

You asked for a purely unsupervised K-Means approach that does not use the labeled data to define the cluster centers. Instead, we let K-Means discover 2 clusters from the entire dataset (150 documents). Then we can see if cluster 0 aligns well with non-publishable or cluster 1 aligns well with publishable, etc.

2.1 Step-by-Step Code Explanation

Fit K-Means with $k=2$ .
Assign each doc a cluster label ( $0$ or $1$ ).
Visualize the results in 2D using:
- PCA
- t-SNE
- ISOMAP
  Each of these is a different dimensionality-reduction technique for visualization.
Highlight your 15 labeled docs with black edges (and annotate them as “C0” or “C1” for clarity).

Below is a single code snippet that does all three visualizations in subplots. Adjust as you see fit.

3. Code Snippet

python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE, Isomap

# Suppose we have:
#   doc_embeddings: shape [n_docs, d]
#   documents: list of dicts, each with doc['filename']
#   labeled_files_class0 = { "R001.pdf", ..., "R005.pdf" }
#   labeled_files_class1 = { "R006.pdf", ..., "R015.pdf" }

# 1) Perform K-Means with 2 clusters, ignoring labels
kmeans_2 = KMeans(n_clusters=2, random_state=42)
kmeans_labels = kmeans_2.fit_predict(doc_embeddings)

# Store cluster labels in documents
for i, doc in enumerate(documents):
    doc['kmeans_label'] = int(kmeans_labels[i])

print("Finished K-Means with 2 clusters. Label distribution:")
unique, counts = np.unique(kmeans_labels, return_counts=True)
for lab, count in zip(unique, counts):
    print(f"Cluster {lab}: {count} documents")

# 2) We'll define a helper function to plot 2D projections
def plot_2d_projection(proj_matrix, title):
    """
    proj_matrix: shape [n_docs, 2]
    title: string for the plot
    """
    x_coords = proj_matrix[:, 0]
    y_coords = proj_matrix[:, 1]

    # color by kmeans_label
    clusters = [doc['kmeans_label'] for doc in documents]
    scatter = plt.scatter(x_coords, y_coords, c=clusters, cmap='tab10', alpha=0.6)
    plt.title(title)
    plt.xlabel("Component 1")
    plt.ylabel("Component 2")

    # highlight the 15 labeled docs
    for i, doc in enumerate(documents):
        fname = doc['filename']
        if fname in labeled_files_class0 or fname in labeled_files_class1:
            # black edge circle
            plt.scatter(x_coords[i], y_coords[i], 
                        c='none', edgecolors='black', marker='o', s=100, linewidths=1.5)
            if fname in labeled_files_class0:
                label_text = "C0"
            else:
                label_text = "C1"
            plt.text(x_coords[i] + 0.02, y_coords[i] + 0.02, label_text, fontsize=9)

    # Optionally add a legend for cluster coloring
    unique_clusters = list(set(clusters))
    handles = scatter.legend_elements()[0]
    legend_labels = [f"Cluster {uc}" for uc in unique_clusters]
    plt.legend(handles, legend_labels, title="K-Means Clusters", loc="best")

# 3) Prepare 2D projections: PCA, t-SNE, and ISOMAP
pca = PCA(n_components=2, random_state=42)
pca_2d = pca.fit_transform(doc_embeddings)

tsne = TSNE(n_components=2, random_state=42)
tsne_2d = tsne.fit_transform(doc_embeddings)

iso = Isomap(n_components=2)
iso_2d = iso.fit_transform(doc_embeddings)

# 4) Plot them in subplots
fig = plt.figure(figsize=(18, 5))

plt.subplot(1, 3, 1)
plot_2d_projection(pca_2d, "PCA (2D) of doc_embeddings")

plt.subplot(1, 3, 2)
plot_2d_projection(tsne_2d, "t-SNE (2D) of doc_embeddings")

plt.subplot(1, 3, 3)
plot_2d_projection(iso_2d, "ISOMAP (2D) of doc_embeddings")

plt.tight_layout()
plt.show()

Notes:

We do not use the 15 labeled examples for training the clusters. K-Means is purely unsupervised.
After the clustering, you can inspect how many of your 5 known class-0 docs ended up in cluster 0 or cluster 1, similarly for your 10 class-1 docs. This can help you decide if cluster 0 corresponds to non-publishable or publishable.
If you want a single classification from K-Means, you could say “Cluster 0 => non-publishable, Cluster 1 => publishable” if the majority of your class-0 docs ended up in cluster 0, for example. However, that is a post-hoc labeling approach.

4. Final Thoughts

Perfect Classification on 15 Docs: A side effect of using those same docs to define your centroids. This is not overfitting in the typical sense, but be aware it tells you very little about true generalization.
Private Test: The real test is on the private data. As long as you’ve carefully selected your method and tried not to artificially tune it too heavily to those 15 examples, you should be fine.
K-Means 2 Clusters: This is a purely unsupervised baseline ignoring labels. Use the 2D visualizations to see if it forms any meaningful separation. You might find that your known 5 non-publishable docs are mostly in one cluster, and your 10 publishable docs are in another. Or it might mix them.

With this code, you can show a simpler unsupervised approach to the judges, demonstrating you didn’t rely solely on your labeled docs to define the entire classification logic. Good luck!

Answer 12

Below is a complete solution that:

Stores embeddings per section (one embedding for “Abstract,” one for “Introduction,” etc.) in each document’s data structure.
Computes an overall document-level embedding (by averaging all available section embeddings).
Runs K-Means separately for each section-level embedding (as well as the overall document embedding).
Visualizes each set of embeddings (PCA, t-SNE, ISOMAP) in a modular fashion, highlighting your 15 labeled examples.

We’ll assume you have:

A list of dictionaries called documents, where doc['sections'] is a list of (section_name, text) tuples.
A pre-trained model embed_model (like SciBERT) that provides an .encode(...) method returning a 768-dim vector (or a different dimension; just adjust accordingly).

Your labeled sets:

python
labeled_files_class0 = {"R001.pdf","R002.pdf","R003.pdf","R004.pdf","R005.pdf"}
labeled_files_class1 = {"R006.pdf","R007.pdf","R008.pdf","R009.pdf","R010.pdf",
                        "R011.pdf","R012.pdf","R013.pdf","R014.pdf","R015.pdf"}

A known set of potential sections:

python
SECTIONS = [
  "Unlabeled", "Abstract", "Introduction", "Methods", "Results",
  "Related work", "Experiments", "Conclusion", "Discussion",
  "Background", "References"
]

Feel free to adjust naming or add your own custom logic. This code is meant to be modular so you can run a loop over all sections and produce separate K-Means + 2D visualizations.

1) Building Section Embeddings and Overall Document Embeddings

python
import numpy as np

def build_section_and_doc_embeddings(documents, embed_model):
    """
    For each document in 'documents', build:
      doc['section_embeddings'] = {
         "Abstract": <768-dim embedding or zeros>,
         "Introduction": <768-dim embedding or zeros>,
         ...
      }
      doc['doc_embedding'] = <768-dim average across all available sections>

    'documents' is a list of dicts, each with:
      doc['sections'] = list of (sec_name, sec_text)
    
    embed_model.encode() returns a numpy array embedding for given text input.
    """
    SECTIONS = [
      "Unlabeled", "Abstract", "Introduction", "Methods", "Results",
      "Related work", "Experiments", "Conclusion", "Discussion",
      "Background", "References"
    ]

    for doc in documents:
        # Initialize a dict that maps section_name -> embedding
        doc['section_embeddings'] = {}
        
        # We’ll store per-section embeddings in a dictionary
        # e.g. doc['section_embeddings']['Abstract'] = np.array([...])
        
        # We might have multiple sections in doc['sections']. Let's gather them in a dict for easy access
        # e.g. actual_sections_found = { "Abstract": text_for_abstract, ... }
        actual_sections_found = {}
        for (sec_name, sec_text) in doc['sections']:
            actual_sections_found[sec_name] = sec_text
        
        # For each possible section in SECTIONS, build an embedding
        for section_name in SECTIONS:
            if section_name in actual_sections_found:
                raw_text = actual_sections_found[section_name].strip()
                if len(raw_text) == 0:
                    # If empty text, store zeros
                    doc['section_embeddings'][section_name] = np.zeros(embed_model.get_sentence_embedding_dimension())
                else:
                    # Encode this section's text
                    embedding = embed_model.encode(raw_text, convert_to_numpy=True)
                    # embed_model.encode(...) returns a 768-dim array if you're using SciBERT or similar
                    # If you have large text, you might want to do chunking or similar, but here's the simple approach
                    doc['section_embeddings'][section_name] = embedding
            else:
                # If doc doesn't have this section, store zeros
                doc['section_embeddings'][section_name] = np.zeros(embed_model.get_sentence_embedding_dimension())
        
        # Now, compute doc['doc_embedding'] by averaging all non-zero embeddings
        nonzero_vectors = []
        for sec_name, emb in doc['section_embeddings'].items():
            # If emb is all zeros, skip
            if np.any(emb):  # i.e. not all zeros
                nonzero_vectors.append(emb)
        
        if len(nonzero_vectors) == 0:
            # If no valid sections, store a zeros vector
            doc['doc_embedding'] = np.zeros(embed_model.get_sentence_embedding_dimension())
        else:
            doc['doc_embedding'] = np.mean(nonzero_vectors, axis=0)

Notes:

We use embed_model.encode(raw_text, convert_to_numpy=True) to get a single embedding for that entire section.
If a section is missing (or is empty), we store a zeros vector.
The overall doc embedding is the mean of all non-zero section embeddings.

2) Running K-Means for Each Section + Document-Level

We’ll define a function that:

Gathers embeddings for a given “section_name” across all docs.
Applies K-Means with some k (for instance, k=5).
Stores the cluster label in doc['kmeans_{section_name}'].
Optionally do the same for doc_embedding.

python
from sklearn.cluster import KMeans

def run_kmeans_for_section(documents, section_name, n_clusters=5, random_state=42):
    """
    Gathers the embedding for 'section_name' from each doc and runs K-Means.
    Then stores the cluster label in doc['kmeans_{section_name}'].
    """
    # 1) Build an array of shape [n_docs, embedding_dim] for this section
    all_embeddings = []
    for doc in documents:
        # E.g. doc['section_embeddings']['Abstract']
        emb = doc['section_embeddings'][section_name]
        all_embeddings.append(emb)
    all_embeddings = np.array(all_embeddings)  # shape: [n_docs, d]
    
    # 2) K-Means
    kmeans = KMeans(n_clusters=n_clusters, random_state=random_state)
    labels = kmeans.fit_predict(all_embeddings)
    
    # 3) Store the label in the doc
    for i, doc in enumerate(documents):
        doc[f'kmeans_{section_name}'] = int(labels[i])

def run_kmeans_for_doc_embedding(documents, n_clusters=5, random_state=42):
    """
    Gathers doc['doc_embedding'] for each doc, runs K-Means,
    and stores cluster label in doc['kmeans_doc'].
    """
    # 1) Build array of shape [n_docs, d] from doc['doc_embedding']
    all_doc_emb = [doc['doc_embedding'] for doc in documents]
    all_doc_emb = np.array(all_doc_emb)
    
    # 2) K-Means
    kmeans = KMeans(n_clusters=n_clusters, random_state=random_state)
    labels = kmeans.fit_predict(all_doc_emb)
    
    # 3) Store label
    for i, doc in enumerate(documents):
        doc['kmeans_doc'] = int(labels[i])

3) Visualization: PCA, t-SNE, ISOMAP in a Single Function

We want to pass in a numpy array of shape [n_docs, d] and a list of cluster labels. Then we do:

PCA
t-SNE
ISOMAP

We create subplots for each. We also highlight the labeled docs. This is very similar to your previous code, but more modular so we can re-use it for each section or for doc-level embeddings.

python
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE, Isomap

def visualize_2d_projections(embeddings, cluster_labels, documents, 
                             labeled_files_class0, labeled_files_class1, 
                             title_prefix=""):
    """
    embeddings: shape [n_docs, d]
    cluster_labels: shape [n_docs], e.g. from KMeans
    documents: the list of doc dicts
    labeled_files_class0, labeled_files_class1: sets for highlight
    title_prefix: string to prefix in the title, e.g. "Abstract" or "doc_embedding"
    """
    # 1) Prepare the 3 transformations
    pca = PCA(n_components=2, random_state=42)
    pca_2d = pca.fit_transform(embeddings)

    tsne = TSNE(n_components=2, random_state=42)
    tsne_2d = tsne.fit_transform(embeddings)

    iso = Isomap(n_components=2)
    iso_2d = iso.fit_transform(embeddings)

    fig = plt.figure(figsize=(18, 5))

    # We'll define a helper closure to plot each sub-figure
    def plot_projection(ax, proj_matrix, method_name):
        x_coords = proj_matrix[:, 0]
        y_coords = proj_matrix[:, 1]

        scatter = ax.scatter(x_coords, y_coords, c=cluster_labels, cmap='tab10', alpha=0.6)
        ax.set_title(f"{title_prefix} - {method_name}")
        ax.set_xlabel("Component 1")
        ax.set_ylabel("Component 2")

        for i, doc in enumerate(documents):
            fname = doc['filename']
            if fname in labeled_files_class0 or fname in labeled_files_class1:
                ax.scatter(x_coords[i], y_coords[i], 
                           c='none', edgecolors='black', marker='o', s=100, linewidths=1.5)
                label_text = "C0" if fname in labeled_files_class0 else "C1"
                ax.text(x_coords[i]+0.02, y_coords[i]+0.02, label_text, fontsize=9)

        unique_clusters = list(sorted(set(cluster_labels)))
        handles = scatter.legend_elements()[0]
        legend_labels = [f"Cluster {uc}" for uc in unique_clusters]
        ax.legend(handles, legend_labels, title="K-Means Clusters", loc="best")

    ax1 = fig.add_subplot(1, 3, 1)
    plot_projection(ax1, pca_2d, "PCA")

    ax2 = fig.add_subplot(1, 3, 2)
    plot_projection(ax2, tsne_2d, "t-SNE")

    ax3 = fig.add_subplot(1, 3, 3)
    plot_projection(ax3, iso_2d, "ISOMAP")

    plt.tight_layout()
    plt.show()

4) Putting It All Together

Below is an example workflow that you can adapt to your code:

python
###############################################################################
# EXAMPLE WORKFLOW
###############################################################################
# 1) Build section embeddings and doc-level embeddings
build_section_and_doc_embeddings(documents, embed_model)

# 2) For each section in our known set, run K-Means + visualize
SECTIONS = [
  "Unlabeled", "Abstract", "Introduction", "Methods", "Results",
  "Related work", "Experiments", "Conclusion", "Discussion",
  "Background", "References"
]

for section_name in SECTIONS:
    # 2a) Run K-Means
    run_kmeans_for_section(documents, section_name, n_clusters=5, random_state=42)
    
    # 2b) Gather the embeddings array for that section
    section_embs = []
    section_labels = []
    for i, doc in enumerate(documents):
        emb = doc['section_embeddings'][section_name]
        lbl = doc[f'kmeans_{section_name}']
        section_embs.append(emb)
        section_labels.append(lbl)
    section_embs = np.array(section_embs)
    section_labels = np.array(section_labels)
    
    # 2c) Visualize
    visualize_2d_projections(
        embeddings=section_embs,
        cluster_labels=section_labels,
        documents=documents,
        labeled_files_class0=labeled_files_class0,
        labeled_files_class1=labeled_files_class1,
        title_prefix=f"{section_name} Section"
    )

# 3) Do doc-level K-Means
run_kmeans_for_doc_embedding(documents, n_clusters=5, random_state=42)

# 4) Gather doc-level embeddings + labels
doc_embs = []
doc_labels = []
for i, doc in enumerate(documents):
    emb = doc['doc_embedding']
    lbl = doc['kmeans_doc']
    doc_embs.append(emb)
    doc_labels.append(lbl)
doc_embs = np.array(doc_embs)
doc_labels = np.array(doc_labels)

visualize_2d_projections(
    embeddings=doc_embs,
    cluster_labels=doc_labels,
    documents=documents,
    labeled_files_class0=labeled_files_class0,
    labeled_files_class1=labeled_files_class1,
    title_prefix="Document-Level"
)

What this does:

Builds per-section embeddings in doc['section_embeddings'][section_name].
Builds a doc-level embedding in doc['doc_embedding'].
Runs K-Means for each section (with 5 clusters, but you can set k=2 if you want pure “non-pub vs. pub”—that’s up to you).
Visualizes each section’s clustering in 3 different 2D projections (PCA, t-SNE, ISOMAP).
Finally, does the same for the doc-level embeddings.

If you want exactly 2 clusters for “publishable vs. non-publishable,” just replace n_clusters=5 with n_clusters=2. Or do both and see how they differ.

5) Summary

Section-Specific Embeddings: You’re no longer losing detail by averaging everything upfront. Each section gets its own embedding.
Doc-Level Embedding: Still useful for a high-level summarization.
Unchanged Labeled Data: You can still highlight your 5 non-pub and 10 pub examples on each of these K-Means cluster plots.
Why You Might See Similar Clustering: If many documents share similar structure or your embedding model lumps them similarly, you might still end up with near-identical clusters. But hopefully, focusing on specific sections (like “Abstract,” “Methods,” etc.) can reveal more nuanced differences.

This modular approach should integrate cleanly with your existing data structures (documents, embed_model). You can tweak n_clusters, or do further analysis on each section’s clusters to see if, for instance, “Abstract” embeddings separate out certain flawed papers more clearly.

Good luck, and feel free to adapt the code as needed!

Answer 13

Below are some reflections on why you might still be seeing identical final classifications despite introducing section-level embeddings, multiple clustering approaches, etc.:

1. Why the Results Could Be the Same

Homogeneity in the Data
If most of your papers (both publishable and non-publishable) share similar language and vocabulary in every section—i.e., the “Results” and “Discussion” sections look more or less the same across many papers—then averaging or section-level embeddings might not introduce enough variability to shift the clusters. In other words, the model “sees” roughly the same signals no matter how you segment the text.
Strong Dominant Signal
Some domain-specific terms or writing style might dominate across all sections. For example, if every paper includes standard ML terminology or references, it might overwhelm the differences that typically differentiate publishable vs. non-publishable content. Thus, whether you look at the “Introduction” or “Methods” or combine them, the embeddings land in roughly the same space.
Limited Label Diversity
You only have 15 labeled examples (5 non-publishable, 10 publishable). If they are not diverse enough or if the unlabeled set is large and somewhat homogeneous, the clustering and centroid-based methods often yield the same outcome. The data itself may not be providing strong opposing signals to re-separate the clusters.
Robustness of the Embedding Model
If your chosen embedding model (e.g., SciBERT) consistently encodes the papers in a similar fashion, the different partitionings (section-level vs. entire doc) might just lead to near-identical vectors. For instance, if the “Methods” section heavily overlaps with the “Introduction” content, the average or sum might not differ enough to rearrange the cluster memberships.

2. What These Results Suggest

The Data is Not Contradictory
The fact that multiple approaches lead to the same classification implies that, from a textual perspective, the boundary between “publishable” and “non-publishable” (as your data is capturing it) is consistent. The data is not easily teased apart differently by slicing or chunking.
Consistency in the Embedding Space
You might actually have a solid embedding approach. If all these transformations (section-level vs. doc-level) converge to the same outcome, it might mean the representation is stable and not very sensitive to how you slice the document text.
Possibility of Domain Constraints
Some tasks simply do not vary enough across sections for them to produce differing clusters. If every paper has the same general structure, length, or domain keywords, the embeddings can look very alike regardless of how you chunk or average them.

3. Potential Next Steps

Refine or Add Additional Features
- Heuristics: If you suspect certain “red flags” (like extremely short Methods sections), encode that as an additional numerical feature.
- Topic Modeling: Already tried? If you find certain nonsense topics, feed that as an extra signal into classification.
Examine “Borderline” Cases
- If you have any borderline docs (confidence near 0.5 in a similarity-based approach), inspect them manually. Are they truly borderline or does the text contain something that’s not being captured by the embedding?
Increase Labeled Diversity
- If feasible, label more papers or ensure your 5 non-publishable examples are truly representative of varied reasons for rejection. If all 5 share an extremely similar writing style, the centroid might not generalize.
Try Another Embedding Model
- Domain-Focused: If your dataset is heavily domain-specific, perhaps a different or more specialized model might reveal distinctions SciBERT is glossing over.
- Document Transformers (Longformer, BigBird) can capture longer contexts.
Validate with External Data
- If you have any external “gold standard” non-publishable or publishable papers, run them through to see if they get classified as expected. This can reveal if your approach is truly robust or just consistent within your 150-paper set.

4. Conclusion

Getting the same final classification from different chunking/averaging strategies often means:

The differences between sections are not large enough to reorder embeddings in a meaningful way, or
The embedding model is robust in how it encodes text, leading to stable results.

It does not necessarily mean you are doing something wrong—rather, it suggests your data and chosen embeddings converge on the same representation regardless of how you slice the paper text. This can be interpreted as stability in your approach.

The best way to be sure is to test on truly unseen data or carefully label more documents to see if there are any subtle differences the current pipeline is missing.

Frage

Gedanke

Antwort

1. Handling Limited and Imbalanced Labeled Data

1.1 Data Augmentation Strategies

1.2 Handling Class Imbalance in Training

2. Dealing with Large Documents of Varying Structure

2.1 Segmenting by Sections or Sub-Sections

2.2 Representation of Sections

2.3 Checking for Coherence and Methodological Soundness

3. Potential NLP Approaches

3.1 Pre-trained Transformers (Fine-Tuned)

3.2 Classical ML + Feature Engineering

3.3 Multi-Task or Transfer Learning from “Topic Classification”

4. Practical Workflow Example

5. Potential Use of Large Language Models (LLMs) as Tools (Not Direct Classification)

6. Ensuring Scalability and Consistency

7. Summary of Key Recommendations

Frage

Gedanke

Antwort

Table of Contents

1. Preprocessing & Section Extraction

1.1 Basic Preprocessing vs. Context-Aware Preprocessing

1.2 Heuristic-Based Section Detection

2. Unsupervised/Clustering Approach

2.1 Embedding Generation (Sentence Transformers)

2.2 K-Means / Hierarchical Clustering

2.3 Interpreting Clusters & Using Minimal Labels

3. Hierarchical Model for Long Documents

3.1 Chunking into Sections

3.2 Encoding & Combining Section Embeddings

3.3 Fine-Tuning a Small Classifier Head (Optional)

4. Rule-Based “Red Flag” Detection

4.1 Designing Heuristics

4.2 Scoring & Combining with Other Methods

5. Semantic Similarity to Known Non-Publishable Papers

5.1 Building a Similarity Index

5.2 Threshold-Based Classification

6. Putting It All Together: Example Pipeline

7. Improving & Checking for Pitfalls

8. Complete Code Examples

8.1 Section Extraction & Preprocessing

8.2 Unsupervised Clustering (KMeans)

8.3 Rule-Based Signals

8.4 Semantic Similarity Scoring

Putting It All Together in a Step-by-Step Manner

Final Notes & Key Takeaways

Frage

KMeans clustering

Example threshold for suspicious

Gedanke

Antwort

1. Visualizing the K-Means Clusters

Explanation / Notes

2. Visualizing Cosine Similarities

Explanation / Notes

Additional Variation

3. More Red Flag Rules

Things to Keep in Mind

4. Visualizing Red Flag Scores

5. Suggested Improvements

Summary

Frage

Antwort

Step 1: Verify File Matching Logic

Step 2: Correct Clustering Visualization Code

Step 3: Correct Cosine Similarity Visualization Code

Step 4: Debugging Mismatches

Explanation of Changes

Frage

Antwort

1. Clustering Visualization with maybe_nonpub Highlighting

Explanation:

2. Cosine Similarity Distribution with maybe_nonpub Highlighting

Explanation:

Testing and Debugging

Frage

Step 0: Install dependencies (if needed)

Download NLTK data

1. Clustering Visualization with `maybe_nonpub` Highlighting

2. Cosine Similarity Distribution with `maybe_nonpub` Highlighting