Research Paper Classification Framework

Question

In the academic and research domains, the quality, and publishability of research papers play a critical role in advancing knowledge and fostering innovation.

However, the process of determining whether a paper meets the standards for publication can be both time-consuming and subjective, often requiring expert review.

With the increasing volume of research outputs, there is a growing need for automated systems to assist in evaluating the quality and suitability of papers for publication. This challenge not only offers an opportunity to innovate but also holds the potential to streamline the publication process and enhance its objectivity.

The task involves developing a framework that can classify research papers as either "Publishable" or "Non-Publishable" based on the evaluation of their content.

The goal is to create a robust system capable of identifying critical issues, such as inappropriate methodologies, incoherent arguments, or unsubstantiated claims, that affect the suitability of a paper for publication.

For instance, a research paper that applies methodologies or techniques that are not well-suited to the problem being addressed, without adequate justification or adaptation to the context, would be considered unsuitable. Similarly, a paper that presents arguments that are unclear, disorganized, or lack logical coherence, or one that claims results that appear unusually high or unrealistic without sufficient evidence or proper validation, would also fall into the "Non-Publishable" category.

A dataset of 150 research papers is provided for classification, with 15 labeled papers available for reference to guide the development of the framework.

The framework should be designed to accurately classify papers into the appropriate category, ensuring that it can handle a wide range of research topics and maintain consistency across different types of content.

The proposed framework must be capable of systematically analyzing these and other aspects of research papers to ensure a reliable and objective evaluation process.

The solution should demonstrate high accuracy in detecting such issues, ensuring its applicability across a range of research domains and scalability for future use with larger datasets.

What are ALL the possible approaches and the which is the best approach.
Use Topic Modelling for Analyzing the research papers. I have only 15 labelled data (5 non publishable, 10 publishable) and 135 unlabelled data, what should I do, which is the best approach to avoid overfitting. Combine the necessary models that address all the problems creating a hybrid model which is the best approach. I have the unlabelled data in a folder called 'Papers' and labelled data in 'Reference' folder. Inside 'Reference' folder, there are 2 folders 'Non-Publishable' which has 5 papers and 'Publishable' which has 10 papers. Everything is in pdf format. Give me the code also to run everything which works fully-fledged.

Answer 1

Below is a comprehensive overview of all possible approaches for classifying research papers as “Publishable” or “Non-Publishable,” followed by a recommended hybrid approach that uses topic modeling plus a semi-supervised classification strategy to avoid overfitting. Finally, you will find a fully fledged sample Python code that demonstrates how to implement such a pipeline, including PDF text extraction, topic modeling, and hybrid (semi-supervised) classification.

1. All Possible Approaches

When you have very limited labeled data (in your case, 15 labeled papers: 10 Publishable and 5 Non-Publishable) and a relatively large set of unlabeled data (135 papers), these are some approaches you can consider:

Pure Supervised Learning
- What it is: Use only the 15 labeled samples to train a classifier (e.g., Logistic Regression, SVM, Random Forest).
- Pros: Straightforward to implement.
- Cons: Very high risk of overfitting due to extremely limited labeled data.
Unsupervised Topic Modeling
- What it is: Use Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), or other topic modeling techniques to group papers into topics based on word distributions.
- Pros: Can organize the collection into coherent topics without needing labels, provides interpretable representations (topic distributions).
- Cons: Does not directly yield publishable vs. non-publishable labels. You still need some classification or rule-based method to assign final labels.
Self-Supervised / Language Model Embeddings
- What it is: Extract embeddings from large pre-trained language models (e.g., BERT, SciBERT) for each paper and then apply a simple classifier on top.
- Pros: Leverages massive amounts of linguistic knowledge from pretraining, often better generalization even with fewer labeled samples.
- Cons: Potentially higher computational overhead, especially during inference. Fine-tuning large language models with only 15 samples can still overfit.
Semi-Supervised Learning (Label Propagation, Self-Training, etc.)
- What it is: Start with the small labeled dataset, train an initial classifier, then iteratively assign labels to unlabeled data with high confidence and retrain.
- Pros: Makes use of the large unlabeled set to improve the model, helps mitigate overfitting.
- Cons: Quality depends on initial classifier performance and the confidence threshold for pseudo-labels.
Active Learning
- What it is: Iteratively choose the most “informative” unlabeled samples and ask a human expert to label them.
- Pros: Can drastically reduce labeling effort while improving model performance.
- Cons: Requires an active loop with a human expert to label additional data when requested.
Hybrid Models (Topic Modeling + Supervised/Semi-Supervised Classifier)
- What it is: Use topic modeling to reduce dimensionality and extract interpretable features, then feed these topic distributions into a supervised or semi-supervised classifier.
- Pros:
  - Topic modeling helps capture thematic structure from the entire dataset (both labeled and unlabeled).
  - Classifier uses these topic features plus any additional text-based features.
  - Semi-supervised classification helps avoid overfitting to 15 labeled samples.
- Cons: Some complexity in combining unsupervised topic modeling with classification, but often yields strong performance.

2. Recommended “Best” Approach

Given you have:

Very limited labeled data (5 Non-Publishable + 10 Publishable).
A larger unlabeled set (135 papers).
Interest in topic modeling (which can help capture domain structure).

A Hybrid approach using Topic Modeling + Semi-Supervised Classification is likely most effective in avoiding overfitting while leveraging the unlabeled data:

Extract text from all PDFs (both labeled and unlabeled).
Preprocess the text (tokenize, remove stopwords, etc.).
Perform Topic Modeling (e.g., LDA via Gensim or scikit-learn) on all 150 papers to learn a topic distribution for each paper.
- This step uses unsupervised learning, so it benefits from having all 150 documents.
Use the topic distribution (and potentially other features like TF-IDF) as input to a semi-supervised classification approach. For instance:
- Self-Training: Train an initial classifier on the 15 labeled examples. Then iteratively predict on unlabeled data and add the highest confidence predictions to the training set.
- Label Propagation or Label Spreading from scikit-learn: These algorithms propagate labels from the small labeled set to the unlabeled set based on feature similarity.
Combine the predictions from the semi-supervised approach with additional heuristics (e.g., check for extremely short papers, or very high/implausible claims without references) if needed.

This pipeline allows you to build a model that is less prone to overfitting (thanks to the extra unlabeled data in both the topic modeling step and the semi-supervised classification step).

3. Fully Fledged Sample Code

Below is an end-to-end Python script demonstrating how you could implement a Topic Modeling + Semi-Supervised (Self-Training) classification pipeline. It uses the following libraries:

PyPDF2 or pdfminer.six (for PDF text extraction)
nltk (for tokenization, stopword removal)
gensim (for LDA topic modeling)
scikit-learn (for TF-IDF, classification, self-training)
glob, os (for file handling)
re (for text cleaning)

Note: Make sure to install these libraries if you haven’t already (e.g., pip install pypdf2 pdfminer.six nltk gensim scikit-learn).
You may also need to download NLTK stopwords: nltk.download('stopwords'), nltk.download('punkt').

Folder Structure:

text
.
├── Reference
│   ├── Non-Publishable   (5 PDF files)
│   └── Publishable       (10 PDF files)
└── Papers                (135 PDF files, unlabeled)

<details> <summary><strong>Click to view the sample code</strong></summary>

python
import os
import glob
import re
import nltk
import PyPDF2
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import gensim
from gensim import corpora
from gensim.models.ldamodel import LdaModel

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# 1. PDF Text Extraction Function
def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as f:
        reader = PyPDF2.PdfReader(f)
        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            text += page.extract_text() if page.extract_text() else ""
    return text

# 2. Text Preprocessing Function
def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove non-alphabetic chars
    text = re.sub(r'[^a-z\s]', ' ', text)
    # Tokenize
    words = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if w not in stop_words and len(w) > 2]
    return words

# 3. Load all data
def load_data(reference_folder, papers_folder):
    # Labeled data
    labeled_texts = []
    labels = []
    
    # Non-Publishable
    non_pub_files = glob.glob(os.path.join(reference_folder, 'Non-Publishable', '*.pdf'))
    for file_path in non_pub_files:
        raw_text = extract_text_from_pdf(file_path)
        labeled_texts.append(raw_text)
        labels.append(0)  # 0 for Non-Publishable

    # Publishable
    pub_files = glob.glob(os.path.join(reference_folder, 'Publishable', '*.pdf'))
    for file_path in pub_files:
        raw_text = extract_text_from_pdf(file_path)
        labeled_texts.append(raw_text)
        labels.append(1)  # 1 for Publishable

    # Unlabeled data
    unlabeled_texts = []
    unlabeled_files = glob.glob(os.path.join(papers_folder, '*.pdf'))
    unlabeled_filepaths = []
    for file_path in unlabeled_files:
        raw_text = extract_text_from_pdf(file_path)
        unlabeled_texts.append(raw_text)
        unlabeled_filepaths.append(file_path)

    return labeled_texts, labels, unlabeled_texts, unlabeled_filepaths

# 4. Topic Modeling (LDA) to get topic distributions for each document
def topic_modeling(all_docs, num_topics=5):
    """
    all_docs: list of strings (raw text of each document).
    num_topics: number of latent topics to learn.
    Returns: 
        - lda_model (gensim model)
        - corpus (list of bag-of-words representations)
        - dictionary (gensim dictionary)
    """
    # Preprocess all documents
    tokenized_docs = [preprocess_text(doc) for doc in all_docs]
    
    # Create a dictionary representation of the documents.
    dictionary = corpora.Dictionary(tokenized_docs)
    dictionary.filter_extremes(no_below=2, no_above=0.9)  # can be tuned
    
    # Convert tokenized documents into a bag-of-words
    corpus = [dictionary.doc2bow(tokens) for tokens in tokenized_docs]

    # Train LDA model
    lda_model = LdaModel(corpus=corpus,
                         id2word=dictionary,
                         num_topics=num_topics,
                         random_state=42,
                         passes=5)

    return lda_model, corpus, dictionary

def get_topic_distribution(lda_model, dictionary, text):
    """
    Given an LDA model, dictionary, and raw text of a document,
    returns the topic distribution (as a vector).
    """
    tokens = preprocess_text(text)
    bow = dictionary.doc2bow(tokens)
    topic_probs = lda_model.get_document_topics(bow, minimum_probability=0.0)
    # Convert to a fixed-size vector
    topic_vec = np.zeros(lda_model.num_topics)
    for topic_id, prob in topic_probs:
        topic_vec[topic_id] = prob
    return topic_vec

# 5. Feature Extraction (Topic distribution + TF-IDF) for classification
def build_features(labeled_texts, unlabeled_texts, lda_model, dictionary):
    """
    Build combined feature vectors (Topic distribution + TF-IDF).
    Returns:
      - X_labeled: feature array for labeled data
      - X_unlabeled: feature array for unlabeled data
    """

    # --- (A) TF-IDF Features ---
    all_texts = labeled_texts + unlabeled_texts
    tfidf_vectorizer = TfidfVectorizer(stop_words='english', min_df=2)
    tfidf_matrix = tfidf_vectorizer.fit_transform(all_texts)
    
    # Split back into labeled/unlabeled
    X_tfidf_labeled = tfidf_matrix[:len(labeled_texts)]
    X_tfidf_unlabeled = tfidf_matrix[len(labeled_texts):]
    
    # --- (B) Topic Distribution Features ---
    topic_features_labeled = []
    for txt in labeled_texts:
        topic_vec = get_topic_distribution(lda_model, dictionary, txt)
        topic_features_labeled.append(topic_vec)
    topic_features_labeled = np.array(topic_features_labeled)
    
    topic_features_unlabeled = []
    for txt in unlabeled_texts:
        topic_vec = get_topic_distribution(lda_model, dictionary, txt)
        topic_features_unlabeled.append(topic_vec)
    topic_features_unlabeled = np.array(topic_features_unlabeled)
    
    # --- (C) Combine them (simple concatenation) ---
    from scipy.sparse import csr_matrix, hstack
    X_labeled = hstack([X_tfidf_labeled, csr_matrix(topic_features_labeled)])
    X_unlabeled = hstack([X_tfidf_unlabeled, csr_matrix(topic_features_unlabeled)])
    
    return X_labeled, X_unlabeled

def main():
    nltk.download('punkt')
    nltk.download('stopwords')

    reference_folder = 'Reference'
    papers_folder = 'Papers'

    # --- 1. Load Data ---
    labeled_texts, labels, unlabeled_texts, unlabeled_filepaths = load_data(reference_folder, papers_folder)

    # --- 2. Combine all docs for topic modeling ---
    all_docs = labeled_texts + unlabeled_texts
    
    # --- 3. Topic Modeling (Unsupervised) on ALL docs ---
    lda_model, corpus, dictionary = topic_modeling(all_docs, num_topics=5)
    
    # --- 4. Build Features (TopicDist + TF-IDF) ---
    X_labeled, X_unlabeled = build_features(labeled_texts, unlabeled_texts, lda_model, dictionary)
    y_labeled = np.array(labels)
    
    # --- 5. Semi-Supervised Classification via Self-Training ---
    base_clf = LogisticRegression(max_iter=1000, solver='lbfgs')
    self_training_clf = SelfTrainingClassifier(base_clf, threshold=0.8) 
    # threshold=0.8 means we only add pseudo-labels we are 80% confident about
    
    # Fit on labeled + unlabeled
    # We need to pass a combined X and y, where unlabeled examples have y = -1
    y_unlabeled = np.full(X_unlabeled.shape[0], -1, dtype=int)
    X_combined = np.vstack([X_labeled.toarray(), X_unlabeled.toarray()])
    y_combined = np.concatenate([y_labeled, y_unlabeled])
    
    self_training_clf.fit(X_combined, y_combined)
    
    # Now we have a trained semi-supervised classifier.
    # You can test it using cross-validation on the labeled set or do a final evaluation.
    
    # Example: Evaluate only on your limited labeled data
    # (Better approach: You might do a smaller train/test split on your 15 labeled samples)
    X_train, X_test, y_train, y_test = train_test_split(X_labeled, y_labeled, 
                                                        test_size=0.4, 
                                                        random_state=42,
                                                        stratify=y_labeled)
    base_clf_evaluation = LogisticRegression(max_iter=1000)
    base_clf_evaluation.fit(X_train, y_train)
    acc_base = base_clf_evaluation.score(X_test, y_test)
    
    # Evaluate semi-supervised model on the same test split
    acc_semi = self_training_clf.score(X_test.toarray(), y_test)
    
    print(f"Supervised-Only LogisticRegression accuracy (on 15-labeled subset): {acc_base:.3f}")
    print(f"Semi-Supervised SelfTrainingClassifier accuracy: {acc_semi:.3f}")
    
    # For demonstration, let’s also see predictions on the unlabeled set
    unlabeled_preds = self_training_clf.predict(X_unlabeled.toarray())
    
    # Print out which unlabeled papers are predicted 'Non-Publishable' (0) or 'Publishable' (1)
    print("Predictions on unlabeled data:")
    for filepath, pred_label in zip(unlabeled_filepaths, unlabeled_preds):
        label_str = 'Publishable' if pred_label == 1 else 'Non-Publishable'
        print(f"{os.path.basename(filepath)} -> {label_str}")

if __name__ == '__main__':
    main()

</details>

How the Code Works

Extract text from PDFs using PyPDF2.PdfReader.
Preprocess using NLTK (lowercase, remove non-alpha chars, remove stopwords, tokenize).
Topic Modeling
- Preprocess each document to get tokens.
- Build a Gensim Dictionary and convert each doc into a Bag-of-Words corpus.
- Train an LDA model (num_topics=5 as an example) over all 150 documents (labeled + unlabeled).
- For each document, retrieve the topic distribution vector of size = num_topics.
Feature Construction
- Build TF-IDF vectors over all texts (again, labeled + unlabeled), then split them back.
- Concatenate the topic distribution vectors as additional features to the TF-IDF vector.
Semi-Supervised Classification using SelfTrainingClassifier in scikit-learn:
- Combine labeled and unlabeled data (with y=-1 for unlabeled).
- The classifier iteratively labels unlabeled data with high-confidence predictions and retrains.
Evaluation:
- Shows basic accuracy on the small labeled set (train/test split).
- Shows how to predict for unlabeled documents and prints out the predicted label for each.

4. Further Notes & Tweaks

Topic Model Tuning: Adjust the number of topics, num_topics=5, 10, 20, etc.
Stopword Removal: You could add a domain-specific list of words (e.g., “et al.,” “however,” “abstract,” etc.).
Dimensionality Reduction: You can consider an additional step (like PCA) on the TF-IDF features before concatenation.
Alternative Semi-Supervised Methods:
- LabelPropagation or LabelSpreading from sklearn.semi_supervised.
- More advanced self-supervised or consistency-regularization methods (e.g., FixMatch, MixMatch) if you can adapt them to text.
Transfer Learning: Could also use embeddings from a pre-trained BERT model (e.g., using sentence-transformers) and then do a similar semi-supervised approach.

Conclusion

To avoid overfitting with only 15 labeled samples, leverage your unlabeled 135 papers via topic modeling and semi-supervised classification.
The hybrid approach of LDA + TF-IDF for feature representation, followed by self-training or label propagation, typically offers robust performance and mitigates overfitting compared to a purely supervised approach.
The sample fully fledged code above should get you started on building and running a complete pipeline.

Answer 2

Below are several possible reasons why the model might be labeling all 135 unlabeled PDFs as “Publishable” and some practical tips on how to address or debug this issue. The fundamental problem is that you have limited labeled data (only 5 Non-Publishable vs. 10 Publishable) and a larger unlabeled set (135 PDFs) that the model must handle in a semi-supervised manner.

1. Why Are All Unlabeled PDFs Predicted As Publishable?

Class Imbalance
- You have 10 labeled “Publishable” papers versus 5 “Non-Publishable.” This 2:1 ratio might cause the classifier to learn a bias toward predicting “Publishable.”
- When using semi-supervised learning, if the initial classifier is biased or not confident in the “Non-Publishable” concept, it could start labeling unlabeled samples as “Publishable” (the majority class). Once those labels are added in self-training, the bias can amplify.
Threshold in Self-Training
- If you are using SelfTrainingClassifier with a relatively high confidence threshold (e.g., 0.8), the model might only pseudo-label the unlabeled data as the class for which it’s most certain: “Publishable.” If it’s never “confident enough” to assign “Non-Publishable” labels, everything ends up being labeled “Publishable.”
Small Labeled Dataset
- Fifteen total labeled samples is extremely limited, especially for a text classification task that can be quite nuanced. The model might simply not have enough examples of “Non-Publishable” to learn a robust decision boundary.
Features Not Differentiating
- Depending on how your text is structured, the features (topic distributions + TF-IDF) might not adequately separate “Non-Publishable” from “Publishable.” The LDA topics or TF-IDF features could be capturing general research-related language but not the subtle signals of poor methodology or incoherence.
Overfitting to the “Publishable” Examples
- With so few “Non-Publishable” training samples, the model may find it harder to learn robust features that distinguish them from the rest.

2. Practical Tips to Improve or Debug

Below are steps you can take to gain insight and improve classification results:

A. Check Your Labeled Performance & Confusion Matrix

Train/Test Split of Labeled Data
- Before even doing semi-supervised learning, train a simple classifier on the 15 labeled papers (e.g., logistic regression or SVM) and test with a small cross-validation approach (e.g., leave-one-out or 5-fold if you can).
- Check if it can differentiate “Publishable” vs. “Non-Publishable” even in a purely supervised setting. If the accuracy or F1 score is already poor, you’ll need more labeled data or better feature engineering.
Look at Predicted Probabilities
- For each labeled sample in the training set, ask the model for the predicted probability of being “Non-Publishable” vs. “Publishable.” Are the probabilities extremely skewed (e.g., always 0.99 for Publishable)? If so, you have a bias problem.
Inspect the Confusion Matrix
- If, on the training set, the classifier is rarely or never predicting “Non-Publishable,” it’s likely ignoring the minority class entirely.

B. Adjust the Self-Training Strategy

Lower the Confidence Threshold
- By default, you might have threshold=0.8 in SelfTrainingClassifier. That means the model will only pseudo-label an unlabeled sample if it has at least 80% confidence in the predicted class.
- Try reducing the threshold to 0.5 or 0.6 to allow more unlabeled samples to get a pseudo-label, which could introduce some “Non-Publishable” examples if the model sees them as a second-likely class.
Use Label Spreading or Label Propagation
- Instead of self-training, try LabelPropagation or LabelSpreading in sklearn.semi_supervised. These methods propagate labels in a graph-based manner, which sometimes helps with very few labeled data.
Incremental Self-Training
- Rather than label all unlabeled data at once, do it in small batches. After each batch is labeled, re-train the model. This can sometimes prevent the model from going “all in” on a single class too soon.

C. Class Imbalance Remedies

Oversample “Non-Publishable”
- Synthetic oversampling (e.g., SMOTE) can be tricky with text data, but you can try simpler methods like random oversampling, repeating the “Non-Publishable” documents in the training set multiple times.
- Alternatively, apply class weights in your logistic regression or SVM: class_weight={0:some_factor, 1:1}.
Data Augmentation
- If feasible, create augmented versions of your 5 “Non-Publishable” texts by paraphrasing, slightly altering text structure, or combining multiple paragraphs to artificially enlarge that class.
Actively Label More Data
- If possible, manually label a small portion of the 135 unlabeled PDFs. Focus especially on ones that might be “Non-Publishable.” Even labeling 5 or 10 more “Non-Publishable” could significantly help the model learn the minority class.

D. Improve Feature Extraction

Refine Topic Modeling
- If your topic model is too coarse (e.g., 5 topics for potentially very diverse research papers), it might not capture the difference between correct vs. incorrect methodologies. Try increasing num_topics to 10, 15, or 20.
Use Domain-Specific Terms or Custom Dictionaries
- Extend your stopwords list or explicitly keep domain-relevant words (e.g., “methodology,” “validation,” “proposed approach,” “limitations,” etc.) to highlight differences in quality or coherence.
Incorporate Additional Signals
- For example, check if the paper references are well-formatted, length of method section, presence of disclaimers or IRB statements, or the ratio of references to pages. Sometimes metadata-based features can help.
Try Large Language Model Embeddings
- Instead of LDA + TF-IDF, try embeddings from a pretrained model like BERT, SciBERT, or Sentence Transformers. Even with few labeled samples, these richer representations may better capture writing quality or research coherence.

3. Example Tweaks in Code

Below is a minimal snippet showing how to:

Adjust threshold
Use LabelPropagation instead of SelfTrainingClassifier
Apply class_weight in LogisticRegression

Feel free to integrate into the earlier pipeline where indicated.

python
from sklearn.semi_supervised import LabelPropagation
from sklearn.linear_model import LogisticRegression

# 1. Lower the confidence threshold in SelfTraining
from sklearn.semi_supervised import SelfTrainingClassifier

base_clf = LogisticRegression(
    max_iter=1000, 
    solver='lbfgs',
    # 2. Apply class_weight to emphasize minority class
    class_weight={0: 2.0, 1: 1.0}  # double the weight for "Non-Publishable"
)

self_training_clf = SelfTrainingClassifier(
    base_clf,
    threshold=0.5  # Lower threshold from 0.8 to 0.5
)
self_training_clf.fit(X_combined, y_combined)

# Alternatively, you can try LabelPropagation
label_prop_clf = LabelPropagation()
label_prop_clf.fit(X_combined, y_combined)

Explanation of Changes

class_weight={0:2.0, 1:1.0}
- We double the weight for the “Non-Publishable” (0) class, forcing the model to care more about correctly classifying that class.
threshold=0.5 (in SelfTrainingClassifier)
- The model now only needs a 50% confidence to assign a pseudo-label, which can help it assign “Non-Publishable” more readily (especially if it’s borderline).
Use LabelPropagation
- A completely different semi-supervised algorithm that might better handle small labeled sets. However, you lose the direct “threshold” parameter. But you can see if it naturally yields a more balanced pseudo-label assignment for unlabeled data.

4. Final Thoughts

All unlabeled data predicted as “Publishable” is a classic symptom of having too little minority-class data and an overly confident or biased model.
To fix this, (a) gather or label more “Non-Publishable” examples, (b) use methods that penalize misclassification of the minority class (class weighting / oversampling), (c) reduce the pseudo-label confidence threshold so your model actually assigns “Non-Publishable” to unlabeled samples that are borderline, (d) experiment with alternative semi-supervised methods like LabelPropagation or LabelSpreading, and (e) explore richer feature representations or domain-specific cues.

By iterating on these steps—especially adding a few more “Non-Publishable” labels or adjusting the classifier’s threshold and class weight—you should see the model begin to produce more reasonable predictions on the unlabeled set.