Research Paper Classification Framework

作成日: 2025年1月10日

使用モデル O1 by Chat01

質問

In the academic and research domains, the quality, and publishability of research papers play a critical role in advancing knowledge and fostering innovation.

However, the process of determining whether a paper meets the standards for publication can be both time-consuming and subjective, often requiring expert review.

With the increasing volume of research outputs, there is a growing need for automated systems to assist in evaluating the quality and suitability of papers for publication. This challenge not only offers an opportunity to innovate but also holds the potential to streamline the publication process and enhance its objectivity.

The task involves developing a framework that can classify research papers as either "Publishable" or "Non-Publishable" based on the evaluation of their content.

The goal is to create a robust system capable of identifying critical issues, such as inappropriate methodologies, incoherent arguments, or unsubstantiated claims, that affect the suitability of a paper for publication.

For instance, a research paper that applies methodologies or techniques that are not well-suited to the problem being addressed, without adequate justification or adaptation to the context, would be considered unsuitable. Similarly, a paper that presents arguments that are unclear, disorganized, or lack logical coherence, or one that claims results that appear unusually high or unrealistic without sufficient evidence or proper validation, would also fall into the "Non-Publishable" category.

A dataset of 150 research papers is provided for classification, with 15 labeled papers available for reference to guide the development of the framework.

The framework should be designed to accurately classify papers into the appropriate category, ensuring that it can handle a wide range of research topics and maintain consistency across different types of content.

The proposed framework must be capable of systematically analyzing these and other aspects of research papers to ensure a reliable and objective evaluation process.

The solution should demonstrate high accuracy in detecting such issues, ensuring its applicability across a range of research domains and scalability for future use with larger datasets.

What are ALL the possible approaches and the which is the best approach.
Use Topic Modelling for Analyzing the research papers. I have only 15 labelled data (5 non publishable, 10 publishable) and 135 unlabelled data, what should I do, which is the best approach to avoid overfitting. Combine the necessary models that address all the problems creating a hybrid model which is the best approach. I have the unlabelled data in a folder called 'Papers' and labelled data in 'Reference' folder. Inside 'Reference' folder, there are 2 folders 'Non-Publishable' which has 5 papers and 'Publishable' which has 10 papers. Everything is in pdf format. Give me the code also to run everything which works fully-fledged.

思考

Unveiling a solution

I’m piecing together a solution to classify research papers by reading, processing, and analyzing the text data. This approach clarifies their publishability.

Crafting a pipeline

I’m detailing steps to classifying research papers: loading data, converting to tf-idf, applying topic modeling, training a classifier with labeled data, and considering a hybrid model for accuracy.

Stitching together

I’m mapping out the components for a comprehensive research paper classifier: PDF extraction, text cleaning, topic modeling, embeddings, classifier training, cross-validation, and iterative labeling for a robust hybrid model.

Weighing the choices

I'm curious about comparing traditional supervised learning, transfer learning, semisupervised learning, unsupervised learning, and hybrid methods. The top pick likely merges pretrained models with a semisupervised approach.

Piecing together the pipeline

OK, let me see: parsing PDFs, removing stopwords, creating vector documents, applying LDA, training classifiers, and classifying using reconciliation. This blend should ensure a comprehensive, effective approach.

Thought about research paper classification methods for 11 seconds

回答

Below is a comprehensive overview of all possible approaches for classifying research papers as “Publishable” or “Non-Publishable,” followed by a recommended hybrid approach that uses topic modeling plus a semi-supervised classification strategy to avoid overfitting. Finally, you will find a fully fledged sample Python code that demonstrates how to implement such a pipeline, including PDF text extraction, topic modeling, and hybrid (semi-supervised) classification.


1. All Possible Approaches

When you have very limited labeled data (in your case, 15 labeled papers: 10 Publishable and 5 Non-Publishable) and a relatively large set of unlabeled data (135 papers), these are some approaches you can consider:

  1. Pure Supervised Learning

    • What it is: Use only the 15 labeled samples to train a classifier (e.g., Logistic Regression, SVM, Random Forest).
    • Pros: Straightforward to implement.
    • Cons: Very high risk of overfitting due to extremely limited labeled data.
  2. Unsupervised Topic Modeling

    • What it is: Use Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), or other topic modeling techniques to group papers into topics based on word distributions.
    • Pros: Can organize the collection into coherent topics without needing labels, provides interpretable representations (topic distributions).
    • Cons: Does not directly yield publishable vs. non-publishable labels. You still need some classification or rule-based method to assign final labels.
  3. Self-Supervised / Language Model Embeddings

    • What it is: Extract embeddings from large pre-trained language models (e.g., BERT, SciBERT) for each paper and then apply a simple classifier on top.
    • Pros: Leverages massive amounts of linguistic knowledge from pretraining, often better generalization even with fewer labeled samples.
    • Cons: Potentially higher computational overhead, especially during inference. Fine-tuning large language models with only 15 samples can still overfit.
  4. Semi-Supervised Learning (Label Propagation, Self-Training, etc.)

    • What it is: Start with the small labeled dataset, train an initial classifier, then iteratively assign labels to unlabeled data with high confidence and retrain.
    • Pros: Makes use of the large unlabeled set to improve the model, helps mitigate overfitting.
    • Cons: Quality depends on initial classifier performance and the confidence threshold for pseudo-labels.
  5. Active Learning

    • What it is: Iteratively choose the most “informative” unlabeled samples and ask a human expert to label them.
    • Pros: Can drastically reduce labeling effort while improving model performance.
    • Cons: Requires an active loop with a human expert to label additional data when requested.
  6. Hybrid Models (Topic Modeling + Supervised/Semi-Supervised Classifier)

    • What it is: Use topic modeling to reduce dimensionality and extract interpretable features, then feed these topic distributions into a supervised or semi-supervised classifier.
    • Pros:
      • Topic modeling helps capture thematic structure from the entire dataset (both labeled and unlabeled).
      • Classifier uses these topic features plus any additional text-based features.
      • Semi-supervised classification helps avoid overfitting to 15 labeled samples.
    • Cons: Some complexity in combining unsupervised topic modeling with classification, but often yields strong performance.

2. Recommended “Best” Approach

Given you have:

  • Very limited labeled data (5 Non-Publishable + 10 Publishable).
  • A larger unlabeled set (135 papers).
  • Interest in topic modeling (which can help capture domain structure).

A Hybrid approach using Topic Modeling + Semi-Supervised Classification is likely most effective in avoiding overfitting while leveraging the unlabeled data:

  1. Extract text from all PDFs (both labeled and unlabeled).
  2. Preprocess the text (tokenize, remove stopwords, etc.).
  3. Perform Topic Modeling (e.g., LDA via Gensim or scikit-learn) on all 150 papers to learn a topic distribution for each paper.
    • This step uses unsupervised learning, so it benefits from having all 150 documents.
  4. Use the topic distribution (and potentially other features like TF-IDF) as input to a semi-supervised classification approach. For instance:
    • Self-Training: Train an initial classifier on the 15 labeled examples. Then iteratively predict on unlabeled data and add the highest confidence predictions to the training set.
    • Label Propagation or Label Spreading from scikit-learn: These algorithms propagate labels from the small labeled set to the unlabeled set based on feature similarity.
  5. Combine the predictions from the semi-supervised approach with additional heuristics (e.g., check for extremely short papers, or very high/implausible claims without references) if needed.

This pipeline allows you to build a model that is less prone to overfitting (thanks to the extra unlabeled data in both the topic modeling step and the semi-supervised classification step).


3. Fully Fledged Sample Code

Below is an end-to-end Python script demonstrating how you could implement a Topic Modeling + Semi-Supervised (Self-Training) classification pipeline. It uses the following libraries:

  • PyPDF2 or pdfminer.six (for PDF text extraction)
  • nltk (for tokenization, stopword removal)
  • gensim (for LDA topic modeling)
  • scikit-learn (for TF-IDF, classification, self-training)
  • glob, os (for file handling)
  • re (for text cleaning)

Note: Make sure to install these libraries if you haven’t already (e.g., pip install pypdf2 pdfminer.six nltk gensim scikit-learn).
You may also need to download NLTK stopwords: nltk.download('stopwords'), nltk.download('punkt').

Folder Structure:

.
├── Reference
│   ├── Non-Publishable   (5 PDF files)
│   └── Publishable       (10 PDF files)
└── Papers                (135 PDF files, unlabeled)
<details> <summary><strong>Click to view the sample code</strong></summary>
import os import glob import re import nltk import PyPDF2 from nltk.corpus import stopwords from nltk.tokenize import word_tokenize import gensim from gensim import corpora from gensim.models.ldamodel import LdaModel import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.semi_supervised import SelfTrainingClassifier from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split # 1. PDF Text Extraction Function def extract_text_from_pdf(pdf_path): text = "" with open(pdf_path, 'rb') as f: reader = PyPDF2.PdfReader(f) for page_num in range(len(reader.pages)): page = reader.pages[page_num] text += page.extract_text() if page.extract_text() else "" return text # 2. Text Preprocessing Function def preprocess_text(text): # Lowercase text = text.lower() # Remove non-alphabetic chars text = re.sub(r'[^a-z\s]', ' ', text) # Tokenize words = word_tokenize(text) # Remove stopwords stop_words = set(stopwords.words('english')) words = [w for w in words if w not in stop_words and len(w) > 2] return words # 3. Load all data def load_data(reference_folder, papers_folder): # Labeled data labeled_texts = [] labels = [] # Non-Publishable non_pub_files = glob.glob(os.path.join(reference_folder, 'Non-Publishable', '*.pdf')) for file_path in non_pub_files: raw_text = extract_text_from_pdf(file_path) labeled_texts.append(raw_text) labels.append(0) # 0 for Non-Publishable # Publishable pub_files = glob.glob(os.path.join(reference_folder, 'Publishable', '*.pdf')) for file_path in pub_files: raw_text = extract_text_from_pdf(file_path) labeled_texts.append(raw_text) labels.append(1) # 1 for Publishable # Unlabeled data unlabeled_texts = [] unlabeled_files = glob.glob(os.path.join(papers_folder, '*.pdf')) unlabeled_filepaths = [] for file_path in unlabeled_files: raw_text = extract_text_from_pdf(file_path) unlabeled_texts.append(raw_text) unlabeled_filepaths.append(file_path) return labeled_texts, labels, unlabeled_texts, unlabeled_filepaths # 4. Topic Modeling (LDA) to get topic distributions for each document def topic_modeling(all_docs, num_topics=5): """ all_docs: list of strings (raw text of each document). num_topics: number of latent topics to learn. Returns: - lda_model (gensim model) - corpus (list of bag-of-words representations) - dictionary (gensim dictionary) """ # Preprocess all documents tokenized_docs = [preprocess_text(doc) for doc in all_docs] # Create a dictionary representation of the documents. dictionary = corpora.Dictionary(tokenized_docs) dictionary.filter_extremes(no_below=2, no_above=0.9) # can be tuned # Convert tokenized documents into a bag-of-words corpus = [dictionary.doc2bow(tokens) for tokens in tokenized_docs] # Train LDA model lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=42, passes=5) return lda_model, corpus, dictionary def get_topic_distribution(lda_model, dictionary, text): """ Given an LDA model, dictionary, and raw text of a document, returns the topic distribution (as a vector). """ tokens = preprocess_text(text) bow = dictionary.doc2bow(tokens) topic_probs = lda_model.get_document_topics(bow, minimum_probability=0.0) # Convert to a fixed-size vector topic_vec = np.zeros(lda_model.num_topics) for topic_id, prob in topic_probs: topic_vec[topic_id] = prob return topic_vec # 5. Feature Extraction (Topic distribution + TF-IDF) for classification def build_features(labeled_texts, unlabeled_texts, lda_model, dictionary): """ Build combined feature vectors (Topic distribution + TF-IDF). Returns: - X_labeled: feature array for labeled data - X_unlabeled: feature array for unlabeled data """ # --- (A) TF-IDF Features --- all_texts = labeled_texts + unlabeled_texts tfidf_vectorizer = TfidfVectorizer(stop_words='english', min_df=2) tfidf_matrix = tfidf_vectorizer.fit_transform(all_texts) # Split back into labeled/unlabeled X_tfidf_labeled = tfidf_matrix[:len(labeled_texts)] X_tfidf_unlabeled = tfidf_matrix[len(labeled_texts):] # --- (B) Topic Distribution Features --- topic_features_labeled = [] for txt in labeled_texts: topic_vec = get_topic_distribution(lda_model, dictionary, txt) topic_features_labeled.append(topic_vec) topic_features_labeled = np.array(topic_features_labeled) topic_features_unlabeled = [] for txt in unlabeled_texts: topic_vec = get_topic_distribution(lda_model, dictionary, txt) topic_features_unlabeled.append(topic_vec) topic_features_unlabeled = np.array(topic_features_unlabeled) # --- (C) Combine them (simple concatenation) --- from scipy.sparse import csr_matrix, hstack X_labeled = hstack([X_tfidf_labeled, csr_matrix(topic_features_labeled)]) X_unlabeled = hstack([X_tfidf_unlabeled, csr_matrix(topic_features_unlabeled)]) return X_labeled, X_unlabeled def main(): nltk.download('punkt') nltk.download('stopwords') reference_folder = 'Reference' papers_folder = 'Papers' # --- 1. Load Data --- labeled_texts, labels, unlabeled_texts, unlabeled_filepaths = load_data(reference_folder, papers_folder) # --- 2. Combine all docs for topic modeling --- all_docs = labeled_texts + unlabeled_texts # --- 3. Topic Modeling (Unsupervised) on ALL docs --- lda_model, corpus, dictionary = topic_modeling(all_docs, num_topics=5) # --- 4. Build Features (TopicDist + TF-IDF) --- X_labeled, X_unlabeled = build_features(labeled_texts, unlabeled_texts, lda_model, dictionary) y_labeled = np.array(labels) # --- 5. Semi-Supervised Classification via Self-Training --- base_clf = LogisticRegression(max_iter=1000, solver='lbfgs') self_training_clf = SelfTrainingClassifier(base_clf, threshold=0.8) # threshold=0.8 means we only add pseudo-labels we are 80% confident about # Fit on labeled + unlabeled # We need to pass a combined X and y, where unlabeled examples have y = -1 y_unlabeled = np.full(X_unlabeled.shape[0], -1, dtype=int) X_combined = np.vstack([X_labeled.toarray(), X_unlabeled.toarray()]) y_combined = np.concatenate([y_labeled, y_unlabeled]) self_training_clf.fit(X_combined, y_combined) # Now we have a trained semi-supervised classifier. # You can test it using cross-validation on the labeled set or do a final evaluation. # Example: Evaluate only on your limited labeled data # (Better approach: You might do a smaller train/test split on your 15 labeled samples) X_train, X_test, y_train, y_test = train_test_split(X_labeled, y_labeled, test_size=0.4, random_state=42, stratify=y_labeled) base_clf_evaluation = LogisticRegression(max_iter=1000) base_clf_evaluation.fit(X_train, y_train) acc_base = base_clf_evaluation.score(X_test, y_test) # Evaluate semi-supervised model on the same test split acc_semi = self_training_clf.score(X_test.toarray(), y_test) print(f"Supervised-Only LogisticRegression accuracy (on 15-labeled subset): {acc_base:.3f}") print(f"Semi-Supervised SelfTrainingClassifier accuracy: {acc_semi:.3f}") # For demonstration, let’s also see predictions on the unlabeled set unlabeled_preds = self_training_clf.predict(X_unlabeled.toarray()) # Print out which unlabeled papers are predicted 'Non-Publishable' (0) or 'Publishable' (1) print("Predictions on unlabeled data:") for filepath, pred_label in zip(unlabeled_filepaths, unlabeled_preds): label_str = 'Publishable' if pred_label == 1 else 'Non-Publishable' print(f"{os.path.basename(filepath)} -> {label_str}") if __name__ == '__main__': main()
</details>

How the Code Works

  1. Extract text from PDFs using PyPDF2.PdfReader.
  2. Preprocess using NLTK (lowercase, remove non-alpha chars, remove stopwords, tokenize).
  3. Topic Modeling
    • Preprocess each document to get tokens.
    • Build a Gensim Dictionary and convert each doc into a Bag-of-Words corpus.
    • Train an LDA model (num_topics=5 as an example) over all 150 documents (labeled + unlabeled).
    • For each document, retrieve the topic distribution vector of size = num_topics.
  4. Feature Construction
    • Build TF-IDF vectors over all texts (again, labeled + unlabeled), then split them back.
    • Concatenate the topic distribution vectors as additional features to the TF-IDF vector.
  5. Semi-Supervised Classification using SelfTrainingClassifier in scikit-learn:
    • Combine labeled and unlabeled data (with y=-1 for unlabeled).
    • The classifier iteratively labels unlabeled data with high-confidence predictions and retrains.
  6. Evaluation:
    • Shows basic accuracy on the small labeled set (train/test split).
    • Shows how to predict for unlabeled documents and prints out the predicted label for each.

4. Further Notes & Tweaks

  • Topic Model Tuning: Adjust the number of topics, num_topics=5, 10, 20, etc.
  • Stopword Removal: You could add a domain-specific list of words (e.g., “et al.,” “however,” “abstract,” etc.).
  • Dimensionality Reduction: You can consider an additional step (like PCA) on the TF-IDF features before concatenation.
  • Alternative Semi-Supervised Methods:
    • LabelPropagation or LabelSpreading from sklearn.semi_supervised.
    • More advanced self-supervised or consistency-regularization methods (e.g., FixMatch, MixMatch) if you can adapt them to text.
  • Transfer Learning: Could also use embeddings from a pre-trained BERT model (e.g., using sentence-transformers) and then do a similar semi-supervised approach.

Conclusion

  • To avoid overfitting with only 15 labeled samples, leverage your unlabeled 135 papers via topic modeling and semi-supervised classification.
  • The hybrid approach of LDA + TF-IDF for feature representation, followed by self-training or label propagation, typically offers robust performance and mitigates overfitting compared to a purely supervised approach.
  • The sample fully fledged code above should get you started on building and running a complete pipeline.

質問

I am getting all 135 unlabelled pdfs as publishable.

思考

Adjusting thresholds

I'm noting that the code might classify all PDFs as publishable due to a low threshold or insufficient data. Exploring data expansion, threshold tuning, label propagation, debugging techniques, and cross-validation for potential improvements.

Thought about model bias and overfitting for a few seconds

回答

Below are several possible reasons why the model might be labeling all 135 unlabeled PDFs as “Publishable” and some practical tips on how to address or debug this issue. The fundamental problem is that you have limited labeled data (only 5 Non-Publishable vs. 10 Publishable) and a larger unlabeled set (135 PDFs) that the model must handle in a semi-supervised manner.


1. Why Are All Unlabeled PDFs Predicted As Publishable?

  1. Class Imbalance

    • You have 10 labeled “Publishable” papers versus 5 “Non-Publishable.” This 2:1 ratio might cause the classifier to learn a bias toward predicting “Publishable.”
    • When using semi-supervised learning, if the initial classifier is biased or not confident in the “Non-Publishable” concept, it could start labeling unlabeled samples as “Publishable” (the majority class). Once those labels are added in self-training, the bias can amplify.
  2. Threshold in Self-Training

    • If you are using SelfTrainingClassifier with a relatively high confidence threshold (e.g., 0.8), the model might only pseudo-label the unlabeled data as the class for which it’s most certain: “Publishable.” If it’s never “confident enough” to assign “Non-Publishable” labels, everything ends up being labeled “Publishable.”
  3. Small Labeled Dataset

    • Fifteen total labeled samples is extremely limited, especially for a text classification task that can be quite nuanced. The model might simply not have enough examples of “Non-Publishable” to learn a robust decision boundary.
  4. Features Not Differentiating

    • Depending on how your text is structured, the features (topic distributions + TF-IDF) might not adequately separate “Non-Publishable” from “Publishable.” The LDA topics or TF-IDF features could be capturing general research-related language but not the subtle signals of poor methodology or incoherence.
  5. Overfitting to the “Publishable” Examples

    • With so few “Non-Publishable” training samples, the model may find it harder to learn robust features that distinguish them from the rest.

2. Practical Tips to Improve or Debug

Below are steps you can take to gain insight and improve classification results:

A. Check Your Labeled Performance & Confusion Matrix

  1. Train/Test Split of Labeled Data

    • Before even doing semi-supervised learning, train a simple classifier on the 15 labeled papers (e.g., logistic regression or SVM) and test with a small cross-validation approach (e.g., leave-one-out or 5-fold if you can).
    • Check if it can differentiate “Publishable” vs. “Non-Publishable” even in a purely supervised setting. If the accuracy or F1 score is already poor, you’ll need more labeled data or better feature engineering.
  2. Look at Predicted Probabilities

    • For each labeled sample in the training set, ask the model for the predicted probability of being “Non-Publishable” vs. “Publishable.” Are the probabilities extremely skewed (e.g., always 0.99 for Publishable)? If so, you have a bias problem.
  3. Inspect the Confusion Matrix

    • If, on the training set, the classifier is rarely or never predicting “Non-Publishable,” it’s likely ignoring the minority class entirely.

B. Adjust the Self-Training Strategy

  1. Lower the Confidence Threshold

    • By default, you might have threshold=0.8 in SelfTrainingClassifier. That means the model will only pseudo-label an unlabeled sample if it has at least 80% confidence in the predicted class.
    • Try reducing the threshold to 0.5 or 0.6 to allow more unlabeled samples to get a pseudo-label, which could introduce some “Non-Publishable” examples if the model sees them as a second-likely class.
  2. Use Label Spreading or Label Propagation

    • Instead of self-training, try LabelPropagation or LabelSpreading in sklearn.semi_supervised. These methods propagate labels in a graph-based manner, which sometimes helps with very few labeled data.
  3. Incremental Self-Training

    • Rather than label all unlabeled data at once, do it in small batches. After each batch is labeled, re-train the model. This can sometimes prevent the model from going “all in” on a single class too soon.

C. Class Imbalance Remedies

  1. Oversample “Non-Publishable”

    • Synthetic oversampling (e.g., SMOTE) can be tricky with text data, but you can try simpler methods like random oversampling, repeating the “Non-Publishable” documents in the training set multiple times.
    • Alternatively, apply class weights in your logistic regression or SVM: class_weight={0:some_factor, 1:1}.
  2. Data Augmentation

    • If feasible, create augmented versions of your 5 “Non-Publishable” texts by paraphrasing, slightly altering text structure, or combining multiple paragraphs to artificially enlarge that class.
  3. Actively Label More Data

    • If possible, manually label a small portion of the 135 unlabeled PDFs. Focus especially on ones that might be “Non-Publishable.” Even labeling 5 or 10 more “Non-Publishable” could significantly help the model learn the minority class.

D. Improve Feature Extraction

  1. Refine Topic Modeling

    • If your topic model is too coarse (e.g., 5 topics for potentially very diverse research papers), it might not capture the difference between correct vs. incorrect methodologies. Try increasing num_topics to 10, 15, or 20.
  2. Use Domain-Specific Terms or Custom Dictionaries

    • Extend your stopwords list or explicitly keep domain-relevant words (e.g., “methodology,” “validation,” “proposed approach,” “limitations,” etc.) to highlight differences in quality or coherence.
  3. Incorporate Additional Signals

    • For example, check if the paper references are well-formatted, length of method section, presence of disclaimers or IRB statements, or the ratio of references to pages. Sometimes metadata-based features can help.
  4. Try Large Language Model Embeddings

    • Instead of LDA + TF-IDF, try embeddings from a pretrained model like BERT, SciBERT, or Sentence Transformers. Even with few labeled samples, these richer representations may better capture writing quality or research coherence.

3. Example Tweaks in Code

Below is a minimal snippet showing how to:

  1. Adjust threshold
  2. Use LabelPropagation instead of SelfTrainingClassifier
  3. Apply class_weight in LogisticRegression

Feel free to integrate into the earlier pipeline where indicated.

from sklearn.semi_supervised import LabelPropagation from sklearn.linear_model import LogisticRegression # 1. Lower the confidence threshold in SelfTraining from sklearn.semi_supervised import SelfTrainingClassifier base_clf = LogisticRegression( max_iter=1000, solver='lbfgs', # 2. Apply class_weight to emphasize minority class class_weight={0: 2.0, 1: 1.0} # double the weight for "Non-Publishable" ) self_training_clf = SelfTrainingClassifier( base_clf, threshold=0.5 # Lower threshold from 0.8 to 0.5 ) self_training_clf.fit(X_combined, y_combined) # Alternatively, you can try LabelPropagation label_prop_clf = LabelPropagation() label_prop_clf.fit(X_combined, y_combined)

Explanation of Changes

  1. class_weight={0:2.0, 1:1.0}

    • We double the weight for the “Non-Publishable” (0) class, forcing the model to care more about correctly classifying that class.
  2. threshold=0.5 (in SelfTrainingClassifier)

    • The model now only needs a 50% confidence to assign a pseudo-label, which can help it assign “Non-Publishable” more readily (especially if it’s borderline).
  3. Use LabelPropagation

    • A completely different semi-supervised algorithm that might better handle small labeled sets. However, you lose the direct “threshold” parameter. But you can see if it naturally yields a more balanced pseudo-label assignment for unlabeled data.

4. Final Thoughts

  • All unlabeled data predicted as “Publishable” is a classic symptom of having too little minority-class data and an overly confident or biased model.
  • To fix this, (a) gather or label more “Non-Publishable” examples, (b) use methods that penalize misclassification of the minority class (class weighting / oversampling), (c) reduce the pseudo-label confidence threshold so your model actually assigns “Non-Publishable” to unlabeled samples that are borderline, (d) experiment with alternative semi-supervised methods like LabelPropagation or LabelSpreading, and (e) explore richer feature representations or domain-specific cues.

By iterating on these steps—especially adding a few more “Non-Publishable” labels or adjusting the classifier’s threshold and class weight—you should see the model begin to produce more reasonable predictions on the unlabeled set.

このQ&Aを共有