Research Paper Classification Framework
Tạo vào: 10 tháng 1, 2025
Tạo vào: 10 tháng 1, 2025
Unveiling a solution
I’m piecing together a solution to classify research papers by reading, processing, and analyzing the text data. This approach clarifies their publishability.
Crafting a pipeline
I’m detailing steps to classifying research papers: loading data, converting to tf-idf, applying topic modeling, training a classifier with labeled data, and considering a hybrid model for accuracy.
Stitching together
I’m mapping out the components for a comprehensive research paper classifier: PDF extraction, text cleaning, topic modeling, embeddings, classifier training, cross-validation, and iterative labeling for a robust hybrid model.
Weighing the choices
I'm curious about comparing traditional supervised learning, transfer learning, semisupervised learning, unsupervised learning, and hybrid methods. The top pick likely merges pretrained models with a semisupervised approach.
Piecing together the pipeline
OK, let me see: parsing PDFs, removing stopwords, creating vector documents, applying LDA, training classifiers, and classifying using reconciliation. This blend should ensure a comprehensive, effective approach.
Thought about research paper classification methods for 11 seconds
Below is a comprehensive overview of all possible approaches for classifying research papers as “Publishable” or “Non-Publishable,” followed by a recommended hybrid approach that uses topic modeling plus a semi-supervised classification strategy to avoid overfitting. Finally, you will find a fully fledged sample Python code that demonstrates how to implement such a pipeline, including PDF text extraction, topic modeling, and hybrid (semi-supervised) classification.
When you have very limited labeled data (in your case, 15 labeled papers: 10 Publishable and 5 Non-Publishable) and a relatively large set of unlabeled data (135 papers), these are some approaches you can consider:
Pure Supervised Learning
Unsupervised Topic Modeling
Self-Supervised / Language Model Embeddings
Semi-Supervised Learning (Label Propagation, Self-Training, etc.)
Active Learning
Hybrid Models (Topic Modeling + Supervised/Semi-Supervised Classifier)
Given you have:
A Hybrid approach using Topic Modeling + Semi-Supervised Classification is likely most effective in avoiding overfitting while leveraging the unlabeled data:
This pipeline allows you to build a model that is less prone to overfitting (thanks to the extra unlabeled data in both the topic modeling step and the semi-supervised classification step).
Below is an end-to-end Python script demonstrating how you could implement a Topic Modeling + Semi-Supervised (Self-Training) classification pipeline. It uses the following libraries:
PyPDF2
or pdfminer.six
(for PDF text extraction)nltk
(for tokenization, stopword removal)gensim
(for LDA topic modeling)scikit-learn
(for TF-IDF, classification, self-training)glob
, os
(for file handling)re
(for text cleaning)Note: Make sure to install these libraries if you haven’t already (e.g.,
pip install pypdf2 pdfminer.six nltk gensim scikit-learn
).
You may also need to download NLTK stopwords:nltk.download('stopwords')
,nltk.download('punkt')
.
Folder Structure:
.
├── Reference
│ ├── Non-Publishable (5 PDF files)
│ └── Publishable (10 PDF files)
└── Papers (135 PDF files, unlabeled)
<details>
<summary><strong>Click to view the sample code</strong></summary>
</details>import os import glob import re import nltk import PyPDF2 from nltk.corpus import stopwords from nltk.tokenize import word_tokenize import gensim from gensim import corpora from gensim.models.ldamodel import LdaModel import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.semi_supervised import SelfTrainingClassifier from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split # 1. PDF Text Extraction Function def extract_text_from_pdf(pdf_path): text = "" with open(pdf_path, 'rb') as f: reader = PyPDF2.PdfReader(f) for page_num in range(len(reader.pages)): page = reader.pages[page_num] text += page.extract_text() if page.extract_text() else "" return text # 2. Text Preprocessing Function def preprocess_text(text): # Lowercase text = text.lower() # Remove non-alphabetic chars text = re.sub(r'[^a-z\s]', ' ', text) # Tokenize words = word_tokenize(text) # Remove stopwords stop_words = set(stopwords.words('english')) words = [w for w in words if w not in stop_words and len(w) > 2] return words # 3. Load all data def load_data(reference_folder, papers_folder): # Labeled data labeled_texts = [] labels = [] # Non-Publishable non_pub_files = glob.glob(os.path.join(reference_folder, 'Non-Publishable', '*.pdf')) for file_path in non_pub_files: raw_text = extract_text_from_pdf(file_path) labeled_texts.append(raw_text) labels.append(0) # 0 for Non-Publishable # Publishable pub_files = glob.glob(os.path.join(reference_folder, 'Publishable', '*.pdf')) for file_path in pub_files: raw_text = extract_text_from_pdf(file_path) labeled_texts.append(raw_text) labels.append(1) # 1 for Publishable # Unlabeled data unlabeled_texts = [] unlabeled_files = glob.glob(os.path.join(papers_folder, '*.pdf')) unlabeled_filepaths = [] for file_path in unlabeled_files: raw_text = extract_text_from_pdf(file_path) unlabeled_texts.append(raw_text) unlabeled_filepaths.append(file_path) return labeled_texts, labels, unlabeled_texts, unlabeled_filepaths # 4. Topic Modeling (LDA) to get topic distributions for each document def topic_modeling(all_docs, num_topics=5): """ all_docs: list of strings (raw text of each document). num_topics: number of latent topics to learn. Returns: - lda_model (gensim model) - corpus (list of bag-of-words representations) - dictionary (gensim dictionary) """ # Preprocess all documents tokenized_docs = [preprocess_text(doc) for doc in all_docs] # Create a dictionary representation of the documents. dictionary = corpora.Dictionary(tokenized_docs) dictionary.filter_extremes(no_below=2, no_above=0.9) # can be tuned # Convert tokenized documents into a bag-of-words corpus = [dictionary.doc2bow(tokens) for tokens in tokenized_docs] # Train LDA model lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=42, passes=5) return lda_model, corpus, dictionary def get_topic_distribution(lda_model, dictionary, text): """ Given an LDA model, dictionary, and raw text of a document, returns the topic distribution (as a vector). """ tokens = preprocess_text(text) bow = dictionary.doc2bow(tokens) topic_probs = lda_model.get_document_topics(bow, minimum_probability=0.0) # Convert to a fixed-size vector topic_vec = np.zeros(lda_model.num_topics) for topic_id, prob in topic_probs: topic_vec[topic_id] = prob return topic_vec # 5. Feature Extraction (Topic distribution + TF-IDF) for classification def build_features(labeled_texts, unlabeled_texts, lda_model, dictionary): """ Build combined feature vectors (Topic distribution + TF-IDF). Returns: - X_labeled: feature array for labeled data - X_unlabeled: feature array for unlabeled data """ # --- (A) TF-IDF Features --- all_texts = labeled_texts + unlabeled_texts tfidf_vectorizer = TfidfVectorizer(stop_words='english', min_df=2) tfidf_matrix = tfidf_vectorizer.fit_transform(all_texts) # Split back into labeled/unlabeled X_tfidf_labeled = tfidf_matrix[:len(labeled_texts)] X_tfidf_unlabeled = tfidf_matrix[len(labeled_texts):] # --- (B) Topic Distribution Features --- topic_features_labeled = [] for txt in labeled_texts: topic_vec = get_topic_distribution(lda_model, dictionary, txt) topic_features_labeled.append(topic_vec) topic_features_labeled = np.array(topic_features_labeled) topic_features_unlabeled = [] for txt in unlabeled_texts: topic_vec = get_topic_distribution(lda_model, dictionary, txt) topic_features_unlabeled.append(topic_vec) topic_features_unlabeled = np.array(topic_features_unlabeled) # --- (C) Combine them (simple concatenation) --- from scipy.sparse import csr_matrix, hstack X_labeled = hstack([X_tfidf_labeled, csr_matrix(topic_features_labeled)]) X_unlabeled = hstack([X_tfidf_unlabeled, csr_matrix(topic_features_unlabeled)]) return X_labeled, X_unlabeled def main(): nltk.download('punkt') nltk.download('stopwords') reference_folder = 'Reference' papers_folder = 'Papers' # --- 1. Load Data --- labeled_texts, labels, unlabeled_texts, unlabeled_filepaths = load_data(reference_folder, papers_folder) # --- 2. Combine all docs for topic modeling --- all_docs = labeled_texts + unlabeled_texts # --- 3. Topic Modeling (Unsupervised) on ALL docs --- lda_model, corpus, dictionary = topic_modeling(all_docs, num_topics=5) # --- 4. Build Features (TopicDist + TF-IDF) --- X_labeled, X_unlabeled = build_features(labeled_texts, unlabeled_texts, lda_model, dictionary) y_labeled = np.array(labels) # --- 5. Semi-Supervised Classification via Self-Training --- base_clf = LogisticRegression(max_iter=1000, solver='lbfgs') self_training_clf = SelfTrainingClassifier(base_clf, threshold=0.8) # threshold=0.8 means we only add pseudo-labels we are 80% confident about # Fit on labeled + unlabeled # We need to pass a combined X and y, where unlabeled examples have y = -1 y_unlabeled = np.full(X_unlabeled.shape[0], -1, dtype=int) X_combined = np.vstack([X_labeled.toarray(), X_unlabeled.toarray()]) y_combined = np.concatenate([y_labeled, y_unlabeled]) self_training_clf.fit(X_combined, y_combined) # Now we have a trained semi-supervised classifier. # You can test it using cross-validation on the labeled set or do a final evaluation. # Example: Evaluate only on your limited labeled data # (Better approach: You might do a smaller train/test split on your 15 labeled samples) X_train, X_test, y_train, y_test = train_test_split(X_labeled, y_labeled, test_size=0.4, random_state=42, stratify=y_labeled) base_clf_evaluation = LogisticRegression(max_iter=1000) base_clf_evaluation.fit(X_train, y_train) acc_base = base_clf_evaluation.score(X_test, y_test) # Evaluate semi-supervised model on the same test split acc_semi = self_training_clf.score(X_test.toarray(), y_test) print(f"Supervised-Only LogisticRegression accuracy (on 15-labeled subset): {acc_base:.3f}") print(f"Semi-Supervised SelfTrainingClassifier accuracy: {acc_semi:.3f}") # For demonstration, let’s also see predictions on the unlabeled set unlabeled_preds = self_training_clf.predict(X_unlabeled.toarray()) # Print out which unlabeled papers are predicted 'Non-Publishable' (0) or 'Publishable' (1) print("Predictions on unlabeled data:") for filepath, pred_label in zip(unlabeled_filepaths, unlabeled_preds): label_str = 'Publishable' if pred_label == 1 else 'Non-Publishable' print(f"{os.path.basename(filepath)} -> {label_str}") if __name__ == '__main__': main()
PyPDF2.PdfReader
.num_topics=5
as an example) over all 150 documents (labeled + unlabeled).num_topics
.SelfTrainingClassifier
in scikit-learn:
y=-1
for unlabeled).num_topics=5, 10, 20
, etc.LabelPropagation
or LabelSpreading
from sklearn.semi_supervised
.sentence-transformers
) and then do a similar semi-supervised approach.Adjusting thresholds
I'm noting that the code might classify all PDFs as publishable due to a low threshold or insufficient data. Exploring data expansion, threshold tuning, label propagation, debugging techniques, and cross-validation for potential improvements.
Thought about model bias and overfitting for a few seconds
Below are several possible reasons why the model might be labeling all 135 unlabeled PDFs as “Publishable” and some practical tips on how to address or debug this issue. The fundamental problem is that you have limited labeled data (only 5 Non-Publishable vs. 10 Publishable) and a larger unlabeled set (135 PDFs) that the model must handle in a semi-supervised manner.
Class Imbalance
Threshold in Self-Training
SelfTrainingClassifier
with a relatively high confidence threshold (e.g., 0.8), the model might only pseudo-label the unlabeled data as the class for which it’s most certain: “Publishable.” If it’s never “confident enough” to assign “Non-Publishable” labels, everything ends up being labeled “Publishable.”Small Labeled Dataset
Features Not Differentiating
Overfitting to the “Publishable” Examples
Below are steps you can take to gain insight and improve classification results:
Train/Test Split of Labeled Data
Look at Predicted Probabilities
Inspect the Confusion Matrix
Lower the Confidence Threshold
threshold=0.8
in SelfTrainingClassifier
. That means the model will only pseudo-label an unlabeled sample if it has at least 80% confidence in the predicted class.Use Label Spreading or Label Propagation
LabelPropagation
or LabelSpreading
in sklearn.semi_supervised
. These methods propagate labels in a graph-based manner, which sometimes helps with very few labeled data.Incremental Self-Training
Oversample “Non-Publishable”
class_weight={0:some_factor, 1:1}
.Data Augmentation
Actively Label More Data
Refine Topic Modeling
num_topics
to 10, 15, or 20.Use Domain-Specific Terms or Custom Dictionaries
Incorporate Additional Signals
Try Large Language Model Embeddings
Below is a minimal snippet showing how to:
Feel free to integrate into the earlier pipeline where indicated.
from sklearn.semi_supervised import LabelPropagation from sklearn.linear_model import LogisticRegression # 1. Lower the confidence threshold in SelfTraining from sklearn.semi_supervised import SelfTrainingClassifier base_clf = LogisticRegression( max_iter=1000, solver='lbfgs', # 2. Apply class_weight to emphasize minority class class_weight={0: 2.0, 1: 1.0} # double the weight for "Non-Publishable" ) self_training_clf = SelfTrainingClassifier( base_clf, threshold=0.5 # Lower threshold from 0.8 to 0.5 ) self_training_clf.fit(X_combined, y_combined) # Alternatively, you can try LabelPropagation label_prop_clf = LabelPropagation() label_prop_clf.fit(X_combined, y_combined)
class_weight={0:2.0, 1:1.0}
threshold=0.5
(in SelfTrainingClassifier)
Use LabelPropagation
LabelPropagation
or LabelSpreading
, and (e) explore richer feature representations or domain-specific cues.By iterating on these steps—especially adding a few more “Non-Publishable” labels or adjusting the classifier’s threshold and class weight—you should see the model begin to produce more reasonable predictions on the unlabeled set.