Research Paper Classification Framework
Tạo vào: 10 tháng 1, 2025
Tạo vào: 10 tháng 1, 2025
In the academic and research domains, the quality, and publishability of research papers play a critical role in advancing knowledge and fostering innovation.
However, the process of determining whether a paper meets the standards for publication can be both time-consuming and subjective, often requiring expert review.
With the increasing volume of research outputs, there is a growing need for automated systems to assist in evaluating the quality and suitability of papers for publication. This challenge not only offers an opportunity to innovate but also holds the potential to streamline the publication process and enhance its objectivity.
The task involves developing a framework that can classify research papers as either "Publishable" or "Non-Publishable" based on the evaluation of their content.
The goal is to create a robust system capable of identifying critical issues, such as inappropriate methodologies, incoherent arguments, or unsubstantiated claims, that affect the suitability of a paper for publication.
For instance, a research paper that applies methodologies or techniques that are not well-suited to the problem being addressed, without adequate justification or adaptation to the context, would be considered unsuitable. Similarly, a paper that presents arguments that are unclear, disorganized, or lack logical coherence, or one that claims results that appear unusually high or unrealistic without sufficient evidence or proper validation, would also fall into the "Non-Publishable" category.
A dataset of 150 research papers is provided for classification, with 15 labeled papers available for reference to guide the development of the framework.
The framework should be designed to accurately classify papers into the appropriate category, ensuring that it can handle a wide range of research topics and maintain consistency across different types of content.
The proposed framework must be capable of systematically analyzing these and other aspects of research papers to ensure a reliable and objective evaluation process.
The solution should demonstrate high accuracy in detecting such issues, ensuring its applicability across a range of research domains and scalability for future use with larger datasets.
What are ALL the possible approaches and the which is the best approach.
Use Topic Modelling for Analyzing the research papers. I have only 15 labelled data (5 non publishable, 10 publishable) and 135 unlabelled data, what should I do, which is the best approach to avoid overfitting. Combine the necessary models that address all the problems creating a hybrid model which is the best approach. I have the unlabelled data in a folder called 'Papers' and labelled data in 'Reference' folder. Inside 'Reference' folder, there are 2 folders 'Non-Publishable' which has 5 papers and 'Publishable' which has 10 papers. Everything is in pdf format. Give me the code also to run everything which works fully-fledged.
Below is a comprehensive overview of all possible approaches for classifying research papers as “Publishable” or “Non-Publishable,” followed by a recommended hybrid approach that uses topic modeling plus a semi-supervised classification strategy to avoid overfitting. Finally, you will find a fully fledged sample Python code that demonstrates how to implement such a pipeline, including PDF text extraction, topic modeling, and hybrid (semi-supervised) classification.
When you have very limited labeled data (in your case, 15 labeled papers: 10 Publishable and 5 Non-Publishable) and a relatively large set of unlabeled data (135 papers), these are some approaches you can consider:
Pure Supervised Learning
Unsupervised Topic Modeling
Self-Supervised / Language Model Embeddings
Semi-Supervised Learning (Label Propagation, Self-Training, etc.)
Active Learning
Hybrid Models (Topic Modeling + Supervised/Semi-Supervised Classifier)
Given you have:
A Hybrid approach using Topic Modeling + Semi-Supervised Classification is likely most effective in avoiding overfitting while leveraging the unlabeled data:
This pipeline allows you to build a model that is less prone to overfitting (thanks to the extra unlabeled data in both the topic modeling step and the semi-supervised classification step).
Below is an end-to-end Python script demonstrating how you could implement a Topic Modeling + Semi-Supervised (Self-Training) classification pipeline. It uses the following libraries:
PyPDF2
or pdfminer.six
(for PDF text extraction)nltk
(for tokenization, stopword removal)gensim
(for LDA topic modeling)scikit-learn
(for TF-IDF, classification, self-training)glob
, os
(for file handling)re
(for text cleaning)Note: Make sure to install these libraries if you haven’t already (e.g.,
pip install pypdf2 pdfminer.six nltk gensim scikit-learn
).
You may also need to download NLTK stopwords:nltk.download('stopwords')
,nltk.download('punkt')
.
Folder Structure:
<details> <summary><strong>Click to view the sample code</strong></summary>text. ├── Reference │ ├── Non-Publishable (5 PDF files) │ └── Publishable (10 PDF files) └── Papers (135 PDF files, unlabeled)
</details>pythonimport os import glob import re import nltk import PyPDF2 from nltk.corpus import stopwords from nltk.tokenize import word_tokenize import gensim from gensim import corpora from gensim.models.ldamodel import LdaModel import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.semi_supervised import SelfTrainingClassifier from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split # 1. PDF Text Extraction Function def extract_text_from_pdf(pdf_path): text = "" with open(pdf_path, 'rb') as f: reader = PyPDF2.PdfReader(f) for page_num in range(len(reader.pages)): page = reader.pages[page_num] text += page.extract_text() if page.extract_text() else "" return text # 2. Text Preprocessing Function def preprocess_text(text): # Lowercase text = text.lower() # Remove non-alphabetic chars text = re.sub(r'[^a-z\s]', ' ', text) # Tokenize words = word_tokenize(text) # Remove stopwords stop_words = set(stopwords.words('english')) words = [w for w in words if w not in stop_words and len(w) > 2] return words # 3. Load all data def load_data(reference_folder, papers_folder): # Labeled data labeled_texts = [] labels = [] # Non-Publishable non_pub_files = glob.glob(os.path.join(reference_folder, 'Non-Publishable', '*.pdf')) for file_path in non_pub_files: raw_text = extract_text_from_pdf(file_path) labeled_texts.append(raw_text) labels.append(0) # 0 for Non-Publishable # Publishable pub_files = glob.glob(os.path.join(reference_folder, 'Publishable', '*.pdf')) for file_path in pub_files: raw_text = extract_text_from_pdf(file_path) labeled_texts.append(raw_text) labels.append(1) # 1 for Publishable # Unlabeled data unlabeled_texts = [] unlabeled_files = glob.glob(os.path.join(papers_folder, '*.pdf')) unlabeled_filepaths = [] for file_path in unlabeled_files: raw_text = extract_text_from_pdf(file_path) unlabeled_texts.append(raw_text) unlabeled_filepaths.append(file_path) return labeled_texts, labels, unlabeled_texts, unlabeled_filepaths # 4. Topic Modeling (LDA) to get topic distributions for each document def topic_modeling(all_docs, num_topics=5): """ all_docs: list of strings (raw text of each document). num_topics: number of latent topics to learn. Returns: - lda_model (gensim model) - corpus (list of bag-of-words representations) - dictionary (gensim dictionary) """ # Preprocess all documents tokenized_docs = [preprocess_text(doc) for doc in all_docs] # Create a dictionary representation of the documents. dictionary = corpora.Dictionary(tokenized_docs) dictionary.filter_extremes(no_below=2, no_above=0.9) # can be tuned # Convert tokenized documents into a bag-of-words corpus = [dictionary.doc2bow(tokens) for tokens in tokenized_docs] # Train LDA model lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=42, passes=5) return lda_model, corpus, dictionary def get_topic_distribution(lda_model, dictionary, text): """ Given an LDA model, dictionary, and raw text of a document, returns the topic distribution (as a vector). """ tokens = preprocess_text(text) bow = dictionary.doc2bow(tokens) topic_probs = lda_model.get_document_topics(bow, minimum_probability=0.0) # Convert to a fixed-size vector topic_vec = np.zeros(lda_model.num_topics) for topic_id, prob in topic_probs: topic_vec[topic_id] = prob return topic_vec # 5. Feature Extraction (Topic distribution + TF-IDF) for classification def build_features(labeled_texts, unlabeled_texts, lda_model, dictionary): """ Build combined feature vectors (Topic distribution + TF-IDF). Returns: - X_labeled: feature array for labeled data - X_unlabeled: feature array for unlabeled data """ # --- (A) TF-IDF Features --- all_texts = labeled_texts + unlabeled_texts tfidf_vectorizer = TfidfVectorizer(stop_words='english', min_df=2) tfidf_matrix = tfidf_vectorizer.fit_transform(all_texts) # Split back into labeled/unlabeled X_tfidf_labeled = tfidf_matrix[:len(labeled_texts)] X_tfidf_unlabeled = tfidf_matrix[len(labeled_texts):] # --- (B) Topic Distribution Features --- topic_features_labeled = [] for txt in labeled_texts: topic_vec = get_topic_distribution(lda_model, dictionary, txt) topic_features_labeled.append(topic_vec) topic_features_labeled = np.array(topic_features_labeled) topic_features_unlabeled = [] for txt in unlabeled_texts: topic_vec = get_topic_distribution(lda_model, dictionary, txt) topic_features_unlabeled.append(topic_vec) topic_features_unlabeled = np.array(topic_features_unlabeled) # --- (C) Combine them (simple concatenation) --- from scipy.sparse import csr_matrix, hstack X_labeled = hstack([X_tfidf_labeled, csr_matrix(topic_features_labeled)]) X_unlabeled = hstack([X_tfidf_unlabeled, csr_matrix(topic_features_unlabeled)]) return X_labeled, X_unlabeled def main(): nltk.download('punkt') nltk.download('stopwords') reference_folder = 'Reference' papers_folder = 'Papers' # --- 1. Load Data --- labeled_texts, labels, unlabeled_texts, unlabeled_filepaths = load_data(reference_folder, papers_folder) # --- 2. Combine all docs for topic modeling --- all_docs = labeled_texts + unlabeled_texts # --- 3. Topic Modeling (Unsupervised) on ALL docs --- lda_model, corpus, dictionary = topic_modeling(all_docs, num_topics=5) # --- 4. Build Features (TopicDist + TF-IDF) --- X_labeled, X_unlabeled = build_features(labeled_texts, unlabeled_texts, lda_model, dictionary) y_labeled = np.array(labels) # --- 5. Semi-Supervised Classification via Self-Training --- base_clf = LogisticRegression(max_iter=1000, solver='lbfgs') self_training_clf = SelfTrainingClassifier(base_clf, threshold=0.8) # threshold=0.8 means we only add pseudo-labels we are 80% confident about # Fit on labeled + unlabeled # We need to pass a combined X and y, where unlabeled examples have y = -1 y_unlabeled = np.full(X_unlabeled.shape[0], -1, dtype=int) X_combined = np.vstack([X_labeled.toarray(), X_unlabeled.toarray()]) y_combined = np.concatenate([y_labeled, y_unlabeled]) self_training_clf.fit(X_combined, y_combined) # Now we have a trained semi-supervised classifier. # You can test it using cross-validation on the labeled set or do a final evaluation. # Example: Evaluate only on your limited labeled data # (Better approach: You might do a smaller train/test split on your 15 labeled samples) X_train, X_test, y_train, y_test = train_test_split(X_labeled, y_labeled, test_size=0.4, random_state=42, stratify=y_labeled) base_clf_evaluation = LogisticRegression(max_iter=1000) base_clf_evaluation.fit(X_train, y_train) acc_base = base_clf_evaluation.score(X_test, y_test) # Evaluate semi-supervised model on the same test split acc_semi = self_training_clf.score(X_test.toarray(), y_test) print(f"Supervised-Only LogisticRegression accuracy (on 15-labeled subset): {acc_base:.3f}") print(f"Semi-Supervised SelfTrainingClassifier accuracy: {acc_semi:.3f}") # For demonstration, let’s also see predictions on the unlabeled set unlabeled_preds = self_training_clf.predict(X_unlabeled.toarray()) # Print out which unlabeled papers are predicted 'Non-Publishable' (0) or 'Publishable' (1) print("Predictions on unlabeled data:") for filepath, pred_label in zip(unlabeled_filepaths, unlabeled_preds): label_str = 'Publishable' if pred_label == 1 else 'Non-Publishable' print(f"{os.path.basename(filepath)} -> {label_str}") if __name__ == '__main__': main()
PyPDF2.PdfReader
.num_topics=5
as an example) over all 150 documents (labeled + unlabeled).num_topics
.SelfTrainingClassifier
in scikit-learn:
y=-1
for unlabeled).num_topics=5, 10, 20
, etc.LabelPropagation
or LabelSpreading
from sklearn.semi_supervised
.sentence-transformers
) and then do a similar semi-supervised approach.I am getting all 135 unlabelled pdfs as publishable.
Below are several possible reasons why the model might be labeling all 135 unlabeled PDFs as “Publishable” and some practical tips on how to address or debug this issue. The fundamental problem is that you have limited labeled data (only 5 Non-Publishable vs. 10 Publishable) and a larger unlabeled set (135 PDFs) that the model must handle in a semi-supervised manner.
Class Imbalance
Threshold in Self-Training
SelfTrainingClassifier
with a relatively high confidence threshold (e.g., 0.8), the model might only pseudo-label the unlabeled data as the class for which it’s most certain: “Publishable.” If it’s never “confident enough” to assign “Non-Publishable” labels, everything ends up being labeled “Publishable.”Small Labeled Dataset
Features Not Differentiating
Overfitting to the “Publishable” Examples
Below are steps you can take to gain insight and improve classification results:
Train/Test Split of Labeled Data
Look at Predicted Probabilities
Inspect the Confusion Matrix
Lower the Confidence Threshold
threshold=0.8
in SelfTrainingClassifier
. That means the model will only pseudo-label an unlabeled sample if it has at least 80% confidence in the predicted class.Use Label Spreading or Label Propagation
LabelPropagation
or LabelSpreading
in sklearn.semi_supervised
. These methods propagate labels in a graph-based manner, which sometimes helps with very few labeled data.Incremental Self-Training
Oversample “Non-Publishable”
class_weight={0:some_factor, 1:1}
.Data Augmentation
Actively Label More Data
Refine Topic Modeling
num_topics
to 10, 15, or 20.Use Domain-Specific Terms or Custom Dictionaries
Incorporate Additional Signals
Try Large Language Model Embeddings
Below is a minimal snippet showing how to:
Feel free to integrate into the earlier pipeline where indicated.
pythonfrom sklearn.semi_supervised import LabelPropagation from sklearn.linear_model import LogisticRegression # 1. Lower the confidence threshold in SelfTraining from sklearn.semi_supervised import SelfTrainingClassifier base_clf = LogisticRegression( max_iter=1000, solver='lbfgs', # 2. Apply class_weight to emphasize minority class class_weight={0: 2.0, 1: 1.0} # double the weight for "Non-Publishable" ) self_training_clf = SelfTrainingClassifier( base_clf, threshold=0.5 # Lower threshold from 0.8 to 0.5 ) self_training_clf.fit(X_combined, y_combined) # Alternatively, you can try LabelPropagation label_prop_clf = LabelPropagation() label_prop_clf.fit(X_combined, y_combined)
class_weight={0:2.0, 1:1.0}
threshold=0.5
(in SelfTrainingClassifier)
Use LabelPropagation
LabelPropagation
or LabelSpreading
, and (e) explore richer feature representations or domain-specific cues.By iterating on these steps—especially adding a few more “Non-Publishable” labels or adjusting the classifier’s threshold and class weight—you should see the model begin to produce more reasonable predictions on the unlabeled set.