Issuu

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 12 Issue: 10 | Oct 2025

p-ISSN: 2395-0072

www.irjet.net

A Comparative Case Study: NLP-Based Extractive Summarization Techniques Using SpaCy, NLTK, TextRank, and K-Means Abhijeet R Pandipermbil1, Prof. Deepali Dhainje2 1Student, Dept. of Computer Science, Fergusson College, Maharashtra, India

2 Professor, Dept. of Computer Science, Fergusson College, Maharashtra, India

---------------------------------------------------------------------***--------------------------------------------------------------------Abstractive summarization, on the other hand, is expected to Abstract - In the growing field of Natural Language

generate new sentences and therefore needs deeper semantic understanding [5].

Processing (NLP), efficient summarization of text is required in order to reduce vast amounts of information. The present study proposes a comparative case study of extractive summarization methods through the use of SpaCy, NLTK, TextRank, and K-Means clustering on domain specific datasets obtained from BBC News and Kaggle article datasets. With the same preprocessing and evaluation techniques (BLEU scores), the study explores the strength, weakness, and content-domain suitability of every algorithm. It is evidenced that SpaCy and TextRank perform better in politics and technology topics, yet K-Means takes a huge lead for entertainment-based material. The study's findings contribute to practical NLP applications through algorithm preference based on content characteristics.

SpaCy and NLTK exist as grundlegende Python libraries for NLP. SpaCy supports neural pipeline integrations and dependency parsing, while NLTK is versatile for tokenization, POS tagging, and scoring sentences [6][7]. Previous works have demonstrated the use of SpaCy in conjunction with deep learning architectures in summarization problems by Zhang et al. [8]. Also, NLTK has been used for extractive summarization by Gupta and Kumar [9] with decent results. TextRank algorithm works based on centrality measures on a graph of sentences, from the point of view of importance [3]. It has been shown suitable for news summarization, especially for content that has a lot of political or technical discourse [10]. K-Means clustering has also become popular for grouping sentences on the basis of semantic similarity, combined with TF-IDF or with GloVe embeddings [4][11].

Key Words: Text Summarization, SpaCy, NLTK, TextRank, KMeans, NLP, BLEU Score, Extractive Summarization, Case Study.

1.INTRODUCTION

3. METHODOLOGY

Text summarization remains a critical area of NLP, enabling efficient content consumption across domains like journalism, academia, and business intelligence. In an era where vast textual data is generated daily, effective summarization tools reduce reading time while preserving key information. Extractive summarization techniques select and rank significant sentences, making them particularly suitable for applications where factual integrity and context retention are critical [1][2]. This paper investigates four summarization techniques—SpaCy, NLTK, TextRank, and KMeans—across categorized datasets. Rather than proposing a novel algorithm, the focus is on applying and analyzing these techniques in realistic case study settings using widely used news corpora.

3.1 Datasets Two datasets were considered, both of which are publicly downloadable: 1. BBC News Dataset: 4900+ articles in categories: politics, tech, sport, business, and entertainment. 2. Kaggle English Articles Dataset: 3101 articles across similar categories. For each dataset, 20 articles from every domain were selected so that there is an equally balanced domain distribution. To understand the structural variance in these datasets, we provide a histogram of average article and summary lengths per category (Figure 1). This will offer insight into content density and aid in the interpretation of model performance in Section 4.

2. RELATED WORK The research cameras, in general, divide into extractive and abstractive summarization paradigms. Extractive methods, for instance, TextRank [3] and K-Means [4], select sentences deemed representative based on relevance scores or clustering mechanisms, respectively.

Impact Factor value: 8.315

ISO 9001:2008 Certified Journal

Page 81