📰 News NLP Processor - Natural Language Processing for News Articles

🎓 Academic Information

University: Universidad Politécnica de San Luis Potosí
Course: Artificial Intelligence 2
Professor: Manuel Chávez Pérez

👥 Development Team

Emiliano Chequer Enriquez - 180016
Jorge Matadamas Cortes - 180789
Axel Rivera Saldaña - 180606

📋 Project Description

This project implements a Natural Language Processing (NLP) system to extract, analyze, and process news articles from the El Pulso SLP website. The system provides:

✅ Automatic news and title extraction
✅ Text processing with tokenization and stemming
✅ Keyword search in news articles
✅ Relevance analysis using TF-IDF
✅ Word index generation
✅ News storage in text files

🛠️ Technologies Used

Python 3.8+
BeautifulSoup4 - Web scraping
Requests - HTTP requests
NLTK - Natural language processing
Pandas - Data manipulation
NumPy - Numerical operations
Scikit-learn - Machine learning and TF-IDF
Sentence-splitter - Sentence segmentation

📦 Installation

1. Clone the repository

git clone https://github.com/AxelSaldana/news-nlp-processor.git
cd news-nlp-processor

2. Create virtual environment (recommended)

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install dependencies

pip install -r requirements.txt

4. Download NLTK resources

import nltk
nltk.download('punkt')

🚀 System Usage

Main Execution

python procesador_noticias.py

Programmatic Usage

from procesador_noticias import ProcesadorNoticias

# Create processor instance
procesador = ProcesadorNoticias()

# Get news
procesador.obtener_noticias()
procesador.extraer_contenido_noticias()

# Save to files
procesador.guardar_noticias_txt()

# Create index for searches
procesador.crear_indice()

# Search for specific word
procesador.buscar_palabra("política")

# Relevance analysis
procesador.analizar_relevancia_tfidf("economía", top_n=5)

📁 Project Structure

news-nlp-processor/
├── README.md                           # This file
├── requirements.txt                    # Project dependencies
├── procesador_noticias.py             # Main clean and organized code
├── main.py                            # Original notebook code
├── ProyectoFinalNoticias.ipynb        # Original Colab notebook
└── noticias/                          # Directory where news are saved
    ├── noticia1.txt
    ├── noticia2.txt
    └── ...

🔍 Main Features

1. News Extraction

Automatic connection to El Pulso SLP website
Extraction of news titles and links
Download of complete content for each news article

2. Text Processing

Tokenization: Text separation into individual words
Cleaning: Removal of punctuation marks
Normalization: Conversion to lowercase
Stemming: Word reduction to their root form

3. Search System

Creation of inverted index for fast searches
Keyword search across all news articles
Identification of files containing specific terms

4. TF-IDF Analysis

Calculation of term relevance per document
Ranking of most relevant news for a keyword
Statistical analysis of term frequency

📊 Example Output

=== EXTRACTED NEWS ===
1. SLP Government announces new economic measures
2. Polytechnic University inaugurates new laboratory
3. Tourism sector shows signs of recovery
...

=== WORD SEARCH ===
Enter a word to search: university

The word 'university' was found in:
  - Universidad_Politecnica_inaugura_nuevo_laboratorio_2.txt
  - Convenio_educativo_universidades_5.txt

TOP 5 most relevant files for 'university':
1. Universidad_Politecnica_inaugura_nuevo_laboratorio_2 (Score: 0.8542)
2. Convenio_educativo_universidades_5 (Score: 0.6234)
...

🤝 Contributions

This project was developed as part of the Artificial Intelligence 2 course. Contributions are limited to the development team members.

📄 License

This project is for academic use and developed for educational purposes at Universidad Politécnica de San Luis Potosí.

📞 Contact

For questions about this project, contact any of the development team members mentioned above.

Note: This project uses web scraping from El Pulso SLP website for academic and research purposes. The website's terms of use are respected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📰 News NLP Processor - Natural Language Processing for News Articles

🎓 Academic Information

👥 Development Team

📋 Project Description

🛠️ Technologies Used

📦 Installation

1. Clone the repository

2. Create virtual environment (recommended)

3. Install dependencies

4. Download NLTK resources

🚀 System Usage

Main Execution

Programmatic Usage

📁 Project Structure

🔍 Main Features

1. News Extraction

2. Text Processing

3. Search System

4. TF-IDF Analysis

📊 Example Output

🤝 Contributions

📄 License

📞 Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
ProyectoFinalNoticias.ipynb		ProyectoFinalNoticias.ipynb
README.md		README.md
main.py		main.py
procesador_noticias.py		procesador_noticias.py
requirements.txt		requirements.txt

AxelSaldana/news-nlp-processor

Folders and files

Latest commit

History

Repository files navigation

📰 News NLP Processor - Natural Language Processing for News Articles

🎓 Academic Information

👥 Development Team

📋 Project Description

🛠️ Technologies Used

📦 Installation

1. Clone the repository

2. Create virtual environment (recommended)

3. Install dependencies

4. Download NLTK resources

🚀 System Usage

Main Execution

Programmatic Usage

📁 Project Structure

🔍 Main Features

1. News Extraction

2. Text Processing

3. Search System

4. TF-IDF Analysis

📊 Example Output

🤝 Contributions

📄 License

📞 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages