University: Universidad Politécnica de San Luis Potosí
Course: Artificial Intelligence 2
Professor: Manuel Chávez Pérez
- Emiliano Chequer Enriquez - 180016
- Jorge Matadamas Cortes - 180789
- Axel Rivera Saldaña - 180606
This project implements a Natural Language Processing (NLP) system to extract, analyze, and process news articles from the El Pulso SLP website. The system provides:
- ✅ Automatic news and title extraction
- ✅ Text processing with tokenization and stemming
- ✅ Keyword search in news articles
- ✅ Relevance analysis using TF-IDF
- ✅ Word index generation
- ✅ News storage in text files
- Python 3.8+
- BeautifulSoup4 - Web scraping
- Requests - HTTP requests
- NLTK - Natural language processing
- Pandas - Data manipulation
- NumPy - Numerical operations
- Scikit-learn - Machine learning and TF-IDF
- Sentence-splitter - Sentence segmentation
git clone https://github.com/AxelSaldana/news-nlp-processor.git
cd news-nlp-processorpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtimport nltk
nltk.download('punkt')python procesador_noticias.pyfrom procesador_noticias import ProcesadorNoticias
# Create processor instance
procesador = ProcesadorNoticias()
# Get news
procesador.obtener_noticias()
procesador.extraer_contenido_noticias()
# Save to files
procesador.guardar_noticias_txt()
# Create index for searches
procesador.crear_indice()
# Search for specific word
procesador.buscar_palabra("política")
# Relevance analysis
procesador.analizar_relevancia_tfidf("economía", top_n=5)news-nlp-processor/
├── README.md # This file
├── requirements.txt # Project dependencies
├── procesador_noticias.py # Main clean and organized code
├── main.py # Original notebook code
├── ProyectoFinalNoticias.ipynb # Original Colab notebook
└── noticias/ # Directory where news are saved
├── noticia1.txt
├── noticia2.txt
└── ...
- Automatic connection to El Pulso SLP website
- Extraction of news titles and links
- Download of complete content for each news article
- Tokenization: Text separation into individual words
- Cleaning: Removal of punctuation marks
- Normalization: Conversion to lowercase
- Stemming: Word reduction to their root form
- Creation of inverted index for fast searches
- Keyword search across all news articles
- Identification of files containing specific terms
- Calculation of term relevance per document
- Ranking of most relevant news for a keyword
- Statistical analysis of term frequency
=== EXTRACTED NEWS ===
1. SLP Government announces new economic measures
2. Polytechnic University inaugurates new laboratory
3. Tourism sector shows signs of recovery
...
=== WORD SEARCH ===
Enter a word to search: university
The word 'university' was found in:
- Universidad_Politecnica_inaugura_nuevo_laboratorio_2.txt
- Convenio_educativo_universidades_5.txt
TOP 5 most relevant files for 'university':
1. Universidad_Politecnica_inaugura_nuevo_laboratorio_2 (Score: 0.8542)
2. Convenio_educativo_universidades_5 (Score: 0.6234)
...
This project was developed as part of the Artificial Intelligence 2 course. Contributions are limited to the development team members.
This project is for academic use and developed for educational purposes at Universidad Politécnica de San Luis Potosí.
For questions about this project, contact any of the development team members mentioned above.
Note: This project uses web scraping from El Pulso SLP website for academic and research purposes. The website's terms of use are respected.