feed extractor

danja · danja · commit d5ecea6f832d · 2025-10-21T12:45:43.000+02:00
diff --git a/.claude/settings.local.json b/.claude/settings.local.json
@@ -8,7 +8,10 @@
       "Bash(python3 -c \"import sys, json; data=json.load(sys.stdin); [print(f''''{b[\"\"feedTitle\"\"][\"\"value\"\"]}: {b[\"\"title\"\"][\"\"value\"\"][:50]}...'''') for b in data[''''results''''][''''bindings'''']]\")",
       "Bash(curl -s \"http://localhost:3030/newsmonitor/query\" --data-urlencode \"query=PREFIX sioc: <http://rdfs.org/sioc/ns#>\nPREFIX dc: <http://purl.org/dc/elements/1.1/>\nSELECT ?feedTitle (COUNT(?post) as ?count)\nWHERE {\n  GRAPH <http://hyperdata.it/content> {\n    ?post a sioc:Post ;\n          sioc:has_container ?feed .\n  }\n  GRAPH <http://hyperdata.it/feeds> {\n    ?feed dc:title ?feedTitle .\n  }\n}\nGROUP BY ?feedTitle\" -u admin:admin123)",
       "Bash(python3 -c \"import sys, json; data=json.load(sys.stdin); [print(f''''{b[\"\"feedTitle\"\"][\"\"value\"\"]}: {b[\"\"count\"\"][\"\"value\"\"]} entries'''') for b in data[''''results''''][''''bindings'''']]\")",
-      "Bash(curl -s \"http://localhost:3030/newsmonitor/query\" --data-urlencode \"query=PREFIX sioc: <http://rdfs.org/sioc/ns#>\nPREFIX dc: <http://purl.org/dc/elements/1.1/>\nSELECT ?post ?container\nWHERE {\n  GRAPH <http://hyperdata.it/content> {\n    ?post a sioc:Post .\n    OPTIONAL { ?post sioc:has_container ?container }\n  }\n}\nLIMIT 5\" -u admin:admin123)"
+      "Bash(curl -s \"http://localhost:3030/newsmonitor/query\" --data-urlencode \"query=PREFIX sioc: <http://rdfs.org/sioc/ns#>\nPREFIX dc: <http://purl.org/dc/elements/1.1/>\nSELECT ?post ?container\nWHERE {\n  GRAPH <http://hyperdata.it/content> {\n    ?post a sioc:Post .\n    OPTIONAL { ?post sioc:has_container ?container }\n  }\n}\nLIMIT 5\" -u admin:admin123)",
+      "Skill(transmissions-processor)",
+      "Skill(transmissions-app)",
+      "Bash(tree src/apps/feed-finder/)"
     ],
     "deny": [],
     "ask": []
diff --git a/.claude/skills/transmissions-processor/SKILL.md b/.claude/skills/transmissions-processor/SKILL.md
@@ -9,6 +9,8 @@ This skill guides you through creating custom processors to extend Transmissions
 
 ## Quick Start Decision Tree
 
+First check for an existing processor that can be used in `src/processors`. If the name is a reasonable match for the required functionality, check the processor's signature in comments in the code.
+
 **Choose your development path:**
 
 ### Core Development (Recommended for reusable processors)
diff --git a/src/apps/feed-finder/about.md b/src/apps/feed-finder/about.md
@@ -0,0 +1,101 @@
+# feed-finder Application
+
+## Runner
+
+```sh
+cd ~/hyperdata/transmissions # adjust to your local path
+./trans feed-finder
+```
+
+## Description
+
+Discovers RSS and Atom feeds from web pages previously cataloged by the link-finder application.
+
+**Pipeline Flow:**
+1. Queries SPARQL store for HTML bookmarks with status 200 (from link-finder)
+2. Fetches HTML content for each page
+3. Extracts feed URLs from `<link rel="alternate">` tags
+4. Accumulates discovered feed URLs
+5. Saves list to `data/feeds.md`
+
+**Feed Types Detected:**
+- RSS 2.0 (`application/rss+xml`)
+- RSS 1.0 (`application/rdf+xml`)
+- Atom (`application/atom+xml`)
+- JSON Feed (`application/feed+json`)
+
+## Prerequisites
+
+1. **SPARQL Store**: Running Fuseki instance at `http://localhost:3030/test`
+2. **Existing Data**: Bookmarks must exist in `<http://hyperdata.it/content>` graph (created by link-finder)
+3. **Network Access**: HTTP access to fetch pages
+
+## Configuration
+
+### SPARQL Query
+Queries `<http://hyperdata.it/content>` graph for:
+- Bookmarks with `bm:status "200"`
+- Content type containing `text/html`
+
+### HTTP Settings
+- Timeout: 10 seconds (prevents hanging on slow sites)
+- Skips pages with errors or timeouts
+
+### Output
+- File: `src/apps/feed-finder/data/feeds.md`
+- Format: One feed URL per line
+
+## Performance
+
+- Processing time depends on number of HTML bookmarks in store
+- HTTP timeout of 10 seconds per page
+- For large datasets, expect ~30-60 minutes per 1000 pages
+
+## Verification
+
+Check discovered feeds:
+
+```sh
+cat src/apps/feed-finder/data/feeds.md
+```
+
+Count feeds found:
+
+```sh
+wc -l src/apps/feed-finder/data/feeds.md
+```
+
+## Example Output
+
+```
+https://example.com/feed
+https://blog.example.org/rss.xml
+https://news.example.net/atom.xml
+```
+
+## Troubleshooting
+
+**No feeds found:**
+- Verify link-finder has populated the SPARQL store
+- Check that pages have status 200 and HTML content type
+- Run with `-v` flag for verbose logging
+
+**HTTP timeouts:**
+- Increase timeout in config.ttl: `:httpSettings :timeout "20000"`
+- Check network connectivity
+
+**SPARQL connection errors:**
+- Verify Fuseki is running: `http://localhost:3030/test`
+- Check endpoints.json configuration
+
+## Related Apps
+
+- **link-finder**: Discovers links from markdown files, stores in SPARQL
+- **newsmonitor**: Subscribes to feeds and monitors content
+
+## Next Steps
+
+After finding feeds:
+1. Review `data/feeds.md` for relevant feeds
+2. Use newsmonitor/subscribe to add feeds to monitoring
+3. Fetch and process feed content with newsmonitor/fetch-with-storage
diff --git a/src/apps/feed-finder/config.ttl b/src/apps/feed-finder/config.ttl
@@ -0,0 +1,65 @@
+# src/apps/feed-finder/config.ttl
+
+@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
+@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
+
+@prefix : <http://purl.org/stuff/transmissions/> .
+
+# SPARQL Select - Query for HTML bookmarks
+:selectLinks a :ConfigSet ;
+    :templateFilename "data/select-links.njk" ;
+    :endpointSettings "data/endpoints.json" ;
+    :graph <http://hyperdata.it/content> .
+
+# ForEach - Iterate over query results
+:forEach a :ConfigSet ;
+    :remove "true" ;
+    :forEach "queryResults.results.bindings" .
+
+# Restructure - Extract URL from SPARQL bindings
+:extractUrl a :ConfigSet ;
+    :rename (:r1) .
+    :r1 :pre "currentItem.target.value" ;
+        :post "url" .
+
+# HttpClient - Fetch HTML with timeout
+:httpSettings a :ConfigSet ;
+    :timeout "10000" .
+
+# HTMLFeedExtractor - Extract feed URL from HTML
+:feedExtractor a :ConfigSet ;
+    :inputField "http.data" ;
+    :outputField "feedUrl" ;
+    :baseUrlField "url" .
+
+# Choice - Only continue if feed found
+:checkFeedExists a :ConfigSet ;
+    :testProperty "feedUrl" ;
+    :testOperator "exists" ;
+    :trueProperty "feedUrl" ;
+    :trueValue "feedUrl" ;
+    :falseProperty "skip" ;
+    :falseValue "true" .
+
+# Accumulate - Collect feed URLs
+:accumulator a :ConfigSet ;
+    :label "feedList" ;
+    :accumulatorType "array" ;
+    :sourceField "feedUrl" ;
+    :targetField "feedUrls" .
+
+# Restructure - Prepare for template
+:prepareTemplate a :ConfigSet ;
+    :rename (:o1) .
+    :o1 :pre "feedUrls" ;
+        :post "feeds" .
+
+# Templater - Convert array to newline-separated string
+:formatOutput a :ConfigSet ;
+    :templateFilename "data/feeds-template.njk" ;
+    :dataField "feeds" .
+
+# FileWriter - Save to feeds.md
+:fileOutput a :ConfigSet ;
+    :destinationFile "src/apps/feed-finder/data/feeds.md" ;
+    :contentField "content" .
diff --git a/src/apps/feed-finder/data/endpoints.json b/src/apps/feed-finder/data/endpoints.json
@@ -0,0 +1,20 @@
+[
+    {
+        "name": "local Fuseki",
+        "type": "query",
+        "url": "http://localhost:3030/semem/query",
+        "credentials": {
+            "user": "admin",
+            "password": "admin123"
+        }
+    },
+    {
+        "name": "local Fuseki",
+        "type": "update",
+        "url": "http://localhost:3030/semem/update",
+        "credentials": {
+            "user": "admin",
+            "password": "admin123"
+        }
+    }
+]
diff --git a/src/apps/feed-finder/data/feeds-template.njk b/src/apps/feed-finder/data/feeds-template.njk
@@ -0,0 +1,3 @@
+{% for feed in feeds -%}
+{{ feed }}
+{% endfor %}
diff --git a/src/apps/feed-finder/data/feeds.md b/src/apps/feed-finder/data/feeds.md
@@ -0,0 +1,3 @@
+# Discovered Feeds
+
+This file will be populated with feed URLs discovered from HTML pages.
diff --git a/src/apps/feed-finder/data/select-links.njk b/src/apps/feed-finder/data/select-links.njk
@@ -0,0 +1,19 @@
+PREFIX bm: <http://purl.org/stuff/bm/>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+
+SELECT ?target ?contentType
+FROM <{{graph}}>
+WHERE {
+  ?bookmark a bm:Bookmark ;
+            bm:target ?target ;
+            bm:status ?status .
+
+  OPTIONAL { ?bookmark bm:contentType ?contentType }
+
+  # Only fetch pages that returned 200 OK
+  FILTER (?status = "200"^^xsd:integer)
+
+  # Only fetch HTML pages (text/html content type)
+  FILTER (CONTAINS(LCASE(STR(?contentType)), "text/html"))
+}
+ORDER BY ?target
diff --git a/src/apps/feed-finder/transmissions.ttl b/src/apps/feed-finder/transmissions.ttl
@@ -0,0 +1,60 @@
+# src/apps/feed-finder/transmissions.ttl
+
+@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
+@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
+
+@prefix : <http://purl.org/stuff/transmissions/> .
+
+##################################################################
+# Utility Processors : insert into pipe for debugging            #
+#                                                                #
+:SM a :ShowMessage . # verbose report, continues pipe            #
+:SC a :ShowConfig . # verbose report, continues pipe             #
+:DE a :DeadEnd . # ends the current pipe quietly                 #
+:H  a :Halt . # kills everything                                 #
+:N  a :NOP . # no operation (except for showing stage in pipe)   #
+:UF a :Unfork . # collapses all pipes but one                    #
+##################################################################
+
+:feedFinder a :Transmission ;
+    :pipe (:p10 :p20 :p30 :p40 :p50 :p60 :p70 :p80 :p85 :p90) .
+
+# Query SPARQL for HTML bookmarks with status 200
+:p10 a :SPARQLSelect ;
+     :settings :selectLinks .
+
+# Iterate over each result
+:p20 a :ForEach ;
+     :settings :forEach .
+
+# Extract URL from SPARQL bindings
+:p30 a :Restructure ;
+     :settings :extractUrl .
+
+# Fetch HTML content
+:p40 a :HttpClient ;
+     :settings :httpSettings .
+
+# Extract feed URL from HTML
+:p50 a :HTMLFeedExtractor ;
+     :settings :feedExtractor .
+
+# Only process if feed found
+:p60 a :Choice ;
+     :settings :checkFeedExists .
+
+# Accumulate feed URLs
+:p70 a :Accumulate ;
+     :settings :accumulator .
+
+# Restructure for template
+:p80 a :Restructure ;
+     :settings :prepareTemplate .
+
+# Convert array to newline-separated string
+:p85 a :Templater ;
+     :settings :formatOutput .
+
+# Write to feeds.md
+:p90 a :FileWriter ;
+     :settings :fileOutput .
diff --git a/src/processors/markup/HTMLFeedExtractor.js b/src/processors/markup/HTMLFeedExtractor.js
diff --git a/src/processors/markup/MarkupProcessorsFactory.js b/src/processors/markup/MarkupProcessorsFactory.js

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+{% for feed in feeds -%}`
	`2`	`+{{ feed }}`
	`3`	`+{% endfor %}`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Discovered Feeds`
	`2`	`+`
	`3`	`+This file will be populated with feed URLs discovered from HTML pages.`