Skip to content

Commit d5ecea6

Browse files
committed
feed extractor
1 parent 268df1f commit d5ecea6

File tree

11 files changed

+498
-1
lines changed

11 files changed

+498
-1
lines changed

.claude/settings.local.json

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,10 @@
88
"Bash(python3 -c \"import sys, json; data=json.load(sys.stdin); [print(f''''{b[\"\"feedTitle\"\"][\"\"value\"\"]}: {b[\"\"title\"\"][\"\"value\"\"][:50]}...'''') for b in data[''''results''''][''''bindings'''']]\")",
99
"Bash(curl -s \"http://localhost:3030/newsmonitor/query\" --data-urlencode \"query=PREFIX sioc: <http://rdfs.org/sioc/ns#>\nPREFIX dc: <http://purl.org/dc/elements/1.1/>\nSELECT ?feedTitle (COUNT(?post) as ?count)\nWHERE {\n GRAPH <http://hyperdata.it/content> {\n ?post a sioc:Post ;\n sioc:has_container ?feed .\n }\n GRAPH <http://hyperdata.it/feeds> {\n ?feed dc:title ?feedTitle .\n }\n}\nGROUP BY ?feedTitle\" -u admin:admin123)",
1010
"Bash(python3 -c \"import sys, json; data=json.load(sys.stdin); [print(f''''{b[\"\"feedTitle\"\"][\"\"value\"\"]}: {b[\"\"count\"\"][\"\"value\"\"]} entries'''') for b in data[''''results''''][''''bindings'''']]\")",
11-
"Bash(curl -s \"http://localhost:3030/newsmonitor/query\" --data-urlencode \"query=PREFIX sioc: <http://rdfs.org/sioc/ns#>\nPREFIX dc: <http://purl.org/dc/elements/1.1/>\nSELECT ?post ?container\nWHERE {\n GRAPH <http://hyperdata.it/content> {\n ?post a sioc:Post .\n OPTIONAL { ?post sioc:has_container ?container }\n }\n}\nLIMIT 5\" -u admin:admin123)"
11+
"Bash(curl -s \"http://localhost:3030/newsmonitor/query\" --data-urlencode \"query=PREFIX sioc: <http://rdfs.org/sioc/ns#>\nPREFIX dc: <http://purl.org/dc/elements/1.1/>\nSELECT ?post ?container\nWHERE {\n GRAPH <http://hyperdata.it/content> {\n ?post a sioc:Post .\n OPTIONAL { ?post sioc:has_container ?container }\n }\n}\nLIMIT 5\" -u admin:admin123)",
12+
"Skill(transmissions-processor)",
13+
"Skill(transmissions-app)",
14+
"Bash(tree src/apps/feed-finder/)"
1215
],
1316
"deny": [],
1417
"ask": []

.claude/skills/transmissions-processor/SKILL.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ This skill guides you through creating custom processors to extend Transmissions
99

1010
## Quick Start Decision Tree
1111

12+
First check for an existing processor that can be used in `src/processors`. If the name is a reasonable match for the required functionality, check the processor's signature in comments in the code.
13+
1214
**Choose your development path:**
1315

1416
### Core Development (Recommended for reusable processors)

src/apps/feed-finder/about.md

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# feed-finder Application
2+
3+
## Runner
4+
5+
```sh
6+
cd ~/hyperdata/transmissions # adjust to your local path
7+
./trans feed-finder
8+
```
9+
10+
## Description
11+
12+
Discovers RSS and Atom feeds from web pages previously cataloged by the link-finder application.
13+
14+
**Pipeline Flow:**
15+
1. Queries SPARQL store for HTML bookmarks with status 200 (from link-finder)
16+
2. Fetches HTML content for each page
17+
3. Extracts feed URLs from `<link rel="alternate">` tags
18+
4. Accumulates discovered feed URLs
19+
5. Saves list to `data/feeds.md`
20+
21+
**Feed Types Detected:**
22+
- RSS 2.0 (`application/rss+xml`)
23+
- RSS 1.0 (`application/rdf+xml`)
24+
- Atom (`application/atom+xml`)
25+
- JSON Feed (`application/feed+json`)
26+
27+
## Prerequisites
28+
29+
1. **SPARQL Store**: Running Fuseki instance at `http://localhost:3030/test`
30+
2. **Existing Data**: Bookmarks must exist in `<http://hyperdata.it/content>` graph (created by link-finder)
31+
3. **Network Access**: HTTP access to fetch pages
32+
33+
## Configuration
34+
35+
### SPARQL Query
36+
Queries `<http://hyperdata.it/content>` graph for:
37+
- Bookmarks with `bm:status "200"`
38+
- Content type containing `text/html`
39+
40+
### HTTP Settings
41+
- Timeout: 10 seconds (prevents hanging on slow sites)
42+
- Skips pages with errors or timeouts
43+
44+
### Output
45+
- File: `src/apps/feed-finder/data/feeds.md`
46+
- Format: One feed URL per line
47+
48+
## Performance
49+
50+
- Processing time depends on number of HTML bookmarks in store
51+
- HTTP timeout of 10 seconds per page
52+
- For large datasets, expect ~30-60 minutes per 1000 pages
53+
54+
## Verification
55+
56+
Check discovered feeds:
57+
58+
```sh
59+
cat src/apps/feed-finder/data/feeds.md
60+
```
61+
62+
Count feeds found:
63+
64+
```sh
65+
wc -l src/apps/feed-finder/data/feeds.md
66+
```
67+
68+
## Example Output
69+
70+
```
71+
https://example.com/feed
72+
https://blog.example.org/rss.xml
73+
https://news.example.net/atom.xml
74+
```
75+
76+
## Troubleshooting
77+
78+
**No feeds found:**
79+
- Verify link-finder has populated the SPARQL store
80+
- Check that pages have status 200 and HTML content type
81+
- Run with `-v` flag for verbose logging
82+
83+
**HTTP timeouts:**
84+
- Increase timeout in config.ttl: `:httpSettings :timeout "20000"`
85+
- Check network connectivity
86+
87+
**SPARQL connection errors:**
88+
- Verify Fuseki is running: `http://localhost:3030/test`
89+
- Check endpoints.json configuration
90+
91+
## Related Apps
92+
93+
- **link-finder**: Discovers links from markdown files, stores in SPARQL
94+
- **newsmonitor**: Subscribes to feeds and monitors content
95+
96+
## Next Steps
97+
98+
After finding feeds:
99+
1. Review `data/feeds.md` for relevant feeds
100+
2. Use newsmonitor/subscribe to add feeds to monitoring
101+
3. Fetch and process feed content with newsmonitor/fetch-with-storage

src/apps/feed-finder/config.ttl

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# src/apps/feed-finder/config.ttl
2+
3+
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
4+
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
5+
6+
@prefix : <http://purl.org/stuff/transmissions/> .
7+
8+
# SPARQL Select - Query for HTML bookmarks
9+
:selectLinks a :ConfigSet ;
10+
:templateFilename "data/select-links.njk" ;
11+
:endpointSettings "data/endpoints.json" ;
12+
:graph <http://hyperdata.it/content> .
13+
14+
# ForEach - Iterate over query results
15+
:forEach a :ConfigSet ;
16+
:remove "true" ;
17+
:forEach "queryResults.results.bindings" .
18+
19+
# Restructure - Extract URL from SPARQL bindings
20+
:extractUrl a :ConfigSet ;
21+
:rename (:r1) .
22+
:r1 :pre "currentItem.target.value" ;
23+
:post "url" .
24+
25+
# HttpClient - Fetch HTML with timeout
26+
:httpSettings a :ConfigSet ;
27+
:timeout "10000" .
28+
29+
# HTMLFeedExtractor - Extract feed URL from HTML
30+
:feedExtractor a :ConfigSet ;
31+
:inputField "http.data" ;
32+
:outputField "feedUrl" ;
33+
:baseUrlField "url" .
34+
35+
# Choice - Only continue if feed found
36+
:checkFeedExists a :ConfigSet ;
37+
:testProperty "feedUrl" ;
38+
:testOperator "exists" ;
39+
:trueProperty "feedUrl" ;
40+
:trueValue "feedUrl" ;
41+
:falseProperty "skip" ;
42+
:falseValue "true" .
43+
44+
# Accumulate - Collect feed URLs
45+
:accumulator a :ConfigSet ;
46+
:label "feedList" ;
47+
:accumulatorType "array" ;
48+
:sourceField "feedUrl" ;
49+
:targetField "feedUrls" .
50+
51+
# Restructure - Prepare for template
52+
:prepareTemplate a :ConfigSet ;
53+
:rename (:o1) .
54+
:o1 :pre "feedUrls" ;
55+
:post "feeds" .
56+
57+
# Templater - Convert array to newline-separated string
58+
:formatOutput a :ConfigSet ;
59+
:templateFilename "data/feeds-template.njk" ;
60+
:dataField "feeds" .
61+
62+
# FileWriter - Save to feeds.md
63+
:fileOutput a :ConfigSet ;
64+
:destinationFile "src/apps/feed-finder/data/feeds.md" ;
65+
:contentField "content" .
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
[
2+
{
3+
"name": "local Fuseki",
4+
"type": "query",
5+
"url": "http://localhost:3030/semem/query",
6+
"credentials": {
7+
"user": "admin",
8+
"password": "admin123"
9+
}
10+
},
11+
{
12+
"name": "local Fuseki",
13+
"type": "update",
14+
"url": "http://localhost:3030/semem/update",
15+
"credentials": {
16+
"user": "admin",
17+
"password": "admin123"
18+
}
19+
}
20+
]
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{% for feed in feeds -%}
2+
{{ feed }}
3+
{% endfor %}

src/apps/feed-finder/data/feeds.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Discovered Feeds
2+
3+
This file will be populated with feed URLs discovered from HTML pages.
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
PREFIX bm: <http://purl.org/stuff/bm/>
2+
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
3+
4+
SELECT ?target ?contentType
5+
FROM <{{graph}}>
6+
WHERE {
7+
?bookmark a bm:Bookmark ;
8+
bm:target ?target ;
9+
bm:status ?status .
10+
11+
OPTIONAL { ?bookmark bm:contentType ?contentType }
12+
13+
# Only fetch pages that returned 200 OK
14+
FILTER (?status = "200"^^xsd:integer)
15+
16+
# Only fetch HTML pages (text/html content type)
17+
FILTER (CONTAINS(LCASE(STR(?contentType)), "text/html"))
18+
}
19+
ORDER BY ?target
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# src/apps/feed-finder/transmissions.ttl
2+
3+
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
4+
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
5+
6+
@prefix : <http://purl.org/stuff/transmissions/> .
7+
8+
##################################################################
9+
# Utility Processors : insert into pipe for debugging #
10+
# #
11+
:SM a :ShowMessage . # verbose report, continues pipe #
12+
:SC a :ShowConfig . # verbose report, continues pipe #
13+
:DE a :DeadEnd . # ends the current pipe quietly #
14+
:H a :Halt . # kills everything #
15+
:N a :NOP . # no operation (except for showing stage in pipe) #
16+
:UF a :Unfork . # collapses all pipes but one #
17+
##################################################################
18+
19+
:feedFinder a :Transmission ;
20+
:pipe (:p10 :p20 :p30 :p40 :p50 :p60 :p70 :p80 :p85 :p90) .
21+
22+
# Query SPARQL for HTML bookmarks with status 200
23+
:p10 a :SPARQLSelect ;
24+
:settings :selectLinks .
25+
26+
# Iterate over each result
27+
:p20 a :ForEach ;
28+
:settings :forEach .
29+
30+
# Extract URL from SPARQL bindings
31+
:p30 a :Restructure ;
32+
:settings :extractUrl .
33+
34+
# Fetch HTML content
35+
:p40 a :HttpClient ;
36+
:settings :httpSettings .
37+
38+
# Extract feed URL from HTML
39+
:p50 a :HTMLFeedExtractor ;
40+
:settings :feedExtractor .
41+
42+
# Only process if feed found
43+
:p60 a :Choice ;
44+
:settings :checkFeedExists .
45+
46+
# Accumulate feed URLs
47+
:p70 a :Accumulate ;
48+
:settings :accumulator .
49+
50+
# Restructure for template
51+
:p80 a :Restructure ;
52+
:settings :prepareTemplate .
53+
54+
# Convert array to newline-separated string
55+
:p85 a :Templater ;
56+
:settings :formatOutput .
57+
58+
# Write to feeds.md
59+
:p90 a :FileWriter ;
60+
:settings :fileOutput .

0 commit comments

Comments
 (0)