Content Indexing Pipeline
How publisher content is indexed, stored, and refreshed.
Indexing Flow
Publisher RSS/Atom Feed
↓
POST /introspect (validate + preview)
↓
POST /servers (create MCP server record with upstreamType: "content")
↓
POST /index (fetch feed → parse → store in DynamoDB)
↓
Content live at {slug}.mcp.xpay.sh
↓
Every 6 hours: contentRefresher detects new articlesWhat Gets Indexed
Each article stored in the index includes:
| Field | Description |
|---|---|
| URL | Original article URL on your site |
| Title | Article headline |
| Description | Plain text excerpt (up to 500 characters) |
| Content | Full article as clean markdown (up to 50KB) |
| Published date | When the article was originally published |
| Author | Byline/author name |
| Categories | Topic tags from your feed |
| Source | Whether it came from RSS or sitemap |
Articles are deduplicated by URL — re-indexing the same feed won’t create duplicates.
Feed Refresh (Scheduled)
The contentRefresher Lambda runs every 6 hours:
- Scans
xpay-mcp-proxy-serversfor servers withupstreamType: "content"andstatus: "active" - Fetches each server’s
feedUrl - Parses the feed and compares article URLs against existing index
- Indexes only new articles (URL hash not in existing set)
- Updates the server record with
lastRefreshedtimestamp and incrementedarticleCount
The refresher only adds new articles — it does not update or delete existing ones. Deleted articles remain in the index until they expire via TTL (if configured).
Performance
- Each server refresh takes 1–3 seconds (feed fetch + DynamoDB writes)
- Lambda timeout: 300 seconds (can handle ~100 servers per run)
- At scale (1000+ servers), consider partitioning refreshes across multiple scheduled invocations
HTML to Markdown Conversion
Article content is converted from HTML to markdown during indexing:
| HTML Element | Markdown Output |
|---|---|
<h1> – <h6> | # – ###### |
<strong>, <b> | **bold** |
<em>, <i> | *italic* |
<a href="..."> | [text](url) |
<img> |  |
<li> | - item |
<blockquote> | > quote |
<code> | `code` |
<pre> | ``` code ``` |
<p> | Paragraph with blank line |
<script>, <style>, <nav>, <header>, <footer> | Removed entirely |
HTML entities (&, <, >, ", ', ) are decoded.
Server Record Fields
When a publisher server is created, these additional fields are set on the xpay-mcp-proxy-servers record:
| Field | Type | Description |
|---|---|---|
upstreamType | String | "content" — tells the MCP Proxy to route to the content server |
feedUrl | String | The RSS/sitemap URL |
feedType | String | "rss2.0", "atom", "rss1.0", or "sitemap" |
articleCount | Number | Total indexed articles |
lastRefreshed | Number | Timestamp of last successful refresh |
category | String | "Content & Knowledge" |
Last updated on: