Content Indexing Pipeline

How publisher content is indexed, stored, and refreshed.

Indexing Flow


Publisher RSS/Atom Feed
        ↓
POST /introspect (validate + preview)
        ↓
POST /servers (create MCP server record with upstreamType: "content")
        ↓
POST /index (fetch feed → parse → store in DynamoDB)
        ↓
Content live at {slug}.mcp.xpay.sh
        ↓
Every 6 hours: contentRefresher detects new articles

What Gets Indexed

Each article stored in the index includes:

Field	Description
URL	Original article URL on your site
Title	Article headline
Description	Plain text excerpt (up to 500 characters)
Content	Full article as clean markdown (up to 50KB)
Published date	When the article was originally published
Author	Byline/author name
Categories	Topic tags from your feed
Source	Whether it came from RSS or sitemap

Articles are deduplicated by URL — re-indexing the same feed won’t create duplicates.

Feed Refresh (Scheduled)

The contentRefresher Lambda runs every 6 hours:

Scans xpay-mcp-proxy-servers for servers with upstreamType: "content" and status: "active"
Fetches each server’s feedUrl
Parses the feed and compares article URLs against existing index
Indexes only new articles (URL hash not in existing set)
Updates the server record with lastRefreshed timestamp and incremented articleCount

The refresher only adds new articles — it does not update or delete existing ones. Deleted articles remain in the index until they expire via TTL (if configured).

Performance

Each server refresh takes 1–3 seconds (feed fetch + DynamoDB writes)
Lambda timeout: 300 seconds (can handle ~100 servers per run)
At scale (1000+ servers), consider partitioning refreshes across multiple scheduled invocations

HTML to Markdown Conversion

Article content is converted from HTML to markdown during indexing:

HTML Element	Markdown Output
`<h1>` – `<h6>`	`#` – `######`
`<strong>`, `<b>`	`bold`
`<em>`, `<i>`	`italic`
`<a href="...">`	`[text](url)`
`<img>`	`![alt](src)`
`<li>`	`- item`
`<blockquote>`	`> quote`
`<code>`	`code`
`<pre>`	``` code ```
`<p>`	Paragraph with blank line
`<script>`, `<style>`, `<nav>`, `<header>`, `<footer>`	Removed entirely

HTML entities (&, <, >, ", ',  ) are decoded.

Server Record Fields

When a publisher server is created, these additional fields are set on the xpay-mcp-proxy-servers record:

Field	Type	Description
`upstreamType`	String	`"content"` — tells the MCP Proxy to route to the content server
`feedUrl`	String	The RSS/sitemap URL
`feedType`	String	`"rss2.0"`, `"atom"`, `"rss1.0"`, or `"sitemap"`
`articleCount`	Number	Total indexed articles
`lastRefreshed`	Number	Timestamp of last successful refresh
`category`	String	`"Content & Knowledge"`