Your Slack MCP Isn’t Broken — Your AI Just Can’t Read It

Connecting Slack to an AI agent is easy. Making that data actually useful to an LLM is the hard part. We improved retrieval accuracy by 27% with three changes: thread-level document splitting, noise filtering, and markup cleanup.

Bobb KimMarch 31, 2026

How we improved Slack retrieval for AI agents by rethinking document boundaries, not just connections.

If you’re building an AI agent that answers questions from Slack data — whether via MCP tools, RAG pipelines, or any other integration — this post is for you.

The problem

When you connect Slack to an AI agent, you expect the agent to understand your team’s conversations. Here’s what Slack data actually looks like when it reaches the model:

[2026-03-20 14:23] U0847XJKL: :tada: shipped! :rocket:
[2026-03-20 14:24] U0923MXPQ: <@U0847XJKL> :+1:
[2026-03-20 14:25] U0847XJKL: <!channel> FYI migration complete
[2026-03-20 14:30] U1029RLST: @here should we update the docs?
[2026-03-20 14:31] U0847XJKL: yes pls & also update the <link>

:tada: consumes tokens but carries zero information. U0847XJKL can’t be resolved by the model. <!channel> and & are Slack-internal syntax passed through as-is. And the biggest problem: thread discussions get interleaved with unrelated messages, destroying conversational context.

This isn’t a bug in any particular tool or MCP server. Connecting a data source to an LLM is not the same as making that data understandable to an LLM.

Why this matters

Humans naturally ignore emoji reactions, follow threads, and infer meaning from brief messages. LLMs can’t. Every token gets equal weight — :tada: and a critical architecture decision receive the same attention.

Anthropic’s Writing Effective Tools for Agents guide makes the case that tool design directly impacts agent performance. The same principle applies to tool outputs: what the LLM receives matters as much as how it’s connected. A study analyzing 856 tools across 103 MCP servers found that 97.1% had quality issues in their descriptions alone — fixing these improved task success rates by a median of 5.85 percentage points.

The question isn’t “how do we connect Slack?” — that’s solved. The question is what happens to the data between the source and the LLM.

What we did

Three changes to how Slack messages are preprocessed before indexing. Thread splitting had the largest impact. Cleanup and filtering were straightforward engineering, but they compounded.

Thread-level document splitting

Most systems bundle Slack messages chronologically — 200 messages per file, in order. This is how our system worked before:

[Message 1] Hey, anyone looked at the auth bug?
[Message 2] :wave:
[Message 3] Let's discuss Q2 roadmap
[Message 4] RE: auth bug - I think it's a session issue...
...
[Message 200] ...

Messages 1 and 4 belong to the same thread, but they’re separated by unrelated content. The embedding for this 200-message document can’t represent any single topic well. Searching for “auth bug” returns the entire file, with Q2 roadmap discussions and emoji reactions competing for attention.

The fix was simple: use the thread boundaries Slack already provides. Each thread becomes its own document:

Document A: "Auth bug discussion" (thread, 12 messages)
Document B: "Q2 roadmap planning" (thread, 8 messages)
Document C: Standalone messages (noise filtered)

A thread is a self-contained conversation. Question, discussion, and conclusion all live in one document. Search returns the relevant thread, and the LLM sees coherent context.

This is consistent with what the research shows. Work on chat disentanglement has demonstrated that treating multiparty chat as a single stream breaks both retrieval and comprehension. LongRAG (2024) found the same for RAG: fixed-size chunking breaks context, and document boundaries need to align with semantic units. Snyk’s Slack RAG write-up reported similar gains from restructuring threads into coherent knowledge objects.

Noise filtering and markup cleanup

For standalone messages (not thread replies — where short responses like “ok” carry meaning in context), we drop emoji-only messages and anything under 5 words.

For all messages, we clean Slack markup before indexing:

Raw Slack	Cleaned	Why
:tada: shipped! :rocket:	shipped!	Emoji codes are zero-information tokens
<!channel>	@channel	Slack syntax to readable text
<!subteam^S1234\|@backend-team>	@backend-team	Resolve subteam mentions
& < >	& < >	Decode HTML entities

These aren’t format changes — they’re pure noise removal. The tokens being eliminated carry no semantic content at all.

Results

We ran an end-to-end evaluation on real internal Slack data: 3 channels, 86 QA pairs. Each question required information from thread replies — “What did we decide?”, “What was the root cause?”, “How was that resolved?” The kind of questions real users ask their team’s AI assistant.

Same data, two indexing approaches: raw 200-message buckets vs thread-split documents. Retrieved context was passed to an LLM, and we scored whether the answer was factually correct against ground truth.

Metric	Raw Bucket	Thread-Split
Correct answers	51 (59.3%)	65 (75.6%)
Unanswerable	35 (40.7%)	20 (23.3%)
Avg factual score	0.593	0.744 (+25%)
Head-to-head wins	10	23 (70%)

Thread-split improved correct answer rate by 27% and cut unanswerable queries by 43%.

Where it mattered most

The advantage was most dramatic for questions about conclusions and decisions buried deep in long threads:

Notion KB sync failure — The root cause was in the 24th reply out of 25: two issues (API deprecation + missing retry logic) identified in a single summary message. Thread-split retrieved the entire investigation as one document — from initial report to final diagnosis. Raw bucket returned only the first few messages, where the team was still asking “what happened?”

Slack outage response — The team rolled out a feature flag to display an outage banner, then 5 hours later undid the rollout when the issue resolved. The rollout was in message 21, the undo in message 40. Thread-split captured the full story. Raw bucket cut off before the resolution.

Accidental billing — Three users were overcharged during a free trial. The resolution (refund + free access + follow-up) spanned multiple late replies. Thread-split returned the complete thread. Raw bucket retrieved a different billing thread from a different time period — confidently wrong context.

The pattern is the same in each case: when the answer lives at the end of a conversation, bucket-based chunking loses it.

What we learned

The real engineering in connecting Slack to an AI agent isn’t the connection. It’s what happens to the data afterward.

For Slack, this meant three things: aligning document boundaries with conversations instead of time windows, filtering noise before it reaches the index, and cleaning raw API output into something an LLM can parse.

These principles aren’t Slack-specific. Confluence has HTML macros that turn clean content into parser-hostile markup. Notion database tables look nothing like how humans read them. Google Sheets returns JSON arrays that waste tokens on structure instead of content. Every source has its own version of “data that LLMs struggle to read.”

Connecting data sources is a solved problem. Making that data genuinely useful to an LLM is not.

Next up: how we applied the same principles to Confluence and Notion.