Data Persistence Strategy for Robo-Hub

Executive Summary

This document outlines the data capture and persistence strategy for Robo-Hub's negotiation system, focusing on leveraging Claude Code's built-in PDF extraction capabilities for automated data structuring.


1. Available Claude Code Skills for Data Extraction

PDF Skill (Built-in to Claude Code)

Claude Code has native PDF processing capabilities that can:

  • ✅ Extract text from PDF documents

  • ✅ Parse structured data (tables, pricing lists)

  • ✅ Analyze images embedded in PDFs

  • ✅ Extract metadata (dates, supplier names, pricing tiers)

CSV/Spreadsheet Processing

  • ✅ Parse pricing catalogs in CSV/Excel format

  • ✅ Extract tabular data automatically

  • ✅ Convert to structured JSON for database insertion

Image Analysis

  • ✅ Extract text from images (OCR)

  • ✅ Analyze vehicle inspection photos

  • ✅ Process invoice scans


Phase 1: Automated PDF Extraction (Using Claude Skills)

When a supplier uploads a tire catalog PDF in chat:

Example Extracted Data from Tire Catalog PDF:


3. Database Schema (Aligned with Robo-Dapp)

New Tables for Robo-Hub


4. Implementation Plan with Claude Skills

Phase 1: MVP (Weeks 1-2)

Objective: Basic persistence with manual data entry

What this enables:

  • ✅ Chat history persists across sessions

  • ✅ Shepherds can reference past conversations

  • ✅ Basic audit trail for negotiations


Phase 2: PDF Extraction with Claude (Weeks 3-4)

Objective: Automated data extraction from supplier documents

What this enables:

  • ✅ Auto-extract pricing from PDF catalogs

  • ✅ Supplier uploads catalog once, data available for all negotiations

  • ✅ AI can compare quotes: "This quote is 12% above their catalog price"

  • ✅ Shepherds see structured comparison tables


Phase 3: AI-Powered Insights (Weeks 5-8)

Objective: Cross-shepherd analytics and recommendations


5. Privacy & Data Governance

Anonymization Strategy

Data Retention Policy


6. Cost Analysis

Storage Costs (AWS S3)

  • PDFs: ~2MB each × 1,000 docs/year = 2GB = $0.05/month

  • Database: ~500MB structured data = $0.10/month (PostgreSQL)

Claude API Costs (PDF Extraction)

  • Input: 2MB PDF = ~5,000 tokens

  • Output: 500 tokens (structured JSON)

  • Cost per extraction: ~$0.05

  • Volume: 100 PDFs/month = $5/month

Total Monthly Cost: ~$5.15/month (MVP scale)


7. Immediate Next Steps

Action Items:

  1. Create database migration with schema above

  2. Implement ConversationService (basic persistence)

  3. Set up Anthropic API key for PDF extraction

  4. Build DocumentExtractionService with Claude integration

  5. Add file upload UI to chat interfaces

  6. Test PDF extraction with real tire catalogs (Phase 2)

  7. Build pricing comparison UI (Phase 3)


8. Questions for Decision

  1. Storage preference: S3 (AWS) or R2 (Cloudflare) or IPFS (decentralized)?

  2. Immediate implementation: Should we start with Phase 1 (basic persistence) now?

  3. PDF extraction priority: Is automated extraction critical for MVP or can it wait?

  4. Privacy level: How anonymous should cross-shepherd analytics be?

Recommendation: Start with Phase 1 (basic chat persistence) this week, add PDF extraction (Phase 2) when you have 5+ pilot suppliers uploading catalogs.

Last updated

Was this helpful?