Data Persistence Strategy for Robo-Hub

Executive Summary

This document outlines the data capture and persistence strategy for Robo-Hub's negotiation system, focusing on leveraging Claude Code's built-in PDF extraction capabilities for automated data structuring.

1. Available Claude Code Skills for Data Extraction

PDF Skill (Built-in to Claude Code)

Claude Code has native PDF processing capabilities that can:

✅ Extract text from PDF documents
✅ Parse structured data (tables, pricing lists)
✅ Analyze images embedded in PDFs
✅ Extract metadata (dates, supplier names, pricing tiers)

CSV/Spreadsheet Processing

✅ Parse pricing catalogs in CSV/Excel format
✅ Extract tabular data automatically
✅ Convert to structured JSON for database insertion

Image Analysis

✅ Extract text from images (OCR)
✅ Analyze vehicle inspection photos
✅ Process invoice scans

2. Recommended Architecture (Revised with Skills)

Phase 1: Automated PDF Extraction (Using Claude Skills)

When a supplier uploads a tire catalog PDF in chat:

// Frontend: Supplier uploads PDF
const handleFileUpload = async (file: File, conversationId: string) => {
  // 1. Upload to temporary storage (S3/R2)
  const fileUrl = await uploadToS3(file);

  // 2. Send to Claude API with PDF skill for extraction
  const extractedData = await extractPDFData(fileUrl);

  // 3. Store both raw file and structured data
  await saveAttachment({
    conversationId,
    fileName: file.name,
    fileUrl,
    fileType: 'pdf',
    extractedData: extractedData.structured,
    rawText: extractedData.fullText
  });

  // 4. AI automatically suggests pricing comparison
  const comparison = await compareWithMarket(extractedData.structured.pricing);

  return { extractedData, comparison };
};

Example Extracted Data from Tire Catalog PDF:

{
  "supplier": "Cyber Tire & Wheel",
  "document_date": "2025-01-15",
  "catalog_type": "tire_services",
  "pricing_tiers": [
    {
      "tier_name": "Basic Fleet Package",
      "price_per_vehicle_month": 140,
      "includes_tpms": false,
      "warranty_months": 12,
      "features": [
        "Standard tire rotation",
        "Emergency replacement",
        "24hr hotline"
      ]
    },
    {
      "tier_name": "Premium TPMS Package",
      "price_per_vehicle_month": 160,
      "includes_tpms": true,
      "warranty_months": 24,
      "features": [
        "All Basic features",
        "Real-time pressure monitoring",
        "Predictive alerts",
        "Dashboard integration"
      ]
    }
  ],
  "volume_discounts": [
    { "min_vehicles": 50, "discount_percent": 5 },
    { "min_vehicles": 100, "discount_percent": 8 },
    { "min_vehicles": 200, "discount_percent": 12 }
  ],
  "payment_terms": "Net-30 standard, Net-0 available with USDC",
  "service_zones": ["San Francisco", "Oakland", "San Jose"]
}

3. Database Schema (Aligned with Robo-Dapp)

New Tables for Robo-Hub

-- Conversations (chat threads between shepherd and supplier)
CREATE TABLE conversations (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  rfq_id UUID REFERENCES rfqs(id),
  shepherd_id UUID NOT NULL,
  supplier_id UUID NOT NULL,
  service_category TEXT NOT NULL, -- 'TIRES', 'PARKING', etc.
  status TEXT DEFAULT 'active', -- 'active', 'agreed', 'archived'
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Messages within conversations
CREATE TABLE messages (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  conversation_id UUID REFERENCES conversations(id) ON DELETE CASCADE,
  sender_type TEXT NOT NULL, -- 'shepherd', 'supplier', 'ai', 'system'
  sender_id UUID,
  message_text TEXT NOT NULL,
  metadata JSONB, -- For AI-generated insights, system notifications
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- File attachments (PDFs, images, spreadsheets)
CREATE TABLE attachments (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  message_id UUID REFERENCES messages(id) ON DELETE CASCADE,
  conversation_id UUID REFERENCES conversations(id) ON DELETE CASCADE,
  file_type TEXT NOT NULL, -- 'pdf', 'image', 'csv', 'xlsx'
  file_name TEXT NOT NULL,
  file_url TEXT NOT NULL, -- S3/R2/IPFS URL
  file_size_bytes BIGINT,

  -- Claude-extracted data
  extracted_text TEXT, -- Full OCR/text extraction
  extracted_data JSONB, -- Structured pricing/catalog data
  extraction_status TEXT DEFAULT 'pending', -- 'pending', 'completed', 'failed'
  extraction_error TEXT,

  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Structured service quotes (extracted from PDFs or manually entered)
CREATE TABLE service_quotes (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  attachment_id UUID REFERENCES attachments(id),
  supplier_id UUID NOT NULL,
  shepherd_id UUID, -- NULL if generic catalog
  service_category TEXT NOT NULL,

  -- Pricing details
  tier_name TEXT,
  price_per_unit NUMERIC(10,2) NOT NULL,
  unit_type TEXT DEFAULT 'per_vehicle_month',

  -- Service specifics (flexible JSONB for category-specific fields)
  includes_tpms BOOLEAN,
  warranty_months INTEGER,
  response_time_minutes INTEGER,
  coverage_zones TEXT[],
  features JSONB,

  -- Discounts
  volume_discounts JSONB, -- [{ min_vehicles: 50, discount_percent: 5 }]

  -- Validity
  valid_from TIMESTAMPTZ DEFAULT NOW(),
  valid_until TIMESTAMPTZ,

  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Negotiation outcomes (for AI training and analytics)
CREATE TABLE negotiation_outcomes (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  conversation_id UUID REFERENCES conversations(id),
  rfq_id UUID REFERENCES rfqs(id),
  shepherd_id UUID NOT NULL,
  supplier_id UUID NOT NULL,
  service_category TEXT NOT NULL,

  -- Pricing journey
  shepherd_initial_budget NUMERIC(10,2),
  supplier_initial_quote NUMERIC(10,2),
  final_agreed_price NUMERIC(10,2),
  discount_percent NUMERIC(5,2),

  -- Negotiation dynamics
  num_messages INTEGER DEFAULT 0,
  duration_hours NUMERIC(8,2),
  outcome TEXT NOT NULL, -- 'agreed', 'rejected_by_shepherd', 'rejected_by_supplier', 'abandoned'

  -- Context for AI learning
  key_factors JSONB, -- e.g., { "tpms_required": true, "urgent": false, "fleet_size": 50 }

  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Cross-shepherd pricing benchmarks (anonymized)
CREATE TABLE pricing_benchmarks (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  service_category TEXT NOT NULL,
  zone TEXT NOT NULL,

  -- Fleet characteristics (anonymized)
  fleet_size_range TEXT, -- '1-25', '26-50', '51-100', '100+'

  -- Pricing stats
  avg_price NUMERIC(10,2),
  min_price NUMERIC(10,2),
  max_price NUMERIC(10,2),
  median_price NUMERIC(10,2),
  sample_size INTEGER,

  -- Service specifics
  includes_tpms BOOLEAN,
  service_tier TEXT,

  -- Time window
  calculated_at TIMESTAMPTZ DEFAULT NOW(),
  data_from_date TIMESTAMPTZ,
  data_to_date TIMESTAMPTZ
);

-- Indexes for performance
CREATE INDEX idx_conversations_rfq ON conversations(rfq_id);
CREATE INDEX idx_conversations_participants ON conversations(shepherd_id, supplier_id);
CREATE INDEX idx_messages_conversation ON messages(conversation_id);
CREATE INDEX idx_attachments_conversation ON attachments(conversation_id);
CREATE INDEX idx_quotes_supplier_category ON service_quotes(supplier_id, service_category);
CREATE INDEX idx_benchmarks_category_zone ON pricing_benchmarks(service_category, zone);

4. Implementation Plan with Claude Skills

Phase 1: MVP (Weeks 1-2)

Objective: Basic persistence with manual data entry

// services/ConversationService.ts
export class ConversationService {
  async createConversation(rfqId: string, shepherdId: string, supplierId: string) {
    // Create conversation thread
    const conversation = await db.conversations.create({
      rfq_id: rfqId,
      shepherd_id: shepherdId,
      supplier_id: supplierId,
      service_category: await getRFQCategory(rfqId)
    });

    // Auto-create initial system message
    await this.addMessage(conversation.id, 'system', null,
      `Conversation started for RFQ #${rfqId}`
    );

    return conversation;
  }

  async addMessage(conversationId: string, senderType: string, senderId: string, text: string) {
    return await db.messages.create({
      conversation_id: conversationId,
      sender_type: senderType,
      sender_id: senderId,
      message_text: text
    });
  }

  async getConversationHistory(conversationId: string) {
    return await db.messages
      .where({ conversation_id: conversationId })
      .orderBy('created_at', 'asc');
  }
}

What this enables:

✅ Chat history persists across sessions
✅ Shepherds can reference past conversations
✅ Basic audit trail for negotiations

Phase 2: PDF Extraction with Claude (Weeks 3-4)

Objective: Automated data extraction from supplier documents

// services/DocumentExtractionService.ts
import Anthropic from '@anthropic-ai/sdk';

export class DocumentExtractionService {
  private anthropic: Anthropic;

  constructor() {
    this.anthropic = new Anthropic({
      apiKey: process.env.ANTHROPIC_API_KEY
    });
  }

  async extractPricingFromPDF(pdfUrl: string, serviceCategory: string) {
    // 1. Fetch PDF as base64
    const pdfBuffer = await fetch(pdfUrl).then(r => r.arrayBuffer());
    const pdfBase64 = Buffer.from(pdfBuffer).toString('base64');

    // 2. Send to Claude with PDF analysis prompt
    const response = await this.anthropic.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 4096,
      messages: [{
        role: 'user',
        content: [
          {
            type: 'document',
            source: {
              type: 'base64',
              media_type: 'application/pdf',
              data: pdfBase64
            }
          },
          {
            type: 'text',
            text: `Extract pricing data from this ${serviceCategory} service catalog. Return a JSON object with:
            - supplier_name
            - pricing_tiers (array of { tier_name, price_per_vehicle_month, includes_tpms, warranty_months, features })
            - volume_discounts (array of { min_vehicles, discount_percent })
            - payment_terms
            - service_zones

            Only return valid JSON, no markdown formatting.`
          }
        ]
      }]
    });

    // 3. Parse Claude's response
    const extractedText = response.content[0].text;
    const extractedData = JSON.parse(extractedText);

    return {
      fullText: extractedText,
      structured: extractedData
    };
  }

  async processUploadedDocument(
    file: File,
    conversationId: string,
    messageId: string
  ) {
    // 1. Upload to S3
    const fileUrl = await this.uploadToS3(file);

    // 2. Save attachment record (pending extraction)
    const attachment = await db.attachments.create({
      message_id: messageId,
      conversation_id: conversationId,
      file_type: file.type,
      file_name: file.name,
      file_url: fileUrl,
      extraction_status: 'pending'
    });

    // 3. Extract in background
    try {
      const extracted = await this.extractPricingFromPDF(
        fileUrl,
        await getConversationCategory(conversationId)
      );

      // 4. Update attachment with extracted data
      await db.attachments.update(attachment.id, {
        extracted_text: extracted.fullText,
        extracted_data: extracted.structured,
        extraction_status: 'completed'
      });

      // 5. Create service_quotes records
      await this.createQuotesFromExtraction(
        attachment.id,
        extracted.structured
      );

      return { attachment, extracted };
    } catch (error) {
      await db.attachments.update(attachment.id, {
        extraction_status: 'failed',
        extraction_error: error.message
      });
      throw error;
    }
  }

  async createQuotesFromExtraction(attachmentId: string, extractedData: any) {
    const quotes = [];

    for (const tier of extractedData.pricing_tiers || []) {
      const quote = await db.service_quotes.create({
        attachment_id: attachmentId,
        supplier_id: extractedData.supplier_id,
        service_category: extractedData.category,
        tier_name: tier.tier_name,
        price_per_unit: tier.price_per_vehicle_month,
        includes_tpms: tier.includes_tpms,
        warranty_months: tier.warranty_months,
        features: tier.features,
        volume_discounts: extractedData.volume_discounts
      });
      quotes.push(quote);
    }

    return quotes;
  }
}

What this enables:

✅ Auto-extract pricing from PDF catalogs
✅ Supplier uploads catalog once, data available for all negotiations
✅ AI can compare quotes: "This quote is 12% above their catalog price"
✅ Shepherds see structured comparison tables

Phase 3: AI-Powered Insights (Weeks 5-8)

Objective: Cross-shepherd analytics and recommendations

// services/MarketIntelligenceService.ts
export class MarketIntelligenceService {
  async generatePricingBenchmark(
    serviceCategory: string,
    zone: string,
    fleetSize: number,
    includesTPMS: boolean
  ) {
    // 1. Query historical negotiations
    const outcomes = await db.negotiation_outcomes
      .where({
        service_category: serviceCategory,
        outcome: 'agreed'
      })
      .join('conversations', 'conversations.id', 'negotiation_outcomes.conversation_id')
      .where('conversations.zone', zone)
      .where('key_factors->fleet_size_range', getFleetSizeRange(fleetSize))
      .where('key_factors->includes_tpms', includesTPMS);

    // 2. Calculate statistics
    const prices = outcomes.map(o => o.final_agreed_price);
    const benchmark = {
      service_category: serviceCategory,
      zone,
      avg_price: avg(prices),
      min_price: min(prices),
      max_price: max(prices),
      median_price: median(prices),
      sample_size: prices.length,
      includes_tpms: includesTPMS
    };

    // 3. Store benchmark
    await db.pricing_benchmarks.create(benchmark);

    return benchmark;
  }

  async getAIRecommendation(conversationId: string, supplierQuote: number) {
    const conversation = await db.conversations.findById(conversationId);
    const benchmark = await this.getBenchmark(
      conversation.service_category,
      conversation.zone
    );

    const percentDiff = ((supplierQuote - benchmark.avg_price) / benchmark.avg_price) * 100;

    if (percentDiff > 10) {
      return {
        type: 'warning',
        message: `This quote is ${percentDiff.toFixed(1)}% above market average ($${benchmark.avg_price}). Consider negotiating or exploring alternatives.`,
        avgPrice: benchmark.avg_price,
        yourQuote: supplierQuote
      };
    } else if (percentDiff < -10) {
      return {
        type: 'opportunity',
        message: `Great deal! This quote is ${Math.abs(percentDiff).toFixed(1)}% below market average.`,
        avgPrice: benchmark.avg_price,
        yourQuote: supplierQuote
      };
    } else {
      return {
        type: 'neutral',
        message: `This quote is within market range (±10% of avg $${benchmark.avg_price}).`,
        avgPrice: benchmark.avg_price,
        yourQuote: supplierQuote
      };
    }
  }
}

5. Privacy & Data Governance

Anonymization Strategy

-- Privacy-preserving view for cross-shepherd analytics
CREATE VIEW anonymous_pricing_data AS
SELECT
  service_category,
  zone,
  CASE
    WHEN fleet_size < 25 THEN '1-25'
    WHEN fleet_size < 50 THEN '26-50'
    WHEN fleet_size < 100 THEN '51-100'
    ELSE '100+'
  END as fleet_size_range,
  final_agreed_price,
  -- NO shepherd_id or supplier_id exposed
  key_factors->>'includes_tpms' as includes_tpms
FROM negotiation_outcomes
WHERE outcome = 'agreed'
  AND created_at > NOW() - INTERVAL '90 days';

Data Retention Policy

// Automated cleanup (run monthly)
export async function cleanupOldData() {
  // Delete conversations older than 2 years (keep outcomes for analytics)
  await db.conversations
    .where('created_at', '<', twoYearsAgo())
    .where('status', 'archived')
    .delete();

  // Delete orphaned attachments
  await db.attachments
    .whereNotExists(
      db.messages.select('id').whereRaw('messages.id = attachments.message_id')
    )
    .delete();
}

6. Cost Analysis

Storage Costs (AWS S3)

PDFs: ~2MB each × 1,000 docs/year = 2GB = $0.05/month
Database: ~500MB structured data = $0.10/month (PostgreSQL)

Claude API Costs (PDF Extraction)

Input: 2MB PDF = ~5,000 tokens
Output: 500 tokens (structured JSON)
Cost per extraction: ~$0.05
Volume: 100 PDFs/month = $5/month

Total Monthly Cost: ~$5.15/month (MVP scale)

7. Immediate Next Steps

Action Items:

✅ Create database migration with schema above
✅ Implement ConversationService (basic persistence)
✅ Set up Anthropic API key for PDF extraction
✅ Build DocumentExtractionService with Claude integration
✅ Add file upload UI to chat interfaces
⏳ Test PDF extraction with real tire catalogs (Phase 2)
⏳ Build pricing comparison UI (Phase 3)

8. Questions for Decision

Storage preference: S3 (AWS) or R2 (Cloudflare) or IPFS (decentralized)?
Immediate implementation: Should we start with Phase 1 (basic persistence) now?
PDF extraction priority: Is automated extraction critical for MVP or can it wait?
Privacy level: How anonymous should cross-shepherd analytics be?

Recommendation: Start with Phase 1 (basic chat persistence) this week, add PDF extraction (Phase 2) when you have 5+ pilot suppliers uploading catalogs.

PreviousT420 Agentic Workflow - Visual Flow Diagram NextRobo-Hub: Relationship-First Supply Chain Management Platform

Last updated 8 days ago

Was this helpful?

hashtagExecutive Summary

hashtag1. Available Claude Code Skills for Data Extraction

hashtagPDF Skill (Built-in to Claude Code)

hashtagCSV/Spreadsheet Processing

hashtagImage Analysis

hashtag2. Recommended Architecture (Revised with Skills)

hashtagPhase 1: Automated PDF Extraction (Using Claude Skills)

hashtag3. Database Schema (Aligned with Robo-Dapp)

hashtagNew Tables for Robo-Hub

hashtag4. Implementation Plan with Claude Skills

hashtagPhase 1: MVP (Weeks 1-2)

hashtagPhase 2: PDF Extraction with Claude (Weeks 3-4)

hashtagPhase 3: AI-Powered Insights (Weeks 5-8)

hashtag5. Privacy & Data Governance

hashtagAnonymization Strategy

hashtagData Retention Policy

hashtag6. Cost Analysis

hashtagStorage Costs (AWS S3)

hashtagClaude API Costs (PDF Extraction)

hashtagTotal Monthly Cost: ~$5.15/month (MVP scale)

hashtag7. Immediate Next Steps

hashtagAction Items:

hashtag8. Questions for Decision