SocialPulse · Production Architecture

Multi-Platform Social Listening & Competitive Intelligence

A production-scale, multi-tenant blueprint for ingesting, normalizing, and analyzing public conversation across X, Facebook, Instagram, and TikTok — with geo-filtering, autonomous categorization, and entity-level competitive comparison. Tuned for the Kenyan market.

Multi-tenant SaaS Hybrid ingestion PostgreSQL + JSONB pgvector / RAG Event-driven workers Phase 1 → X / Twitter
PHASE 1 — START HERE

Twitter / X first: the highest-signal, lowest-friction wedge

X is the correct first platform for a Kenyan listening product. Kenyans on Twitter (KOT) is the loudest, most text-dense, most quote-able public square in the country — and X is the only one of the four with a usable, paid, official firehose-style API. Nail the end-to-end pipeline on X, then clone the pattern to the harder platforms.

Why X de-risks everything downstream

  • Public-by-default & text-first — no login wall on the data model itself; the payload is the text you actually need for NLP, not a video you must transcribe.
  • An official paid API exists — you can be ToS-compliant from day one and defer the legal/proxy complexity that FB/IG/TikTok force on you.
  • Richest geo & entity signal — bios, place tags, and KOT's habit of @-mentioning brands (Naivas, Quickmart, Carrefour) make entity extraction and Share-of-Voice work immediately.
  • It validates the whole spine — auth → keyword rule → ingest → normalize → dedup → classify → store → dashboard. Once that spine works for X, adding a platform is "write one adapter."

The X ingestion decision (production posture)

X's official API was repriced into tiers. As of 2025 the practical options for a startup:

PathWhat you getCost / limitsVerdict for production
X API — BasicApp-level read, recent search (~7-day), modest monthly post cap~$200/mo, ~10–15k posts/mo readMVP only — caps too low for real listening
X API — ProFull-archive search, higher caps, filtered stream~$5,000/moCompliant production path once revenue justifies it
Managed providers (Apify, Bright Data, TwitterAPI.io-style)Search + profile + timeline via their infra; they absorb proxy/ban riskUsage-priced (per 1k results)Pragmatic bridge: real coverage at MVP cost
Self-hosted stealth scrapingMax coverage, min data costProxy + maintenance + ban riskToS-violating & fragile — avoid as primary
Recommended hybrid: Build the ingestion worker behind a SourceAdapter interface so the same normalization/dedup/NLP pipeline runs regardless of where bytes come from. Start on a managed provider (fast, affordable coverage), keep the official X API adapter ready behind a feature flag, and flip to X API Pro for the streams that need ToS guarantees. You never rewrite the pipeline — you only swap the adapter.
Legal reality, stated plainly: scraping Meta, X, and TikTok violates their Terms of Service, and in some jurisdictions raises CFAA-style and data-protection questions (Kenya's Data Protection Act, 2019 applies once you store personal data of Kenyan residents). The compliant spine is: prefer official APIs, use reputable managed providers for gaps, store only public data, honor deletion, and register as a data controller where required. Treat self-hosted stealth scraping as a last resort with eyes open.

Phase-1 build order (X only)

1
Stand up auth + tenancy + one keyword_rule table (Section 2 schema).
2
Ship the XAdapter: takes a rule, returns raw posts. Provider-backed first.
3
Normalize → unified mentions row; dedup on content hash + platform id.
4
Geo-resolve (place tag → bio → mention text → tenant default) into geo JSONB.
5
Async NLP: sentiment + category + entities (Section 3). Write back to the row.
6
Dashboard: SoV, sentiment time-series, entity matrix — all from one table.

The starter worker that does steps 2–4 ships alongside this blueprint as x_ingestor.py.

1Scraper & Data Ingestion Architecture

A resilient ingestion tier is an adapter pattern behind a queue, not a pile of scrapers. Each platform is one adapter implementing a common contract; everything after the adapter is platform-agnostic.

Keyword rules
(per tenant)
Scheduler
(Celery beat)
Source Adapters
X · FB · IG · TikTok
Raw queue
(Kafka/SQS)
Normalizer
+ dedup
NLP queue
Postgres +
pgvector

1a · Official APIs vs stealth automation vs managed providers

PlatformOfficial APIRealityProduction choice
X / TwitterYes (Basic→Pro→Enterprise)Best official option of the four; archive search behind ProAPI (Pro) or managed provider
FacebookGraph API — only your own Pages/owned assetsNo general public-post search since the old Public Feed/CrowdTangle wind-down (Meta Content Library is research-gated)Meta Content Library if eligible; else managed provider for public Pages
InstagramGraph API — business/creator accounts you manage + hashtag search (limited)Hashtag Search API gives recent public media for a hashtag, with tight rate capsGraph hashtag API + managed provider for breadth
TikTokResearch API (gated) + Business/Display APIResearch API is approval-only and region-limited; otherwise no clean public searchManaged provider; Research API if you qualify

Net: only X gives you a clean pay-to-play public firehose. For the other three, a reputable managed provider is the realistic production source, with official APIs layered in where eligibility allows. Build for this asymmetry — don't assume parity across platforms.

1b · Resilience patterns (assume everything throttles and breaks)

1c · Inferring location when the user never states it

Explicit geo is rare. Resolve location with a confidence-scored cascade, stored as structured geo with a geo_confidence field so the UI can filter on trust level:

TierSignalConfidence
1Native geotag / place object (X place, IG location) → reverse-geocodeHigh (0.9+)
2Profile location field, parsed & gazetteer-matched ("Nai", "254", "Nrb" → Nairobi)Medium (0.6–0.8)
3In-text place mentions via NER + a Kenya gazetteer (Westlands, CBD, Kasarani, Mombasa…)Medium (0.5–0.7)
4Language/slang & currency cues (Sheng, "Ksh", "bob", KOT) → country-level priorLow (0.3–0.4)
5Network graph (who they reply to / who follows them) → cluster centroidLow, but improves with volume

Build a Kenya gazetteer (counties, towns, neighborhoods, malls, landmarks, common misspellings & Sheng aliases) as a Postgres table with trigram + full-text indexes. Combine signals in a small scoring function; never overwrite a high-confidence native tag with a low-confidence inference. Expose geo_confidence as a dashboard filter so a user can choose "Nairobi (confirmed)" vs "Nairobi (inferred)."

2Data Pipeline & Normalization Schema

A Tweet, a TikTok caption, and a Facebook post are the same abstract object: an authored public utterance, at a time, maybe a place, that mentions things and carries engagement. Model that abstraction once; push platform quirks into JSONB.

2a · Mapping disparate models to one canonical "Mention"

Canonical fieldX / TweetTikTokFacebook / IG
contenttweet textvideo description + captionpost message / caption
authoruser objectauthorpage/profile
published_atcreated_atcreateTimecreated_time
engagementlike/retweet/reply/quotedigg/share/comment/playreactions/shares/comments
mediamedia entitiesvideo URL + coverattachments
geoplace / coordinates(rare) region codeplace
rawthe entire untouched provider/API payload (JSONB) — your insurance policy

Normalization rule of thumb: promote to a typed column only what you filter, sort, or join on (tenant, platform, time, geo, sentiment, category). Everything else lives in raw JSONB and gets pulled forward later if it earns a query.

2b · PostgreSQL schema (JSONB + pgvector)

Postgres over Mongo here: you need multi-tenant relational integrity (tenants↔rules↔mentions), and JSONB flexibility, and vector search for semantic dedup/RAG — all in one engine via pgvector. Partition the hot table by time.

-- Tenancy ----------------------------------------------------------
CREATE TABLE tenants (
  id           uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  name         text NOT NULL,
  plan         text DEFAULT 'mvp',
  created_at   timestamptz DEFAULT now()
);

CREATE TABLE users (
  id           uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id    uuid REFERENCES tenants(id) ON DELETE CASCADE,
  email        citext UNIQUE NOT NULL,
  role         text DEFAULT 'member'
);

-- What each tenant is listening for --------------------------------
CREATE TABLE keyword_rules (
  id           uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id    uuid REFERENCES tenants(id) ON DELETE CASCADE,
  label        text,                          -- "Unga prices"
  terms        text[] NOT NULL,            -- ['unga','bei ya unga','maize flour']
  platforms    text[] DEFAULT '{x}',        -- start with X
  geo_filter   jsonb,                         -- {"country":"KE","city":"Nairobi"}
  cursors      jsonb DEFAULT '{}',           -- per-platform since_id / pagination
  is_active    boolean DEFAULT true,
  created_at   timestamptz DEFAULT now()
);

-- The canonical, platform-agnostic record --------------------------
CREATE TABLE mentions (
  id              uuid DEFAULT gen_random_uuid(),
  tenant_id       uuid NOT NULL,
  rule_id         uuid,
  platform        text NOT NULL,          -- x | facebook | instagram | tiktok
  native_post_id  text NOT NULL,          -- id on that platform
  content_hash    text NOT NULL,          -- sha256(normalized text) for dedup
  content         text,
  lang            text,
  author          jsonb,                     -- {handle, display, followers, verified}
  engagement      jsonb,                     -- {likes, shares, comments, views}
  media           jsonb,
  geo             jsonb,                     -- {country, city, lat, lng}
  geo_confidence  real DEFAULT 0,
  sentiment       real,                      -- -1..1
  sentiment_label text,                      -- pos | neu | neg
  categories      text[],                    -- ['RetailPricing','ConsumerSentiment']
  entities        jsonb,                     -- [{name:'Naivas',type:'BRAND',span:[..]}]
  embedding       vector(768),               -- pgvector: semantic dedup + RAG
  raw             jsonb,                     -- untouched source payload
  published_at    timestamptz NOT NULL,
  ingested_at     timestamptz DEFAULT now(),
  PRIMARY KEY (id, published_at)
) PARTITION BY RANGE (published_at);

-- Dedup + the indexes that make the dashboard fast -----------------
CREATE UNIQUE INDEX ux_mentions_dedup
  ON mentions (tenant_id, platform, native_post_id, published_at);
CREATE INDEX ix_mentions_tenant_time ON mentions (tenant_id, published_at DESC);
CREATE INDEX ix_mentions_categories  ON mentions USING GIN (categories);
CREATE INDEX ix_mentions_entities    ON mentions USING GIN (entities jsonb_path_ops);
CREATE INDEX ix_mentions_geo         ON mentions USING GIN (geo);
CREATE INDEX ix_mentions_embedding   ON mentions
  USING hnsw (embedding vector_cosine_ops);
Multi-tenant isolation: enable Postgres Row-Level SecurityCREATE POLICY tenant_isolation ON mentions USING (tenant_id = current_setting('app.tenant_id')::uuid). The app sets app.tenant_id per request, so a query bug can't leak one customer's data into another's dashboard.

2c · Deduplication (two layers)

3AI Categorization & Sentiment Engine

A tiered pipeline: cheap deterministic filters first, small fast models in the middle, an LLM only where ambiguity actually requires it. Running an LLM on every post is how you go bankrupt.

Clean & lang-detect
Rules / regex
(prices, $brands)
Embed
(multilingual)
Fast classifier
sentiment + category
LLM fallback
(low-confidence only)
Write back

3a · The NLP pipeline, stage by stage

3b · Auto-tagging into #RetailPricing, #LocalEvents, #ConsumerSentiment

Define each category as (a) a short natural-language definition, (b) a handful of labeled examples, and (c) a prototype embedding (mean of example embeddings). At ingest:

  1. Cosine-match the post embedding to every prototype; assign categories above threshold (multi-label — a post can be both pricing and sentiment).
  2. If the best matches are ambiguous, send only that post to an LLM with the category definitions for a final multi-label decision.
  3. Periodically retrain prototypes from human-corrected labels (active learning) so the taxonomy sharpens over time.
#RetailPricing#LocalEvents#ConsumerSentiment #ProductQuality#CustomerService#Promotions#Availability

3c · Kenyan context: Sheng, code-switching, sarcasm

Off-the-shelf English sentiment models fail badly on KOT. Concrete handling:

Build a labeling flywheel early. Your durable moat isn't the model — it's a growing, human-verified corpus of Kenyan social text with sentiment/category/entity labels. Capture every analyst correction in the dashboard and feed it back into fine-tuning.

4High-Level Reporting & Comparison Logic

Reporting is aggregation over the canonical table, sliced by entity, category, geo, and time. Because sentiment is stored per entity+aspect, competitive matrices are a GROUP BY — not a second pipeline.

4a · Price comparison & competitive matrices (Naivas vs Carrefour)

  1. Filter to comparable mentions: posts where categories @> '{RetailPricing}' and entities contains the target brands.
  2. Extract structured prices from the deterministic price-regex output, attaching each price to (entity, product/aspect) — e.g. (Naivas, 2kg unga, Ksh 210).
  3. Normalize units (per kg, per litre) so a 2kg vs 1kg quote is comparable; reject outliers via IQR to kill typos/jokes.
  4. Aggregate to a matrix: rows = products, columns = brands, cells = median price + sample size + 7-day trend + crowd-sentiment.
  5. Confidence: every cell carries n and a freshness timestamp — never show a single noisy post as a "price."
-- Share of Voice between competing brands, last 30 days, Nairobi
SELECT e->>'name' AS brand,
       count(*) AS mentions,
       round(avg(sentiment)::numeric, 3) AS avg_sentiment
FROM mentions m, jsonb_array_elements(m.entities) e
WHERE m.tenant_id = current_setting('app.tenant_id')::uuid
  AND m.published_at > now() - interval '30 days'
  AND m.geo->>'city' = 'Nairobi'
  AND e->>'name' = ANY(ARRAY['Naivas','Carrefour','Quickmart'])
GROUP BY brand ORDER BY mentions DESC;

4b · Visualization & aggregation

Pre-aggregate for speed. Don't scan raw mentions for every dashboard load. Roll up into hourly/daily materialized views (or TimescaleDB continuous aggregates) keyed by (tenant, entity, category, geo, bucket). Dashboards read the rollups; the raw table is for drill-down only.

5Tech Stack & Roadmap

Scalable, budget-conscious, boring-on-purpose. Every choice favors one operational surface and a clear upgrade path over premature microservices.

Frontend

  • Next.js (React) + TypeScript, Tailwind, deployed on Vercel/Cloudflare Pages.
  • Recharts / ECharts for time-series & SoV; deck.gl/Mapbox for geo heatmaps.
  • Auth UI via Clerk/Auth.js; org switcher = tenant switcher.

Backend / API

  • FastAPI (Python) — same language as the ML stack; async; great for the NLP services.
  • REST + WebSocket (live mention feed); JWT scoped to tenant.
  • Pydantic models mirror the canonical schema.

Workers / Queue

  • Celery + Redis for MVP (ingest, normalize, NLP, rollups). Beat for schedules.
  • Upgrade path: Kafka / Redpanda when ingest volume outgrows Redis as a broker.
  • Separate queues per stage so NLP backpressure never stalls ingest.

Data

  • PostgreSQL + pgvector (one engine: relational + JSONB + vectors). Add TimescaleDB for time partitioning & continuous aggregates.
  • S3 / Cloudflare R2 raw-payload landing zone.
  • Redis for rate-limit buckets, cache, dedup bloom filters.

AI / LLM orchestration

  • Embeddings: bge-m3 / multilingual-e5 self-hosted (cheap at volume).
  • Sentiment/category: fine-tuned transformer served via vLLM / HF TGI.
  • LLM fallback: hosted API (Claude/GPT) behind a router; cache by content hash; cap spend per tenant.
  • Vector store = pgvector (no extra service until you need one).

Infra / Ops

  • Docker + a managed container host (Fly.io/Render for MVP → ECS/GKE at scale).
  • Managed Postgres (Neon/Supabase/RDS) so you don't babysit a DB early.
  • Observability: OpenTelemetry → Grafana; Sentry for errors; per-adapter health dashboards.

Phased roadmap

1
Weeks 1–4 — X spine. Auth + tenancy + rules + XAdapter + normalize/dedup + basic sentiment + a dashboard with SoV & sentiment time-series for one platform.
2
Weeks 5–8 — Intelligence. Entity linking, category prototypes, Kenyan sentiment fine-tune, price extraction, comparison matrix, geo cascade + gazetteer.
3
Weeks 9–14 — Breadth. Add TikTok & IG via managed-provider adapters (same pipeline), then Facebook Pages. Continuous aggregates + rollup dashboards.
4
Ongoing — Hardening. Circuit breakers, RLS audit, labeling flywheel, cost caps, alerting, and the X API Pro cutover when revenue justifies ToS-guaranteed streams.