Picturam
A PWA that translates speech into semantic images in real time, helping communication with elderly people who are deaf, illiterate or have verbal comprehension difficulties. Dual-mode pipeline (local GPU or cloud) with a two-layer cache.
The Challenge
Elderly people with hearing loss, illiteracy or comprehension problems miss much of what caregivers, therapists or family members tell them. Spoken language is a fragile channel that doesn't always reach the listener.
Results
- Dual-mode pipeline: local GPU (RTX 4080) or cloud, config-swappable
- Two-layer cache (exact phrase + concept) cuts latency 10-50x on repeats
- Typical latency ~200-300ms cached, 2-5s on fresh generation
- Use cases: care homes, speech therapy, family communication
The Solution
I built a PWA that captures the voice, transcribes it with Whisper, extracts the key concept with an LLM and generates a semantic image to accompany the sentence. The server orchestrates two interchangeable modes — local GPU (faster-whisper + Ollama + ComfyUI) or cloud (Deepgram + Gemini + fal.ai) — and a two-layer cache eliminates redundant work.
Motivation
I wanted to push the social use case of generative models: instead of decorative images, images that serve someone who can't hear or read. And along the way, experiment with a pipeline that could run on my local GPU or on cloud without changing the app.
Challenges
The hardest part was keeping usable latency on a pipeline with three models in series (STT → LLM → image). The two-layer cache and fuzzy matching of known people were the two decisions that made the conversational mode viable.
Learnings
I learned to design provider abstractions (STT/LLM/image) that can be swapped without coupling to a specific SDK, and that in real scenarios (a care home with flaky wifi) offline-first stops being optional.
Context
Most active project in my portfolio in 2026 (44 commits in 60 days). Technically solid MVP, pending a commercial milestone: open demo, care-home pilot, or third-party API.