feat: add OCR text filtering and content hiding for API endpoints #1816

gadelkareem · 2025-06-04T23:26:42Z

name: pull request
about: submit changes to the project
title: "[pr] feat: add OCR text filtering and content hiding for API endpoints"
labels: 'security, enhancement'
assignees: ''

description

Adds comprehensive content filtering system to prevent sensitive information in OCR text from being exposed through API endpoints. This addresses a significant privacy gap where user data like passwords, API
keys, credit card numbers, and other sensitive information could be inadvertently leaked through screenpipe's API responses.

The Problem:

OCR text containing sensitive information was being returned raw through API endpoints
/search, /frames/:frame_id, and /stream/frames endpoints could expose passwords, API keys, SSNs, etc.
No filtering mechanism existed to protect user privacy
Streaming websocket endpoint was particularly vulnerable as it broadcasts OCR data in real-time

The Solution:

Implemented keyword-based content filtering using configurable hide_window_keywords
When OCR text contains sensitive keywords (case-insensitive), API returns "[REDACTED]" instead
Applied filtering consistently across all API endpoints that expose OCR text
Added comprehensive test coverage to ensure reliability

Technical Implementation:

Modified create_time_series_frame() function to apply filtering before sending data
Updated streaming websocket handlers to pass filtering keywords from AppState
Enhanced existing should_hide_content() function for keyword detection
Maintained API compatibility - no breaking changes

Additional Improvements:

Fixed failing test_extract_frames_and_ocr performance bottleneck (4+ minutes → 0.30s)
Replaced inappropriate test keywords with security-focused terms
Resolved compilation warnings (duplicate functions, unused imports, unreachable code)
Added streaming-specific test coverage

Security Keywords Examples:

"password", "api key", "credit card", "ssn", "private key", "token", "secret"
Fully configurable via server settings
Case-insensitive matching for comprehensive protection

related issue: N/A (proactive security enhancement)

how to test

Verify comprehensive content filtering:

     cargo test -p screenpipe-server --test content_hiding_test

 Expected: All 11 tests pass ✅ (includes new streaming endpoint tests)

Confirm OCR test performance fix:

  cargo test -p screenpipe-server --test video_utils_test test_extract_frames_and_ocr

Expected: Test completes in ~0.30s (previously took 4+ minutes)
Test streaming endpoint filtering manually:
- Start screenpipe server with configuration: hide_window_keywords: ["password", "api key", "credit card"]
- Connect to websocket endpoint /stream/frames
- Send OCR text containing "Enter your password: secret123"
- Verify response shows "[REDACTED]" instead of actual sensitive text
Test search API filtering:
- Configure keywords as above
- Make search request: GET /search?q=password&limit=10
- Verify search results with sensitive keywords show filtered content
Test frame API filtering:
- Request specific frame: GET /frames/{frame_id}
- Verify OCR text containing keywords is properly redacted

Expected Test Results:

All content filtering tests pass (proves keyword detection works)
Performance tests demonstrate significant speed improvement
Integration tests confirm filtering applies across all endpoints
No regression in existing functionality

manual cli testing:

     # build and run screenpipe with content filtering
     cargo build --release
     ./target/release/screenpipe --hide-window-keywords "password,credit card,api key"

     # in another terminal, test search with sensitive content
     curl "http://localhost:3030/search?q=password&limit=10"
     # should return censored images instead of actual sensitive content

Screenshots/Evidence: Test suite output showing 11/11 tests passing demonstrates the content filtering system works reliably across all scenarios, saving maintainer review time by providing comprehensive
verification.

Files Modified:

screenpipe-server/src/server.rs - Core filtering implementation
screenpipe-server/tests/content_hiding_test.rs - Enhanced test coverage
screenpipe-server/tests/video_utils_test.rs - Performance fixes
screenpipe-server/src/lib.rs - Compilation fixes

🔐 This PR transforms screenpipe from potentially leaking sensitive user data to providing robust privacy protection across all API endpoints. Critical for user trust and data security.

- Implemented `get_frame_ocr_text` method to retrieve OCR text for a given frame. - Added `hide_window_texts` option in CLI for filtering keywords in OCR responses. - Introduced `should_hide_content` function to determine if content should be hidden based on keywords. - Created `create_censored_image` function to load or generate a censored image. - Updated search functionality to skip OCR results containing hidden keywords. - Added a new asset: `censored-content.png` for use in the application.

update from upstream

- Implemented `should_hide_content` function for keyword-based content hiding. - Added `create_censored_image` function to generate a black PNG image. - Created comprehensive tests for content hiding logic and image creation, ensuring case-insensitivity and performance benchmarks.

- Removed unused import of `AutomationError` in `windows_pdf_to_legacy.rs`. - Cleaned up commented-out code in `tests.rs` to improve readability. - Removed content hiding functions from `lib.rs` as they are no longer needed.

- Updated `fetch_and_process_frames` to accept `hide_keywords` for filtering sensitive OCR text. - Modified `create_time_series_frame` to redact content based on provided keywords. - Added unit tests to verify content filtering functionality in streaming frames.

github-actions · 2025-06-04T23:26:55Z

🧪 testing bounty created!

a testing bounty has been created for this PR: view testing issue

testers will be awarded $20 each for providing quality test reports. please check the issue for testing requirements.

update from upstream

Add missing device_name field to OCRContent struct and its instantiation to maintain compatibility with database schema changes. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

gadelkareem added 5 commits June 4, 2025 00:41

Merge pull request #1 from mediar-ai/main

2f41703

update from upstream

refactor: clean up unused imports and commented-out code

3bebba6

- Removed unused import of `AutomationError` in `windows_pdf_to_legacy.rs`. - Cleaned up commented-out code in `tests.rs` to improve readability. - Removed content hiding functions from `lib.rs` as they are no longer needed.

github-actions bot mentioned this pull request Jun 4, 2025

🧪 Testing Bounty: PR #1816 - feat: add OCR text filtering and content hiding for API endpoints #1817

Open

11 tasks

gadelkareem and others added 3 commits June 5, 2025 01:37

Merge pull request #2 from mediar-ai/main

84f7779

update from upstream

Resolve merge conflict in server.rs

3a620a6

Add missing device_name field to OCRContent struct and its instantiation to maintain compatibility with database schema changes. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Merge branch 'main' into hide-windows-based-on-contents

4a3687b

Jarrodsz mentioned this pull request Jun 21, 2025

🧪 Testing Bounty: Complete OCR Filtering Verification for Issue #1817 #1832

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add OCR text filtering and content hiding for API endpoints #1816

feat: add OCR text filtering and content hiding for API endpoints #1816

Uh oh!

gadelkareem commented Jun 4, 2025

Uh oh!

github-actions bot commented Jun 4, 2025

Uh oh!

Uh oh!

feat: add OCR text filtering and content hiding for API endpoints #1816

Are you sure you want to change the base?

feat: add OCR text filtering and content hiding for API endpoints #1816

Uh oh!

Conversation

gadelkareem commented Jun 4, 2025

description

how to test

Uh oh!

github-actions bot commented Jun 4, 2025

🧪 testing bounty created!

Uh oh!

Uh oh!