Skip to content

Conversation

gadelkareem
Copy link


name: pull request
about: submit changes to the project
title: "[pr] feat: add OCR text filtering and content hiding for API endpoints"
labels: 'security, enhancement'
assignees: ''


description

Adds comprehensive content filtering system to prevent sensitive information in OCR text from being exposed through API endpoints. This addresses a significant privacy gap where user data like passwords, API
keys, credit card numbers, and other sensitive information could be inadvertently leaked through screenpipe's API responses.

The Problem:

  • OCR text containing sensitive information was being returned raw through API endpoints
  • /search, /frames/:frame_id, and /stream/frames endpoints could expose passwords, API keys, SSNs, etc.
  • No filtering mechanism existed to protect user privacy
  • Streaming websocket endpoint was particularly vulnerable as it broadcasts OCR data in real-time

The Solution:

  • Implemented keyword-based content filtering using configurable hide_window_keywords
  • When OCR text contains sensitive keywords (case-insensitive), API returns "[REDACTED]" instead
  • Applied filtering consistently across all API endpoints that expose OCR text
  • Added comprehensive test coverage to ensure reliability

Technical Implementation:

  • Modified create_time_series_frame() function to apply filtering before sending data
  • Updated streaming websocket handlers to pass filtering keywords from AppState
  • Enhanced existing should_hide_content() function for keyword detection
  • Maintained API compatibility - no breaking changes

Additional Improvements:

  • Fixed failing test_extract_frames_and_ocr performance bottleneck (4+ minutes → 0.30s)
  • Replaced inappropriate test keywords with security-focused terms
  • Resolved compilation warnings (duplicate functions, unused imports, unreachable code)
  • Added streaming-specific test coverage

Security Keywords Examples:

  • "password", "api key", "credit card", "ssn", "private key", "token", "secret"
  • Fully configurable via server settings
  • Case-insensitive matching for comprehensive protection

related issue: N/A (proactive security enhancement)

how to test

  1. Verify comprehensive content filtering:
     cargo test -p screenpipe-server --test content_hiding_test
 Expected: All 11 tests pass ✅ (includes new streaming endpoint tests)
  1. Confirm OCR test performance fix:
  cargo test -p screenpipe-server --test video_utils_test test_extract_frames_and_ocr
  1. Expected: Test completes in ~0.30s (previously took 4+ minutes)
  2. Test streaming endpoint filtering manually:
    - Start screenpipe server with configuration: hide_window_keywords: ["password", "api key", "credit card"]
    - Connect to websocket endpoint /stream/frames
    - Send OCR text containing "Enter your password: secret123"
    - Verify response shows "[REDACTED]" instead of actual sensitive text
  3. Test search API filtering:
    - Configure keywords as above
    - Make search request: GET /search?q=password&limit=10
    - Verify search results with sensitive keywords show filtered content
  4. Test frame API filtering:
    - Request specific frame: GET /frames/{frame_id}
    - Verify OCR text containing keywords is properly redacted

Expected Test Results:

  • All content filtering tests pass (proves keyword detection works)
  • Performance tests demonstrate significant speed improvement
  • Integration tests confirm filtering applies across all endpoints
  • No regression in existing functionality

manual cli testing:

     # build and run screenpipe with content filtering
     cargo build --release
     ./target/release/screenpipe --hide-window-keywords "password,credit card,api key"

     # in another terminal, test search with sensitive content
     curl "http://localhost:3030/search?q=password&limit=10"
     # should return censored images instead of actual sensitive content

Screenshots/Evidence: Test suite output showing 11/11 tests passing demonstrates the content filtering system works reliably across all scenarios, saving maintainer review time by providing comprehensive
verification.
Screenshot 2025-06-05 at 1 15 33 AM
Screenshot 2025-06-05 at 1 16 49 AM

Files Modified:

  • screenpipe-server/src/server.rs - Core filtering implementation
  • screenpipe-server/tests/content_hiding_test.rs - Enhanced test coverage
  • screenpipe-server/tests/video_utils_test.rs - Performance fixes
  • screenpipe-server/src/lib.rs - Compilation fixes

🔐 This PR transforms screenpipe from potentially leaking sensitive user data to providing robust privacy protection across all API endpoints. Critical for user trust and data security.

- Implemented `get_frame_ocr_text` method to retrieve OCR text for a given frame.
- Added `hide_window_texts` option in CLI for filtering keywords in OCR responses.
- Introduced `should_hide_content` function to determine if content should be hidden based on keywords.
- Created `create_censored_image` function to load or generate a censored image.
- Updated search functionality to skip OCR results containing hidden keywords.
- Added a new asset: `censored-content.png` for use in the application.
- Implemented `should_hide_content` function for keyword-based content hiding.
- Added `create_censored_image` function to generate a black PNG image.
- Created comprehensive tests for content hiding logic and image creation, ensuring case-insensitivity and performance benchmarks.
- Removed unused import of `AutomationError` in `windows_pdf_to_legacy.rs`.
- Cleaned up commented-out code in `tests.rs` to improve readability.
- Removed content hiding functions from `lib.rs` as they are no longer needed.
- Updated `fetch_and_process_frames` to accept `hide_keywords` for filtering sensitive OCR text.
- Modified `create_time_series_frame` to redact content based on provided keywords.
- Added unit tests to verify content filtering functionality in streaming frames.
Copy link
Contributor

github-actions bot commented Jun 4, 2025

🧪 testing bounty created!

a testing bounty has been created for this PR: view testing issue

testers will be awarded $20 each for providing quality test reports. please check the issue for testing requirements.

gadelkareem and others added 3 commits June 5, 2025 01:37
Add missing device_name field to OCRContent struct and its instantiation to maintain compatibility with database schema changes.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant