Science Infuse
Data

Data Structure

Note

You can find the data structure in the Prisma schema located at webapp/src/prisma/schema.prisma.
We use PostgreSQL with pgvector extension to store the data.

Types of Documents

Ada can injest multiple types of documents:

  • Pdfs
  • Images
  • Youtube Videos
  • Websites
  • ...

Document Structure

Documents are processed and stored in a three-level hierarchy:

  1. Document Level

    • The top-level entity representing the document
    • Contains basic file information: i.e. name, original path, s3 access...
  2. Chunks Level

    • Documents are broken down into smaller text chunks for easier indexing
    • Each chunk's text is transformed into vector embeddings
    • These embeddings are used for semantic search
  3. Metadata Level

    • Each chunk can have associated metadata
    • Stores contextual information about the chunk: i.e. page number, bounding box

Semantic Search Implementation

The semantic search capability is enabled by:

  1. Breaking documents into manageable chunks
  2. Converting chunk text into vector embeddings
  3. Storing these embeddings in pgvector
  4. Using vector similarity for search operations

On this page