Add a Source to a Corpus.

A Source consists of a set of rules for crawling and indexing Documents in a Corpus.

Path Parameters

corpusId string required

The parent corpus to create the source in.

application/json

Request Body required

corpusId string required

The parent corpus to create the source in.

source object

A source of documents for building a corpus. A source defines how documents are loaded, processed, and embedded.

displayName string

A human-readable name for this source.

loadSpec object required

The specification for how documents are loaded for this source.

maxDocuments int32

The maximum number of documents to ingest. This cannot exceed 200 in general. If you need more documents in a single corpus, please contact the Fixie team.

maxDocumentBytes int32

The maximum size of an individual document in bytes. If unset, a reasonable default will be chosen by Fixie.

relevantDocumentTypes object

The types of documents to keep. Any documents surfaced during loading that don't match this filter will be discarded. If unset, all documents will be kept.

include object

Mime types must be in this set to be kept. Empty implies the universal set. That is, all mime types will be kept save those in the exclude set.

mimeTypes string[]

exclude object

Mime types must not be in this set to be kept. Empty imples the empty set.

mimeTypes string[]

web object

Allows loading documents by crawling the web.

Only one of the web or static fields may be populated when creating a new source.

startUrls string[] required

The list of start URLs to crawl.

maxDepth int32

The maximum depth of links to traverse. If 0 (or unset), there will be no depth limit.

includeGlobPatterns string[]

A set of glob patterns matched against any additional discovered URLs. URLs matching these patterns will be included in the crawl, unless the URL matches any of the exclude_glob_patterns.

excludeGlobPatterns string[]

A set of glob patterns matched against any additional discovered URLs. URLs matching these patterns will be excluded from the crawl.

static object

Allows loading documents from a static source (e.g. a file upload).

Only one of the web or static fields may be populated when creating a new source.

documents object[] required

The documents to load.

Array [

filename string required

The filename of the document.

mimeType string required

The MIME type of the document.

contents bytes required

The contents of the document.

metadata object

The metadata to attach to this document.

publicUrl string

The public URL of the document, if any.

language string

The BCP47 language code of the document, if known.

title string

The title of the document, if known.

description string

The description of the document, if known.

published date-time

The timestamp that the document was published, if known.

]

processSteps object[]

How to process documents for this source. Each step will be run in order over all applicable documents (as defined by each step's relevant_document_types).

If no steps are specified, the service will add default document transformation steps to convert various document types into text easily understood by an LLM.

Array [

stepName string required

The human-readable name of the step.

relevantDocumentTypes object

The types of documents to which this step applies. Leave empty to apply to all documents.

include object

Mime types must be in this set to be kept. Empty implies the universal set. That is, all mime types will be kept save those in the exclude set.

mimeTypes string[]

exclude object

Mime types must not be in this set to be kept. Empty imples the empty set.

mimeTypes string[]

htmlToMarkdown object

Converts HTML documents to Markdown. Use with relevant_document_types set to include only text/html.

unstructuredProcessor object

Converts binary documents to plain text.

]

chunkSpec object

Which documents to chunk and how to chunk them for this source. Defaults will be provided if unset.

inputSelector object

The input documents that should be chunked. Only documents that correspond to UTF-8 encoded text can be chunked. Any other kind of document will fail.

mimeTypeFilter object

Filters documents based on their mime type.

include object

Mime types must be in this set to be kept. Empty implies the universal set. That is, all mime types will be kept save those in the exclude set.

mimeTypes string[]

exclude object

Mime types must not be in this set to be kept. Empty imples the empty set.

mimeTypes string[]

originFilter object

Filters documents based on their origin.

origins object[]

Document origins must match one of these to be kept.

Array [

load boolean

processStep string

]

chunkSize int32

The desired chunk size for each chunk, in tokens. This is a strict maximum, as well as a target. Adjacent chunks will be combined if their total size is under this limit.

maxChunksPerDocument int32

The maximum number of chunks to produce for an individual document.

maxChunksTotal int32

The maximum number of chunks to produce in total. This cannot exceed 5000 in general. If you need more chunks in a single source, please contact the Fixie team.

embedSteps object[]

How to embed chunks for this source. Defaults will be provided if unset.

Array [

stepName string

The human-readable name for this step.

direct object

Embeds chunks directly.

parentChild object

Embeds chunks using a parent-child strategy.

]

description string

A human-readable description of this source.

stats object

The current stats for this source.

schedule object

How to automatically refresh this source. If unset, no automatic refresh jobs will be scheduled.

enablePeriodicRefresh boolean

Whether to enable automatic refresh for this source.

Responses

200
default

application/json

Schema

Example (from schema)

Schema

source object

A source of documents for building a corpus. A source defines how documents are loaded, processed, and embedded.

displayName string

A human-readable name for this source.

created date-time

When this source was created.

modified date-time

When this source was last modified.

corpusId string

The corpus that this source belongs to.

sourceId string

The unique ID of this source.

loadSpec object required

The specification for how documents are loaded for this source.

maxDocuments int32

The maximum number of documents to ingest. This cannot exceed 200 in general. If you need more documents in a single corpus, please contact the Fixie team.

maxDocumentBytes int32

The maximum size of an individual document in bytes. If unset, a reasonable default will be chosen by Fixie.

relevantDocumentTypes object

The types of documents to keep. Any documents surfaced during loading that don't match this filter will be discarded. If unset, all documents will be kept.

include object

Mime types must be in this set to be kept. Empty implies the universal set. That is, all mime types will be kept save those in the exclude set.

mimeTypes string[]

exclude object

Mime types must not be in this set to be kept. Empty imples the empty set.

mimeTypes string[]

web object

Allows loading documents by crawling the web.

Only one of the web or static fields may be populated when creating a new source.

startUrls string[] required

The list of start URLs to crawl.

maxDepth int32

The maximum depth of links to traverse. If 0 (or unset), there will be no depth limit.

includeGlobPatterns string[]

A set of glob patterns matched against any additional discovered URLs. URLs matching these patterns will be included in the crawl, unless the URL matches any of the exclude_glob_patterns.

excludeGlobPatterns string[]

A set of glob patterns matched against any additional discovered URLs. URLs matching these patterns will be excluded from the crawl.

static object

Allows loading documents from a static source (e.g. a file upload).

Only one of the web or static fields may be populated when creating a new source.

documents object[] required

The documents to load.

Array [

filename string required

The filename of the document.

mimeType string required

The MIME type of the document.

contents bytes required

The contents of the document.

metadata object

The metadata to attach to this document.

publicUrl string

The public URL of the document, if any.

language string

The BCP47 language code of the document, if known.

title string

The title of the document, if known.

description string

The description of the document, if known.

published date-time

The timestamp that the document was published, if known.

]

processSteps object[]

How to process documents for this source. Each step will be run in order over all applicable documents (as defined by each step's relevant_document_types).

If no steps are specified, the service will add default document transformation steps to convert various document types into text easily understood by an LLM.

Array [

stepName string required

The human-readable name of the step.

relevantDocumentTypes object

The types of documents to which this step applies. Leave empty to apply to all documents.

include object

Mime types must be in this set to be kept. Empty implies the universal set. That is, all mime types will be kept save those in the exclude set.

mimeTypes string[]

exclude object

Mime types must not be in this set to be kept. Empty imples the empty set.

mimeTypes string[]

htmlToMarkdown object

Converts HTML documents to Markdown. Use with relevant_document_types set to include only text/html.

unstructuredProcessor object

Converts binary documents to plain text.

]

chunkSpec object

Which documents to chunk and how to chunk them for this source. Defaults will be provided if unset.

inputSelector object

The input documents that should be chunked. Only documents that correspond to UTF-8 encoded text can be chunked. Any other kind of document will fail.

mimeTypeFilter object

Filters documents based on their mime type.

include object

Mime types must be in this set to be kept. Empty implies the universal set. That is, all mime types will be kept save those in the exclude set.

mimeTypes string[]

exclude object

Mime types must not be in this set to be kept. Empty imples the empty set.

mimeTypes string[]

originFilter object

Filters documents based on their origin.

origins object[]

Document origins must match one of these to be kept.

Array [

load boolean

processStep string

]

chunkSize int32

The desired chunk size for each chunk, in tokens. This is a strict maximum, as well as a target. Adjacent chunks will be combined if their total size is under this limit.

maxChunksPerDocument int32

The maximum number of chunks to produce for an individual document.

maxChunksTotal int32

The maximum number of chunks to produce in total. This cannot exceed 5000 in general. If you need more chunks in a single source, please contact the Fixie team.

embedSteps object[]

How to embed chunks for this source. Defaults will be provided if unset.

Array [

stepName string

The human-readable name for this step.

direct object

Embeds chunks directly.

parentChild object

Embeds chunks using a parent-child strategy.

]

description string

A human-readable description of this source.

stats object

The current stats for this source.

status enum

Possible values: [SOURCE_STATUS_UNSPECIFIED, SOURCE_STATUS_INITIALIZING, SOURCE_STATUS_READY, SOURCE_STATUS_UPDATING]

The current status of this source, indicating whether it affects queries.

This field should not be populated in client requests.

lastUpdated date-time

When a job last completed for this source.

This field should not be populated in client requests.

numDocs int32

The total number of documents in this source.

schedule object

How to automatically refresh this source. If unset, no automatic refresh jobs will be scheduled.

enablePeriodicRefresh boolean

Whether to enable automatic refresh for this source.

{
  "source": {
    "displayName": "string",
    "created": "2024-03-07T22:56:08.215Z",
    "modified": "2024-03-07T22:56:08.216Z",
    "corpusId": "string",
    "sourceId": "string",
    "loadSpec": {
      "maxDocuments": 0,
      "maxDocumentBytes": 0,
      "relevantDocumentTypes": {
        "include": {
          "mimeTypes": [
            "string"
          ]
        },
        "exclude": {
          "mimeTypes": [
            "string"
          ]
        }
      },
      "web": {
        "startUrls": [
          "string"
        ],
        "maxDepth": 0,
        "includeGlobPatterns": [
          "string"
        ],
        "excludeGlobPatterns": [
          "string"
        ]
      },
      "static": {
        "documents": [
          {
            "filename": "string",
            "mimeType": "string",
            "contents": "string",
            "metadata": {
              "publicUrl": "string",
              "language": "string",
              "title": "string",
              "description": "string",
              "published": "2024-03-07T22:56:08.216Z"
            }
          }
        ]
      }
    },
    "processSteps": [
      {
        "stepName": "string",
        "relevantDocumentTypes": {
          "include": {
            "mimeTypes": [
              "string"
            ]
          },
          "exclude": {
            "mimeTypes": [
              "string"
            ]
          }
        },
        "htmlToMarkdown": {},
        "unstructuredProcessor": {}
      }
    ],
    "chunkSpec": {
      "inputSelector": {
        "mimeTypeFilter": {
          "include": {
            "mimeTypes": [
              "string"
            ]
          },
          "exclude": {
            "mimeTypes": [
              "string"
            ]
          }
        },
        "originFilter": {
          "origins": [
            {
              "load": true,
              "processStep": "string"
            }
          ]
        }
      },
      "chunkSize": 0,
      "maxChunksPerDocument": 0,
      "maxChunksTotal": 0
    },
    "embedSteps": [
      {
        "stepName": "string",
        "direct": {},
        "parentChild": {}
      }
    ],
    "description": "string",
    "stats": {
      "status": "SOURCE_STATUS_UNSPECIFIED",
      "lastUpdated": "2024-03-07T22:56:08.216Z",
      "numDocs": 0
    },
    "schedule": {
      "enablePeriodicRefresh": true
    }
  }
}

Default error response

application/json

Schema

Example (from schema)

Schema

code int32

The status code, which should be an enum value of [google.rpc.Code][google.rpc.Code].

message string

A developer-facing error message, which should be in English. Any user-facing error message should be localized and sent in the [google.rpc.Status.details][google.rpc.Status.details] field, or localized by the client.

details object[]

A list of messages that carry the error details. There is a common set of message types for APIs to use.

Array [

@type string

The type of the serialized message.

]

{
  "code": 0,
  "message": "string",
  "details": [
    {
      "@type": "string"
    }
  ]
}

Add a Source to a Corpus.​

Add a Source to a Corpus.