Get a Job.
Get a Job.
Path Parameters
The ID of the corpus owning the job to retrieve.
The ID of the source owning the job to retrieve.
The ID of the job to retrieve.
- 200
- default
OK
Schema
- Array [
- ]
- Array [
- ]
- Array [
- ]
- Array [
- ]
- Array [
- ]
- Array [
- ]
job object
A complete loading, processing, and embedding pipeline for a particular Source. A job can be started in one of several ways:
1. When a new Source is created.
2. Automatically based on the Source's JobSchedule.
3. Manually by forcing a refresh.
4. When a Document is updated or deleted.
In all cases, the job's loading, processing, and embedding parameters are copied from the Source when the Job is created to ensure unambiguous behavior. In the case of a document being updated or deleted, no loading will occur. Instead processing and embedding will be performed with a scope limited to the updated document, possibly spawning other jobs to update the document's children. (If an aggregation processing step requires other documents, they'll be read from storage but won't be altered.)
The corpus that this job belongs to.
The source that this job belongs to.
The unique ID of this job.
For document updates and deletions, each job may spawn children to update documents derived from the updated document. If this job is such a child, then this is the parent job ID. (The parent is owned by the same corpus and source.)
Possible values: [JOB_STATE_UNSPECIFIED
, JOB_STATE_PENDING
, JOB_STATE_RUNNING
, JOB_STATE_COMPLETED
, JOB_STATE_FAILED
, JOB_STATE_CANCELLED
]
The current state of the job.
The timestamp that the job was requested.
The timestamp that the job began.
The timestamp that the job completed.
If the job failed (or was cancelled), the error message describing the failure.
loadSpec object
The LoadSpec used during this job. This is the LoadSpec defined for the Source at the time this job was created.
Only one of the load_spec
, updated_document_id
, or deleted_document_id
fields
may be populated.
The maximum number of documents to ingest. This cannot exceed 200 in general. If you need more documents in a single corpus, please contact the Fixie team.
The maximum size of an individual document in bytes. If unset, a reasonable default will be chosen by Fixie.
relevantDocumentTypes object
The types of documents to keep. Any documents surfaced during loading that don't match this filter will be discarded. If unset, all documents will be kept.
include object
Mime types must be in this set to be kept. Empty implies the universal set. That is, all mime types will be kept save those in the exclude set.
exclude object
Mime types must not be in this set to be kept. Empty imples the empty set.
web object
Allows loading documents by crawling the web.
Only one of the web
or static
fields may be populated when creating a new source.
The list of start URLs to crawl.
The maximum depth of links to traverse. If 0 (or unset), there will be no depth limit.
A set of glob patterns matched against any additional discovered URLs. URLs matching these patterns will be included in the crawl, unless the URL matches any of the exclude_glob_patterns.
A set of glob patterns matched against any additional discovered URLs. URLs matching these patterns will be excluded from the crawl.
static object
Allows loading documents from a static source (e.g. a file upload).
Only one of the web
or static
fields may be populated when creating a new source.
documents object[] required
The documents to load.
The filename of the document.
The MIME type of the document.
The contents of the document.
metadata object
The metadata to attach to this document.
The public URL of the document, if any.
The BCP47 language code of the document, if known.
The title of the document, if known.
The description of the document, if known.
The timestamp that the document was published, if known.
The ID of the document that was updated whose direct children should be reprocessed and whose chunks should be recomputed as part of this job. (The document is owned by the same corpus and source as this job.)
Only one of the load_spec
, updated_document_id
, or deleted_document_id
fields
may be populated.
The ID of the document to be deleted whose direct children and chunks should be deleted (or updated in the case of a child created by an aggregation processing step) as part of this job. (The document is owned by the same corpus and source as this job.)
Only one of the load_spec
, updated_document_id
, or deleted_document_id
fields
may be populated.
processSteps object[]
The ProcessSteps used during this job. These are the ProcessSteps defined for the Source at the time this job was created. In the case of an updated/deleted document child job, this may be a sublist of the Source's ProcessSteps.
The human-readable name of the step.
relevantDocumentTypes object
The types of documents to which this step applies. Leave empty to apply to all documents.
include object
Mime types must be in this set to be kept. Empty implies the universal set. That is, all mime types will be kept save those in the exclude set.
exclude object
Mime types must not be in this set to be kept. Empty imples the empty set.
Converts HTML documents to Markdown. Use with relevant_document_types set to include only text/html.
Converts binary documents to plain text.
chunkSpec object
The ChunkSpec used during this job. This is the ChunkSpec defined for the Source at the time this job was created.
inputSelector object
The input documents that should be chunked. Only documents that correspond to UTF-8 encoded text can be chunked. Any other kind of document will fail.
mimeTypeFilter object
Filters documents based on their mime type.
include object
Mime types must be in this set to be kept. Empty implies the universal set. That is, all mime types will be kept save those in the exclude set.
exclude object
Mime types must not be in this set to be kept. Empty imples the empty set.
originFilter object
Filters documents based on their origin.
origins object[]
Document origins must match one of these to be kept.
The desired chunk size for each chunk, in tokens. This is a strict maximum, as well as a target. Adjacent chunks will be combined if their total size is under this limit.
The maximum number of chunks to produce for an individual document.
The maximum number of chunks to produce in total. This cannot exceed 5000 in general. If you need more chunks in a single source, please contact the Fixie team.
embedSteps object[]
The EmbedSteps used during this job. These are the EmbedSteps defined for the Source at the time this job was created.
The human-readable name for this step.
Embeds chunks directly.
Embeds chunks using a parent-child strategy.
loadResult object
The results of loading.
The timestamp at which loading began.
The timestamp at which loading completed.
The number of documents created.
The number of documents that existed previously and were updated.
The number of documents that existed previously and were not modified.
The number of documents deleted because they're no longer present in the source.
The number of documents omitted due to content size.
The number of documents omitted due to mime type.
processStepResults object[]
The results of each processing step.
The step that produced these results.
The number of documents expected from this step prior to execution.
The timestamp that this processing step began.
The timestamp that this processing step completed.
The total number of documents produced by this step.
The number of documents that failed to be processed.
The number of documents created.
The number of documents that existed previously and were updated.
The number of documents that existed previously for which processing produced the same result as before.
The number of documents deleted because they were previously produced by this step but weren't with the latest input.
chunkResult object
The results of chunking.
The timestamp that chunking began.
The timestamp that chunking completed.
The number of documents expected to be chunked prior to execution.
The number of documents successfully chunked.
The number of documents that failed to be chunked.
The number of chunks created.
The number of chunks that were not modified.
The number of chunks deleted.
embedStepResults object[]
The results of each embedding step.
The step that produced these results.
The timestamp that this embedding step began.
The timestamp that this embedding step completed.
The number of chunks expected to be embedded by this step prior to execution.
The number of chunks successfully embedded.
The number of chunks that failed to be embedded.
The number of vectors created.
The number of vectors that were not modified.
The number of vectors deleted.
{
"job": {
"corpusId": "string",
"sourceId": "string",
"jobId": "string",
"parentJobId": "string",
"state": "JOB_STATE_UNSPECIFIED",
"created": "2024-03-07T22:56:08.272Z",
"started": "2024-03-07T22:56:08.272Z",
"completed": "2024-03-07T22:56:08.272Z",
"errorMessage": "string",
"loadSpec": {
"maxDocuments": 0,
"maxDocumentBytes": 0,
"relevantDocumentTypes": {
"include": {
"mimeTypes": [
"string"
]
},
"exclude": {
"mimeTypes": [
"string"
]
}
},
"web": {
"startUrls": [
"string"
],
"maxDepth": 0,
"includeGlobPatterns": [
"string"
],
"excludeGlobPatterns": [
"string"
]
},
"static": {
"documents": [
{
"filename": "string",
"mimeType": "string",
"contents": "string",
"metadata": {
"publicUrl": "string",
"language": "string",
"title": "string",
"description": "string",
"published": "2024-03-07T22:56:08.273Z"
}
}
]
}
},
"updatedDocumentId": "string",
"deletedDocumentId": "string",
"processSteps": [
{
"stepName": "string",
"relevantDocumentTypes": {
"include": {
"mimeTypes": [
"string"
]
},
"exclude": {
"mimeTypes": [
"string"
]
}
},
"htmlToMarkdown": {},
"unstructuredProcessor": {}
}
],
"chunkSpec": {
"inputSelector": {
"mimeTypeFilter": {
"include": {
"mimeTypes": [
"string"
]
},
"exclude": {
"mimeTypes": [
"string"
]
}
},
"originFilter": {
"origins": [
{
"load": true,
"processStep": "string"
}
]
}
},
"chunkSize": 0,
"maxChunksPerDocument": 0,
"maxChunksTotal": 0
},
"embedSteps": [
{
"stepName": "string",
"direct": {},
"parentChild": {}
}
],
"loadResult": {
"started": "2024-03-07T22:56:08.273Z",
"completed": "2024-03-07T22:56:08.273Z",
"createdDocsCount": 0,
"updatedDocsCount": 0,
"unchangedDocsCount": 0,
"deletedDocsCount": 0,
"sizeFilteredDocsCount": 0,
"typeFilteredDocsCount": 0
},
"processStepResults": [
{
"stepName": "string",
"expectedOutputDocsCount": 0,
"started": "2024-03-07T22:56:08.273Z",
"completed": "2024-03-07T22:56:08.273Z",
"producedDocsCount": 0,
"failedDocsCount": 0,
"createdDocsCount": 0,
"updatedDocsCount": 0,
"unchangedDocsCount": 0,
"deletedDocsCount": 0
}
],
"chunkResult": {
"started": "2024-03-07T22:56:08.273Z",
"completed": "2024-03-07T22:56:08.273Z",
"expectedDocsCount": 0,
"successfulDocsCount": 0,
"failedDocsCount": 0,
"createdChunksCount": 0,
"unchangedChunksCount": 0,
"deletedChunksCount": 0
},
"embedStepResults": [
{
"stepName": "string",
"started": "2024-03-07T22:56:08.273Z",
"completed": "2024-03-07T22:56:08.273Z",
"expectedChunksCount": 0,
"successfulChunksCount": 0,
"failedChunksCount": 0,
"createdVectorsCount": 0,
"unchangedVectorsCount": 0,
"deletedVectorsCount": 0
}
]
}
}
Default error response
Schema
- Array [
- ]
The status code, which should be an enum value of [google.rpc.Code][google.rpc.Code].
A developer-facing error message, which should be in English. Any user-facing error message should be localized and sent in the [google.rpc.Status.details][google.rpc.Status.details] field, or localized by the client.
details object[]
A list of messages that carry the error details. There is a common set of message types for APIs to use.
The type of the serialized message.
{
"code": 0,
"message": "string",
"details": [
{
"@type": "string"
}
]
}