Loading...
Skip to main content

Corpus API Examples: Working with Sources

This page provides examples for how to use the Corpus API to work with sources. Examples include creating sources from URLs and from static documents. Refreshing, clearing and deleting. Modifying sources (changing the crawling depth or the pages included/excluded). These examples will build on what we did above with creating a corpus. There are more details about working with Sources in the API docs.

Most if not all of what is covered here can also be accomplished by using the Fixie Console. Examples are provided in four flavors with each of those corresponding with the method of calling the Corpus API: via curl, using the Fixie CLI via the terminal, using the Fixie client in JavaScript, and using the Corpus REST API in Python.

Notes on Authentication

Calling the Corpus API requires a Fixie account. For the examples using curl or the REST API (in JS or Python), you will need to provide your Fixie API key. The examples use <TOKEN> to denote where you need to provide your key. You can find your API key (AKA "API token") on your profile page.

When you use the Fixie CLI, you will need to use the auth command. See the docs for more information.

Creating Sources

There are many ways to create sources. You can create them from URLs and Fixie will crawl the content and store it. Sources can also be created from static files (PDFs, Word documents, etc.). This section contains various examples of creating sources including using some of the options (e.g. crawling depth, included/excluded pages, etc.).

Create Source (from URL)

Create a source for the Fixie blog (located at https://fixie.ai/blog). Add this source to the corpus with ID 74e5bc4c-c2d9-4296-a2de-9c448d4bc307.

Some things to note:

  • We are using the corpus ID in the URL and in the body of the request.
  • Start URL(s) are required. In this case, we are only providing one.
  • Include pattern is used to ensure we only crawl content that is part of the blog.
curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
"corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
"source": {
"displayName": "Fixie.ai blog",
"loadSpec": {
"web": {
"startUrls": [
"https://fixie.ai/blog"
],
"includeGlobPatterns": [
"https://fixie.ai/blog/**"
]
}
}
}
}'

Create Source (Exclude Pages)

Above we saw how to create a source from a URL. In this example, we will create a source from a URL but exclude certain pages. In this case, we will create a source for the Fixie website (located at https://fixie.ai) but exclude the blog (https://fixie.ai/blog).

Just as we did earlier, we will add this source to the corpus with ID 74e5bc4c-c2d9-4296-a2de-9c448d4bc307.

Some things to note:

  • We are using the corpus ID in the URL and in the body of the request.
  • Start URL(s) are required. In this case, we are only providing one (for fixie.ai).
  • Include pattern is used to ensure we only crawl content that is part of the site.
  • Exclude pattern is used to ensure we do not crawl the blog.
curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
"corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
"source": {
"displayName": "Fixie.ai Website",
"loadSpec": {
"web": {
"startUrls": [
"https://fixie.ai/"
],
"includeGlobPatterns": [
"https://fixie.ai/**"
],
"excludeGlobPatterns": [
"https://fixie.ai/blog/**"
]
}
}
}
}'

Create Source (Increase URLs & Change Crawling Depth)

By default, Fixie will crawl URLs with a depth of 3. That means the URL provided will be crawled, any URLs on that page will be crawled, and any URLs on those pages will be crawled. This behavior can be changed by setting the maxDepth parameter.

In this example, we will revisit the Fixie blog source we created above and increase the number of URLs crawled and change the crawling depth.

Some things to note:

  • We are now providing two start URLs. One is for the current blog (https://fixie.ai/blog) and the other is for the (now deprecated) Medium blog (https://blog.fixie.ai/).
  • We must also have include patterns for each of the URLs.
  • We reduce the crawling depth to prevent external links from being crawled.
curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
"corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
"source": {
"displayName": "Fixie.ai blog and Medium blog",
"loadSpec": {
"web": {
"startUrls": [
"https://fixie.ai/blog",
"https://blog.fixie.ai/"
],
"maxDepth": 2,
"includeGlobPatterns": [
"https://fixie.ai/blog/**",
"https://blog.fixie.ai/**"
]
}
}
}
}'

Create Source (from Static File)

In this example we are going to add a new source based on a static document. This can be a PDF, Word Document, etc. For static files you must provide bytes and a MIME type.

Some things to note:

  • We are using the corpus ID in the URL and in the body of the request.
  • The static document must be Base64 encoded. The Fixie CLI will automatically do this for you. For curl, you can use a tool like this.
  • <BASE64_ENCODED_FILE> is a placeholder for the Base64 encoded PDF. You will need to replace this with the actual Base64 encoded PDF.
  • The correct MIME type must be used for the document. For PDFs, this is application/pdf. For Word documents, this is application/msword. See this list for more information.
  • We have a sample PDF that is an export of the Wikipedia page on LLMs. There are also sample .doc and a .docx files that you can use for testing.

We will add this source to the corpus with ID 74e5bc4c-c2d9-4296-a2de-9c448d4bc307.

curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
"corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
"source": {
"displayName": "Wikipedia: Large Language Model",
"loadSpec": {
"static": {
"documents": [
{
"filename": "LLM-Wikipedia.pdf",
"mimeType": "application/pdf",
"contents": "<BASE64_ENCODED_FILE>"
}
]
}
}
}
}'

Clearing & Deleting Sources

Let's delete the source we just created. We will need to use both the corpus ID (74e5bc4c-c2d9-4296-a2de-9c448d4bc307) and the source ID (42c505e7-672a-4ed0-9201-934da2e0b17d) in the URL.

Some things to note:

  • We must clear the source before deleting it. This is to ensure that we don't get back chunks of data from the source if the corpus is queried at the time we are removing the source.
  • We are using the corpus ID and the source ID in the URL and in the body of the request.

Clear Source

Sources must be cleared before they can be deleted. Note: if any jobs are running on a source, calling clear will fail and you will get a 409 error. If you don't care about the other jobs that are running, you can use the force option and any running jobs against the source will be cancelled and the source will be cleared.

curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources/42c505e7-672a-4ed0-9201-934da2e0b17d/clear' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
"corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
"sourceId": "42c505e7-672a-4ed0-9201-934da2e0b17d"
}'

Delete Source

Once the source is cleared, we can delete it.

curl -L -X DELETE 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources/42c505e7-672a-4ed0-9201-934da2e0b17d' \
-H 'Authorization: Bearer <TOKEN>'

Refreshing Sources

Refreshing a source will re-crawl the source and update the content. Refresh starts a new Job and will fail if There are other jobs currently running for the source.

Refresh Source (from URL)

Let's take the source we created above for the Fixie blog and refresh it. We will need to use both the corpus ID (74e5bc4c-c2d9-4296-a2de-9c448d4bc307) and the source ID (42c505e7-672a-4ed0-9201-934da2e0b17d). Refreshing a source will fail with a 409 error if another job is running on the source. You can use the force option to cancel any running jobs and refresh the source.

curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources/42c505e7-672a-4ed0-9201-934da2e0b17d/refresh' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
"corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
"sourceId": "42c505e7-672a-4ed0-9201-934da2e0b17d"
}'