Corpus API Examples: Working with Sources
This page provides examples for how to use the Corpus API to work with sources. Examples include creating sources from URLs and from static documents. Refreshing, clearing and deleting. Modifying sources (changing the crawling depth or the pages included/excluded). These examples will build on what we did above with creating a corpus. There are more details about working with Sources in the API docs.
Most if not all of what is covered here can also be accomplished by using the Fixie Console. Examples are provided in four flavors with each of those corresponding with the method of calling the Corpus API: via curl, using the Fixie CLI via the terminal, using the Fixie client in JavaScript, and using the Corpus REST API in Python.
Calling the Corpus API requires a Fixie account. For the examples using curl or the REST API (in JS or Python), you will
need to provide your Fixie API key. The examples use <TOKEN>
to denote where you need to provide your key. You can find
your API key (AKA "API token") on your profile page.
When you use the Fixie CLI, you will need to use the auth
command. See the docs for more
information.
Creating Sources
There are many ways to create sources. You can create them from URLs and Fixie will crawl the content and store it. Sources can also be created from static files (PDFs, Word documents, etc.). This section contains various examples of creating sources including using some of the options (e.g. crawling depth, included/excluded pages, etc.).
Create Source (from URL)
Create a source for the Fixie blog (located at https://fixie.ai/blog). Add this source to the corpus with
ID 74e5bc4c-c2d9-4296-a2de-9c448d4bc307
.
Some things to note:
- We are using the corpus ID in the URL and in the body of the request.
- Start URL(s) are required. In this case, we are only providing one.
- Include pattern is used to ensure we only crawl content that is part of the blog.
- curl
- Fixie CLI
- JavaScript
- Python
curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
"corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
"source": {
"displayName": "Fixie.ai blog",
"loadSpec": {
"web": {
"startUrls": [
"https://fixie.ai/blog"
],
"includeGlobPatterns": [
"https://fixie.ai/blog/**"
]
}
}
}
}'
npx fixie@latest corpus sources add 74e5bc4c-c2d9-4296-a2de-9c448d4bc307 \
'https://fixie.ai/blog' --include-patterns 'https://fixie.ai/blog/**'
// Import the Fixie client
import { FixieClient } from "fixie";
// Set the API key and Create the Fixie client
const API_KEY = "<TOKEN>";
const fixieClient = new FixieClient({ apiKey: API_KEY });
// Set our variables for the Corpus and Source
const CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307";
const DISPLAY_NAME = "Fixie.ai Blog";
const START_URLS = ["https://fixie.ai/blog"];
const INCLUDE_PATTERNS = ["https://fixie.ai/blog/**"];
// Add the source to the corpus
fixieClient.addCorpusSource({ corpusId: CORPUS_ID, startUrls: START_URLS, includeGlobs: INCLUDE_PATTERNS, displayName: DISPLAY_NAME }).then((source) => {
console.log(JSON.stringify(source));
});
import requests
import json
API_KEY = "<TOKEN>"
CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307"
DISPLAY_NAME = "Fixie.ai Blog"
START_URL = "https://fixie.ai/blog"
INCLUDE_PATTERN = "https://fixie.ai/blog/**"
url = f"https://api.fixie.ai/api/v1/corpora/{CORPUS_ID}/sources"
payload = json.dumps({
"corpusId": f"{CORPUS_ID}",
"source": {
"displayName": f"{DISPLAY_NAME}",
"loadSpec": {
"web": {
"startUrls": [
f"{START_URL}"
],
"includeGlobPatterns": [
f"{INCLUDE_PATTERN}"
]
}
}
}
})
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json',
'Authorization': f'Bearer {API_KEY}'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
Create Source (Exclude Pages)
Above we saw how to create a source from a URL. In this example, we will create a source from a URL but exclude certain pages. In this case, we will create a source for the Fixie website (located at https://fixie.ai) but exclude the blog (https://fixie.ai/blog).
Just as we did earlier, we will add this source to the corpus with ID 74e5bc4c-c2d9-4296-a2de-9c448d4bc307
.
Some things to note:
- We are using the corpus ID in the URL and in the body of the request.
- Start URL(s) are required. In this case, we are only providing one (for fixie.ai).
- Include pattern is used to ensure we only crawl content that is part of the site.
- Exclude pattern is used to ensure we do not crawl the blog.
- curl
- Fixie CLI
- JavaScript
- Python
curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
"corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
"source": {
"displayName": "Fixie.ai Website",
"loadSpec": {
"web": {
"startUrls": [
"https://fixie.ai/"
],
"includeGlobPatterns": [
"https://fixie.ai/**"
],
"excludeGlobPatterns": [
"https://fixie.ai/blog/**"
]
}
}
}
}'
npx fixie@latest corpus sources add 74e5bc4c-c2d9-4296-a2de-9c448d4bc307 \
'https://fixie.ai' --include-patterns 'https://fixie.ai/**' \
--exclude-patterns 'https://fixie.ai/blog/**'
// Import the Fixie client
import { FixieClient } from "fixie";
// Set the API key and Create the Fixie client
const API_KEY = "<TOKEN>";
const fixieClient = new FixieClient({ apiKey: API_KEY });
// Set our variables for the Corpus and Source
const CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307";
const DISPLAY_NAME = "Fixie.ai Website";
const START_URLS = ["https://fixie.ai"];
const INCLUDE_PATTERNS = ["https://fixie.ai/**"];
const EXLUDE_PATTERNS = ["https://fixie.ai/blog/**"];
// Add the source to the corpus
fixieClient.addCorpusSource({ corpusId: CORPUS_ID, startUrls: START_URLS, includeGlobs: INCLUDE_PATTERNS, excludeGlobs: EXLUDE_PATTERNS, displayName: DISPLAY_NAME }).then((source) => {
console.log(JSON.stringify(source));
});
import requests
import json
API_KEY = "<TOKEN>"
CORPUS_ID = "3083bd58-71af-46b4-b4fd-944d5b85a131"
DISPLAY_NAME = "Fixie.ai Website"
START_URL = "https://fixie.ai"
INCLUDE_PATTERN = "https://fixie.ai/**"
EXLUDE_PATTERN = "https://fixie.ai/blog/**"
url = f"https://api.fixie.ai/api/v1/corpora/{CORPUS_ID}/sources"
payload = json.dumps({
"corpusId": f"{CORPUS_ID}",
"source": {
"displayName": f"{DISPLAY_NAME}",
"loadSpec": {
"web": {
"startUrls": [
f"{START_URL}"
],
"includeGlobPatterns": [
f"{INCLUDE_PATTERN}"
],
"excludeGlobPatterns": [
f"{EXLUDE_PATTERN}"
]
}
}
}
})
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json',
'Authorization': f'Bearer {API_KEY}'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
Create Source (Increase URLs & Change Crawling Depth)
By default, Fixie will crawl URLs with a depth of 3. That means the URL provided will be crawled, any URLs on that page will be crawled, and any URLs on those pages will be crawled. This
behavior can be changed by setting the maxDepth
parameter.
In this example, we will revisit the Fixie blog source we created above and increase the number of URLs crawled and change the crawling depth.
Some things to note:
- We are now providing two start URLs. One is for the current blog (https://fixie.ai/blog) and the other is for the (now deprecated) Medium blog (https://blog.fixie.ai/).
- We must also have include patterns for each of the URLs.
- We reduce the crawling depth to prevent external links from being crawled.
- curl
- Fixie CLI
- JavaScript
- Python
curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
"corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
"source": {
"displayName": "Fixie.ai blog and Medium blog",
"loadSpec": {
"web": {
"startUrls": [
"https://fixie.ai/blog",
"https://blog.fixie.ai/"
],
"maxDepth": 2,
"includeGlobPatterns": [
"https://fixie.ai/blog/**",
"https://blog.fixie.ai/**"
]
}
}
}
}'
npx fixie@latest corpus sources add 74e5bc4c-c2d9-4296-a2de-9c448d4bc307 \
'https://fixie.ai/blog' 'https://blog.fixie.ai' --include-patterns 'https://fixie.ai/blog/**' \
'https://blog.fixie.ai/**' --max-depth 2
// Import the Fixie client
import { FixieClient } from "fixie";
// Set the API key and Create the Fixie client
const API_KEY = "<TOKEN>";
const fixieClient = new FixieClient({ apiKey: API_KEY });
// Set our variables for the Corpus and Source
const CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307";
const DISPLAY_NAME = "Fixie.ai Blog and Medium Blog";
const START_URLS = ["https://fixie.ai/blog", "https://blog.fixie.ai"];
const INCLUDE_PATTERNS = ["https://fixie.ai/blog/**", "https://blog.fixie.ai/**"];
const MAX_DEPTH = 2;
// Add the source to the corpus
fixieClient.addCorpusSource({ corpusId: CORPUS_ID, startUrls: START_URLS, includeGlobs: INCLUDE_PATTERNS, maxDepth: MAX_DEPTH, displayName: DISPLAY_NAME }).then((source) => {
console.log(JSON.stringify(source));
});
import requests
import json
API_KEY = "<TOKEN>"
CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307"
DISPLAY_NAME = "Fixie.ai Blog and Medium Blog"
START_URLS = ["https://fixie.ai/blog", "https://blog.fixie.ai"]
INCLUDE_PATTERNS = ["https://fixie.ai/blog/**", "https://blog.fixie.ai/**"]
MAX_DEPTH = 2
url = f"https://api.fixie.ai/api/v1/corpora/{CORPUS_ID}/sources"
payload = json.dumps({
"corpusId": f"{CORPUS_ID}",
"source": {
"displayName": f"{DISPLAY_NAME}",
"loadSpec": {
"web": {
"startUrls": [
f"{START_URLS}"
],
"maxDepth": MAX_DEPTH,
"includeGlobPatterns": [
f"{INCLUDE_PATTERNS}"
]
}
}
}
})
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json',
'Authorization': f'Bearer {API_KEY}'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
Create Source (from Static File)
In this example we are going to add a new source based on a static document. This can be a PDF, Word Document, etc. For static files you must provide bytes and a MIME type.
Some things to note:
- We are using the corpus ID in the URL and in the body of the request.
- The static document must be Base64 encoded. The Fixie CLI will automatically do this for you. For curl, you can use a tool like this.
<BASE64_ENCODED_FILE>
is a placeholder for the Base64 encoded PDF. You will need to replace this with the actual Base64 encoded PDF.- The correct MIME type must be used for the document. For PDFs, this is
application/pdf
. For Word documents, this isapplication/msword
. See this list for more information. - We have a sample PDF that is an export of the Wikipedia page on LLMs. There are also sample .doc and a .docx files that you can use for testing.
We will add this source to the corpus with ID 74e5bc4c-c2d9-4296-a2de-9c448d4bc307
.
- curl
- Fixie CLI
- JavaScript
- Python
curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
"corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
"source": {
"displayName": "Wikipedia: Large Language Model",
"loadSpec": {
"static": {
"documents": [
{
"filename": "LLM-Wikipedia.pdf",
"mimeType": "application/pdf",
"contents": "<BASE64_ENCODED_FILE>"
}
]
}
}
}
}'
npx fixie@latest corpus sources upload 74e5bc4c-c2d9-4296-a2de-9c448d4bc307 \
'application/pdf' 'LLM-Wikipedia.pdf'
// This works in Node.js but not in the browser. To run in the browser, you need to get the static file into a Blob.
// Import the Fixie client and fs
import { FixieClient } from "fixie";
import fs from 'fs';
// Set the API key and Create the Fixie client
const API_KEY = "<TOKEN>";
const fixieClient = new FixieClient({ apiKey: API_KEY });
// Set our variables for the Corpus and Source
const CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307";
const STATIC_FILE = "LLM-Wikipedia.pdf";
const MIME_TYPE = "application/pdf";
const DISPLAY_NAME = "Wikipedia page for Large Language Model";
// Get our file as a blob
let blob = new Blob([fs.readFileSync(STATIC_FILE)]);
// Add our file to an array of files to be added to the corpus
let files = [{filename: STATIC_FILE, mimeType: MIME_TYPE, contents: blob}];
// Add the source to the corpus
fixieClient.addCorpusFileSource({ corpusId: CORPUS_ID, files: files, displayName: DISPLAY_NAME }).then((source) => {
console.log(JSON.stringify(source));
});
import requests
import json
import base64
API_KEY = "<TOKEN>"
CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307"
STATIC_FILE = "LLM-Wikipedia.pdf"
MIME_TYPE = "application/pdf"
DISPLAY_NAME = "Wikipedia page for Large Language Model"
url = f"https://api.fixie.ai/api/v1/corpora/{CORPUS_ID}/sources"
# Base64 encode our static file
encoded_static_file = ""
with open(STATIC_FILE, "rb") as static_file:
encoded_static_file = base64.b64encode(static_file.read()).decode('utf-8')
# Create the request
payload = json.dumps({
"corpusId": f"{CORPUS_ID}",
"source": {
"displayName": f"{DISPLAY_NAME}",
"loadSpec": {
"static": {
"documents": [
{
"filename": STATIC_FILE,
"mimeType": MIME_TYPE,
"contents": encoded_static_file
}
]
}
}
}
})
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json',
'Authorization': f'Bearer {API_KEY}'
}
# Send the request and log results
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
Clearing & Deleting Sources
Let's delete the source we just created. We will need to use both the corpus ID (74e5bc4c-c2d9-4296-a2de-9c448d4bc307
) and the source
ID (42c505e7-672a-4ed0-9201-934da2e0b17d
) in the URL.
Some things to note:
- We must clear the source before deleting it. This is to ensure that we don't get back chunks of data from the source if the corpus is queried at the time we are removing the source.
- We are using the corpus ID and the source ID in the URL and in the body of the request.
Clear Source
Sources must be cleared before they can be deleted. Note: if any jobs are running on a source, calling clear will fail and you will get a 409 error. If you don't care
about the other jobs that are running, you can use the force
option and any running jobs against the source will be cancelled and the source will be cleared.
- curl
- Fixie CLI
- JavaScript
- Python
curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources/42c505e7-672a-4ed0-9201-934da2e0b17d/clear' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
"corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
"sourceId": "42c505e7-672a-4ed0-9201-934da2e0b17d"
}'
npx fixie@latest corpus sources clear 74e5bc4c-c2d9-4296-a2de-9c448d4bc307 \
42c505e7-672a-4ed0-9201-934da2e0b17d
// Import the Fixie client
import { FixieClient } from "fixie";
// Set the API key and Create the Fixie client
const API_KEY = "<TOKEN>";
const fixieClient = new FixieClient({ apiKey: API_KEY });
// Set our variables for the Corpus and Source
const CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307";
const SOURCE_ID = "42c505e7-672a-4ed0-9201-934da2e0b17d";
// Clear the source
fixieClient.clearCorpusSource({ corpusId: CORPUS_ID, sourceId: SOURCE_ID }).then((source) => {
console.log(JSON.stringify(source));
});
import requests
import json
API_KEY = "<TOKEN>"
CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307"
SOURCE_ID = "42c505e7-672a-4ed0-9201-934da2e0b17d"
url = f"https://api.fixie.ai/api/v1/corpora/{CORPUS_ID}/sources/{SOURCE_ID}/clear"
payload = json.dumps({
"corpusId": f"{CORPUS_ID}",
"sourceId": f"{SOURCE_ID}"
})
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json',
'Authorization': f'Bearer {API_KEY}'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
Delete Source
Once the source is cleared, we can delete it.
- curl
- Fixie CLI
- JavaScript
- Python
curl -L -X DELETE 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources/42c505e7-672a-4ed0-9201-934da2e0b17d' \
-H 'Authorization: Bearer <TOKEN>'
npx fixie@latest corpus sources delete 74e5bc4c-c2d9-4296-a2de-9c448d4bc307 \
42c505e7-672a-4ed0-9201-934da2e0b17d
// Import the Fixie client
import { FixieClient } from "fixie";
// Set the API key and Create the Fixie client
const API_KEY = "<TOKEN>";
const fixieClient = new FixieClient({ apiKey: API_KEY });
// Set our variables for the Corpus and Source
const CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307";
const SOURCE_ID = "42c505e7-672a-4ed0-9201-934da2e0b17d";
// Delete the source
fixieClient.deleteCorpusSource({ corpusId: CORPUS_ID, sourceId: SOURCE_ID }).then((source) => {
console.log(JSON.stringify(source));
});
import requests
API_KEY = "<TOKEN>"
CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307"
SOURCE_ID = "42c505e7-672a-4ed0-9201-934da2e0b17d"
url = f"https://api.fixie.ai/api/v1/corpora/{CORPUS_ID}/sources/{SOURCE_ID}"
payload={}
headers = {
'Authorization': f'Bearer {API_KEY}'
}
response = requests.request("DELETE", url, headers=headers, data=payload)
print(response.text)
Refreshing Sources
Refreshing a source will re-crawl the source and update the content. Refresh starts a new Job and will fail if There are other jobs currently running for the source.
Refresh Source (from URL)
Let's take the source we created above for the Fixie blog and refresh it. We will
need to use both the corpus ID (74e5bc4c-c2d9-4296-a2de-9c448d4bc307
) and the source ID (42c505e7-672a-4ed0-9201-934da2e0b17d
). Refreshing
a source will fail with a 409 error if another job is running on the source. You can use the force
option to cancel any running jobs and refresh the source.
- curl
- Fixie CLI
- JavaScript
- Python
curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources/42c505e7-672a-4ed0-9201-934da2e0b17d/refresh' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
"corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
"sourceId": "42c505e7-672a-4ed0-9201-934da2e0b17d"
}'
npx fixie@latest corpus sources refresh 74e5bc4c-c2d9-4296-a2de-9c448d4bc307 \
42c505e7-672a-4ed0-9201-934da2e0b17d
// Import the Fixie client
import { FixieClient } from "fixie";
// Set the API key and Create the Fixie client
const API_KEY = "<TOKEN>";
const fixieClient = new FixieClient({ apiKey: API_KEY });
// Set our variables for the Corpus and Source
const CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307";
const SOURCE_ID = "42c505e7-672a-4ed0-9201-934da2e0b17d";
// Refresh the source
fixieClient.refreshCorpusSource({ corpusId: CORPUS_ID, sourceId: SOURCE_ID }).then((source) => {
console.log(JSON.stringify(source));
});
import requests
import json
API_KEY = "<TOKEN>"
CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307"
SOURCE_ID = "42c505e7-672a-4ed0-9201-934da2e0b17d"
url = f"https://api.fixie.ai/api/v1/corpora/{CORPUS_ID}/sources/{SOURCE_ID}/refresh"
payload = json.dumps({
"corpusId": CORPUS_ID,
"sourceId": SOURCE_ID
})
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json',
'Authorization': f'Bearer {API_KEY}'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)