Corpus API Examples: Working with Sources

This page provides examples for how to use the Corpus API to work with sources. Examples include creating sources from URLs and from static documents. Refreshing, clearing and deleting. Modifying sources (changing the crawling depth or the pages included/excluded). These examples will build on what we did above with creating a corpus. There are more details about working with Sources in the API docs.

Most if not all of what is covered here can also be accomplished by using the Fixie Console. Examples are provided in four flavors with each of those corresponding with the method of calling the Corpus API: via curl, using the Fixie CLI via the terminal, using the Fixie client in JavaScript, and using the Corpus REST API in Python.

Notes on Authentication

Calling the Corpus API requires a Fixie account. For the examples using curl or the REST API (in JS or Python), you will need to provide your Fixie API key. The examples use <TOKEN> to denote where you need to provide your key. You can find your API key (AKA "API token") on your profile page.

When you use the Fixie CLI, you will need to use the auth command. See the docs for more information.

Creating Sources

There are many ways to create sources. You can create them from URLs and Fixie will crawl the content and store it. Sources can also be created from static files (PDFs, Word documents, etc.). This section contains various examples of creating sources including using some of the options (e.g. crawling depth, included/excluded pages, etc.).

Create Source (from URL)

Create a source for the Fixie blog (located at https://fixie.ai/blog). Add this source to the corpus with ID 74e5bc4c-c2d9-4296-a2de-9c448d4bc307.

Some things to note:

We are using the corpus ID in the URL and in the body of the request.
Start URL(s) are required. In this case, we are only providing one.
Include pattern is used to ensure we only crawl content that is part of the blog.

curl
Fixie CLI
JavaScript
Python

curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
  "corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
  "source": {
    "displayName": "Fixie.ai blog",
    "loadSpec": {
      "web": {
        "startUrls": [
          "https://fixie.ai/blog"
        ],
        "includeGlobPatterns": [
          "https://fixie.ai/blog/**"
        ]
      }
    }
  }
}'

npx fixie@latest corpus sources add 74e5bc4c-c2d9-4296-a2de-9c448d4bc307 \
'https://fixie.ai/blog' --include-patterns 'https://fixie.ai/blog/**'

// Import the Fixie client
import { FixieClient } from "fixie";

// Set the API key and Create the Fixie client
const API_KEY = "<TOKEN>";
const fixieClient = new FixieClient({ apiKey: API_KEY });

// Set our variables for the Corpus and Source
const CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307";
const DISPLAY_NAME = "Fixie.ai Blog";
const START_URLS = ["https://fixie.ai/blog"];
const INCLUDE_PATTERNS = ["https://fixie.ai/blog/**"];

// Add the source to the corpus
fixieClient.addCorpusSource({ corpusId: CORPUS_ID, startUrls: START_URLS, includeGlobs: INCLUDE_PATTERNS, displayName: DISPLAY_NAME }).then((source) => {
  console.log(JSON.stringify(source));
});

import requests
import json

API_KEY = "<TOKEN>"
CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307"
DISPLAY_NAME = "Fixie.ai Blog"
START_URL = "https://fixie.ai/blog"
INCLUDE_PATTERN = "https://fixie.ai/blog/**"

url = f"https://api.fixie.ai/api/v1/corpora/{CORPUS_ID}/sources"

payload = json.dumps({
  "corpusId": f"{CORPUS_ID}",
  "source": {
    "displayName": f"{DISPLAY_NAME}",
    "loadSpec": {
      "web": {
        "startUrls": [
          f"{START_URL}"
        ],
        "includeGlobPatterns": [
          f"{INCLUDE_PATTERN}"
        ]
      }
    }
  }
})
headers = {
  'Content-Type': 'application/json',
  'Accept': 'application/json',
  'Authorization': f'Bearer {API_KEY}'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

Create Source (Exclude Pages)

Above we saw how to create a source from a URL. In this example, we will create a source from a URL but exclude certain pages. In this case, we will create a source for the Fixie website (located at https://fixie.ai) but exclude the blog (https://fixie.ai/blog).

Just as we did earlier, we will add this source to the corpus with ID 74e5bc4c-c2d9-4296-a2de-9c448d4bc307.

Some things to note:

We are using the corpus ID in the URL and in the body of the request.
Start URL(s) are required. In this case, we are only providing one (for fixie.ai).
Include pattern is used to ensure we only crawl content that is part of the site.
Exclude pattern is used to ensure we do not crawl the blog.

curl
Fixie CLI
JavaScript
Python

curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
  "corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
  "source": {
    "displayName": "Fixie.ai Website",
    "loadSpec": {
      "web": {
        "startUrls": [
          "https://fixie.ai/"
        ],
        "includeGlobPatterns": [
          "https://fixie.ai/**"
        ],
        "excludeGlobPatterns": [
          "https://fixie.ai/blog/**"
        ]
      }
    }
  }
}'

npx fixie@latest corpus sources add 74e5bc4c-c2d9-4296-a2de-9c448d4bc307 \
'https://fixie.ai' --include-patterns 'https://fixie.ai/**' \
--exclude-patterns 'https://fixie.ai/blog/**'

// Import the Fixie client
import { FixieClient } from "fixie";

// Set the API key and Create the Fixie client
const API_KEY = "<TOKEN>";
const fixieClient = new FixieClient({ apiKey: API_KEY });

// Set our variables for the Corpus and Source
const CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307";
const DISPLAY_NAME = "Fixie.ai Website";
const START_URLS = ["https://fixie.ai"];
const INCLUDE_PATTERNS = ["https://fixie.ai/**"];
const EXLUDE_PATTERNS = ["https://fixie.ai/blog/**"];

// Add the source to the corpus
fixieClient.addCorpusSource({ corpusId: CORPUS_ID, startUrls: START_URLS, includeGlobs: INCLUDE_PATTERNS, excludeGlobs: EXLUDE_PATTERNS, displayName: DISPLAY_NAME }).then((source) => {
  console.log(JSON.stringify(source));
});

import requests
import json

API_KEY = "<TOKEN>"
CORPUS_ID = "3083bd58-71af-46b4-b4fd-944d5b85a131"
DISPLAY_NAME = "Fixie.ai Website"
START_URL = "https://fixie.ai"
INCLUDE_PATTERN = "https://fixie.ai/**"
EXLUDE_PATTERN = "https://fixie.ai/blog/**"

url = f"https://api.fixie.ai/api/v1/corpora/{CORPUS_ID}/sources"

payload = json.dumps({
  "corpusId": f"{CORPUS_ID}",
  "source": {
    "displayName": f"{DISPLAY_NAME}",
    "loadSpec": {
      "web": {
        "startUrls": [
          f"{START_URL}"
        ],
        "includeGlobPatterns": [
          f"{INCLUDE_PATTERN}"
        ],
        "excludeGlobPatterns": [
          f"{EXLUDE_PATTERN}"
        ]
      }
    }
  }
})
headers = {
  'Content-Type': 'application/json',
  'Accept': 'application/json',
  'Authorization': f'Bearer {API_KEY}'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

Create Source (Increase URLs & Change Crawling Depth)

By default, Fixie will crawl URLs with a depth of 3. That means the URL provided will be crawled, any URLs on that page will be crawled, and any URLs on those pages will be crawled. This behavior can be changed by setting the maxDepth parameter.

In this example, we will revisit the Fixie blog source we created above and increase the number of URLs crawled and change the crawling depth.

Some things to note:

We are now providing two start URLs. One is for the current blog (https://fixie.ai/blog) and the other is for the (now deprecated) Medium blog (https://blog.fixie.ai/).
We must also have include patterns for each of the URLs.
We reduce the crawling depth to prevent external links from being crawled.

curl
Fixie CLI
JavaScript
Python

curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
  "corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
  "source": {
    "displayName": "Fixie.ai blog and Medium blog",
    "loadSpec": {
      "web": {
        "startUrls": [
          "https://fixie.ai/blog",
          "https://blog.fixie.ai/"
        ],
        "maxDepth": 2,
        "includeGlobPatterns": [
          "https://fixie.ai/blog/**",
          "https://blog.fixie.ai/**"
        ]
      }
    }
  }
}'

npx fixie@latest corpus sources add 74e5bc4c-c2d9-4296-a2de-9c448d4bc307 \
'https://fixie.ai/blog' 'https://blog.fixie.ai' --include-patterns 'https://fixie.ai/blog/**' \
'https://blog.fixie.ai/**' --max-depth 2

// Import the Fixie client
import { FixieClient } from "fixie";

// Set the API key and Create the Fixie client
const API_KEY = "<TOKEN>";
const fixieClient = new FixieClient({ apiKey: API_KEY });

// Set our variables for the Corpus and Source
const CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307";
const DISPLAY_NAME = "Fixie.ai Blog and Medium Blog";
const START_URLS = ["https://fixie.ai/blog", "https://blog.fixie.ai"];
const INCLUDE_PATTERNS = ["https://fixie.ai/blog/**", "https://blog.fixie.ai/**"];
const MAX_DEPTH = 2;

// Add the source to the corpus
fixieClient.addCorpusSource({ corpusId: CORPUS_ID, startUrls: START_URLS, includeGlobs: INCLUDE_PATTERNS, maxDepth: MAX_DEPTH, displayName: DISPLAY_NAME }).then((source) => {
  console.log(JSON.stringify(source));
});

import requests
import json

API_KEY = "<TOKEN>"
CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307"
DISPLAY_NAME = "Fixie.ai Blog and Medium Blog"
START_URLS = ["https://fixie.ai/blog", "https://blog.fixie.ai"]
INCLUDE_PATTERNS = ["https://fixie.ai/blog/**", "https://blog.fixie.ai/**"]
MAX_DEPTH = 2

url = f"https://api.fixie.ai/api/v1/corpora/{CORPUS_ID}/sources"

payload = json.dumps({
  "corpusId": f"{CORPUS_ID}",
  "source": {
    "displayName": f"{DISPLAY_NAME}",
    "loadSpec": {
      "web": {
        "startUrls": [
          f"{START_URLS}"
        ],
        "maxDepth": MAX_DEPTH,
        "includeGlobPatterns": [
          f"{INCLUDE_PATTERNS}"
        ]
      }
    }
  }
})
headers = {
  'Content-Type': 'application/json',
  'Accept': 'application/json',
  'Authorization': f'Bearer {API_KEY}'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

Create Source (from Static File)

In this example we are going to add a new source based on a static document. This can be a PDF, Word Document, etc. For static files you must provide bytes and a MIME type.

Some things to note:

We are using the corpus ID in the URL and in the body of the request.
The static document must be Base64 encoded. The Fixie CLI will automatically do this for you. For curl, you can use a tool like this.
<BASE64_ENCODED_FILE> is a placeholder for the Base64 encoded PDF. You will need to replace this with the actual Base64 encoded PDF.
The correct MIME type must be used for the document. For PDFs, this is application/pdf. For Word documents, this is application/msword. See this list for more information.
We have a sample PDF that is an export of the Wikipedia page on LLMs. There are also sample .doc and a .docx files that you can use for testing.

We will add this source to the corpus with ID 74e5bc4c-c2d9-4296-a2de-9c448d4bc307.

curl
Fixie CLI
JavaScript
Python

curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
  "corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
  "source": {
    "displayName": "Wikipedia: Large Language Model",
    "loadSpec": {
      "static": {
        "documents": [
          {
            "filename": "LLM-Wikipedia.pdf",
            "mimeType": "application/pdf",
            "contents": "<BASE64_ENCODED_FILE>"
          }
        ]
      }
    }
  }
}'

npx fixie@latest corpus sources upload 74e5bc4c-c2d9-4296-a2de-9c448d4bc307 \
'application/pdf' 'LLM-Wikipedia.pdf'

// This works in Node.js but not in the browser. To run in the browser, you need to get the static file into a Blob.
// Import the Fixie client and fs
import { FixieClient } from "fixie";
import fs from 'fs';

// Set the API key and Create the Fixie client
const API_KEY = "<TOKEN>";
const fixieClient = new FixieClient({ apiKey: API_KEY });

// Set our variables for the Corpus and Source
const CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307";
const STATIC_FILE = "LLM-Wikipedia.pdf";
const MIME_TYPE = "application/pdf";
const DISPLAY_NAME = "Wikipedia page for Large Language Model";

// Get our file as a blob
let blob = new Blob([fs.readFileSync(STATIC_FILE)]);

// Add our file to an array of files to be added to the corpus
let files = [{filename: STATIC_FILE, mimeType: MIME_TYPE, contents: blob}];

// Add the source to the corpus
fixieClient.addCorpusFileSource({ corpusId: CORPUS_ID, files: files, displayName: DISPLAY_NAME }).then((source) => {
  console.log(JSON.stringify(source));
});

import requests
import json
import base64

API_KEY = "<TOKEN>"
CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307"
STATIC_FILE = "LLM-Wikipedia.pdf"
MIME_TYPE = "application/pdf"
DISPLAY_NAME = "Wikipedia page for Large Language Model"
url = f"https://api.fixie.ai/api/v1/corpora/{CORPUS_ID}/sources"

# Base64 encode our static file
encoded_static_file = ""
with open(STATIC_FILE, "rb") as static_file:
    encoded_static_file = base64.b64encode(static_file.read()).decode('utf-8')

# Create the request
payload = json.dumps({
  "corpusId": f"{CORPUS_ID}",
  "source": {
    "displayName": f"{DISPLAY_NAME}",
    "loadSpec": {
        "static": {
        "documents": [
          {
            "filename": STATIC_FILE,
            "mimeType": MIME_TYPE, 
            "contents": encoded_static_file
          }
        ]
      }
    }
  }
})
headers = {
  'Content-Type': 'application/json',
  'Accept': 'application/json',
  'Authorization': f'Bearer {API_KEY}'
}

# Send the request and log results
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)

Clearing & Deleting Sources

Let's delete the source we just created. We will need to use both the corpus ID (74e5bc4c-c2d9-4296-a2de-9c448d4bc307) and the source ID (42c505e7-672a-4ed0-9201-934da2e0b17d) in the URL.

Some things to note:

We must clear the source before deleting it. This is to ensure that we don't get back chunks of data from the source if the corpus is queried at the time we are removing the source.
We are using the corpus ID and the source ID in the URL and in the body of the request.

Clear Source

Sources must be cleared before they can be deleted. Note: if any jobs are running on a source, calling clear will fail and you will get a 409 error. If you don't care about the other jobs that are running, you can use the force option and any running jobs against the source will be cancelled and the source will be cleared.

curl
Fixie CLI
JavaScript
Python

curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources/42c505e7-672a-4ed0-9201-934da2e0b17d/clear' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
  "corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
  "sourceId": "42c505e7-672a-4ed0-9201-934da2e0b17d"
}'

npx fixie@latest corpus sources clear 74e5bc4c-c2d9-4296-a2de-9c448d4bc307 \
42c505e7-672a-4ed0-9201-934da2e0b17d

// Import the Fixie client
import { FixieClient } from "fixie";

// Set the API key and Create the Fixie client
const API_KEY = "<TOKEN>";
const fixieClient = new FixieClient({ apiKey: API_KEY });

// Set our variables for the Corpus and Source
const CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307";
const SOURCE_ID = "42c505e7-672a-4ed0-9201-934da2e0b17d";

// Clear the source
fixieClient.clearCorpusSource({ corpusId: CORPUS_ID, sourceId: SOURCE_ID }).then((source) => {
  console.log(JSON.stringify(source));
});

import requests
import json

API_KEY = "<TOKEN>"
CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307"
SOURCE_ID = "42c505e7-672a-4ed0-9201-934da2e0b17d"

url = f"https://api.fixie.ai/api/v1/corpora/{CORPUS_ID}/sources/{SOURCE_ID}/clear"

payload = json.dumps({
  "corpusId": f"{CORPUS_ID}",
  "sourceId": f"{SOURCE_ID}"
})
headers = {
  'Content-Type': 'application/json',
  'Accept': 'application/json',
  'Authorization': f'Bearer {API_KEY}'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

Delete Source

Once the source is cleared, we can delete it.

curl
Fixie CLI
JavaScript
Python

curl -L -X DELETE 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources/42c505e7-672a-4ed0-9201-934da2e0b17d' \
-H 'Authorization: Bearer <TOKEN>'

npx fixie@latest corpus sources delete 74e5bc4c-c2d9-4296-a2de-9c448d4bc307 \
42c505e7-672a-4ed0-9201-934da2e0b17d

// Import the Fixie client
import { FixieClient } from "fixie";

// Set the API key and Create the Fixie client
const API_KEY = "<TOKEN>";
const fixieClient = new FixieClient({ apiKey: API_KEY });

// Set our variables for the Corpus and Source
const CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307";
const SOURCE_ID = "42c505e7-672a-4ed0-9201-934da2e0b17d";

// Delete the source
fixieClient.deleteCorpusSource({ corpusId: CORPUS_ID, sourceId: SOURCE_ID }).then((source) => {
  console.log(JSON.stringify(source));
});

import requests

API_KEY = "<TOKEN>"
CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307"
SOURCE_ID = "42c505e7-672a-4ed0-9201-934da2e0b17d"

url = f"https://api.fixie.ai/api/v1/corpora/{CORPUS_ID}/sources/{SOURCE_ID}"

payload={}
headers = {
  'Authorization': f'Bearer {API_KEY}'
}

response = requests.request("DELETE", url, headers=headers, data=payload)

print(response.text)

Refreshing Sources

Refreshing a source will re-crawl the source and update the content. Refresh starts a new Job and will fail if There are other jobs currently running for the source.

Refresh Source (from URL)

Let's take the source we created above for the Fixie blog and refresh it. We will need to use both the corpus ID (74e5bc4c-c2d9-4296-a2de-9c448d4bc307) and the source ID (42c505e7-672a-4ed0-9201-934da2e0b17d). Refreshing a source will fail with a 409 error if another job is running on the source. You can use the force option to cancel any running jobs and refresh the source.

curl
Fixie CLI
JavaScript
Python

curl -L -X POST 'https://api.fixie.ai/api/v1/corpora/74e5bc4c-c2d9-4296-a2de-9c448d4bc307/sources/42c505e7-672a-4ed0-9201-934da2e0b17d/refresh' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
  "corpusId": "74e5bc4c-c2d9-4296-a2de-9c448d4bc307",
  "sourceId": "42c505e7-672a-4ed0-9201-934da2e0b17d"
}'

npx fixie@latest corpus sources refresh 74e5bc4c-c2d9-4296-a2de-9c448d4bc307 \
42c505e7-672a-4ed0-9201-934da2e0b17d

// Import the Fixie client
import { FixieClient } from "fixie";

// Set the API key and Create the Fixie client
const API_KEY = "<TOKEN>";
const fixieClient = new FixieClient({ apiKey: API_KEY });

// Set our variables for the Corpus and Source
const CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307";
const SOURCE_ID = "42c505e7-672a-4ed0-9201-934da2e0b17d";

// Refresh the source
fixieClient.refreshCorpusSource({ corpusId: CORPUS_ID, sourceId: SOURCE_ID }).then((source) => {
  console.log(JSON.stringify(source));
});

import requests
import json

API_KEY = "<TOKEN>"
CORPUS_ID = "74e5bc4c-c2d9-4296-a2de-9c448d4bc307"
SOURCE_ID = "42c505e7-672a-4ed0-9201-934da2e0b17d"

url = f"https://api.fixie.ai/api/v1/corpora/{CORPUS_ID}/sources/{SOURCE_ID}/refresh"

payload = json.dumps({
  "corpusId": CORPUS_ID,
  "sourceId": SOURCE_ID
})
headers = {
  'Content-Type': 'application/json',
  'Accept': 'application/json',
  'Authorization': f'Bearer {API_KEY}'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

Corpus API Examples: Working with Sources

Creating Sources​

Create Source (from URL)​

Create Source (Exclude Pages)​

Create Source (Increase URLs & Change Crawling Depth)​

Create Source (from Static File)​

Clearing & Deleting Sources​

Clear Source​

Delete Source​

Refreshing Sources​

Refresh Source (from URL)​

Creating Sources

Create Source (from URL)

Create Source (Exclude Pages)

Create Source (Increase URLs & Change Crawling Depth)

Create Source (from Static File)

Clearing & Deleting Sources

Clear Source

Delete Source

Refreshing Sources

Refresh Source (from URL)