Getting Started

Installation / Setup

pip install fetchfox-sdk

You will need Python 3.8 or later.

Getting an API Key

You will need an API Key to use the Fetchfox SDK. If you want something that runs locally, check out our open-source core project.

If you are logged in to FetchFox, you can get your API key here: https://fetchfox.ai/settings/api-keys

Configuration

If you export the FETCHFOX_API_KEY environment variable, the SDK will use that. You can also provide the key when initializing the SDK like this:

FetchFox(api_key="YOUR_API_KEY_HERE")

Example: Extracting Information from GitHub

Simple Extraction

Fetchfox can extract structured information from websites. The most basic usage is to simply give it a URL and a template that describes the structure of the items you want to extract.

import fetchfox_sdk
fox = fetchfox_sdk.FetchFox()

# Here we'll extract some data from a page:
items = \
    fox.extract(
        "https://github.com/torvalds/linux",
        {
            "forks": "How many forks does this repository have?",
            "stars": "How many stars does this repository have?",
            "name": "What is the full name of the repository?"
        },
        mode='single')

# This may take 15 seconds or so.
print(items[0])

The above will output something like this:

{
    "forks": "55.4k",
    "stars": "189k",
    "name": "torvalds/linux",
    "_url": "https://github.com/torvalds/linux"
}

Extracting Multiple Items

FetchFox can extract multiple items from one page.


items = \
    fox.extract(
        "https://github.com/torvalds/linux/commits/master/",
        {
            "title": "What is the title of the commit?",
            "sha": "What is the hash of the commit?",
        },
        mode="multiple")

# This may take ~15 seconds before showing results.
print("Recent Commits:")
for item in items.limit(10):
    print(f"  {item.title}")
    print(f"  {item.sha}\n")


The above will extract commit titles and hashes.

Pagination

If you specify max_pages, FetchFox will use AI to load subsequent pages after your starting URL.


    fox.extract(
        "https://github.com/torvalds/linux/commits/master/",
        {
            "title": "What is the title of the commit?",
            "sha": "What is the hash of the commit?",
        },
        mode="multiple",
        max_pages=5)

This works like the previous example, but loads many more results.

Extraction Modes: Single and Multiple

You may have noticed the mode parameter being used in extractions. This controls how many items will be yielded per page.

You can specify single or multiple. If you don't provide this parameter, FetchFox will use AI to guess, based on your template and the contents of the page.

Following URLs

URLs are just another thing that you can extract from a page. Simply include a url field in your item template and describe how to find the URLs.

When you produce items with a url field, Fetchfox can load the those URLs in the next step of a workflow.


items = \
    fox.extract(
        "https://github.com/torvalds/linux/commits/master/",
        {
            "title": "What is the title of the commit?",
            "sha": "What is the hash of the commit?",
            "url": "What is the link to the commit?" # Added this field
        },
        mode="multiple")
# Extracts URLs like: https://github.com/torvalds/linux/commit/f31529...

Now, we can extend that workflow by chaining another step. This will load the pages for the individual commits and extract information from those pages.


items2 = \
    items.extract(  # Note that we're extending `items` from before.
        {
            "username": "Who committed this commit?",
            "summary": "Summarize the extended description briefly."
        },
        mode='single')

Workflows

With FetchFox, you can chain operations together. This creates workflows. Execution of workflows is managed on our backend.

Let's extend the examples above to look at the authors of the ten most recent commits and see how many GitHub followers they have.

To accomplish this, we'll use the following steps:

  1. Load the list of commits and get a URL for each individual commit
  2. Load each individual commit and get a URL for the author
  3. Remove any duplicate authors (the same user may have made multiple commits)
  4. Load each unique author's page and extract their follower count

items = fox \
    .extract(
        "https://github.com/torvalds/linux/commits/master/",
        {
            "url": "What is the link to the commit?"
        },
        mode="multiple",
        limit=10) \
    .extract(
        {
            "username": "Who committed this commit?",
            "url": "Link to the committing user. Looks like github.com/$USERNAME"
        },
        mode='single') \
    .unique(['url']) \
    .extract(
        {
            "follower_count": "How many followers does the user have?"
        },
        mode='single')

# This one takes a bit longer, since more pages are being loaded.
for item in items:
    print(f"  {item.username} has {item.follower_count} followers")

The above will print output similar to this:

torvalds has 229k followers
[...]

Filter and Export

FetchFox can also filter items given natural language instructions.

Let's look at the list of commits and find some that are related to networking, then export their extended descriptions to a file.


items = \
    fox.extract(
        "https://github.com/torvalds/linux/commits/master/",
        {
            "title": "What is the title of the commit?",
            "sha": "What is the hash of the commit?",
        },
        mode="multiple",
        max_pages=5) \
    .filter("Only show me commits that pertain to networking.") \
    .limit(10)

items.export("networking_commits.jsonl", overwrite=True)


The above will produce a JSONL file with lines like this:

{
    "title": "Merge tag 'net-6.14-rc6' of [...]"
    "sha": "f315296c92fd4b7716bdea17f727ab431891dc3b",
    "_url": "https://github.com/torvalds/linux/commits/master/"
}

More About Workflows

Workflow are lazy, carry results with them, and may be run concurrently. See concepts and more examples.