Getting Started¶
Installation / Setup¶
pip install fetchfox-sdk
You will need Python 3.8 or later.
Getting an API Key¶
You will need an API Key to use the Fetchfox SDK. If you want something that runs locally, check out our open-source core project.
If you are logged in to FetchFox, you can get your API key here: https://fetchfox.ai/settings/api-keys
Configuration¶
If you export the FETCHFOX_API_KEY
environment variable, the SDK will use that.
You can also provide the key when initializing the SDK like this:
FetchFox(api_key="YOUR_API_KEY_HERE")
Example: Extracting Information from GitHub¶
Simple Extraction¶
Fetchfox can extract structured information from websites. The most basic usage is to simply give it a URL and a template that describes the structure of the items you want to extract.
import fetchfox_sdk
fox = fetchfox_sdk.FetchFox()
# Here we'll extract some data from a page:
items = \
fox.extract(
"https://github.com/torvalds/linux",
{
"forks": "How many forks does this repository have?",
"stars": "How many stars does this repository have?",
"name": "What is the full name of the repository?"
},
mode='single')
# This may take 15 seconds or so.
print(items[0])
The above will output something like this:
{
"forks": "55.4k",
"stars": "189k",
"name": "torvalds/linux",
"_url": "https://github.com/torvalds/linux"
}
Extracting Multiple Items¶
FetchFox can extract multiple items from one page.
items = \
fox.extract(
"https://github.com/torvalds/linux/commits/master/",
{
"title": "What is the title of the commit?",
"sha": "What is the hash of the commit?",
},
mode="multiple")
# This may take ~15 seconds before showing results.
print("Recent Commits:")
for item in items.limit(10):
print(f" {item.title}")
print(f" {item.sha}\n")
The above will extract commit titles and hashes.
Pagination¶
If you specify max_pages
, FetchFox will use AI to load subsequent pages after
your starting URL.
fox.extract(
"https://github.com/torvalds/linux/commits/master/",
{
"title": "What is the title of the commit?",
"sha": "What is the hash of the commit?",
},
mode="multiple",
max_pages=5)
This works like the previous example, but loads many more results.
Extraction Modes: Single and Multiple¶
You may have noticed the mode
parameter being used in extractions. This controls how many items will be yielded per page.
You can specify single
or multiple
. If you don't provide this parameter, FetchFox will use AI to guess, based on your template and the contents of the page.
Following URLs¶
URLs are just another thing that you can extract from a page. Simply include a
url
field in your item template and describe how to find the URLs.
When you produce items with a url
field, Fetchfox can load the those URLs in
the next step of a workflow.
items = \
fox.extract(
"https://github.com/torvalds/linux/commits/master/",
{
"title": "What is the title of the commit?",
"sha": "What is the hash of the commit?",
"url": "What is the link to the commit?" # Added this field
},
mode="multiple")
# Extracts URLs like: https://github.com/torvalds/linux/commit/f31529...
Now, we can extend that workflow by chaining another step. This will load the pages for the individual commits and extract information from those pages.
items2 = \
items.extract( # Note that we're extending `items` from before.
{
"username": "Who committed this commit?",
"summary": "Summarize the extended description briefly."
},
mode='single')
Workflows¶
With FetchFox, you can chain operations together. This creates workflows. Execution of workflows is managed on our backend.
Let's extend the examples above to look at the authors of the ten most recent commits and see how many GitHub followers they have.
To accomplish this, we'll use the following steps:
- Load the list of commits and get a URL for each individual commit
- Load each individual commit and get a URL for the author
- Remove any duplicate authors (the same user may have made multiple commits)
- Load each unique author's page and extract their follower count
items = fox \
.extract(
"https://github.com/torvalds/linux/commits/master/",
{
"url": "What is the link to the commit?"
},
mode="multiple",
limit=10) \
.extract(
{
"username": "Who committed this commit?",
"url": "Link to the committing user. Looks like github.com/$USERNAME"
},
mode='single') \
.unique(['url']) \
.extract(
{
"follower_count": "How many followers does the user have?"
},
mode='single')
# This one takes a bit longer, since more pages are being loaded.
for item in items:
print(f" {item.username} has {item.follower_count} followers")
The above will print output similar to this:
torvalds has 229k followers
[...]
Filter and Export¶
FetchFox can also filter items given natural language instructions.
Let's look at the list of commits and find some that are related to networking, then export their extended descriptions to a file.
items = \
fox.extract(
"https://github.com/torvalds/linux/commits/master/",
{
"title": "What is the title of the commit?",
"sha": "What is the hash of the commit?",
},
mode="multiple",
max_pages=5) \
.filter("Only show me commits that pertain to networking.") \
.limit(10)
items.export("networking_commits.jsonl", overwrite=True)
The above will produce a JSONL file with lines like this:
{
"title": "Merge tag 'net-6.14-rc6' of [...]"
"sha": "f315296c92fd4b7716bdea17f727ab431891dc3b",
"_url": "https://github.com/torvalds/linux/commits/master/"
}
More About Workflows¶
Workflow are lazy, carry results with them, and may be run concurrently. See concepts and more examples.