More Examples¶

Starting an Extraction with Multiple URLs¶

# List of repositories to track
repo_list = [
    "https://github.com/torvalds/linux",
    "https://github.com/microsoft/vscode",
    "https://github.com/facebook/react"
]

# Initialize workflow with multiple URLs directly
repos_stats = fox.extract(
    repo_list,
    {
        "name": "What is the full name of this repository?",
        "stars": "How many stars does this repository have?"
    },
    mode='single'
)

results = list(repos_stats)

Lazy and Composable Workflows¶

import fetchfox_sdk
from pprint import pprint

fox = fetchfox_sdk.FetchFox(quiet=True)

top_posts = \
    fox.extract(
        "https://news.ycombinator.com",
        {"url": "Find me all the URLs of the comments pages."
                "They'll all look like https://news.ycombinator.com/item?id=$SOMETHING"
        },
        limit=10)
# top_posts is a workflow.
# It will not be executed until the results are needed

# Here, we'll take a look at what was retrieved:
print("Found Post URLs:")
for post in top_posts:
    print(f"  {post.url}")

# top_posts is now carrying those results with it.

# We can derive multiple workflows from top_posts, and
# now they'll all inherit the results we already have.

####
# First workflow derived from top_posts:
####

user_urls_for_posters_of_top_ten_posts = \
    top_posts.extract(
        {"url": "The link to the profile of the user who submitted this post."
                "The URL will look like https://news.ycombinator.com/user?id=$USERNAME."
                "ONLY include URL for the post's author."
                "Do not include profiles for any commenters."},
        mode='single')

# This is the information we want to extract for each of the posters:
poster_info_template = {
    "username": "What is the username of this user?",
    "karma_points": "What is the number of 'karma' points this user has?",
    "created_date": "What is the 'created' date for this user?"
}

poster_infos = \
    user_urls_for_posters_of_top_ten_posts.extract(
        poster_info_template,
        mode='single')

####
# Second workflow derived from top_posts:
####

# A post can either be a link, or have a textual body.
links_and_usernames_from_top_ten_posts = \
    top_posts.extract({
        "url": "If the main content of the post is an external link to an article"
               "provide it here.  If the post is a text post (which has it's own"
               "content and NO external link), simply provide the post URL.",
        "username": "The username of the poster."
        },
        mode='single')

summaries_of_post_content = \
    links_and_usernames_from_top_ten_posts.extract({
        "content_summary":
            "Briefly summarize the main content of the article."
            "Ignore all comments and extra information."
            "There is only one article or post."
        },
        mode="single")

print("\n")

print("#####")
print("Posters info:")
print('#####')
print("")
for poster_info in poster_infos:
    pprint(dict(poster_info))

print("#####")
print("Summaries:")
print("#####")
print("")
for summary in summaries_of_post_content:
    pprint(dict(summary))

# NOTE: This example loads ~20 pages and processes a lot of text,
# so it may take more than a minute to finish.

Concurrency¶

Simple Concurrency with Futures¶

Using Workflow.results_future() will give you a standard Python concurrent.futures.Future.

The simplest way to run multiple workflows concurrently is simply by requesting the futures and then using the results.


top_posts_on_hn = \
    fox.extract(
        "https://news.ycombinator.com",
        {"title": "Find me all the titles of the posts."},
        limit=10)

top_posts_on_reddit = \
    fox.extract(
        "https://old.reddit.com",
        {"title": "Find me all the titles of the posts."},
        limit=10)

top_posts_on_slashdot = \
    fox.extract(
        "https://news.slashdot.org/",
        {"title": "Find me all the titles of the posts."},
        limit=10)

# The easiest way to run workflows concurrently is to use futures:

top_posts_on_hn_future = top_posts_on_hn.results_future()
top_posts_on_reddit_future = top_posts_on_reddit.results_future()
top_posts_on_slashdot_future = top_posts_on_slashdot.results_future()
# The above will run all 3 workflows.
# These variables hold standard Python concurrent.futures.Futures

# If you don't want to do anything until all of them are finished, you can
# simply collect the results by using .result() on the futures.
# This will block until they all finish.

hn_posts_results = top_posts_on_hn_future.result()
reddit_posts_results = top_posts_on_reddit_future.result()
slashdot_posts_results = top_posts_on_slashdot_future.result()
# The futures resolve directly to a list of result items.

print(hn_posts_results[0])

# After those futures complete, the results are *also* available
# via the workflows, which may be used normally.

print(top_posts_on_hn[1])