The Quest for Bread

I'm Sarah Crowle. And I like bread.

So much, in fact, that I created this website. Previous incarnations of this website displayed a randomized picture of bread every time you refreshed any of the pages. While looking to re-implement this functionality for this new version of picsofbread.com, I realized that the pool of bread images for randomization was actually rather small in practice. This was a problem. How could I call my site the home of bread pics on the internet if I couldn't even show more than 10-15 images?

Enter the breadchive.

So I thought about it for a while, and I began to dream of a perfect archive of bread images. Thousands of them, all Creative Commons, and all ready to be beamed directly to visitor's computers.

The problem was...

Where the hell do I get like 100000 pictures of bread?

The obvious answer would be to curate them myself. But, like anyone else, I have no time for that. So I turned to my best friend, automation. Looking into APIs to access large amounts of CC images, I found many, none of which really fit my needs. The dataset was either too small, the API too limited, or it wasn't free enough.

Then, completely on accident, I discovered the Creative Commons API while analyzing the network traffic on the new CC Search page (I've used the old one many times before, and I thought it might hold the key to my bread). So I screwed around with it for a while (completely missing the documentation that I linked earlier, seriously CC, you should really promote your api more, it's really quite nice), and I downloaded about 250 pages of search results for bread.

Processing the bread results

Next, of course, was processing all of these results. The API returns JSON (as you could expect nowadays), but maybe I was saving the response contents wrong, because the JSON was all malformed and mangled. So I wrote this really terrible script (over a couple hours of troubleshooting):

import re
import glob

regex = r"(\x..)"

for file in list(glob.glob("*.txt")):
    with open(file, "r") as f:
        file_contents = f.read()
        for match in re.findall(regex, file_contents):
            print (match)
            file_contents = file_contents.replace(match.strip(), "")
        with open(file + ".cleaned", "w") as new_f:
            new_f.write(file_contents.replace("\'", "").replace("\\\", "").replace("\\", "\"))

Yeesh, that's ugly. If you can't figure out what's going on (and I wouldn't really blame you), it's stripping out unicode escape characters (stupid emojis), and then fixing some wierdness with backslashes and single quotes being escaped. I'm not too worried about breaking the data (I only really need the links for later), so I do it in a really dumb way.

Props to Python I guess for letting this code actually work.

Finally, I beautify the json responses for my own sanity using this bash script:

#!/bin/bash

for file in *.txt.cleaned; do
    cat "$file" | sed 's/.$//' | cut -c 3- | json | tee "$file.pretty.json"
done

Yeah, yeah, useless use of cat or whatever. I don't really care, it looks nicer this way.

Anyway, now that we have this nice valid JSON, we can, of course, do some stuff with it. Like check and see what sort of providers there are for the images (so that we can download them later).

import glob
import json

providers = {}
for file in list(glob.glob("*.pretty.json")):
    with open(file, "r") as f:
        file_contents = f.read()
        file_obj = json.loads(file_contents)
        for result in file_obj["results"]:
            if result["provider"] not in providers.keys():
                providers[result["provider"]] = 1
            else:
                providers[result["provider"]] += 1


print(providers)

(For the record, most of them are from flickr.)

But there's a problem: Why do I only have 3600 results or so?

{'flickr': 3579, 'svgsilh': 9, 'behance': 37, 'met': 19, 'digitaltmuseum': 1, 'geographorguk': 2}

I've set page size to 500, and I grabbed 200 pages, which should be 100000 pictures. So what's up?

I decided to check if I really was getting the 500 results per page that I was asking for. After all, the CC API did say that was the maximum.

>>> import json
>>> with open("response_159.txt.cleaned.pretty.json", "r") as json_f:
...     response = json.loads(json_f.read())
... 
>>> len(response["results"])
18

Well that explains it. 18 is not equal to 500. I think, anyway. Maybe it's not inclusive, though. Let's try 499.

>>> import requests
>>> response = requests.get("https://api.creativecommons.engineering/image/search?page=1&pagesize=499&shouldPersistImages=false&q=bread&provider=&li=&lt=")
>>> len(response.json()["results"])
486

Ta-da! Still not 500, but I'll take it. I don't really know if 500 not working is a bug, or if I can't read. Honestly, it could be either.

But... that also means I'm gonna be downloading 200 more pages of data. sigh.

And so I did... well, sorta. You see, it started throwing 400s after about 10 pages. 5000 results. D'oh.

Conclusion

So for now, I've got about 5000 pics of bread. At least, until I get ahold of some more HDD space...

Mastodon