Local development
Running the scrapers
The scrapers are bundled with a docker-compose.yml
file that will allow you to run them on your machine.
To scrape all recently updated data, run:
docker compose run --rm scrapers
To run a particular scrape or pass arguments to pupa
, append your command to the end of the previous command, like:
# Scrape board reports from the last week
docker compose run --rm scrapers pupa update lametro bills window=7
Populate a local Councilmatic instance
If you’d like to scrape data into a Councilmatic instance for easy viewing, first run your local instance of LA Metro Councilmatic as normal.
Then, in your local scraper repository, run your scrapes using the docker-compose.councilmatic.yml
file:
docker compose -f docker-compose.councilmatic.yml run --rm scrapers
Useful pupa commands
pupa update
usage: pupa update [-h] [--scrape] [--import] [--nonstrict] [--fastmode] [--datadir SCRAPED_DATA_DIR] [--cachedir CACHE_DIR]
[-r SCRAPELIB_RPM] [--timeout SCRAPELIB_TIMEOUT] [--no-verify] [--retries SCRAPELIB_RETRIES]
[--retry_wait SCRAPELIB_RETRY_WAIT_SECONDS]
module
update pupa data
positional arguments:
module path to scraper module
options:
-h, --help show this help message and exit
--scrape only run scrape post-scrape step
--import only run import post-scrape step
--nonstrict skip validation on save
--fastmode use cache and turn off throttling
--datadir SCRAPED_DATA_DIR
data directory
--cachedir CACHE_DIR cache directory
-r SCRAPELIB_RPM, --rpm SCRAPELIB_RPM
scraper rpm
--timeout SCRAPELIB_TIMEOUT
scraper timeout
--no-verify skip tls verification
--retries SCRAPELIB_RETRIES
scraper retries
--retry_wait SCRAPELIB_RETRY_WAIT_SECONDS
scraper retry wait
Tip
Running a scrape with --fastmode
will disable request throttling, resulting in a faster scrape. Great for local development, especially for narrow scrapes, e.g.,
pupa update --fastmode lametro events window=1
Additional arguments
bills
window
(default: 28) - How far back to scrape, in days. Scrapes all matters, if 0.matter_ids
(default: None) - Comma-separated list of MatterIds from the Legistar API. Scrapes all matters updated within window, if None.
events
window
(default: None) - How far back to scrape, in days.
Examples
# Scrape board reports from the past week
pupa update lametro bills window=7
# Scrape specific board reports
pupa update lametro bills matter_ids=10340,10084
# Scrape events from past 30 days
pupa update lametro events window=30
pupa clean
usage: pupa clean [-h] [--window WINDOW] [--max MAX] [--report] [--yes]
Removes database objects that haven't been seen in recent scrapes
options:
-h, --help show this help message and exit
--window WINDOW objects not seen in this many days will be deleted from the database
--max MAX max number of objects to delete without triggering failsafe
--report generate a report of what objects this command would delete without making any changes to the database
--yes assumes an answer of 'yes' to all interactive prompts
Examples
# Log which objects will be deleted without making changes to the database
pupa clean --report
# Remove objects that haven't been seen for 30 days
pupa clean --window 30
# Remove a maximum of 100 objects
pupa clean --max 100
Writing tests
tktktk