LA Metro Scrapers Documentation
Welcome to the documentation for the LA Metro Scrapers! Here, you’ll find information about local development, deployment, and an overview of each scraper (and decisions that we’ve made about them).
How do they work?
At a high level, the scrapers retrieve information from Metro instances of the Legistar interface, also known as InSite, and the Legistar API (endpoints at https://webapi.legistar.com/metro/*
).
See the relevant scraper documentation for more information about where information comes from, and how it is parsed.
How are they run?
The scrapers are run by Airflow and populate LA Metro Councilmatic instances, outlined below.
Scraper image tag | Airflow instance | Metro instance |
---|---|---|
main |
https://la-metro-dashboard-heroku.datamade.us/home | https://la-metro-councilmatic-staging.herokuapp.com/ |
deploy |
https://la-metro-dashboard-heroku-prod.datamade.us/home | https://boardagendas.metro.net |
See Deployment for more on how scraper image tags are built.
When do they run?
See the Airflow dashboard for information about the latest and next scraper runs.
Scrape schedules are written in UTC!
- Subtract 7 hours to convert to Los Angeles time.
- Subtract 5 hours to convert to Chicago time.
Mental math getting you down? Try World Time Buddy!
windowed_bill_scrape
Scrape bills with a window of 0.05 at 5, 20, 35, and 50 minutes past the hour. This generally takes somewhere between a few seconds and a few minutes, depending on the volume of updates.
At 5, 20, 35, and 50 minutes past the hour, Sunday through Thursday
At 5, 20, 35, and 50 minutes past the hour, between 12:00 AM and 08:59 PM, only on Friday
At 5, 20, 35, and 50 minutes past the hour, between 06:00 AM and 11:59 PM, only on Saturday
fast_windowed_bill_scrape
Scrape bills with a window of 1 at 35 and 50 minutes past the hour. This generally takes somewhere between a few seconds and a few minutes, depending on the volume of updates.
At 35 and 50 minutes past the hour, between 09:00 PM and 11:59 PM, only on Friday
At 35 and 50 minutes past the hour, between 12:00 AM and 05:59 AM, only on Saturday
fast_full_bill_scrape
Scrape all bills quickly at 5 past the hour. This generally takes less than 30 minutes.
At 5 minutes past the hour, between 09:00 PM and 11:59 PM, only on Friday
At 5 minutes past the hour, between 12:00 AM and 05:59 AM, only on Saturday
windowed_event_scrape
Scrape events with a window of 0.05 at 0, 15, 30, and 45 minutes past the hour. This generally takes somewhere between a few seconds and a few minutes, depending on the volume of updates.
At 0, 15, 30, and 45 minutes past the hour, Sunday through Thursday
At 0, 15, 30, and 45 minutes past the hour, between 12:00 AM and 08:59 PM, only on Friday
At 0, 15, 30, and 45 minutes past the hour, between 06:00 AM and 11:59 PM, only on Saturday
fast_windowed_event_scrape
Scrape events with a window of 1 at 35 and 50 minutes past the hour. This generally takes somewhere between a few seconds and a few minutes, depending on the volume of updates.
At 30 and 45 minutes past the hour, between 09:00 PM and 11:59 PM, only on Friday
At 30 and 45 minutes past the hour, between 12:00 AM and 05:59 AM, only on Saturday
fast_full_event_scrape
Scrape all events quickly on the hour. This generally takes less than 30 minutes.
Between 09:00 PM and 11:59 PM, only on Friday
Between 12:00 AM and 05:59 AM, only on Saturday
person_scrape
Scrape all people and committees. Run in lieu of full scrape on Fridays, when all bills and events are scraped once an hour.
- At 03:05 AM, only on Saturday
What do they depend on?
The scrapers have a couple of key dependencies.
pupa
is the framework for scraping and organizing data according to the Open Civic Data standard. Our scrapers are subclasses ofpupa.Scraper
, and we use thepupa
CLI to run scrapes.- See Useful pupa commands for more on the CLI.
python-legistar-scraper
is a Python wrapper for InSite and the Legistar API that we use to retrieve data. Our scrapers are also subclasses of the relevantLegistarScraper
subclasses from this library.