ETL pattern for API source
The direction for software as a service providers is to provide APIs to access their data instead of structured file exports. Which is a pity, as every SaaS system requires a bespoke data extract solution. I inheireted a solution that had an adverse pattern I'd like to talk about.
- Metering - Providers generally restrict you to only consuming so much over time (where so much and time are highly dependent on the source). Google Analytics, depending on product, rejects your connections after so many calls. Twitter, also depending on their maddening, inconsistent set of APIs (v1 vs v2), endpoints, product (free standard, paid for premium or enterprise) will throttle you based on consumption
- Data availability - you have no idea whether the data you pulled today will be available tomorrow. We had pulled 5 years of data out of Google Analytics that contained a variety of dimensions, two of which were ga:userAgeBracket and ga:userGender. In talking to our client, they wanted just one more data elemented added to the mix. We made the change and boom goes the dynamite: Some data in this report may have been removed when a threshold was applied. That error message means that you're requesting a level of granularity that could de-anonymize users. Ok, fine, we rolled back the change but No, that's no longer a valid combination, ever! And we ran into a situation were some of the data just wasn't availble pre-2020. Yes, a month earlier the same code had pulled 6 years worth of data but no more.
- Oops - Something happened when we created the data for reporting (data merge introduced duplicates, client wanted a differen format, etc) and now we need to do it again, except instead of the month allocated, we have a week to fix all this up. Which bumps into the Metering and Data Availability points. Ouch town, population you.
Preferred pattern
What I inheireted wasn't bad, it just hadn't taken those possible pain points into consideration. In a classic data warehouse, you have a raw zone with immutable source sitting somewhere on cheap storage. The same lesson applies here. When you pull from an API, land that data to disk in some self defining format, json/xml/csv don't care.
Write your process so that it is able to consume that source file and get the exact same results as the source data pull.
def get_data(source_date): """Get a data for a given date. :param source_date: An ISO 8601 formatted date aka yyy-MM-dd :return: A dictionary of data """ source_file = '/dbfs/mnt/datalake/raw/entity/data_provider_{0}.json'.format(source_date) raw_data = {} if os.path.exists(source_file): with open(source_file, 'r', encoding='utf-8') as f: raw_data = json.load(f) else: raw_data = analytics.reports().batchGet(body='json-goes-here').execute() with open(google_file, 'w', encoding='utf-8') as f: json.dump(raw_data, f, ensure_ascii=False) return raw_data
This simple method is responsible for getting my data by date. If the file exists in my file system, then I will reuse the results of a previous run to satisfy the data request. Otherwise, we make a call to the API and before we finish, we write the results to disk so that we can be ready in case Ooops happens downstream of the call.
Using this approach, we were able to reprocess a years worth of cached data in about 10 minutes compared to about 4.5 hours of data trickling out of the source API.
No comments:
Post a Comment