# HG changeset patch # User Aurélien Campéas # Date 1592556459 -7200 # Fri Jun 19 10:47:39 2020 +0200 # Node ID ac84ebf657f1c92412de60e5110226a4d8c270e9 # Parent 0000000000000000000000000000000000000000 initial time series refinery draft diff --git a/timeseries.md b/timeseries.md new file mode 100644 --- /dev/null +++ b/timeseries.md @@ -0,0 +1,380 @@ +# Notes on data refinement, for a time-series information system + +These are reflexions stemming from a stream of software-building +activity in the context of energy comodity markets, spanning several +years. + +How to reconcile heavly automated data processing and human expertise +for data aggregation and supervision? A common workflow is to gather +the data form huge data powerhouse to Excel files, where the actual magic +happens, and all proper IT guidlines (versionning, testing, etc..) are +ignored. We build an Information system that keep human intervention +into the loop with less and less excel involved + + +## The issues with data + +It is common knowledge that predictive models, the holy grail +of the new wave of "digitalisation", is completely dependant +on the availability of good data as input (to train the model, +and also to run it). + +The issues with good data is: +* how to find it and get it, +* when gotten, how to store and archive for further use +* when stored, how to clean it up, +* when cleaned, how to assemble it into proper model variables. + + +### Finding + +The finding part is better left to domain specialists. There are companies +that sell domain specific data for a price and will make it available through +low-level means (big files on an ftp server) or something more pleasant like +a high-level polished api. + +Sometimes there is no other choice but to write a scraper to chop off +the valuable parts of some js-riddled web page. Or jump through a lot of +hoops to just get at data produced internally by other teams and business +units. + +Such is life that just getting at the raw data is already an endeavour. +However there is nothing novel in the methods to get them, and this is +not our core topic. + + +### Aspects of Storage + +Time Series have a well defined shape: they are a list of tuple of a timestamp +and a value. Additional metadata, such as frequency, or a name, can be found, +but these are secondary. + +Hence we need some kind of dedicated time series store. + +Moreover we need not ony store the list of (`timestamp`, `value`) but also remember +all past "versions" of a series, by having a second timestamp dimension, which +yields a (`version-timestamp`, `value-timestamp`, `value`) triplet. + +In data science, we might want to build new models with data available at the +time of older models, to compare their power on an equal basis. We might want +to understand the view we had in the past on some variables. This second +time dimension has many important uses. + +With respect to the storage, we should be less concerned with the implementation +details (technology choice, underlying database backend, etc.) than with the +`api` used to handle the time series. The api point is important inasmuch as +it will provide true `data independance` from the underlying storage device. + +Having a clean and consistent api allows: + +* swapping the underlying storage for something more adequate, while + minimizing the risk of disruption for the data users + +* growing a data api point on top of an existing application + +* connecting several stores together in a uniform way (or `federating` + time series stores) + +We have determined that the following three api points are necessary for a +minimally functioning time series store api: + +* get(, ) -> + +* update(, , ) -> + +* revision_dates() -> ]> + +Since series are handled by name, a good naming convention is very +important. A name can encode a lot of things, and the structure of a name +should be well thought out when starting to populate a time series store. + +In some sense, the name is the first and only mandatory piece of metadata +associated with a series. Optional metadata can also be planned and used +for search purposes, but a good name can make miracles when searching +within a time series catalog. + + +### Cleanup + +Data quality is a vast topic, and its methods are numerous. The field +of automatic anomaly detecion is so vast as to be addressed +by domain specialists. + +There's a cheap thing to think about, when it is possible, e.g. on low-frequency +data: expert input to fix known-bogus values, also known as _supervised_ cleanup. + +If we already have a versioned time series store, let's just make sure that any +new version that fixes bogus values can be seen separately from the upstream, +using some version metadata. + + +### Assembly + +Having a lot of series at hand won't do any good if: + +* they come in heterogenous units (e.g. MWh vs GWh, € vs $, etc.) + +* are either partially redundant or partially complementary (e.g. several + series over the same variables, from different sources, and with different + time spans) + +* represent a low-level component and need to be aggregated with others + (e.g. regional values needed to build a national view) + +We have seen people using Excel or Python or R to fix these issues and produce +the series needed for their views and models. However separating these +computed series from the store has annoying consequences: + +* they are transient, + +* they are ad-hoc and less easily testable, + +* they cannot be easily communicated. + +A solution would be to augment the store with a formula system allowing the +construction of new series out of the existing ones. It is not very complicated +to provide "virtual" series on top of a store. + + +## Addressing the issues: the `tshistory` package + +So we spent the last few years building a time series information system +to tackle these issues. We will provide a high-level understanding of +how we manage them in a reasonnably efficient manner. + + +### An opinionated series store + +We defined the following performance properties for our store: + +* compact storage + +* fast data reads (of any version) + +* decent insertion performance + +Having observed data ingestion processes for several years, we started from an +an important observation: most time series ingesters are adding new points at +the end, a few are editions of existing points near the end. Random editions +all over the place are very rare events. + +We took the opportunity to have a data layout that exploits this observation. + +So we don't have the very simple run-of-the-mill data schema: + + create table series ( + series_name text primary key + revision_date timestamptz, + value_date timestamptz, + value double + ) + +We also noted that being able to query data in the db with point-level +granularity brings little benefits. Hence we store whole contiguous blocks +(or chunks) of series at once. + +So we went with a slightly more complicated data layout. We have three tables, +one to register all the series, and then for each series, one for the versions and +one for the series chunks. + +The `registry` table is there to enumerate all the series we have, and provide +direct access to the relevant series revision table. + + create table registry ( + seriesname text primary key + tablename text unique + metadata json + ) + +The `revision` table provides an index over all the version of a given series: + + create table _revision ( + revision_date timestamptz unique not null, + metadata json, + snapshot integer references _snapshot(id) + ) + +As one can see the meat really comes last, in the snapshot table. Let's look at +its simplified design: + + create table _snapshot ( + id serial primary key, + chunk_start timestamptz, + chunk_end timestamptz, + chunk bytearray, + parent integer references _snapshot(id) + ) + +All the storage magic happens there. Let's describe what's happening when we append +new data to an existing series: + +* the latest revision parent is retrieved + +* the series time stamps and value arrays are bundled together in binary form + (and compressed) + +* a new record is created and linked to its parent + +* in the revision table, a new record is created and linked to the new + snapshot record + +Of course we address, in the real-world version, the following concerns: + +* insertion of a very big chunk at once + +* edition (including erasure) of existing points (not a pure append) + +These are well handled but to know how, you will have to dig a bit the sources :) + +This gives us a fast path for data retrieval: a recursive query will provide all +the needed chunks, and we will chop off the excessive bits if needed. +Going from a `pandas` time series to this format, and the reverse, is a very fast +process, and its disk space is small compared with the vanilla design. + +A last point on the store: the insertion api tries to help you. As already shown, +the api is as follow: + +* update(, ) -> + +The purpose of this is to create a new version of the named series with: + +* all new or changed points made parts of the new version + +* all timestamps matching with a `nan` value considered as data point erasure + +* returns what has been exactly changed with respect with the current series + version + +We quickly discovered how tedious it is to determine if the new series chunk we are +adding to the store contains already existing points. So we made it a feature to +accept already existing data points but don't do anything with them if they +brought no change to the series. + +Consider inserting from Excel or various data ingestion scripts: such a behaviour +is a big time and trouble saver, at a little cost (we must read the existing series +within the update range to compute a meaningful `diff` of what has to be really +ingested, and that is what is returned by the `update` call). + +Of course the api provides a bit more than `get`, `update` and `revision_dates` ... +we have `history` (to retrieve many versions at once), `staircase` (to constrain +series values to a maximum distance from their revision-dates), and more, for e.g. +metadata handling and searching. + + +### Human supervision + +Early on we were asked to handle the situation where ingested data from +providers contain obvious (that is, obvious to the domain experts) bad data points +and they wanted to: + +* fix them quickly + +* being able to see their hand-fixes easily (e.g. special color-coding in the ui) + +* have the fixes be properly versioned (so, like any normal version) + +* have their fixes be automatically overriden whenever the upstream source provided + new values for the same points (such an update would tend to be understood as a + correction by the provider towards saner values) + +To handle this we modelled the problem as we would with a code repository (think git +here) with two branches: + +* the upstream branch, containing pristine upstream versions + +* the master branch, containing upstream + the occasional expert edition + +The process of updating correctly the master expert edition from upstream consist +of always applying the `diff` of the `update` made to upstream. Hence any +expert-edited value is replaced if upstream also updated it. This is morally +equivalent of merging upstream into master after each commit to upstream. + + +### Series Assembly with a Series Formula Language + +#### Basic idea + +Rather than augment the api with an always expanding set of fixed transformations, +we sat and watched domain expert do their work. It appeared that moving the series +store from Excel was only the first step. The next obvious thing to do was also +stealing one of Excel core feature: a formula system. + +It turns out programming languages for such a purpose abound, and we settled on +a `lisp` variant for the following reasons: + +* slightly unusual but a regular and easy-to-learn surface syntax + +* very cheap to implement + +Any engineer mastering Excel formulae would be at ease with lisp series formula. + + +#### Wetting your appetite + +Here's how it looks: + + (series "banana.spot-price.$") + +This returns the spot price of bananas in USD, using the `series` `operator` with +the series name as first and only argument. + +Another one: + + (mul (series "banana.spot-price.$") (series "$-to-€")) + +This will produce the same as above except for a time-dependant conversion +into EUROS. + +Such formulas are given a name as treated, through the api, exactly as data-bearing +series. + +There are a few api points to manage them, such as: + + register_formula(, ) + formula() -> + +Since the underlying series have revisions, so have formulas. The revisions of +a formula the union of the revisions of the underlying series. Formulas can be made +out of formulas. + +Here are two more interesting examples: + + (add (series "be.hydro.production.mwh") + (series "de.hydro.production.mwh") + (series "fr.hydro.production.mwh") + ... + (series "cz.hydro.production.mwh")) + +To build an aggregate hydro power at the scale of the EU. +Or this: + + (priority (series "fr.hydro.production.old-realised.mwh") + (series "fr.hydro.production.realised.mwh") + (series "fr.hydro.production.short-fcst.mwh") + (series "fr.hydro.production.longterm-fcst.mwh")) + +Out of several seroes of history and forecasts of differing horizons, we can make +a synthetic series using the best available values. + + +#### How it works + +First we need a parser. By chance `lisp` being so simple a parser can be implemented +in a few days. Then an interpreter. By chance python being so dynamic an intepreter +can be written trivially by walking down the parse tree and calling matching python +functions. + +All the operators we've seen (such as `series`, `mul`, `add` or `priority`) are plain +python functions with static type annotations. The annotations help us detect bugs +into end-users provided formulas before they have chance to run. + + +## Conclusion: the new data refinery is not Excel any more + +Well, Excel is still there, and we built a client for it, from which +series can be pulled and also updated/pushed. Some use it to manage +entirely hand-crafted series. + +??? + +Profit !!!