initial time series refinery draft
1 files changed, 380 insertions(+), 0 deletions(-)

A => timeseries.md
A => timeseries.md +380 -0
@@ 0,0 1,380 @@ 
+# Notes on data refinement, for a time-series information system
+
+These are reflexions stemming from a stream of software-building
+activity in the context of energy comodity markets, spanning several
+years.
+
+How to reconcile heavly automated data processing and human expertise
+for data aggregation and supervision? A common workflow is to gather
+the data form huge data powerhouse to Excel files, where the actual magic
+happens, and all proper IT guidlines (versionning, testing, etc..) are
+ignored. We build an Information system that keep human intervention
+into the loop with less and less excel involved
+
+
+## The issues with data
+
+It is common knowledge that predictive models, the holy grail
+of the new wave of "digitalisation", is completely dependant
+on the availability of good data as input (to train the model,
+and also to run it).
+
+The issues with good data is:
+* how to find it and get it,
+* when gotten, how to store and archive for further use
+* when stored, how to clean it up,
+* when cleaned, how to assemble it into proper model variables.
+
+
+### Finding
+
+The finding part is better left to domain specialists. There are companies
+that sell domain specific data for a price and will make it available through
+low-level means (big files on an ftp server) or something more pleasant like
+a high-level polished api.
+
+Sometimes there is no other choice but to write a scraper to chop off
+the valuable parts of some js-riddled web page. Or jump through a lot of
+hoops to just get at data produced internally by other teams and business
+units.
+
+Such is life that just getting at the raw data is already an endeavour.
+However there is nothing novel in the methods to get them, and this is
+not our core topic.
+
+
+### Aspects of Storage
+
+Time Series have a well defined shape: they are a list of tuple of a timestamp
+and a value. Additional metadata, such as frequency, or a name, can be found,
+but these are secondary.
+
+Hence we need some kind of dedicated time series store.
+
+Moreover we need not ony store the list of (`timestamp`, `value`) but also remember
+all past "versions" of a series, by having a second timestamp dimension, which
+yields a (`version-timestamp`, `value-timestamp`, `value`) triplet.
+
+In data science, we might want to build new models with data available at the
+time of older models, to compare their power on an equal basis. We might want
+to understand the view we had in the past on some variables. This second
+time dimension has many important uses.
+
+With respect to the storage, we should be less concerned with the implementation
+details (technology choice, underlying database backend, etc.) than with the
+`api` used to handle the time series. The api point is important inasmuch as
+it will provide true `data independance` from the underlying storage device.
+
+Having a clean and consistent api allows:
+
+* swapping the underlying storage for something more adequate, while
+  minimizing the risk of disruption for the data users
+
+* growing a data api point on top of an existing application
+
+* connecting several stores together in a uniform way (or `federating`
+  time series stores)
+
+We have determined that the following three api points are necessary for a
+minimally functioning time series store api:
+
+* get(<name>, <revision-timestamp>) -> <series>
+
+* update(<name>, <series>, <revision-timestamp>) -> <series>
+
+* revision_dates(<name>) -> <list[<timestamp>]>
+
+Since series are handled by name, a good naming convention is very
+important. A name can encode a lot of things, and the structure of a name
+should be well thought out when starting to populate a time series store.
+
+In some sense, the name is the first and only mandatory piece of metadata
+associated with a series. Optional metadata can also be planned and used
+for search purposes, but a good name can make miracles when searching
+within a time series catalog.
+
+
+### Cleanup
+
+Data quality is a vast topic, and its methods are numerous. The field
+of automatic anomaly detecion is so vast as to be addressed
+by domain specialists.
+
+There's a cheap thing to think about, when it is possible, e.g. on low-frequency
+data: expert input to fix known-bogus values, also known as _supervised_ cleanup.
+
+If we already have a versioned time series store, let's just make sure that any
+new version that fixes bogus values can be seen separately from the upstream,
+using some version metadata.
+
+
+### Assembly
+
+Having a lot of series at hand won't do any good if:
+
+* they come in heterogenous units (e.g. MWh vs GWh, € vs $, etc.)
+
+* are either partially redundant or partially complementary (e.g. several
+  series over the same variables, from different sources, and with different
+  time spans)
+
+* represent a low-level component and need to be aggregated with others
+  (e.g. regional values needed to build a national view)
+
+We have seen people using Excel or Python or R to fix these issues and produce
+the series needed for their views and models. However separating these
+computed series from the store has annoying consequences:
+
+* they are transient,
+
+* they are ad-hoc and less easily testable,
+
+* they cannot be easily communicated.
+
+A solution would be to augment the store with a formula system allowing the
+construction of new series out of the existing ones. It is not very complicated
+to provide "virtual" series on top of a store.
+
+
+## Addressing the issues: the `tshistory` package
+
+So we spent the last few years building a time series information system
+to tackle these issues. We will provide a high-level understanding of
+how we manage them in a reasonnably efficient manner.
+
+
+### An opinionated series store
+
+We defined the following performance properties for our store:
+
+* compact storage
+
+* fast data reads (of any version)
+
+* decent insertion performance
+
+Having observed data ingestion processes for several years, we started from an
+an important observation: most time series ingesters are adding new points at
+the end, a few are editions of existing points near the end. Random editions
+all over the place are very rare events.
+
+We took the opportunity to have a data layout that exploits this observation.
+
+So we don't have the very simple run-of-the-mill data schema:
+
+  create table series (
+      series_name text primary key
+      revision_date timestamptz,
+      value_date timestamptz,
+      value double
+  )
+
+We also noted that being able to query data in the db with point-level
+granularity brings little benefits. Hence we store whole contiguous blocks
+(or chunks) of series at once.
+
+So we went with a slightly more complicated data layout. We have three tables,
+one to register all the series, and then for each series, one for the versions and
+one for the series chunks.
+
+The `registry` table is there to enumerate all the series we have, and provide
+direct access to the relevant series revision table.
+
+  create table registry (
+      seriesname text primary key
+      tablename text unique
+      metadata json
+  )
+
+The `revision` table provides an index over all the version of a given series:
+
+  create table <tablename>_revision (
+      revision_date timestamptz unique not null,
+      metadata json,
+      snapshot integer references <tablename>_snapshot(id)
+  )
+
+As one can see the meat really comes last, in the snapshot table. Let's look at
+its simplified design:
+
+  create table <tablename>_snapshot (
+      id serial primary key,
+      chunk_start timestamptz,
+      chunk_end timestamptz,
+      chunk bytearray,
+      parent integer references <tablename>_snapshot(id)
+  )
+
+All the storage magic happens there. Let's describe what's happening when we append
+new data to an existing series:
+
+* the latest revision parent is retrieved
+
+* the series time stamps and value arrays are bundled together in binary form
+  (and compressed)
+
+* a new record is created and linked to its parent
+
+* in the revision table, a new record is created and linked to the new
+  snapshot record
+
+Of course we address, in the real-world version, the following concerns:
+
+* insertion of a very big chunk at once
+
+* edition (including erasure) of existing points (not a pure append)
+
+These are well handled but to know how, you will have to dig a bit the sources :)
+
+This gives us a fast path for data retrieval: a recursive query will provide all
+the needed chunks, and we will chop off the excessive bits if needed.
+Going from a `pandas` time series to this format, and the reverse, is a very fast
+process, and its disk space is small compared with the vanilla design.
+
+A last point on the store: the insertion api tries to help you. As already shown,
+the api is as follow:
+
+* update(<name>, <series>) -> <series>
+
+The purpose of this is to create a new version of the named series with:
+
+* all new or changed points made parts of the new version
+
+* all timestamps matching with a `nan` value considered as data point erasure
+
+* returns what has been exactly changed with respect with the current series
+  version
+
+We quickly discovered how tedious it is to determine if the new series chunk we are
+adding to the store contains already existing points. So we made it a feature to
+accept already existing data points but don't do anything with them if they
+brought no change to the series.
+
+Consider inserting from Excel or various data ingestion scripts: such a behaviour
+is a big time and trouble saver, at a little cost (we must read the existing series
+within the update range to compute a meaningful `diff` of what has to be really
+ingested, and that is what is returned by the `update` call).
+
+Of course the api provides a bit more than `get`, `update` and `revision_dates` ...
+we have `history` (to retrieve many versions at once), `staircase` (to constrain
+series values to a maximum distance from their revision-dates), and more, for e.g.
+metadata handling and searching.
+
+
+### Human supervision
+
+Early on we were asked to handle the situation where ingested data from
+providers contain obvious (that is, obvious to the domain experts) bad data points
+and they wanted to:
+
+* fix them quickly
+
+* being able to see their hand-fixes easily (e.g. special color-coding in the ui)
+
+* have the fixes be properly versioned (so, like any normal version)
+
+* have their fixes be automatically overriden whenever the upstream source provided
+  new values for the same points (such an update would tend to be understood as a
+  correction by the provider towards saner values)
+
+To handle this we modelled the problem as we would with a code repository (think git
+here) with two branches:
+
+* the upstream branch, containing pristine upstream versions
+
+* the master branch, containing upstream + the occasional expert edition
+
+The process of updating correctly the master expert edition from upstream consist
+of always applying the `diff` of the `update` made to upstream. Hence any
+expert-edited value is replaced if upstream also updated it. This is morally
+equivalent of merging upstream into master after each commit to upstream.
+
+
+### Series Assembly with a Series Formula Language
+
+#### Basic idea
+
+Rather than augment the api with an always expanding set of fixed transformations,
+we sat and watched domain expert do their work. It appeared that moving the series
+store from Excel was only the first step. The next obvious thing to do was also
+stealing one of Excel core feature: a formula system.
+
+It turns out programming languages for such a purpose abound, and we settled on
+a `lisp` variant for the following reasons:
+
+* slightly unusual but a regular and easy-to-learn surface syntax
+
+* very cheap to implement
+
+Any engineer mastering Excel formulae would be at ease with lisp series formula.
+
+
+#### Wetting your appetite
+
+Here's how it looks:
+
+ (series "banana.spot-price.$")
+
+This returns the spot price of bananas in USD, using the `series` `operator` with
+the series name as first and only argument.
+
+Another one:
+
+ (mul (series "banana.spot-price.$") (series "$-to-€"))
+
+This will produce the same as above except for a time-dependant conversion
+into EUROS.
+
+Such formulas are given a name as treated, through the api, exactly as data-bearing
+series.
+
+There are a few api points to manage them, such as:
+
+ register_formula(<name>, <formula>)
+ formula(<name>) -> <formula>
+
+Since the underlying series have revisions, so have formulas. The revisions of
+a formula the union of the revisions of the underlying series. Formulas can be made
+out of formulas.
+
+Here are two more interesting examples:
+
+ (add (series "be.hydro.production.mwh")
+      (series "de.hydro.production.mwh")
+      (series "fr.hydro.production.mwh")
+      ...
+      (series "cz.hydro.production.mwh"))
+
+To build an aggregate hydro power at the scale of the EU.
+Or this:
+
+ (priority (series "fr.hydro.production.old-realised.mwh")
+           (series "fr.hydro.production.realised.mwh")
+           (series "fr.hydro.production.short-fcst.mwh")
+           (series "fr.hydro.production.longterm-fcst.mwh"))
+
+Out of several seroes of history and forecasts of differing horizons, we can make
+a synthetic series using the best available values.
+
+
+#### How it works
+
+First we need a parser. By chance `lisp` being so simple a parser can be implemented
+in a few days. Then an interpreter. By chance python being so dynamic an intepreter
+can be written trivially by walking down the parse tree and calling matching python
+functions.
+
+All the operators we've seen (such as `series`, `mul`, `add` or `priority`) are plain
+python functions with static type annotations. The annotations help us detect bugs
+into end-users provided formulas before they have chance to run.
+
+
+## Conclusion: the new data refinery is not Excel any more
+
+Well, Excel is still there, and we built a client for it, from which
+series can be pulled and also updated/pushed. Some use it to manage
+entirely hand-crafted series.
+
+???
+
+Profit !!!