Sharing Big Water Data
This post explores why we need data consistency across utilities and how we might maintain it with shared workflows.
In use-case workflows, there are two processing stages before analysing water data.
The first of these is basic cleaning – taking the raw data and removing unreadable data and duplicates, erroneous data spikes and getting timestamps into correct sequence.
These processes can be automated except for the removal of erroneous data ‘spikes’ which have significant diversity.
The second stage is the organization of the data. This involves labelling events of interest so they can be filtered in or out – and found easily in future analysis.
The One Source of Truth
The One Source of Truth (OSoT) is a concept embraced by water utilities. The idea is that for consistency there must be a single source of data that everyone uses for their analysis.
The raw monitoring data (SCADA or IoT) has been suggested for this. But the raw data requires basic processing before it can be used. For consistency, this basic processing needs a standardized protocol.
For example, the protocol should indicate whether missing values require simple deletion of the record, or deletion then interpolation across the gap.
What method of interpolation? How large is the gap? A protocol must agree on a number of options.
Furthermore, agreement on basic processing should be across the industry to support bench-marking and machine learning (ML) projects.
Data silos should be a thing of the past
Historically, engineering teams used localised spreadsheets for data processing and this led to data silos.
If engineering teams label data events independently, they duplicate work and potentially create differences in their data sets.
For consistency, all events need to be documented and labelled once for use by all teams. Labelling should be shared between the teams that will share the data.
Ideally, one team labels, another checks.
Collaboration for machine learning
Collaboration between utilities is highly desirable.
Using ML to automate the classification of diverse data signatures requires many (thousands) of labelled examples of each signature. Even the largest utilities are unlikely to have enough.
Agreement across participating utilities for a common set of defined labels for all events of importance for use-cases will create many more ML opportunities.
Consistent cloud data cleaning and labelling
SensorClean is FSA Data’s cloud software that allows water engineers to easily clean, visualize and share with team members.
In using SensorClean, engineers can be confident that data cleaning and labelling is performed only once and in a documented, consistent manner - for use by all engineering teams in the water utility.
In conclusion…
FSA Data is keen to participate in development of standardization of basic cleaning and labelling protocols across the water industry.
Please contact us to discuss this important issue.