Data Cleaning In The Multi-Million Dollar Workflow

AdobeStock_299935912.jpg

Decision-making associated with proposed water network infrastructure can have significant economic implications. Decisions lead to costs - often in the order of tens of millions of dollars - which are passed on to communities.

Adopting a decision-making process that uses all available historical water network data will lead to best outcomes for communities. This means leveraging the 25 years of water network data collected via SCADA monitoring. This network data is an evidence-base and provides real information about:

  • water impacts of population growth

  • changes in community water behaviours over time

  • variability in behaviour between communities; and importantly

  • a baseline for future monitoring of the impacts of climate change.

All of these items need to be considered in decision-making about proposed water network infrastructure.

Analysis and reporting tools for water infrastructure engineers commonly consist of a GIS tool, a hydraulic model and Microsoft Excel and Word. These tools are used as part of the decision-making process shown below in Figure 1.

Social and environmental impacts of proposed infrastructure are not shown in Figure 1 but are also assessed during decision-making.

Workflow.png

Figure 1 General economic workflow to make decisions about proposed water infrastructure

Cleaned SCADA monitoring data provides a rich historical record for estimation of hydraulic modelling parameters. Hydraulic models that use modelling parameters estimated by using all spatially observed historical data will generate more realistic model simulations. Estimating infrastructure sizing on this basis is likely to result in refinement to design options with potential for significant cost savings.

In Figure 1, Data Cleaning is shaded grey because it is often underutilised. Why is this so?

Because the tools are not well suited to the scale of modern water data sets. More specifically:

  • Excel allows only 1 million rows in a Worksheet leading to inefficient processes for cleaning data sets often containing 5 to 10 times that many rows; and

  • historical monitoring data is almost always messy (data gaps, shifts and spikes, duplicates, unreadable items etc).

What happens in the spreadsheet cleaning process?

A typical SCADA flow sensor with an average of 1800 observations per day recorded over 9 years will have around 6 million rows. However, because of Excel’s 1 million row limitation the file cannot be opened. The 6 million rows of data must therefore be cut into 6 smaller chunks using a text file editor prior to opening in Excel. But the problem does not end there.

Excel can identify duplicates and NaN’s easily. But visualising data is problematic across 6 different Worksheets. Furthermore, if out-of-sequence data or data gaps span across Worksheets, then prepare for pain.

Once data duplicates, NaN’s and out of sequence timestamps are dealt with, water network domain knowledge is required to address potential sensor errors and abnormal network events. This is essential because including days where large pipe failure events have occurred can lead to significant overestimates of community water consumption. All such ‘bad data’ should be labelled and filtered during further analysis.

Water engineers producing water consumption profiles for subsystems, will still need to go through this process for any additional inflow sensors, reservoir water level sensors and outflow sensors in the subsystem of interest.

Why does this matter?

If we take the approach that “there’s no value in analysing the historical data” then we miss a significant opportunity to improve models with best-evidence data and potentially save considerable sums of money.

In data science, the rule of thumb that ‘more data is better’ holds true when analysing historical water network data.

FSA Data’s cloud software SensorClean allows for quick cleaning, labelling and visualization of water network big data sets. Labelling of bad data prepares datasets for machine learning. Insights from cleaned data leads to best evidence decision making, potentially saving communities significant sums of money.

Please contact us for more information.

Previous
Previous

What’s Labelling Got To Do With It?

Next
Next

Water Network Boundaries And Data Mining