Data sources
A data leader needs to be able to explain to non-data professional decision makers where data is sourced from and the myriad different ways in which the quality can be affected. From here, it can then be made clearer where actions need to be taken to improve and secure data quality to achieve business objectives.
“Analytical users of data are typically trying to extract insight from data created by other business processes,” said Simon Case, head of data, Equal Experts. “By analysing the transactions, logs and other exhaust from transactional processes, we gain valuable visibility and insights about our customers and our operations. But this means that we are usually second-class citizens of the data. As downstream users, our needs are typically not the first to be met and developers prioritise the running of the business process over ours.”
Case continued, “Take clickstream and application log data for example. This is typically data that has been collected from a range of sources – different webpages and services will report and log different things – sometimes with analytics users in mind, sometimes not. It is horribly messy, and often not exactly trusted by anyone, but it is also arguably some of the most useful and valuable data a company has as it is packed with potential insights about your customers and their behaviour.”
When the foundations to assure data quality are put in place, such as minimal levels of data literacy, a collaborative data culture and strict data compliance, it becomes easier to spot and navigate unclean data and allows an organisation to take advantage of more data sources. However, receiving investment in time and money to achieve this requires trust and understanding from senior decision makers, which is why storytelling for your data quality journey is so important.
Data process
There are a couple of reasons why data quality is considered such a challenge. Firstly, “Data must often undergo significant manipulation and transformation before it is useful for analysis,” said Case. “Data pipelines can become very complex.” And secondly, “Changes can happen without data users being told – we commonly see changes in data types, field names and more, which lead to failures in pipelines.”
Simon notes that the needs of the data team should always be understood by the developers of the services being used, but that it is also important to understand that data quality issues are inevitable and require measures to be put in place:
- Observe your data flows: Instrument the data pipelines so you can monitor data entering and progressing through to analytics users. Measure and display the flows so you can tell if something untoward has happened – such as data not arriving, or a subtle change in a data type. In a mature environment, you may even consider adding anomaly detection to find things that do not look right but have not yet led to obvious data quality issues.
- Adopt continuous delivery approaches: Data pipelines are complex beasts, typically involving orchestration, scripts, data manipulation and infrastructure. Many data quality issues are the result of details during pipeline implementation rather than problems at source. Developing in small increments, with testing and integration at each deployment, will capture these issues.
- Typically improving data quality is an iterative process in which initial samples of the data show common data issues which are resolved in the first or early versions of the pipeline. You should expect to learn about your data. For example, as the data is used, rarer quality issues are encountered.
Trust is pivotal
As mentioned previously, data quality greatly impacts the idea of trust. Very simply, people need to make informed decisions and the data being used to inform these decisions must be trustworthy.
“If data quality is not a focus, and the data becomes untrusted, then it will not be used for decision making and will lose business value,” said Case. “I genuinely think that a key part to getting data quality right is to acknowledge that data is complex and will change outside of our control. So, we need to have these foundations of observability and the development processes which allow us to change our pipelines with confidence when we identify quality issues.”
Ultimately, data is there to provide insights which leads to efficiencies, new revenue opportunities, improved internal and external satisfaction rates and much more. To achieve this, the data quality must be monitored, maintained and examined regularly to ensure it is providing the greatest value possible. This can be achieved by organisations of all shapes and sizes, but it requires a level of precision and understanding that should not be rushed or else the quality issues will arise further down the line.