Open ways to avoid bad data apples

ao link

Members

Contact

Free AI assessment

New to DataIQ?

Take our FREE data literacy indicator now

Unlock the power of data - take our FREE data literacy indicator now

Bad data is poorly structured, mis-formatted and plain ugly. It's a problem any analyst will be familiar with and one that never seems to get fixed for good. But could a new generation of tools, including those being developed around open source, not-for-profit models, hold the answer?

Open Knowledge International, a NFP organisations, believes it has created just such a solution to how data should not be presented. According to Vitor Batista, a lead engineer, there are four different types of validation that a data set should go through before analysis. Basic validation makes sure that the file can be loaded or opened and if it has been corrupted. Structural validation checks if it is a proper table, if all the rows have the same number of columns, and if there are any blank rows or columns. Content validation looks at testing the content itself and so checks if data sits within constraints, for example numbers being greater than a certain figure. Advanced checks allow the user to create custom checks for their data set.

One tool that can carry these out is Good Tables, a free website that allows users to validate tabular data and check that it is good to go. Batista explained that it is intended to be used with Tableau and can validate spreadsheets like CSV and Excel files.

Good Tables can also make sure that data fits a specific schema. But what is that exactly? According to the IBM dictionary of computing, it is a centralised repository of information about data, such as meaning, relationships to other data, origin, use and format.

"Data validation is super-useful if you are a producer or consumer."

Batista said: “Data validation is super-useful if you are a data producer or data consumer, which are usually the same people.” He placed Good Tables in the context of a project that Open Knowledge International has been working on called Frictionless Data.

Batista explained that this concept is about the workflow of getting data from the source to the starting point of analysis. “We think there is a lot of friction considering the data quality is there. The data quality itself is not the whole story. I need metadata, I need to know the data dictionary, the license of the data, where it comes from, what’s the source, who’s the author. This kind of information needs to be together with the data so you can understand it,” he said.

"Friction stops us getting insight and solving important problems."

Rufus Pollock, the president and founder of Open Knowledge International, in a video explained that a lot of time is spent collecting and preparing the data, leaving very little time to turn it into insight. He said: “This friction stops us getting insight and solving important problems.” The aim is to eliminate the friction of getting data from tool A to tool B.

Pollock said they can put data into data packages, something akin to shipping containers in the physical world. They take data from a spreadsheet and put it inside a virtual package, making it more efficient to load the data into the tools that users already have. With the data inside a data package, users can validate the data automatically, store and search it in standard ways, import it to their specific tool or export it from the tool automatically.

"We can start moving data around between the tools without losing information."

Batista concluded by saying: “When I have this common language, the tools communicate between each other so we can start moving data around between the tools without losing information from the data, without losing this metadata.” He explained his long-term vision for Good Tables by saying: “The idea is to build materials for people working in data who are technical, but are not developers.”

Vitor Batista was speaking at the offices of the Open Data Institute. The ODI has made a recording of the presentation available.

Log in to read the entire article

Gain access to the entire article by logging in or registering for a free account here.

Did you find this content useful?

Thank you for your input

Thank you for your feedback

Next read

A case of the AI biter bit?

DataIQ’s Chief Knowledge Officer and Evangelist, David Reed, examines the hype cycle around generative AI and the actual speed of transformation being seen.

Next read

A case of the AI biter bit?

23 Apr 2024by David Reed

DataIQ’s Chief Knowledge Officer and Evangelist, David Reed, examines the hype cycle around generative AI and the actual speed of transformation being seen.

Pioneering AI initiatives revealed: DataIQ Announces 2024 AI Awards Shortlist

15 Apr 2024by Alex Roberts

The shortlist for the 2024 DataIQ AI Awards has been unveiled, with the winners to be announced at the DataIQ Summit on May 21.

Final chance to enter the 2024 DataIQ Awards and demonstrate your team’s prowess

08 Apr 2024by Alex Roberts

The final deadline for submissions to the 2024 DataIQ Awards – 26 April – is rapidly approaching, so make sure you have entered to clinch a title.

You may also be interested in

DataIQ 100 Success Series: EDF – National sustainability and preparing for the unexpected

EDF’s head of data and CRM, and member of the DataIQ 100 Martin Aylward, spoke to DataIQ editor Alex Roberts, about what data leaders need to succeed and how investment in data teams can provide extreme unseen wins.

AI just rocked Las Vegas. But where was data?

DataIQ chief knowledge officer and evangelist, David Reed, examines the gamble surrounding AI and why businesses need to play the game.

Analytics and Insight artificial intelligence business leaders CIO data objectives digital information gamble Prediction Technology tools US vegas

DataIQ 100 Success Series: Data Driven Danske – Leveraging data in a new way for legacy business

Legacy businesses have a unique set of challenges when adopting a new data-driven future. Data Driven Danske is a transformational journey taking Danske Bank employees to the next level of leveraging data and analytics to drive value for customers, shareholders, colleagues and broader stakeholders.

Analytics and Insight business leaders data culture data literacy data objectives DataIQ 100 finance Financial Services/Banking investment legacy talent Technology Technology and Tools

Newspapers, radio and television – An insight into the impact of generative AI on media businesses

With generative AI paving the way for a new era of data, businesses are rapidly seeking ways to incorporate tools into their operations, DataIQ member News UK delves into their approach.

AI Analytics and Insight artificial intelligence generative AI machine learning Media ML News skills Technology Technology and Tools upskilling

DataIQ is a trading name of IQ Data Group Limited
10 York Road, London, SE1 7ND

We use cookies so we can provide you with the best online experience. By continuing to browse this site you are agreeing to our use of cookies. Click on the banner to find out more.

Cookie Settings