7 minute read

Building a data lab

Data science has been a focus for investment by organisations seeking to mature their data and analytics capability for the last five years. This briefing paper offers insights into how to improve your maturity score by creating a data lab,

Improving your data science capability from level 2 to levels 3/4

Background

Data science has been a focus for investment by organisations seeking to mature their data and analytics capability for the last five years. Definitions of data science vary, but common points of reference include:

high theoretical basis
experimentation
deep data mining
model-focused, not business outcome-focused.

Demand for data scientists has grown in line with the perception of data science as a “must-have” capability. Definitions of what skills base a data scientist needs compared to a conventional data analyst also vary (see the DataIQ Leaders briefing paper, “Data scientist or data analyst?” for more), but there has been a consistent, strong demand for this type of practitioner.

Since 2014, job listings on Indeed which include the terms artificial intelligence (or AI) and machine learning – both key domains for data scientists – have risen by 485 per cent. Over the same period, the number of candidates searching for positions has risen by 178 per cent.

This has a number of consequences: with 2.3 open positions for every available candidate, data scientists rightly feel they are in high demand and this can lead to a sense of entitlement; data analysts can feel overshadowed and undervalued in consequence, leading to disruption within an existing analytics centre; the working practices of data scientists are not necessarily a good fit with that analytics centre.

What typifies data science is its scientific test-and-learn approach. This is based in experimentation to prove or disprove a hypothesis, with repeated tests required before a robust model emerges. Unlike conventional project development, even when using an agile methodology, where increments are delivered until the entire programme is complete, data science may operate in cycles where only a small outcome is achieved for a long period, before a significant breakthrough or innovate results.

As a consequence, organisations which have hired a data scientist or data science team may experience issues with their performance including:

misaligned expectations v delivery
frustration with communication levels
disruption to IT and data management
disruption to existing data analytics
no apparent return on investment

Understanding the problems with data science

These experiences of difficulties with the performance of data science are often only identified informally – individual managers and business executives may experience problems with understanding, timescales or results that are not always communicated in a structured way. Equally, C-suite oversight may be limited or patchy, while data scientists or teams themselves can experience their own frustration and disappointment.

DataIQ Leaders CARBON captures some of these issues in a more structure way during the capability assessment. An organisation which scores at level 2 (“repeatable”) will typically provide these types of answer:

CARBON data science assessment – typical level 2 responses
Q: How does data science support innovation in your organisation, such as process improvement, product and service development?	A: Requests for data science to support innovation and product/service development are adhoc and inconsistent.
Q: Is there a defined career development path towards becoming a data science leader in your organisation?	A: We only have limited career progression mapped out for individual data scientists.
Q: How well do data scientists understand the needs of the wider organisation?	A: Any understanding of the business develops informally among data scientists.
Q: To what extent is suitable technology provided to the data science function?	A: Technology provision for the data science is limited by the IT constraints of the function it serves.

As is clearly visible, running data science in this way does not create an effective engagement between the individual or team and the business from either direction; does not support those individuals or teams with the right tools and opportunities; and is far from optimal in the delivery that will result.

A different approach: The data lab

A growing number of analytically-led organisations are optimising their data science resources through the creation of a data lab. Reference examples include Aviva’s global data science practice Quantum and Rolls-Royce R² Data Labs. Headcounts in these examples are high (c500 at Aviva), although this often reflects the centralisation of existing practice areas into a single location – scale is not in itself a determinent of a successful data lab, although there is a degree of critical mass required.

When considering the creation of a data lab, a number of critical success factors should be considered:

Operating principles: the focus of the work carried out within the data lab should be research and development, not optimisation of business as usual and operations. The goal is to find breakthrough ideas, not incremental gains – these are better acheived within conventional data analytics centres. Invention, not innovation is the operating principle that typifies a data lab. Given this approach, data scientists in the lab should have freedom to fail – projects are part of an iterative process, rather than specifically goal-oriented. However, this is not an academic research institute, so data science teams should not be allowed to pursue pure “blue-skies” projects – the focus must be on aligning invention to business benefits.

Staffing: it is axiomatic that a data lab will be staffed with high academic achievers, notably PhDs and MScs. One consequence of this is that pay grades and job hierarchies within the data lab need to be decoupled from those of the core organisation. This can be politically challenging, but establishing this precedent from the outset will avoid internal conflicts later. Similarly, there should be clear career progression maps within the data lab itself that allow practitioners to advance without necessarily moving away from their are of activity (ie, becoming more senior without having to take on management responsibilities). As well as data scientists, however, the data lab will also need to employ data engineers who are capable of translating successful models into production-capable outputs. It also needs project managers who act as the translators into and out of the business, ensuring alignment of the practice with the parent organisation.

Technology: data science typically operates on an alternative technology stack to business as usual processes. A major constraint on the success of data labs is the ability of IT to support and deliver this environment – a major improvement in maturity can result from giving the data lab greater autonomy in its technology selection and deployment, although it is still rare to allow it to operate in a completely standalone way, not least because this can make productionisation of successful models even more of a challenge.

Data access and governance: one of the most difficult resources to deliver into a data lab is access to data in a way that is timely, yet governed. The fundamental principle of data science is exploration of data sets without pre-conditioning in order to identify meaningful patterns. Yet this can run counter to data governance principles, especially data protection and privacy. Data minimisation, anonymisation, aggregation and access control are essential dimensions of data management within the data lab (as elsewhere). This can cause some conflict between data scientists and data governors, but a risk-based approach to what data is accessible and to what extent it can be freely moved and viewed is critical.

Tasking: managing the in/out process of tasks on which the data lab works is central to the success of the operation. A balance needs to be struck between the agile working practices increasingly being adopted by conventional data analytics, involving regular updates on incremental projects, and the more long term-focused data science methodology. This is based in an iterative mindset where tests are repeated and evidence continuously reviewed which provides a point of commonality. The critical difference is the level of successul output and the timescales expected. Research by Rexer Analytics found that only 13% of data science projects reach production – a hit rate of just over one-in-ten. Managing expectations within the parent organisation to this level, while also encouraging data scientists to beat their own target, can lead to a productive, engaged practice.

Building success in the data lab

If the critical success factors outlined above are considered from the outset and carefully managed, then a data lab can become what it is intended to be – a driver of invention and breakthrough outputs. If initial piloting proves successful, then scaling of the data lab can be undertaken.

Sustaining this practice can be more achievable if a number of additional actions are considered:

Creating a data factory: while the ultimate goal of the data lab is to develop breakthrough ideas, the desire of the data scientists working in it is not to be tasked with ongoing, repeating tasks. Establishing a separate, parallel data factory which is responsible for the data engineering and productionisation of outputs, especially by automating many of these activities, helps to ensure data scientists do not become disillusioned and retain their appetite for experimentation.

Developing business engagement: in the early stages of a data lab, engagement with lines of business is likely to be constrained, with the focus on specific tasks. As confidence and success grows, the data lab should be encouraged to build much stronger links with the busness and to position itself as an open door to invention. This will require careful management, especially to abitrate between competing demands and to ensure no one function attempts to claim an excessive degree of this resource.

Financial outputs: a specific policy of patenting new inventions may form part of the basis of the operating principles of the data lab. (This is common practice among data labs operated by commercial organisations.) Creating and protecting IP in this way adds signficantly to the value of the parent organisation. There are also tax credits which can be earned from invention of this sort.

Measuring success

Advancing the maturity of data science in an organisation is likely to happen more rapidly through the creation of a data lab. Even so, few such practices are able to reach and sustain level 5 (“Optimised”), not least because of ongoing changes to the market, organisational structure and even business appetite to fund this area.

Realistically, however, a data lab should be aiming to achieve level 4 maturity (“Managed”) as measured by DataIQ Leaders CARBON. At this stage, typical answers to the same set of questions above are likely to be:

CARBON data science assessment – typical level 4 responses
Q: How does data science support innovation in your organisation, such as process improvement, product and service development?	A: Innovation and product/service development often uses data science input, but feedback about the effectiveness of the input is rare.
Q: Is there a defined career development path towards becoming a data science leader in your organisation?	A: We have a clear pathway and defined job roles which allow for career progression to the top level understood by all.
Q: How well do data scientists understand the needs of the wider organisation?	A: Data scientists are offered the opportunity to engage with line of business executives on a regular basis.
Q: To what extent is suitable technology provided to the data science function?	A: Data science has all the technology it has requested, subject to an approvals process, but no rapid commissioning pathway exists.