background

Understanding Data

Uwais Iqbal2022-11-02


Data is the fuel that powers Legal AI. It’s commonly referred to as the ‘new oil’ of the digital economy which helps to paint a picture of how data will come to underpin and fuel many different applications in the future.

Data Modes

Data can come in different forms, shapes and sizes. One way of speaking about data is in terms of its modality. There are a number of modalities data can take including:

  1. Text
  2. Image
  3. Speech
  4. Numeric

Unimodal Data

In most circumstances, a dataset is usually unimodal meaning it only contains only a single modality. A dataset consisting of legal documents would be a unimodal textual dataset.

Multimodal Data

Datasets can also be multimodal and contain multiple modalities. An image captioning dataset is multimodal since it contains the modalities of text in the form of captions as well as images.

Data Similarity

Data can also be described in terms of how similar the contents of a dataset are. There are two terms that are used often:

  1. Homogeneous Data
  2. Heterogeneous Data

Homogeneous Data

Homogeneous data is used to refer to data where the data samples that make up the dataset are similar. For example, a dataset consisting solely of employment agreements would be considered homogeneous data since the contracts are all very similar in kind.

Heterogeneous Data

Heterogeneous data is used to refer to data where the data samples that make up the dataset are very different. For example, a dataset from an eDiscovery use case consisting of emails, letters, contracts etc would be considered heterogeneous data since the data samples are all very different.

Using Data

Like oil, before data can be used there are a series of steps that need to be taken to ensure that the data is prepared and ready to be consumed. To help appreciate this better, data can be broken down into three different states:

  1. Unstructured Data
  2. Semi-structured Data
  3. Structured Data

Unstructured Data

Unstructured data is how data is typically found in the wild. It’s oil that is still in the ground or vegetables that are still in the garden. Some work needs to be put in before it can be used.

A typical example of unstructured data is a set of contracts that live within a document management system.

Semi-structured Data

Semi-structured data is data that has undergone some refinement but still has some rough edges. It’s oil that has been extracted from the ground but still needs to be purified or vegetables that have been picked from the garden but still need to be chopped and peeled before they can be used.

An example of semi-structured data is a set of contracts with some associated meta-data about the parties - it has some structure but work is still needed before the data can properly be used.

Structured Data

Structured data is data that is now ready to be used to fuel applications. It’s oil that has been purified and is now ready for consumption. It’s vegetables that have now been peeled, chopped and diced and are ready to be used in cooking.

An example of structured data is a set of contracts with all of the relevant fields extracted into a table. Structured data usually comes in the form of a CSV file or a database table and is the starting point for creating higher-level insights from data in the form of analytics or statistics.

Data Value Chain

Just like with oil, there is a value chain with data. At every step in the chain, the value of the data increases as it goes from unstructured to semi-structured to structured data. The use case of extracting fields from contracts captures precisely this value chain of data - the goal is to go from unstructured data in the form of contracts and arrive at structured data in the form of a table with the extracted fields from the contracts to enable contract analytics.

Data in Legal AI

While most data in firms and organisations is unstructured, there should be active efforts to work through the value chain to create structured data that can be used to unlock business value and deliver insights through Legal AI.

There is an adage in machine learning: "Garbage in, Garbage out". Any AI model is only as good as the data it's trained on. If you're cooking with old and worn-out ingredients, the dish won't be anywhere near as good as you want it to be.

Before venturing into the shiny world of Legal AI, it's important to have a data strategy that enables long-term benefit and value from a firm's data.