Deduplication and master indexing

Deduplication and master indexing: How EDDIE is revolutionizing the use of data

EDDIE is a revolutionary product from Deep Data Insight. It has been designed to bring real-world savings to any organisation that works with large amounts of data. In this article, we look at this functionality in detail and explain the types of data activities to which EDDIE can bring massive savings.

Deduplication and master indexing is a time-consuming task that can instead be carried out by Artificial Intelligence tools. EDDIE is a revolutionary product from Deep Data Insight. It has been designed to bring real-world savings to any organisation that works with large amounts of data. In this article, we look at this functionality in detail and explain the types of data activities including Deduplication and Master Indexing to which EDDIE can bring massive savings.

Why is data a problem?

We tend to think of data as a commodity these days. It is a fuel that powers organisations from hospitals to schools; from banks to supply chains. The general consensus is that more data is good! It allows a business to make more informed decisions, to influence more people…to sell more. In fact, companies can be valued on the amount of data that they own.

However, it is increasingly important to talk about how the data is stored and used rather than merely how much is there. Think about the value of a well-kept filing cabinet as opposed to reams of paper strewn around the floor of an office. There may be more data on the floor…but it is going to be tough to use the data without it being categorized.

Now consider of an ‘intelligent filing cabinet’, that answers questions when asked, and has automatically organized, error-proofed and consolidated all the data it holds. This is what EDDIE can do.

The analogy works in the digital world, too. Organizations are guilty of treating the same data in different ways from department to department. Typos will creep in. Some data will be wrongly assumed to be of no use, and discarded. All of this becomes a problem when trying to access, compare and activate data. 

Deduplication and Master Indexing

Every time a piece of data is shared, the data will technically have been duplicated; it exists in more than one place. The more times this happens then the more times each data set is changed…and the more the data will differ from location to location.

So, when it comes to consolidating the data, things have quickly become very difficult.

Why do organizations typically want to duplicate data? A real-world example would be in healthcare, where a patient’s information will exist across a number of sites. Assuming the patient’s name hasn’t changed, but their email address/ phone number/ address has, then it is critical that all of the records are deduplicated against each other, to consolidate around one piece of data.

Another example would be a marketing department, who has purchased lots of (very expensive!) data for their sales activities. This needs to be added to their customer relationship management software, without creating new records where none are needed.

The term ‘Master Indexing’ refers to the process of bringing two sets of data together, to produce one single consolidated set. Master Indexing works by creating a new, third repository, into which deduplicated data will be stored. So, an overall data set might consist of 100’s of separate fields; Master indexing will ensure that not only have the two sets of data been checked against each other, but that they all end up in the right place and referred to a unique global data record.

There are enormous savings to be realised from using Artificial Intelligence to help your Deduplication and Master Indexing tasks.


Another great way in which technology can speed up data-cleaning is through Question and Answering functionality. A good system, such as EDDIE from DDI, will be able to apply a number of questions to the data once it has been cleansed, and extract the answer from the data.

Let’s consider then an insurance company, storing millions of incident records on its system from individuals who will have had multiple claims. A Q&A function will not only ensure that the data can be trusted, but it can also be asked to respond to a question, such as the date that an incident took place.

Deduplication and Master Indexing: String, fuzzy and neural matching

It’s great to hear that technology has a solution for this problem…but what is actually taking place here?

String matching is the name we give to identifying the same pieces of data across multiple sources. A super-charged, ultra-fast ‘spot the difference’ if you like.

Fuzzy matching is the next step on. If our two or more sets of data – as so often is the case – are not an exact match, they could be described as being ‘fuzzy around the edges’. Fuzzy matching allows for this, and will make highly accurate and reliable assumptions to allow for the data to be consolidated. A typical example of this would be where a typo has crept in to one element of data – maybe a name is misspelled but the other data matches.

Think about a legal application for this technology. ‘Legal Precedents’ are an enormous part of a country’s legal system; they ensure that consistency is maintained over time. However, since legal precedents usually span decades, the amount of data needed to be checked can be enormous…with differing uses of language too. Fuzzy matching would make a contingency for these differences, whilst producing reports that can be relied upon.

Finally, neural matching. Neural matching is even further advanced. It basis decisions on entire sentences rather than just words, so that the context can be understood. It is a way of further ensuring data integrity across multiple sources. It works by building ‘knowledge graphs’ that link up the various nodes of data produced, and reduces those outliers – the ‘noise’ – that can pollute the overall picture.

The core technology for all of this is called ‘NLP’ – Natural Language Processing. It is a branch of Artificial Intelligence that enables not only words to be matched, but context and sentiment as well. 

Deduplication and Master Indexing examples

If this all sounds rather baffling, let’s have a look at an easy-to follow example that combines all of these technologies.

Let’s consider all of the versions of the Bible that have been published over the centuries.

We get our system – EDDIE – to digest all the ‘data’ using de-duplication and master indexing by picking one version of each publication.

We want an answer to a particular question, so we ask: “Why did Cain kill Able?”

EDDIE understands that we really want to know why Abel was killed with Q&A NLP models and discerns the answer from the millions of characters of text.

Deduplication and Master Indexing…and EDDIE

EDDIE is a product of the Deep Data Insight ‘AI Factory’. It is a modular system that can be used to exactly meet a client’s individual needs. It is capable of reducing the time taken to correlate, consolidate or investigate an organisation’s data.

Key to EDDIE is that it saves you time. We can safely estimate that EDDIE will be faster than a human by a factor of 5 or 6. And of course, once the data consolidation job has been done, every subsequent interrogation will be virtually instantaneous.

EDDIE has been used across multiple sectors including governments, healthcare, insurance, legal and supply chain. It can work with digital or hand-written text, and with tables and graphs too. 

EDDIE not only saves time and money…it creates value from your data.

Take a closer look how EDDIE performs deduplication and master indexing can do here…

Take a look at the latest posts from Deep Data Insight here

Share this post