Machine Learning

Data Entry Errors – the Human Component and Intelligent Cleansing


Manual data entry still represents a significant portion of data collection processes across a variety of industries. Any company that assigns humans to conduct data collection will see data entry errors. These errors may appear insignificant, such as a code entered as “ABCD” instead of “AVCD”, but they can be very costly.

Remote data entry can be done via hand-held devices (collecting information on wastewater, construction and surveying, sale of physical devices, etc.). However, most data entry is done via a QWERTY keyboard for the English language.

We recently worked with a client that stores millions of transactions as a result of manually entered codes, representing physical parts that were ordered. This database of parts is critical to understanding the current state of their inventory. However, sometimes the codes were entered incorrectly due to a lack of rules-based limitations on the front-end and human error.

The most common mistakes in data entry are either transcription errors (misspelling words) or transposition errors (mistyping information). In this case, we were dealing with information that was entered incorrectly (transposition errors) so it wouldn’t have helped to use an English dictionary as a source of truth.

In an effort to correct these codes, we started with a string distance algorithm called Levenshtein’s distance. However, Levenshtein’s will look at transposition errors equally, so “ABC” would have the same distance to “ABP” and “ABX” even though “X” is much closer to “C” than “P” on a keyboard. We needed to account for distance on the keyboard, which accounts for typos in codes or “fat-fingering”.

In order to account for distance between keys, we mapped Cartesian coordinates onto a QWERTY keyboard. For example, G, B, V etc. are all 1 away from F.  We then use Euclidean distance to calculate how far away each key is from each other.

We didn’t find any readily available R code online to do this so we made our own mapping.

keys_cart <- data.frame(key=toupper(c('q', 'w', 'e', 'r',
't', 'y', 'u', 'i', 'o', 'p',
'a', 'z', 's', 'x', 'd', 'c', 'f', 'b', 'm', 'j',
'g', 'h', 'k', 'l', 'v', 'n')),
x = c(0,1,2,3,4,5,6,7,8,9,0,0,1,1,2,2,3,4,6,6,4,5,7,8,3,5),
y = c(0,0,0,0,0,0,0,0,0,0,1,2,1,2,1,2,1,2,2,1,1,1,1,1,2,2))

We started with a list of bad codes and a list of good codes. For each bad code, we used Levenshtein’s distance to help us find the closest possible good codes.

We can see that the possible good codes are only 1 character different from the bad code. If we take into account keyboard distance though, the 1XGFB code is the closest to the 1XVFB code given the Euclidean distance between G and V on a QWERTY keyboard.

This solution seemed to work well for identifying human keyboard entry errors and can be generalized across many other data entry problems.