00530019.pdf

Identity Resolution in Criminal Justice Data:
An Application of NORA
Queen E. Booker
Minnesota State University, Mankato,
150 Morris Hall
Mankato, Minnesota
Queen.booker@mnsu.edu
Abstract. Identifying aliases is an important component of the criminal justice system. Accurately identifying a person of interest or someone who has been arrested can significantly
reduce the costs within the entire criminal justice system. This paper examines the problem
domain of matching and relating identities, examines traditional approaches to the problem,
and applies the identity resolution approach described by Jeff Jonas [1] and relationship
awareness to the specific case of client identification for the indigent defense office. The
combination of identity resolution and relationship awareness offered improved accuracy in
matching identities.
Keywords: Pattern Analysis, Identity Resolution, Text Mining.
1 Introduction
Appointing counsel for indigent clients is a complex task with many constraints and
variables. The manager responsible for assigning the attorney is limited by the number of attorneys at his/her disposal. If the manager assigns an attorney to a case with
which the attorney has a conflict of interest, the office loses the funds already invested
in the case by the representing attorney. Additional resources are needed to bring the
next attorney “up to speed.” Thus, it is in the best interest of the manager to be able to
accurately identify the client, the victim and any potential witnesses to minimize any
conflict of interest. As the number of cases grows, many times, the manager simply
selects the next person on the list when assigning the case. This type of assignment
can lead to a high number of withdrawals due to a late identified conflict of interest.
Costs to the office increase due to additional incarceration expenses while the client is
held in custody as well as the sunk costs of prior and repeated attorney representation
regardless of whether the client is in or out of custody.
These problems are further exacerbated when insufficient systems are in place to
manage the data that could be used to make assignments easier. The data on the defendant is separately maintained by the various criminal justice agencies including the
indigent defense service agency itself. This presents a challenge as the number of
cases increases but without a concomitant increase in staff available to make the assignments. Thus those individuals responsible for assigning attorneys want not only
the ability to better assign attorneys, but also to do so in a more expedient fashion.
E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 19–26, 2009.
springerlink.com
© Springer-Verlag Berlin Heidelberg 2009
20
Q.E. Booker
The aggregate data from all the information systems in the criminal justice process
have been proven to improve the attorney assignment process [2].
Criminal justice systems have many disparate information systems, each with their
own data sets. These include systems concerned with arrests, court case scheduling, the
prosecuting attorneys office, to name a few. In many cases, relationships are nonobvious. It is not unusual for a repeat offender to provide an alternative name that is not
validated prior to sending the arrest data to the indigent defense office. Likewise it is not
unusual for potential witnesses to provide alternative names in an attempt to protect
their identities. And further, it is not unusual for a victim to provide yet another name in
an attempt to hide a previous interaction with the criminal justice process. Detecting
aliases becomes harder as the indigent defense problem grows in complexity.
2 Problems with Matching
Matching identities or finding aliases is a difficult process to perform manually. The
process relies on institutional knowledge and/or visual stimulation. For example, if an
arrest report is accompanied by a picture, the manager or attorney can easily ascertain
the person’s identity. But that is not the case. Arrest reports sent generally are textual
with the defendant’s name, demographic information, arrest charges, victim, and any
witness information. With the institutional knowledge, the manager or an attorney can
review the information on the report and identify the person by the use of a previous
alias or by other pertinent information on the report. So essentially, it is possible to
identify many aliases by humans, and hence possible for an information system because the enterprise contains all the necessary knowledge. But the knowledge and the
process is trapped across isolated operational systems within the criminal justice
agencies.
One approach to improving the indigent defense agency problem is to amass information from as many different available data sources, clean the data, and find
matches to improve the defense process. Traditional algorithms aren't well suited for
this process. Matching is further encumbered by the poor quality of the underlying
data. Lists containing subjects of interest commonly have typographical errors such as
data from the defendants who intentionally misspell their names to frustrate data
matching efforts, and legitimate natural variability (Mike versus Michael and 123
Main Street versus 123 S. Maine Street). Dates are often a problem as well. Months
and days are sometimes transposed, especially in international settings. Numbers
often have transposition errors or might have been entered with a different number of
leading zeros.
2.1 Current Identity Matching Approaches
Organizations typically employ three general types of identity matching systems:
merge/purge and match/merge, binary matching engines, and centralized identity
catalogues. Merge/purge and match/merge is the process of combining two or more
lists or files, simultaneously identifying and eliminating duplicate records. This process was developed by direct marketing organizations to eliminate duplicate customer
records in mailing lists. Binary matching engines test an identity in one data set for its
Identity Resolution in Criminal Justice Data: An Application of NORA
21
presence in a second data set. These matching engines are also sometimes used to
compare one identity with another single identity (versus a list of possibilities), with
the output often expected to be a confidence value pertaining to the likelihood that the
two identity records are the same. These systems were designed to help organizations
recognize individuals with whom they had previously done business or, alternatively,
recognize that the identity under evaluation is known as a subject of interest—that is,
on a watch list—thus warranting special handling. [1] Centralized identity catalogues
are systems collect identity data from disparate and heterogeneous data sources and
assemble it into unique identities, while retaining pointers to the original data source
and record with the purpose of creating an index.
Each of the three types of identity matching systems uses either probabilistic or deterministic matching algorithms. Probabilistic techniques rely on training data sets to
compute attribute distribution and frequency looking for both common and uncommon patterns. These statistics are stored and used later to determine confidence levels
in record matching. As a result, any record containing similar, but uncommon data
might be considered a record the same person with a high degree of probability. These
systems lose accuracy when the underlying data's statistics deviate from the original
training set and must frequently retrained to maintain its level of accuracy. Deterministic techniques rely on pre-coded expert rules to define when records should be
matched. One rule might be that if the names are close (Robert versus Rob) and the
social security numbers are the same, the system should consider the records as
matching identities. These systems often have complex rules based on itemsets such
as name, birthdate, zipcode, telephone number, and gender. However, these systems
fail as data becomes more complex.
3 NORA
Jeff Jonas introduced a system called NORA which stands for non-obvious relationship awareness. He developed the system specifically to solve Las Vegas casinos'
identity matching problems. NORA accepts data feeds from numerous enterprise
information systems, and builds a model of identities and relationships between identities (such as shared addresses or phone numbers) in real time. If a new identity
matched or related to another identity in a manner that warranted human scrutiny
(based on basic rules, such as good guy connected to very bad guy), the system would
immediately generate an intelligence alert. The system approach for the Las Vegas
casinos is very similar to the needs of the criminal justice system. The data needed to
identify aliases and relationships for conflict of interest concerns comes from multiple
data sources – arresting agency, probation offices, court systems, prosecuting attorney
office, and the defense agency itself, and the ability to successfully identify a client is
needed in real-time to reduce costs to the defenses office.
The NORA system requirements were:
• Sequence neutrality. The system needed to react to new data in real time.
• Relationship awareness. Relationship awareness was designed into the identity resolution process so that newly discovered relationships could generate
realtime intelligence. Discovered relationships also persisted in the database,
which is essential to generate alerts to beyond one degree of separation.
22
Q.E. Booker
•
•
•
•
•
•
Perpetual analytics. When the system discovered something of relevance
during the identity matching process, it had to publish an alert in real time to
secondary systems or users before the opportunity to act was lost.
Context accumulation. Identity resolution algorithms evaluate incoming records against fully constructed identities, which are made up of the accumulated attributes of all prior records. This technique enabled new records to
match to known identities in toto, rather than relying on binary matching that
could only match records in pairs. Context accumulation improved accuracy
and greatly improved the handling of low-fidelity data that might otherwise
have been left as a large collection of unmatched orphan records.
Extensible. The system needed to accept new data sources and new attributes
through the modification of configuration files, without requiring that the
system be taken offline.
Knowledge-based name evaluations. The system needed detailed name
evaluation algorithms for high-accuracy name matching. Ideally, the algorithms would be based on actual names taken from all over the world and
developed into statistical models to determine how and how often each name
occurred in its variant form. This empirical approach required that the system
be able to automatically determine the culture that the name most likely
came from because names vary in predictable ways depending on their cultural origin.
Real time. The system had to handle additions, changes, and deletions from
real-time operational business systems. Processing times are so fast that
matching results and accompanying intelligence (such as if the person is on a
watch list or the address is missing an apartment number based on prior observations) could be returned to the operational systems in sub-seconds.
Scalable. The system had to be able to process records on a standard transaction server, adding information to a repository that holds hundreds of
identities. [1]
Like the gaming industry, the defense attorney’s office has relatively low daily transactional volumes. Although it receives booking reports on an ongoing basis, initial
court appearances are handled by a specific attorney, and the assignments are made
daily, usually the day after the initial court appearance. The attorney at the initial
court appearance is not the officially assigned attorney, allowing the manager a window of opportunity from booking to assigning the case to accurately identify the client. But the analytical component of accurate identification involves numerous
records with accurate linkages including aliases as well as past relationships and
networks as related to the case. The legal profession has rules and regulations that
constitute conflict of interest. Lawyers must follow these rules to maintain their license to practice which makes the assignment process even more critical. [3]
NORA’s identity resolution engine is capable of performing in real time against
extraordinary data volumes. The gaming industry's requirements of less than 1 million
affected records a day means that a typical installation might involve a single Intelbased server and any one of several leading SQL database engines. This performance
establishes an excellent baseline for application to the defense attorney data since the
NORA system demonstrated that the system could handle multibillion-row databases
Identity Resolution in Criminal Justice Data: An Application of NORA
23
consisting of hundreds of millions of constructed identities and ingest new identities
at a rate of more than 2,000 identity resolutions per second; such ultra-large deployments require 64 or more CPUs and multiple terabytes of storage, and move the performance bottleneck from the analytic engine to the database engine itself. While the
defense attorney dataset is not quite as large, the processing time on the casino data
suggests that NORA would be able to accurately and easily handle the defense attorney’s needs in real-time.
4 Identity Resolution
Identity resolution is an operational intelligence process, typically powered by an
identity resolution engine, whereby organizations can connect disparate data sources
with a view to understanding possible identity matches and non-obvious relationships
across multiple data sources. It analyzes all of the information relating to individuals
and/or entities from multiple sources of data, and then applies likelihood and
probability scoring to determine which identities are a match and what, if any, nonobvious relationships exist between those identities. These engines are used to
uncover risk, fraud, and conflicts of interest. Identity resolution is designed to assemble i identity records from j data sources into k constructed, persistent identities. The
term "persistent" indicates that matching outcomes are physically stored in a database
at the moment a match is computed.
Accurately evaluating the similarity of proper names is undoubtedly one of the
most complex (and most important) elements of any identity matching system. Dictionary- based approaches fail to handle the complexities of names such as common
names such as Robert Johnson. The approaches fail even greater when cultural influences in naming are involved.
Soundex is an improvement over traditional dictionary approaches. It uses a phonetic algorithm for indexing names by their sound when pronounced in English. The
basic aim is for names with the same pronunciation to be encoded to the same string
so that matching can occur despite minor differences in spelling. Such systems' attempts to neutralize slight variations in name spelling by assigning some form of
reduced "key" to a name (by eliminating vowels or eliminating double consonants)
frequently fail because of external factors—for example, different fuzzy matching
rules are needed for names from different cultures.
Jonas found that the deterministic method is essential for eliminating dependence
on training data sets. As such, the system no longer needed periodic reloads to account for statistical changes to the underlying universe of data. However, he also
asserts many common conditions in which deterministic techniques fail—specifically,
certain attributes were so overused that it made more sense to ignore them than to use
them for identity matching and detecting relationships. For example, two people with
the first name of "Rick" who share the same social security number are probably the
same person—unless the number is 111-11-1111. Two people who have the same
phone number probably live at the same address—unless that phone number is a
travel agency's phone number. He refers to such values as generic because the overuse diminishes the usefulness of the value itself. It's impossible to know all of these
24
Q.E. Booker
generic values a priori—for one reason, they keep changing—thus probabilistic-like
techniques are used to automatically detect and remember them.
His identity resolution system uses a hybrid matching approach that combines deterministic expert rules with a probabilistic-like component to detect generics in real
time (to avoid the drawback of training data sets). The result is expert rules that look
something like this:
If the name is similar
AND there is a matching unique identifier
THEN match
UNLESS this unique identifier is generic
In his system, a unique identifier might include social security or credit-card numbers,
or a passport number, but wouldn't include such values as phone number or date of
birth. The term "generic" here means the value has become so widely used (across a
predefined number of discreet identities) that one can no longer use this same value to
disambiguate one identity from another. [1] However, the approach for the study for
the defense data included a merged itemset that combined date of birth, gender, and
ethnicity code because of the inability or legal constraint of not being able to use the
social security number for identification. Thus, an identifier was developed from a
merged itemset after using the SUDA algorithm to identify infrequent itemsets based
on data mining [4].
The actual deterministic matching rules for NORA as well as the defense attorney
system are much more elaborate in practice because they must explicitly address
fuzzy matching to scrub and clean the data as well as address transposition errors in
numbers, malformed addresses, and other typographical errors. The current defense
attorney agency model has thirty-six rules. Once the data is “cleansed” it is stored and
indexed to provide user-friendly views of the data that make it easy for the user to
find specific information when performing queries and ad hoc reporting. Then, a datamining algorithm using a combination of binary regression and logit models is run to
update patterns for assigning attorneys based on the day’s outcomes [5]. The algorithm identifies patterns for the outcomes and tree structure for attorney and defendant
combinations where the attorney “completed the case.” [6]
Although matching accuracy is highly dependent on the available data, using the
techniques described here achieves the goals of identity resolution, which essentially
boil down to accuracy, scalability, and sustainability even in extremely large transactional environments.
5 Relationship Awareness
According to Jonas, detecting relationships is vastly simplified when a mechanism for
doing so is physically embedded into the identity matching algorithm. Stating the
obvious, before analyzing meaningful relationships, the system must be able to resolve unique identities. As such, identity resolution must occur first. Jonas purported
that it was computationally efficient to observe relationships at the moment the
Identity Resolution in Criminal Justice Data: An Application of NORA
25
identity record is resolved because in-memory residual artifacts (which are required to
match an identity) comprise a significant portion of what's needed to determine relevant relationships. Relevant relationships, much like matched identities, were then
persisted in the same database.
Notably, some relationships are stronger than others; a relationship score that's assigned with each relationship pair captures this strength. For example, living at the
same address three times over 10 years should yield a higher score than living at the
same address once for three months.
As identities are matched and relationships detected, the NORA evaluates userconfigurable rules to determine if any new insight warrants an alert being published as
an intelligence alert to a specific system or user. One simplistic way to do this is via
conflicting roles. A typical rule for the defense attorney might be notification any
time a client rule is associated to a role of victim, witness, co-defendant, or previously
represented relative, for example. In this case, associated might mean zero degrees of
separation (they're the same person) or one degree of separation (they're roommates).
Relationships are maintained in the database to one degree of separation; higher degrees are determined by walking the tree. Although the technology supports searching
for any degree of separation between identities, higher orders include many insignificant leads and are thus less useful.
6 Comparative Results
This research is an ongoing process to improve the attorney assignment process in the
defense attorney offices. As economic times get harder, crime increases and as crimes
increase, so do the number of people who require representation by the public defense
offices. The ability to quickly identify conflicts of interests reduces the amount of
time a person stays in the system and also reduces the time needed to process the case.
The original system built to work with the alias/identity matching as called the Court
Appointed Counsel System or CACS. CACS identified 83% more conflicts of interests than the indigent defense managers during the initial assignments [2]. Using the
merged itemset and an algorithm using NORA’s underlying technology, the conflicts
improved from 83% to 87%. But the real improvement came in the processing time.
The key to the success of these systems is the ability to update and provide accurate
data at a moments notice. Utilizing NORA’s underlying algorithms improved the
updating and matching process significantly, allowing for new data to be entered and
analyzed within a couple of hours as opposed to the days it took to process using the
CACS algorithms. Further, the merged itemset approach helped to provide a unique
identifier in 90% of the cases significantly increasing automated relationship identifications. The ability to handle real-time transactional data with sustained accuracy will
continue to be of "front and center" importance as organizations seek competitive
advantage.
The identity resolution technology applied here provides evidence that such technologies can be applied to more than simple fraud detection but also to improve business decision making and intelligence support to entities whose purpose are to make
expedient decisions regarding individual identities.
26
Q.E. Booker
References
1. Jonas, J.: Threat and Fraud Intelligence, Las Vegas Style. IEEE Security & Privacy 4(06),
28–34 (2006)
2. Booker, Q., Kitchens, F.K., Rebman, C.: A Rule Based Decision Support System Prototype
for Assigning Felony Court Appointed Counsel. In: Proceedings of the 2004 Decision Sciences Annual Meeting, Boston, MA (2004)
3. Gross, L.: Are Differences Among the Attorney Conflict of Interest Rules Consistent with
Principles of Behavioral Economics. Georgetown Journal of Legal Ethics 19, 111 (2006)
4. Manning, A.M., Haglin, D.J., Keane, J.A.: A Recursive Search Algorithm for Statistical
Disclosure Assessment. Data Mining and Knowledge Discovery (accepted, 2007)
5. Kitchens, F.L., Sharma, S.K., Harris, T.: Cluster Computers for e-Business Applications.
Asian Journal of Information Systems (AJIS) 3(10) (2004)
6. Forgy, C.: Rete: A Fast Algorithm for the Many Pattern/ Many Object Pattern Match Problem. Artificial Intelligence 19 (1982)