Ayfie | Text Analytics Glossary

Bag of Words

The model used by less sophisticated natural language processing and information retrieval engines. In the bag of words approach, a text is assumed to be a collection of words without any order and without grammar. It is like all the words of the text being thrown in a bag and thereby losing the context in which they originally appeared in the text.

Boolean Search

Refers to a system of logic developed by an early computer pioneer, George Boole. In Boolean searching, an "and" operator between two words results in a search for documents containing both of the words. An "or" operator between two words creates a search for documents containing either of the target words. A "not" operator before a word creates a search result not containing this word.

Classification

The process by which documents or entities are assigned to groups (classes) or taxonomies. Classification becomes more precise through training – human checks on machine-generated results.

Clustering

The algorithmic grouping of documents based on extracted concepts and weights. Clustering is rarely fully automated; such a process typically creates results that do not match a user's intuition about how documents should be grouped together. Modern text analytics platforms allow for dynamic drill-down, supervised clustering and other semi-automated (but highly precise and efficient) processes.

Computer Forensics

Computer investigation and analysis techniques to determine legal evidence. Areas of application include investigations in computer crime or misuse, theft of trade secrets, theft of or destruction of intellectual property and fraud. Computer forensics specialists use many methods to capture computer system data and recover deleted, encrypted or damaged file information.

Content

Written or recorded information, typically in the form of electronic documents (e.g., Word files, emails or PDFs), images (e.g., JPG or PNG files), video (e.g., mp4 files) or audio files (e.g., mp3 files).

Core Facts

The most central information present in a document or piece of content. Examples are the monetary terms and length of a business contract.

Corpus

A collection of related documents or texts. Examples include Wikipedia and the collected works of Shakespeare.

Culling

Reducing the size of a set of electronic documents using mutually defined criteria (dates, keywords, custodians, etc.) to decrease volume while increasing relevancy of the information.

Custodian

The Electronic Discovery Reference Model (EDRM) defines a custodian as: "Person having administrative control of a document or electronic file; for example, the custodian of an email is the owner of the mailbox which contains the message."

Data

Facts and statistics collected together for reference or analysis. Data should not be confused with information, which is a collection or analysis of data that brings with it insight and understanding.

Deduplication

The identification and removal of identical (or very nearly identical) documents within a corpus. Deduplication is particularly important in the field of eDiscovery, in which the hosting and analysis of duplicate documents can create significant monetary and time drains.

DeNisting

Removing the operating system files, program files and other non-user created data. The NIST (National Institute of Standards and Technology) list contains more than 40 million known files and using this list to filter custodian hard drives files can be effective because these files are usually irrelevant to a case, but often make up a sizable portion of a collected set of electronically stored information (ESI).

Document Profile

A linguistic signature that describes a document's content at an abstract level. Document profiles are created during the ingestion process, during which relevant terminology in a document is identified and structured.

Early Case Assessment

An early step in eDiscovery, in which a corpus is roughly analyzed and irrelevant or duplicate content is culled.

eDiscovery

A legal process in which information in electronic form – as opposed to paper – is captured, processed and reviewed prior to litigation or settlement.

EDRM

Acronym for the Electronic Data Reference Model, a conceptual view of the eDiscovery process developed at Duke University.

Email Threading

A process within an eDiscovery workflow in which emails between parties are isolated and presented chronologically. Seemingly simple, email threading is actually rather complex due to forwards, replys, reply-all's, duplicates and attachments.

Entity

A distinct item referenced in a piece of content. Examples include people, places, companies, products and brands, as well as pattern-based entities, like addresses, phone numbers and email addresses.

ESI (Electronically Stored Information)

Data found in hard drives, CDs, online social networks, PDAs, smart phones, voice mail and other electronic data stores. Electronically stored information, for the purpose of the Federal Rules of Civil Procedure (FRCP) is information created, manipulated, communicated, stored and best utilized in digital form, requiring the use of computer hardware and software.

Keyword Search

In eDiscovery, keyword search is a process of examining electronic documents in a collection or system by matching a keyword or keywords with instances in different documents. Keyword searches can only be done on electronic files in their native format, in searchable PDF or in files that have been associated with an OCR text file. Standard keyword searches will return a positive result only if the exact keyword or a close derivative is specified. Search derivatives returned by litigation support search engines commonly include stemming. Stemming returns grammatical variations on a word, for example a search for "related" would also have results "relating," "relates," and "relate."

Latent Semantic Indexing

A largely deprecated method used to analyze content. LSI utilizes the so-called "bag of words" approach, in which words within a document are treated as discrete items, devoid of context or order.

Linguistics

The scientific study of language and its structure. Modern text analytics platforms leverage linguistic study and theory as core elements of their engines, as opposed to the purely mathematical approach presented by LSI.

Litigation Hold

A notice or communication from legal counsel to an organization that suspends the normal disposition or processing of records, such as backup tape recycling. A litigation hold will be issued as a result of current or anticipated litigation, audit, government investigation or other such matter to avoid evidence spoliation.

Local Grammars

Local Grammars are a formalism used by computational linguistics to precisely describe (natural language) phrases and their meaning. They avoid problems of over-generalization typically found in general grammars. Very often, idioms (e.g., he kicks the bucket) and technical jargon (e.g., he kills the process) have syntactic characteristics that can be more easily and precisely described as local constraints. Local grammars typically make use of extensive dictionaries describing many morphological, syntactic and semantic aspects of words.

Machine Learning

A field of computer science focused on giving computers the ability to "learn" without being explicitly programmed. "Learning" is accomplished through the repetition of like tasks and low-touch human training and checks.

Metadata

Data about data. Metadata provides information about a document or other data managed within an application or environment. Data that describes how, when and by whom a particular set of data was created, edited, formatted and processed. Access to meta-data provides important evidence, such as blind copy (bcc) recipients, the date a file or email message was created and/or modified and other similar information. Such information is lost when an electronic document is converted to paper form for production. Files may include such metadata as an access date, file path, size or name.

Morphology

In linguistics, the study of the structure and construction of words, including roots and morphemes (prefixes, suffixes, infixes and other affixes) and how words of a language are related to each other.

Natural Language Processing

A field of computer science focusing on translating human language – with its imprecision, ambiguity and evolving structure – into a machine-readable and -indexable format, without loss of context or meaning. NLP is critical to text analytics, as it enables platforms to structure the unstructured information contained in written documents, video and audio files.

Normalization

The transformation of text and entities into canonical forms for further analysis and insight extraction. An example is the normalization of "JFK," "John F. Kennedy," "Jack Kennedy" and "President Kennedy" into the canonical "President John F. Kennedy." Properly normalized, all examples of the individual entities within a corpus would be accessible via a search for any of the others – and presented distinct from references to John F. Kennedy International Airport.

Semantic Analysis

The process of extracting the core meaning of words, phrases, clauses, sentences within the context of a document and corpus.

Spoliation

The alteration, deletion or partial destruction of records which may be relevant to ongoing or anticipated litigation, government investigation or audit. Failure to preserve information that may become evidence is also spoliation.

Structured Content

Content treated as data, presented in a predictable form (like a table, XML or character-delimited file) and typically contextualized through the use of metadata. Structured content is extremely simple for content analytics applications to utilize, as it is designed for machine – not human – consumption and analysis.

Synonymy

The relationship between two words or phrases that (more or less) have the same meaning in a given document or corpus.

Technology-Assisted Review

The use of text analytics or similar software to streamline and speed the eDiscovery review, analysis and production processes. TAR typically includes deduplication, search, analytics and predictive coding. Despite fears in the early 2000s, TAR has become essentially universally accepted by judges.

Text Analytics

The use of purpose-built software and methodologies to extract insights and value from structured and unstructured content bases. Text analytics platforms typically rely on one of two core methodologies: mathematical-based (focusing on AI, LSI and machine learning) and language-based (focusing on semantic analysis, content and human-like understanding), though there is decided overlap between the schools.

Thread (Email)

A chain of e-mail conversation which consists of the initiating e-mail and all e-mails related to it including the replies and forwards between senders and recipients in this e-mail chain.

Terms and jargon you often hear in text analytics’ conversations:

Jump to Letter

Artificial Intelligence (AI)