Become familiar with the terms and jargon you often hear in text analytics’ conversations.
Our team has spent more than 30 years researching linguistics and building some of the world’s most advanced and functional knowledge discovery tools. You will find definitions for some of the most common terms we encounter day-to-day. You will undoubtedly run into these words and phrases as you research text analytics solutions for eDiscovery, contract analysis and more.
Artificial Intelligence (AI)
Area in computer science that tries to make machines act intelligently, with the goal of achieving or surpassing human intelligence. AI is typically used for predictive behavior and probability modeling. In text analytics, the term's definition is extremely murky, with different users deploying it for everything from predictive search and entity extraction to contextualization and normalization.
Bag of Words
The model used by less sophisticated natural language processing and information retrieval engines. In the bag of words approach, a text is assumed to be a collection of words without any order and without grammar. It is like all the words of the text being thrown in a bag and thereby losing the context in which they originally appeared in the text.
Refers to a system of logic developed by an early computer pioneer, George Boole. In Boolean searching, an "and" operator between two words results in a search for documents containing both of the words. An "or" operator between two words creates a search for documents containing either of the target words. A "not" operator before a word creates a search result not containing this word.
The process by which documents or entities are assigned to groups (classes) or taxonomies. Classification becomes more precise through training – human checks on machine-generated results.
The algorithmic grouping of documents based on extracted concepts and weights. Clustering is rarely fully automated; such a process typically creates results that do not match a user's intuition about how documents should be grouped together. Modern text analytics platforms allow for dynamic drill-down, supervised clustering and other semi-automated (but highly precise and efficient) processes.
Computer investigation and analysis techniques to determine legal evidence. Areas of application include investigations in computer crime or misuse, theft of trade secrets, theft of or destruction of intellectual property and fraud. Computer forensics specialists use many methods to capture computer system data and recover deleted, encrypted or damaged file information.
Written or recorded information, typically in the form of electronic documents (e.g., Word files, emails or PDFs), images (e.g., JPG or PNG files), video (e.g., mp4 files) or audio files (e.g., mp3 files).
The most central information present in a document or piece of content. Examples are the monetary terms and length of a business contract.
A collection of related documents or texts. Examples include Wikipedia and the collected works of Shakespeare.
Reducing the size of a set of electronic documents using mutually defined criteria (dates, keywords, custodians, etc.) to decrease volume while increasing relevancy of the information.
The Electronic Discovery Reference Model (EDRM) defines a custodian as: "Person having administrative control of a document or electronic file; for example, the custodian of an email is the owner of the mailbox which contains the message."
Facts and statistics collected together for reference or analysis. Data should not be confused with information, which is a collection or analysis of data that brings with it insight and understanding.
The identification and removal of identical (or very nearly identical) documents within a corpus. Deduplication is particularly important in the field of eDiscovery, in which the hosting and analysis of duplicate documents can create significant monetary and time drains.
Removing the operating system files, program files and other non-user created data. The NIST (National Institute of Standards and Technology) list contains more than 40 million known files and using this list to filter custodian hard drives files can be effective because these files are usually irrelevant to a case, but often make up a sizable portion of a collected set of electronically stored information (ESI).
A linguistic signature that describes a document's content at an abstract level. Document profiles are created during the ingestion process, during which relevant terminology in a document is identified and structured.
Early Case Assessment
An early step in eDiscovery, in which a corpus is roughly analyzed and irrelevant or duplicate content is culled.
A legal process in which information in electronic form – as opposed to paper – is captured, processed and reviewed prior to litigation or settlement.
Acronym for the Electronic Data Reference Model, a conceptual view of the eDiscovery process developed at Duke University.
A process within an eDiscovery workflow in which emails between parties are isolated and presented chronologically. Seemingly simple, email threading is actually rather complex due to forwards, replys, reply-all's, duplicates and attachments.
A distinct item referenced in a piece of content. Examples include people, places, companies, products and brands, as well as pattern-based entities, like addresses, phone numbers and email addresses.
ESI (Electronically Stored Information)
Data found in hard drives, CDs, online social networks, PDAs, smart phones, voice mail and other electronic data stores. Electronically stored information, for the purpose of the Federal Rules of Civil Procedure (FRCP) is information created, manipulated, communicated, stored and best utilized in digital form, requiring the use of computer hardware and software.
A collection or analysis of data that brings with it insight and understanding.
The first step in a text analytics workflow, during which documents are collected from disparate sources (e.g., cloud storage, ECM systems and local drives) and converted into a common text representation.
In eDiscovery, keyword search is a process of examining electronic documents in a collection or system by matching a keyword or keywords with instances in different documents. Keyword searches can only be done on electronic files in their native format, in searchable PDF or in files that have been associated with an OCR text file. Standard keyword searches will return a positive result only if the exact keyword or a close derivative is specified. Search derivatives returned by litigation support search engines commonly include stemming. Stemming returns grammatical variations on a word, for example a search for "related" would also have results "relating," "relates," and "relate."
Latent Semantic Indexing
A largely deprecated method used to analyze content. LSI utilizes the so-called "bag of words" approach, in which words within a document are treated as discrete items, devoid of context or order.
The scientific study of language and its structure. Modern text analytics platforms leverage linguistic study and theory as core elements of their engines, as opposed to the purely mathematical approach presented by LSI.
A notice or communication from legal counsel to an organization that suspends the normal disposition or processing of records, such as backup tape recycling. A litigation hold will be issued as a result of current or anticipated litigation, audit, government investigation or other such matter to avoid evidence spoliation.
Local Grammars are a formalism used by computational linguistics to precisely describe (natural language) phrases and their meaning. They avoid problems of over-generalization typically found in general grammars. Very often, idioms (e.g., he kicks the bucket) and technical jargon (e.g., he kills the process) have syntactic characteristics that can be more easily and precisely described as local constraints. Local grammars typically make use of extensive dictionaries describing many morphological, syntactic and semantic aspects of words.
A field of computer science focused on giving computers the ability to "learn" without being explicitly programmed. "Learning" is accomplished through the repetition of like tasks and low-touch human training and checks.
Data about data. Metadata provides information about a document or other data managed within an application or environment. Data that describes how, when and by whom a particular set of data was created, edited, formatted and processed. Access to meta-data provides important evidence, such as blind copy (bcc) recipients, the date a file or email message was created and/or modified and other similar information. Such information is lost when an electronic document is converted to paper form for production. Files may include such metadata as an access date, file path, size or name.
In linguistics, the study of the structure and construction of words, including roots and morphemes (prefixes, suffixes, infixes and other affixes) and how words of a language are related to each other.
Natural Language Processing
A field of computer science focusing on translating human language – with its imprecision, ambiguity and evolving structure – into a machine-readable and -indexable format, without loss of context or meaning. NLP is critical to text analytics, as it enables platforms to structure the unstructured information contained in written documents, video and audio files.
The transformation of text and entities into canonical forms for further analysis and insight extraction. An example is the normalization of "JFK," "John F. Kennedy," "Jack Kennedy" and "President Kennedy" into the canonical "President John F. Kennedy." Properly normalized, all examples of the individual entities within a corpus would be accessible via a search for any of the others – and presented distinct from references to John F. Kennedy International Airport.
A special and exclusive legal advantage or right (for example, attorney work product and certain communications between an individual and his or her attorney, which are protected from disclosure).
PST File Format
A Personal Storage Table (.pst) is a file format used to store copies of messages, calendar events and other items within Microsoft software such as Microsoft Exchange Client, Windows Messaging and Microsoft Outlook.
The process of extracting the core meaning of words, phrases, clauses, sentences within the context of a document and corpus.
The alteration, deletion or partial destruction of records which may be relevant to ongoing or anticipated litigation, government investigation or audit. Failure to preserve information that may become evidence is also spoliation.
Content treated as data, presented in a predictable form (like a table, XML or character-delimited file) and typically contextualized through the use of metadata. Structured content is extremely simple for content analytics applications to utilize, as it is designed for machine – not human – consumption and analysis.
The relationship between two words or phrases that (more or less) have the same meaning in a given document or corpus.
The use of text analytics or similar software to streamline and speed the eDiscovery review, analysis and production processes. TAR typically includes deduplication, search, analytics and predictive coding. Despite fears in the early 2000s, TAR has become essentially universally accepted by judges.
The use of purpose-built software and methodologies to extract insights and value from structured and unstructured content bases. Text analytics platforms typically rely on one of two core methodologies: mathematical-based (focusing on AI, LSI and machine learning) and language-based (focusing on semantic analysis, content and human-like understanding), though there is decided overlap between the schools.
A chain of e-mail conversation which consists of the initiating e-mail and all e-mails related to it including the replies and forwards between senders and recipients in this e-mail chain.
Content intended for human consumption, typically presented in the form of phrases, sentences and paragraphs. Unstructured content is significantly more difficult for machines to understand and analyze than structured content, as human language is prone to ambiguity, lack of clarity, repetition and imprecision. Natural language processing (NLP) must be applied to ready unstructured content for machine analysis.