Learn the In's and Out's of Text Analytics

Our team has spent more than 30 years researching linguistics and building some of the world's most advanced and functional knowledge discovery tools. Here, you'll find definitions for some of the most common terms we come across day to day – and which you'll certainly find as you research text analytics solutions for eDiscovery, contract analysis and more.

Jump to Letter

A · B · C · D · E · H · I · K · L · M · N · P · R · S · T · U

Simulation of human intelligence by machines, typically used for predictive behavior and probability modeling. In text analytics, the term's definition is extremely murky, with different users deploying it for everything from predictive search and entity extraction to contextualization and normalization. "AI" is differentiated from natural or aided intelligence, which refer to a machine's ability to augment human skills, rather than possessing them itself.

The model used by less sophisticated natural language processing and information retrieval engines. In the bag of words approach, individual words within a text are decontextualized, with grammar and order disregarded.

 

Refers to a system of logic developed by an early computer pioneer, George Boole. In Boolean searching, an "and" operator between two words results in a search for documents containing both of the words. An "or" operator between two words creates a search for documents containing either of the target words. A "not" operator between two words creates a search result containing the first word but excluding the second.

The process by which documents or entities are assigned to gropus or taxonomies. Classification becomes more precise through training – human checks on machine-generated results.

 

The algorithmic grouping of documents based on extracted concepts and weights. Clustering is rarely fully automated; such a process typically creates results that do not match a user's intuition about how documents should be grouped together. Modern text analytics platforms allow for dynamic drill-down, supervised clustering and other semi-automated (but highly precise and efficient) processes.

Computer investigation and analysis techniques to determine legal evidence. Applications include computer crime or misuse, theft of trade secrets, theft of or destruction of intellectual property, and fraud. Computer forensics specialists use many methods to capture computer system data, and recover deleted, encrypted, or damaged file information.

 

Written or recorded information, typically in the form of electronic documents (e.g., Word files, emails or PDFs), images (e.g., JPG or PNG files), video (e.g., mp4 files) or audio files (e.g., mp3 files).

 

The most central information present in a document or piece of content. An example is the monetary terms and legnth of a business contract.

 

A collection of related documents or texts. Examples include Wikipedia and the collected works of Shakespeare.

 

Reducing the size of the set of electronic documents using mutually defined criteria (dates, keywords, custodians, etc.) to decrease volume while increasing relevancy of the information.

 

Reducing the size of the set of electronic documents using mutually defined criteria (dates, keywords, custodians, etc.) to decrease volume while increasing relevancy of the information.

Facts and statistics collected together for reference or analysis. Data should not be confused with information, which is a collection or analysis of data that brings with it insight and understanding.

 

The identification and removal of similar (or very nearly similar) documents within a corpus. Deduplication is particularly important in the field of eDiscovery, in which the hosting and analysis of duplicate documents can create significant monetary and time drains.

 

Removing the operating system files, program files and other non-user created data. The NIST (National Institute of Standards and Technology) list contains more than 40 million known files and using this list to filter custodian hard drives files can be effective because these files are usually irrelevant to a case, but often make up a sizable portion of a collected set of electronically stored information (ESI).

 

A linguistic signature that describes a document's content at an abstract level. Document profiles are created during the ingestion process, during which relevant terminology in a document is identified and structured.

An early step in eDiscovery, in which a corpus is roughly analyzed and irrelevant or duplicate content is culled.

 

A legal process in which information in an electronic form – as opposed to paper – is captured, processed and reviewed prior to litigation or settlement.

 

Acronym for the electronic data reference model, a conceptual view of the eDiscovery process developed at Duke University.

 

A process within an eDiscovery workflow in which emails between parties are isolated and presented chronologically. Seemingly simple, email threading is actually rather complex due to forwards, replys, reply-all's, duplicates and attachments.

 

A distinct item referenced in a piece of content. Examples include people, places, companies, products and brands, as well as pattern-based entities, like addresses, phone numbers and email addresses.

 

Data found in hard drives, CDs, online social networks, PDAs, smart phones, voice mail and other electronic data stores. Electronically stored information, for the purpose of the Federal Rules of Civil Procedure (FRCP) is information created, manipulated, communicated, stored, and best utilized in digital form, requiring the use of computer hardware and software

An algorithm that creates a value to verify duplicate electronic documents. A hash mark serves as a digital thumbprint.

A collection or analysis of data that brings with it insight and understanding.

 

The first step in a text analytics workflow, during which documents are collected from disparate sources (e.g., cloud storage, ECM systems and local drives) and converted into a common text representation.

In eDiscovery keyword search is a process of examining electronic documents in a collection or system by matching a keyword or keywords with instances in different documents. Keyword searches can only be done on electronic files in their native format, in searchable PDF, or in files that have been associated with an OCR text file. Standard keyword searches will return a positive result only if the exact keyword or a close derivative is specified. Search derivatives returned by litigation support search engines commonly include stemming. Stemming returns grammatical variations on a word, for example a search for "related" would also have results "relating," "relates," and "relate."

A largely deprecated method used to analyze content. LSI utilizes the so-called "bag of words" approach, in which words within a document are treated as discrete items, devoid of context or order.

 

The scientific study of language and its structure. Modern text analytics platforms leverage linguistic study and theory as core elements of their engines, as opposed to the purely mathematical approach presented by LSI.

 

A notice or communication from legal counsel to an organization that suspends the normal disposition or processing of records, such as backup tape recycling. A litigation hold will be issued a result of current or anticipated litigation, audit, government investigation or other such matter to avoid evidence spoliation.

 

Specific language use guidelines and mores utilized within a region, dialect or industry group. Examples include business "jargon" and the New York-centric use of the term "on line," as opposed to "in line," to describe one waiting in a queue.

A field of computer science focused on giving computers the ability to "learn" without being explicitly programmed. "Learning" is accomplished through the repetition of like tasks and low-touch human training and checks. 

 

Data about data. Metadata provides information about a document or other data managed within an application or environment. Data that describes how, when and by whom a particular set of data was created, edited, formatted, and processed. Access to meta-data provides important evidence, such as blind copy (bcc) recipients, the date a file or email message was created and/or modified, and other similar information. Such information is lost when an electronic document is converted to paper form for production. Files may include such metadata as an access date, file path, size or name. 

 

The study of the structure and construction of words, including roots and morphemes (prefixes, affixes and infixes).

Insight and understanding that evolves and changes based on environment and context. NI can be clearly differentiated from artificial intelligence (or "AI"), which – even in its most complex forms – relies upon a series of largely static Boolean statements and rules. In computing, natural intelligence refers to a machine's ability to augment human performance by automating manual tasks while still providing a human-like work product.

 

A field of computer science focusing on translating human language – with its imprecision, ambiguity and evolving structure – into a machine-readable and -indexable format, without loss of context or meaning. NLP is critical to text analytics, as it enables platforms to structure the unstructured information contained in written documents, video and audio files.

 

The transformation of text and entities into canonical forms for further analysis and insight extraction. An example is the normalization of "JFK," "John F. Kennedy," "Jack Kennedy" and "President Kennedy" into the canonical "President John F. Kennedy." Properly normalized, all examples of the individual entities within a corpus would be accessible via a search for any of the others – and presented distinct from references to John F. Kennedy International Airport.

A special and exclusive legal advantage or right (for example, attorney work product and certain communications between an individual and his or her attorney, which are protected from disclosure).

 

A Personal Storage Table (.pst) is a file format used to store copies of messages, calendar events, and other items within Microsoft software such as Microsoft Exchange Client, Windows Messaging, and Microsoft Outlook.

A leading eDiscovery platform, with more than 150,000 users worldwide. It integrates directly with the ayfie Inspector text analytics platform.

The process of extract the core meaning of words, phrases, clauses, sentences within the context of a document and corpus. 

 

The alteration, deletion or partial destruction of records which may be relevant to ongoing or anticipated litigation, government investigation or audit. Failure to preserve information that may become evidence is also spoliation.

 

Content treated as data, presented in a predictable form (like a table, XML or character-delimited file) and typically contextualized through the use of metadata. Structured content is extremely simple for content analytics applications to utilize, as it is designed for machine – not human – consumption and analysis.

 

The identification of synonyms within a document or corpus.

The use of text analytics or similar software to streamline and speed the eDiscovery review, analysis and production processes. TAR typically includes deduplication, search, analytics and predictive coding. Despite fears in the early 2000s, TAR has become essentially universally accepted by judges.

 

The use of purpose-built software and methodologies to extract insights and value from structured and unstructured content bases. Text analytics platforms typically rely on one of two core methodologies: mathematical-based (focusing on AI, LSI and machine learning) and language-based (focusing on semantic analysis, content and human-like understanding), though there is decided overlap between the schools.

 

A chain of e-mail conversation which consists of the initiating e-mail and all e-mails related to it including the replies and forwards between senders and recipients in this e-mail chain.

Content intended for human consumption, typically presented in the form of phrases, sentences and paragraphs. Unstructured content is significantly more difficult for machines to understand and analyze than structured content, as human language is prone to ambiguity, lack of clarity, repetition and imprecision. Natural language processing (NLP) must be applied to ready unstructured content for machine analysis.