Cleaning Data for Effective Data Science

Accuracy

In a classification model, there are numerous metrics that might express the "goodness" of a model. Accuracy is often the default metric used, and is simply the number of right answers divided by the number of data points. For example, consider this hypothetical confusion matrix:

Predict/Actual	Human	Octopus	Penguin
Human	5	0	2
Octopus	3	3	3
Penguin	0	1	11

There are 28 observations of organisms, and 19 were classified accurately, hence the accuracy is approximately 68%. Other commonly used metrics are precision, recall, and F1 score.

Related concepts: F1 score, precision, recall

ActiveMQ

Apache ActiveMQ is an open source message broker. As with other message brokers, the aggregations of messages sent among systems is often a fruitful domain for data science analysis.

BeautifulSoup

Beautiful Soup is a Python library for parsing and processing HTML and XML documents, and also for handling not-quite-grammatical HTML that often occurs on the World Wide Web. Beautiful Soup is often useful for acquiring data via web scraping.

Berkeley DB

Berkeley DB is an open source library for providing key/value storage systems.

Big Data

The concept of "big data" is one that shifts with time, as computing and storage capabilities increase. Generally, big data is simply data that is too large to handle using "traditional" and simple tools. What tools are traditional or simple, in turn, varies with organization, project, and over time. As s rough guideline, data that can fit inside the memory on a single available server or workstation is "small data," or at most "medium-sized data."

As of 2021, a reasonably powerful single system might have 256 GiB, so big data is at least tens or hundreds of gigabytes ($10^9$) in size. Within a few years of this writing, the threshhold for big data will be at least terabytes ($10^{12}$), and already today some data sets reach into exabytes ($10^{18}$).

Big-endian (see Endianness)

Data arranged into "words" (typically 32-bits), or other units, where the largest magnitude component (typically a byte) is stored in the last position.

BSON (Binary JSON)

BSON is a binary-encoded serialization of JSON-like documents.

caret (Classification And REgression Training)

The R package caret is a rich collection of functions for data splitting, pre-processing, feature selection, resampling, and variable importance estimation.

Cassandra

Apache Cassandra is an open source distributed database system that uses the Cassandra Query Language (CQL), rather than standard SQL for queries. CQL and SQL are largely similar, but vary in specific details.

Categorical variable (see NOIR)

Related concepts: continuous variable, interval variable, nominal variable, ordinal variable, ratio variable

chardet

The chardet module in Python, and analogous versions in other programming languages, applies a colleciton of heuristics to a sequence of bytes thought likely to encode text. If the protocol or format you encounter explicitly declares an encoding, try that first. As a fallback, chardet can often make reasonable guesses based on letter and n-gram frequencies that occur in a different language, and which byte values are permitted by a given encoding.

Chimera

In Greek mythology, a chimera is an animal combining elements of several dramatically disparate animals; most commonly these include the head of a lion, the body of a goat, and the tail of a snake. In adapted uses as a generic but evocative adjective, anything that combines surprisingly juxtaposed elements together can be called chimerical; or metaphorically, the thing might be called a chimera.

Column

A single kind of data item that may have, and usually has, many exemplars, one per row (a.k.a. sample, observation, record, etc.). A column consists of ordered data items of the same data type but varying values. A number of synonyms are used for "columns" with slightly varying focus. Features emphasize the way that columns are used by machine learning algorithms. Field focuses on the data format used to store the data items. Measurement is used most often when a column collects empirical observations, often using some particular instrument. Variable is used when thinking of equational relationships among different columns (e.g. independent versus dependent).

Overall, columns and rows form columnar or tabular data.

Synonyms: feature, field, measurement, variable

Comma-separated values (CSV)

A representation of columnar data in which each line of text is separated by a newline character (or carriage return, or CR/LF). Within each line, data values are separated by commas. Values separated by other delimiters, such as tab or |, are also often informally called CSV (the acronym, not the full words).

Variations on the format use several quoting and escaping conventions. String data items containing commas internally need to be either quoted (usually with quote characters) or escaped (usually with backslash); but if so, those characters in turn need special behaviors.

Continuous variable (see NOIR)

Related concepts: categorical variable, interval variable, nominal variable, ordinal variable, ratio variable

Coreutils (GNU Core Utilities)

A collection of shell-oriented utilities for processing text and data. The subset of these tools that was formerly contained in the separate textutils package, in particular, are relevant to processing textual data sources. These tools include cat, cut, fmt, fold, head, sort, tail, tee, tr, uniq, wc. Other command-line tools like grep, sed, shuf, and awk are also widely used in interaction with these tools.

Corpus (pl. corpora)

Corpus is a term from linguistics, but used also in related natural language processing (NLP). It simply refers to a large "body" (the Latin root) of text covering a similar domain, such as a common publisher, genre, or dialect. In general, some sort of modeling or statistical analysis may apply to a particular body of text, and by extension to texts of a similar domain.

CouchDB

Apache CouchDB is an open-source document-oriented database. Internally, data in CouchDB is represented in JSON format.

CrateDB

CrateDB is an open-source document-oriented database. CrateDB occupies an overlapping space with MongoDB or CouchDB, but emphasizes real-time performance.

Curse of dimensionality

The phrase "curse of dimensionality" was coined by Richard E. Bellman in 1957. It applies to a number of different numeric or scientific fields. In relation to machine learning, in particular, the problem is that as the number of dimensions increases, the size of the parameter space they occupy increases even faster. Even very large data sets will occupy only a tiny portion of that parameter space defined by the dimensions. Models are fairly uniformly poor at predicting or characterizing regions of parameter space where they have few or no observations to train on.

A very rough rule of thumb is that you wish to have fewer than ⅒ as many dimensions/features as you do observations. However, even very large data sets perform best if feature engineering, dimensionality reduction, and/or feature selection can be used to reduce their parameter space to hundreds of dimensions (i.e. not thousands, often tens are better than hundreds).

However, as a flip side of the curse of dimensionality, we also sometimes see a "blessing of dimensionality." Linear models especially can perform very poorly with only a few dimensions to work with. The very same types of models can become very good if it is possible to obtain or construct additional (synthetic) features. Generally, this blessing occurs when models move from, e.g. 5 to 10 features, not when they move from 100 to 200 features.

As John von Neumann famously quipped: “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”

Data artifact

An unintended alteration of data, generally as a consequence of hardware of software bugs. Some artifact can be caused by flaws in data collection instruments; others result from errors in transcription, collation, or data transfer. Data artifacts are often only detectable as anomolies in a data set.

Data frame

A data frame (sometimes "dataframe") is an abstraction of tabular data provided by a variety of programming languages and software library. At heart, a data frame bundles together multiple data-type homogeneous series or arrays (columns), enforcing a few regularities:

All columns in a data frame have the same number of data items within them (some might be explicitly a "missing" sentinel).
Each column has data items of the same data type.
Data may be selected by indicating collections of rows and collections of columns.
Predicates may be used to select row sets based on properties of data on a given row.
Operations on columns are expressed in a vectorized way, operating conceptually on all elements of a column simultaneously.
Both columns and rows may have names; in some libraries rows are only named by index position, but all name columns descriptively.

Popular data frames libraries include Python Pandas and Vaex, R data.table and tibble, Scala DataFrame, and Julia DataFrames.jl.

data.frame

The data frame library that is included with a standard R distribution. The R standard data.frame is the oldest data frame object for R, and remains widely used. However, either the Tidyverse tibble or the data.table library are generally preferable for new development, having been refined based on experience with data.frame.

See also: data frame, data.table, tibble

data.table

A popular data frame library for R. Philosophically, data.table tries to perform filtering, aggregation, and grouping all with standard arguments to its indexing operation. The data.table library has a somewhat different attitude than the Tidyverse, but is generally interoperable with it.

Data set

A data set is simply a collection of related data. Often, if the data is tabular, it will consist of a table; but it may be a number of related tables. In related data that is arranged in hierarchical or other formats, one or more files (in varying formats) may constitute the data set. Often, but not always, a data set is distributed as a single archive file containing all relevant components of it.

Denormalization

Denormalization is the duplication of data within a database system to allow for more "locality" of data to queries performed. This will result in larger storage size, but in many cases also in faster performance of read queries. Denormalization potentially introduces data integrity problems where data in different locations falls out of sync.

DMwR (Data Mining with R)

The R package DMwR includes functions and data accompanying the book Data Mining with R, learning with case studies by Luis Torgo, CRC Press 2010. A wide variety of utilities are included, but from the perspective of this book, it is mentioned because of its inclusion of a SMOTE implementation.

DOM (Document Object Model)

The Document Object Model (DOM) is a language-neutral application programming interface (API) for working with XML or HTML document. While the specification gives a collection of method names that might be implemented in any language, the inspiration and style is especially inspired by JavaScript.

Domain-specific knowledge

Much of data science, including even that part of it concerning this book topic, cleaning data, can be driven by "the shape of the data itself." Certain data items may follow patterns or stand out as anamolous on a purely numeric or analytic basis. However, in many cases, accurate judgements about which data is important, or which is of greater importance, lies not in the data themselves but in knowledge we have about the domains the data describe.

Domain-specific knowledge—or just "domain knowledge"—is what informs us of those distinctions that the data alone cannot reveal. Not all domain knowledge is extremely technical, the term might refer to topics that are more "common sense" as well. For example, it is general knowledge that outdoor temperatures in the northern hemisphere are usually higher in July than in January. A data set that conflicted with this background knowledge would be suspicious even if the individual data values were all, in themselves, in a reasonable numeric range. Bringing that very common domain knowledge to a problem is important, where applicable.

Equally, some domain knowledge requires deep subject-area expertise. Data in a psychological survey might show particular population distributions of subscales from the Minnesota Multiphasic Personality Inventory (MMPI). Some distributions might be implausible and indicate likely data integrity or sample bias problems, but a specialized knowledge is needed to judge that. Or radio astronomy data might show particular emission frequency bands from distant objects. A specialized knowledge is needed to determine whether that is consistent with expectations of Hubble red-shift distances or might be data errors. Likewise in many domains.

Eagerness

In computer programming and computer science, sometimes the words "lazy" and "eager" are used to distinguish approaches to solving a larger problem. Commonly, for example, an algorithm might transform a large data set. An eager program will process all the data at once. In contrast, a lazy program will only perform an individual transformation when that specific result is needed.

Elasticsearch

Elasticsearch is a search engine based on the Lucene library. As a part of implementing a search engine, Elasticsearch contains a document-oriented database or data store.

Endianness

Endianness in computer representations of numbers is typically either big-endian or little-endian. This refers to the scaled magnitude of composite values stored in a particular order. Most typically, the composite values are bytes, and they are arranged into "words" of 16-bits, 32-bits, 64-bits, or 128-bits (i.e. 2, 4, 8, or 16 bytes per word).

For example, suppose we wish to store an (unsigned) integer value in a contiguous 32-bit word. Computer systems and filesystems typically have an addressing resolution of one byte, not of individual bits directly, so this is 4 such slots in which scaled values may be stored. For example, we wish to store the number 1,908,477,236.

First, we can notice that since each byte stores values 0-255, this is a reasonable way to describe that number:

$$1,908,477,236 = (52 \times 2^0) + (13 \times 2^8) + (193 \times 2^{16}) + (113 \times 2^{24})$$

Storing values in each of the 4 bytes in the word could use either of these approaches:

Byte-order	Byte 1	Byte 2	Byte 3	Byte 4
Little-endian	52	13	193	113
Big-endian	113	193	13	52

Historically, most CPUs used only one of big-endian and little-endian word representation, but most modern CPUs offer switchable bi-endianess. Likewise, many libraries such as NumPy allow flexibility in reading and writing data of different endianness in storage format.

Formats other than computer words used to store numeric values may also be endian. Notably, different date formats can be big-endian, little-endian, or indeed middle-endian. For example, ISO-8601 date format prescribes the big-endianness, e.g. 2020-10-31. The year represents the largest magnitude, month next largest, and day number the smallest resolution of a date. The extension to time components is similar.

In contrast, a common United States date format can read, e.g. October 31, 2020. A spelled out month name indirectly represents a number here (numbers are also used with the same endianness and different delimiter, e.g. 10/31/2020). From an endianness perspective, this is middle-endian. The largest magnitude (year) is placed at the end, the next largest magnitude (month) at the start, and the smallest magnitude (day) in the middle. Clearly, a different middle-endian format is also possible, but is not widely used (e.g. 2020 31 Oct).

Much of the world outside of the United States uses a little-endian date representation, such as 31/10/2020. While the specific values in the representation of October 31 would disambiguate the endianness used, for dates such as October 11 or November 10, this is not the case.

F1 Score

In a classification model, there are numerous metrics that might express the "goodness" of a model. F1 score blends recall and precision avoiding the extremes that occur in certain models, and is often a balanced metric. F1 score is derived as:

$$\text{F1} = 2 \times \cfrac{precision \times recall}{precision + recall}$$

Related concepts: accuracy, precision, recall

Feature (see Column)

Synonyms: column, field, measurement, variable

Field (see Column)

Synonyms: column, feature, measurement, variable

Fuzzy

Fuzzy is a Python library for analyzing phonetic similarity in English texts.

GDBM (GNU dbm)

GDBM is an open source library for providing key/value storage systems.

General Decimal Arithmetic Specification

The General Decimal Arithmetic Specification is a standard for implementation of arbitrary precision base-10 arithmetic and numeric representation. It incorporates configurable "contexts" such as rounding rules in effect. The Python standard library decimal module, in particular, is an implementation of this standard.

Gensim

Gensim is an open-source Python library for natural language processing (NLP), specifically around unsupervised topic modeling. Gensim contains an implementation of the word2vec algorithm and a few closely related variants of it.

Gibibyte (GiB)

Metric prefixes are standardized in the International System of Units (SI), by the International Bureau of Weights and Measures (BIPM). Orders of magnitude—powers of 10—are indicated by prefixes ranging from yotta- ($10^{24}$) down to yocto- ($10^{-24}$). In partcular, the multipliers of $10^3$ (kilo-), $10^6$ (mega-), and $10^9$ (giga-) are almost right for dealing with typical quantities seen in computer storage.

However, for both historical and practical reasons, bytes of memory or storage are typically expressed as multiples of $2^{10}$ (1024) rather than of $10^3$ (1000). These numbers are relatively close, but while it is common to misname $2^{10}$, $2^{20}$, and $2^{30}$ as kilobyte, megabyte, and gigabyte, these are wrong. Since 1998, the International Electrotechnical Commission (IEC) has standardized the use of kibibyte (GiB), mebibyte (MiB), gibibyte (GiB) for accurate description of these powers of 2. For larger sizes, we also have tebibyte (TiB), pebibyte (PiB), exbibyte (EiB), zebibyte (ZiB), and yobibyte (YiB).

ggplot2

A popular book, The Grammar of Graphics (Statistics and Computing), by Leland Wilkinson (ISBN: 978-0387245447), first published in 2000, introduced a way of thinking about graphs and data visualizations that breaks down a graph into components that can be expressed independently. Changing one such orthogonal component may change the entire appearance of a graph, but will still reflect the same underlying data in a different manner.

The R library ggplot2 attempts to translate the concepts of that book into concrete APIs, and has been widely adopted by the R community. The Python libraries ggplot, to a strong degree, and Bokeh and Altair, to a somewhat lesser extent, also try to emulate Wilkinson's "grammar." Altair is in turn, built on top of Vega-Lite and Vega which have a similar goal as JavaScript libraries.

Glob

A common and simple pattern-matching language that is most frequently used to identify collections of filenames. Both the Bash shell and libraries in many programming languages support this syntax.

GQL (Graph Query Language)

Graph Query Language is a (pending) standard for querying graph databases, based on the Cypher language developed by Neo4j for their product.

Gremlin

Gremlin is a graph query language, distinct from GQL. Queries in Gremlin emphasize a "fluent programming" and functional style of description of nodes and classes of interest.

Halting problem

The halting problem is probably the most famous result in the theory of computation. Alan Turing proved in 1936 that there cannot exist any general purpose algorithm that answers the question "Will this program ever terminate?" For some programs it is provable, of course, but in the general case it is not. Even running a program for any finite amount of time, N steps, does not answer the question, since it might yet terminate at step N+1.

In slightly more informal parlance, saying that a given task is "equivalent to the halting problem" is an idiomatic way of saying that it cannot be solved. At times the phrase is used as a speculation about the difficulty of a problem, but at other times a mathematical proof is known that shows that solving the novel problem would imply a solution to the halting problem. Within this book, the phrase is used only in the strict sense, but with an affection for the jargon of computer science.

h5py

H5py is a Python library for working with hierarchical data sets stored in the HDF5 format.

HDF Compass

HDF Compass is an open source GUI tool for examing the content of HDF5 data files.

Hierarchical data format (HDF5)

The Hierarchical Data Format (HDF5) is an open source file format that supports large, complex, heterogeneous data. HDF5 uses a hierarchical structure that allows you to organize data within a file in nested groups. The "leaf" of a hierarchy is a dataset. An HDF5 file may contain arbitrary and domain-specific metadata about each dataset or group. Since many HDF5 files contain (vastly) more data than will fit in computer memory, tools that work with HDF5 generally provide a means of lazily reading content so that most data remains solely on disk unless or until it is needed.

Hyperparameter

In machine learning models, a general model type is often pre-configured before it is trained on actual data. Hyperparameters may comprise multipliers, numeric limits, recursion depths, algorithm variations, or other differences that still make up the same kind of model. Models can perform dramatically differently with different hyperparameters.

Idempotent

Idempotence is a useful concept in mathematics, computer science, and generally in programming. It means that calling the same function again on its own output will continue to produce the same answer. This is related to the even fancier concept in mathematics of an attractor.

Imager

Imager reads and writes many image formats and can perform a variety of analysis processing actions on such images programmatically within R. Images within the library are treated as 4-dimensional vectors with two spatial dimensions, one time dimension, and one color dimension. By including time as a dimension, imager can work with video as well.

imbalanced-learn

Imbalanced-learn is an open source Python software library for sensitive oversampling data. It implements the SMOTE (Synthetic Minority Oversampling TEchnique), ADASYN (Adaptive Synthetic), variations of those algorithms, as well as undersampling techniques. In the main, imbalanced-learn emulates the APIs of scikit-learn.

Imputation

The process of replacing missing data points with values that are likely, or at least plausible to allow machine learning or statistical tools to process all observations.

Interval variable (see NOIR)

Related concepts: categorical variable, continuous variable, nominal variable, ordinal variable, ratio variable

ISO-8601

ISO-8601 (Data elements and interchange formats – Information interchange – Representation of dates and times) is an international standard for the representation of dataes and times. For example, generating one while writing this entry, using Python:

>>> from datetime import datetime
>>> datetime.now().isoformat()
'2020-11-23T14:43:09.083771'

jq

jq is a flexible and powerful tool for command-line filtering, searching, and formatting JSON, including JSON Lines.

JSON (Javascript Object Notation)

JSON is a language-independent and human readable format for representation of the data structures and scalar values typically encountered in programming languages. It is widely used both as a data storage format and as a message format to communicate among services.

Jupyter

Project Jupyter is an open source library, written primarily in Python, but supporting numerous programming languages, to create, view, run, and edit "notebooks" for literate programming. This book was written using Jupyter Lab, and its notebooks can be obtained at the book's repository. In literate programming, code and documentation are freely interspersed while both rendering as formatted documents and running as executable code. Whereas R Markdown achieves similar goals using lightly annotated plain text, Jupyter uses JSON as the storage format for its notebooks.

Jupyter supports both the somewhat older "notebook" interface and the more recent "JupyterLab" interface. Both work with the same underlying notebook documents.

Kafka

Apache Kafka is an open source stream processor. As with other stream processors, and related message brokers, the aggregations of messages sent among systems is often a fruitful domain for data science analysis.

Kdb+

Kdb+ is a column-store database that was designed for rapid transactions. It is widely used within high-frequency trading.

Laziness

In computer programming and computer science, sometimes the words "lazy" and "eager" are used to distinguish approaches to solving a larger problem. Commonly, for example, an algorithm might transform a large data set. An eager program will process all the data at once. In contrast, a lazy program will only perform an individual transformation when that specific result is needed.

LDBM (Lightning Memory-Mapped Database)

LDBM is an open source library for providing key/value storage systems.

Lemmatization

Canonicalization of words to their grammatical roots for natural language processing purposes. In contrast to stemming, lemmatization will look at the context a word occurs in to try to derive both the simplified form and the part of speech.

For example, the English word "dog" is used both as a noun for the animal, and occasionally as a verb meaning "annoy." A lemmatization might produce:

we[PRON] dog[VERB] the[DET] dog[NOUN]

Related concept: stemming

Little-endian (see Endianness)

Data arranged into "words" (typically 32-bits), or other units, where the largest magnitude component (typically a byte) is stored in the earliest position.

MariaDB

MariaDB is a popular open source relational database managment system (RDBMS). It uses standard SQL for queries and interaction, and implements a few custom features on top of those required by SQL standards. At a point when the GPL-licensed MySQL was purchased by Oracle, its creator Michael (Monty) Widenius forked the project to create MariaDB. Widenius' elder daughter is named 'My' and his younger daughter 'Maria'.

MariaDB is API and ABI compatible with MySQL, but it adds a few features such as additional storage engines.

Matplotlib

Matplotlib is a powerful and versatile open soure plotting library for Python. For historical reasons, its API originally resembled MATLAB's, but a more object oriented approach is now encouraged. Numerous higher-level libraries and abstractions are built on top of Matplotlib, including Basemap, Cartopy, Geoplot, ggplot, holoviews, Seaborn, Pandas, and others.

Measurement (see Column)

Synonyms: column, feature, field, variable

Memcached

Software that keeps key/value associative arrays in memory for purposes of caching or proxying slower server responses. Although contents of a memcached server are transient, snapshotted contents may be useful to analyze for data science purposes.

Metaphone

Metaphone is an algorithm for phonetic canonicalization of English word, published by Lawrence Philips in 1990. The same author later published Double Metaphone, then Metaphone 3, each of which successively better take advantage of known patterns in words derived from non-English languages. Metaphone, and its followups, are more precise than the early Soundex developed for the same purpose.

Mojibake

Mojibake is the nonsensical text that generally results from trying to decode text using a character encoding different from that used to encode it. Often this will produce individual characters that belong to a given language or alphabet, but in combinations that make no sense (sometimes to humorous effect). The word comes from Japanese, meaning roughly "character transformation."

MonetDB

MonetDB is an open source column-oriented database management system that supports SQL and several other query languages or extensions.

MongoDB

MongoDB is a popular document-oriented database management system. It uses JSON-like storage of its underlying data, and both queries and responses use JSON documents. MongoDB uses a distinct query language that reflects its mostly hierarchical arrangement of data into linked documents.

MySQL

MySQL is a widely popular open source relational database management system (RDBMS). It uses standard SQL for queries and interaction, and implements a few custom features on top of those required by SQL standards. At a point when the GPL-licensed MySQL was purchased by Oracle, its creator Michael (Monty) Widenius forked the project to create MariaDB. Widenius' elder daughter is named 'My' and his younger daughter 'Maria'.

Neo4j

Neo4j is an open source graph database and database management system.

netcdf4-python

netcdf4-python is a Python interface to the netCDF C library.

Network Common Data Form (NetCDF)

NetCDF (Network Common Data Form) is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. It is built on top of HDF5.

NLTK (Natural Language Toolkit)

NLTK is a suite of tools for natural language processing (NLP) in Python. It includes numerous corpora, tools for lexical analysis, for named entity recognition, a part of speech tagger, stemmers and lemmatizers, and a variety of other tools for NLP.

Node.js

Node.js is an open source, standalone JavaScript interpreter that runs outside of embedded JavaScript in web browsers. It can be used at the command line in the manner of scripting languages, with an interactive shell, or as a means to run server processes. The Node.js environment comes with an excellent package manager called npm (Node Package Manager) that allows you to install additional libraries easily (much like pip or conda for Python, RubyGems for Ruby, Cabal for Haskell, Pkg.jl: for Julia, Maven for Java, and so on).

Nominal variable (see NOIR)

Related concepts: categorical variable, continuous variable, interval variable, ratio variable

NOIR (Nominal, Ordinal, Interval, Ratio)

The acronym NOIR is sometimes used as a mnemonic for different feature types. This is the French word for "black" but is especially associated, in English, with a style of "dark" literature or film. The acronym stands for Nominal / Ordinal / Interval / Ratio.

Nominal or ordinal variables simply record which of a finite number of possible labels a data item records. This is sometimes called the classes of the variable.

Ordinal variables express a scale from low to high in the data values, but the spacing in the data may have little to no relationship to the underlying phenomenon. For example, perhaps a foot race records the first place, second place, third place, etc. winners, but not the times taken by each. 1st place crossed the line before 2nd place; but we have no information on whether it was milliseconds sooner or hours sooner. Likewise between 2nd and 3rd position, which might differ significantly from the first gap.

The last variable types are continuous variables, but interval and ratio variables are importantly different. The difference is in whether there is a "natural zero" in the data. The domain zero need not always be numeric zero, but commonly it is. Acidity or alkalinity measured on pH scale has a natural zero of 7, and generally values between 0 and 14 (although those are not sharp physical limits). If we used pH measure as a feature, we might re-center to numeric zero to express actual ratios (albeit, log ratios for this measure). It is reasonable to treated pH as a ratio variable.

As an example of an interval that is not a ratio, a newspaper article claimed that the temperature on a certain winter day, in some city, was twice as hot as in average years based on an artifact of the Fahrenheit scale in which a difference was between 25℉ and 50℉. This is nonsense as a ratio. It is perfectly useful to talk about the mean temperature or the standard deviation in temperature, but the numeric ratio is meaningless (in Celsius or Fahrenheit; in Kelvin or Rankine its minimally meaningful, but rarely used to describe temperatures in the range that occur on the surface of the earth). In contrast, the ratio variable of rainfall has a natural zero which is also numeric zero. Zero inches (or centimeters) of rain means there was none. 2 inches of rain is twice as much water falling as 1 inch of rain is.

NumPy

NumPy is an open source Python library for fast and vectorized computations on multi-dimensional arrays. Nearly all Python libraries that perform numeric or scientific computation rely on NumPy as an underlying support library. This includes tools in machine learning, modeling, statistics, visualization, and so on.

Observation (see Row)

Synonyms: record, row, sample, tuple

Ontology

Ontology in philosophy is the study of "what there is." In data science an ontology describes not only what class/subclass and class/instance relationship exist among entities, but also the kinds of features an entity has. Perhaps most importantly, an ontology can describe the kinds of relationships that can exist among various entities.

When different kinds of observations can be made, describing the particular collection of features that pertain to that observation, and the particular data types and ranges of permissible values each can take on, is an element of the ontology of the data. Different tables, or data subsets, may have different features sets and hence a different ontological role.

Ontology can be important for categorical data especially. Some labels may be instances of other labels, for example with varying degrees of specificity. If one categorical variable indicates the entity is "mammal", another that it is "feline", and another that it is "house cat" those are all possibly descriptions of the identical entity under different taxonomic levels, and hence part of the ontology of the domain.

The relationships among entities can sometimes be derived from the data themselves, but often requires domain knowledge. These relationships can often inform the kinds of models or statistical analysis that make sense. For example, if the entity underlying a collection of data is a medical patient, parts of the ontology of the domain might concern whether several different features observed were collected with the same instrument, or from the same blood sample, or whether the observations were made on the same day. Even though the features might measure very different quanties, the relationships "same-day" or "same-instrument" can inform analysis.

Ordinal variable (see NOIR)

Related concepts: categorical variable, continuous variable, interval variable, ratio variable

OrientDB

OrientDB is an open source, multi-model database management system. It supports graph, document, key/value, and object models. Querying may use either Gremlin or SQL.

Orthonormal basis

Within a highly dimensional space, specifically a parameter space, the location of an observation point is simply a parameterized sum of each of the dimensions. For example, if we measure 3 features in an observation as having values $a$, $b$, and $c$, we can express those measurements in 3-D parameter space, with orthogonal unit vectors $\vec{x}$, $\vec{y}$, and $\vec{z}$ as:

$$ observation = a\vec{x} + b\vec{y} + c\vec{z} $$

However, the choice to represent the observation using those particular unit vectors $\vec{x}$, $\vec{y}$, and $\vec{z}$ is somewhat arbitrary. As long as we choose any orthonormal basis—that is, N mutually perpendicular unit vectors—we can equally well represent all the relationships among observations. For example:

$$ a\vec{x} + b\vec{y} + c\vec{z} = a′\vec{x′} + b′\vec{y′} + c′\vec{z′} $$

Decompositions are a means of selecting an alternate orthonormal basis that distributes the data within the parameter space in a more useful way. Usually this means in a way concentrating variance within the initial components (lowest numbered axes).

Pandas

Pandas is a widely popular, open source, Python library for working with data frames. The name derives from the econometrics term "panel data." Pandas is built on top of NumPy, but adds numerous additional capabilities. One of the great strengths of Pandas is working with time-series data. But as with the underlying NumPy array library and other data frame libraries, most operations on columns are fast and vectorized.

Parameter space

The parameter space of a set of observations with N features is simply an N-dimensional space in which each observation occupies a single point. By default, the vector bases that define the location of a point correspond directly with the features themselves. For example, in analyzing weather data we might define "temperature" as the X-axis, "humidity" as the Y-axis, and "barametric pressure" as the Z-axis. Some portion of that 3-D space has points within it, and they form some pattern or shape that models might analyze and make predictions about.

Under decompositions of the features, we might choose a new orthonormal basis in which to represent the same data points in a rotated or mirrored N-dimensional space.

Parquet

Apache Parquet is an open source, column-oriented data storage format that originated in the Hadoop ecosystem, but is widely supported in other programming languages as well.

PDF (Portable Document Format)

Portable Document Format is a widely used format used to accurately represent the appearance of documents in a cross-platform, cross-device manner. For example, the same document will look nearly identical on a computer monitor, a personal printer, or from a professional press. Fonts, text, images, colors, and lines are some of the elements PDF renders to a page, whether displayed or printed. PDF was developed by Adobe, but is currently governed by the open and freely usable standard ISO 32000-2.

Pillow (forked from PIL)

The Python Imaging Library reads and writes many image formats and can perform a variety of processing actions on such images programmatically within Python.

Poppler

An open source viewing and processing library for Portable Document Format (PDF). In particular, Poppler contains numerous command-line tools for converting PDF files to other formats, including text. Poppler is a fork of Xpdf that aims to incorporate additional capabilities.

PostgreSQL

PostgreSQL is a widely popular open source relational database management system (RDBMS). It uses standard SQL for queries and interaction, and implements custom features and numerous custom data types on top of those required by SQL standards.

Precision

In a classification model, there are numerous metrics that might express the "goodness" of a model. Precision is also called "positive predictive value" and is the fraction of relevant observations among the predicted observations. More informally, precision answers the question "given it was predicted, how likely is the prediction to be accurate?"

For example, consider this hypothetical confusion matrix:

Predict/Actual	Human	Octopus	Penguin
Human	5	0	2
Octopus	3	3	3
Penguin	0	1	11

In a binary problem, this can be expressed as:

$$\text{Precision} = \frac{true\: positive}{true\: positive + false\: positive}$$

For a multiclass problem, as in the confusion matrix, each label has its own precision. Given the 8 true humans in the data set, 5 of them were correctly identified. However, 2 non-humans were also so identified. I.e.:

$$\text{Precision}_{human} = \frac{5}{5 + 2} \approx 71\%$$

An overall precision for a model is often given by averaging (weighted or unweighted) the precision for each label.

Related concepts: accuracy, F1 score, recall

PyTables

PyTables is a Python library for working with hierarchical data sets stored in the HDF5 format.

Query planner

When a query is formulated against a database, whether using SQL or another querying language, the database management system (DBMS) will internally create a set of planned steps involved in executing that query. Many DBMSs can expose these plans prior to executing them; users can use this information to judge the efficiency of database access (and possibly modify queries or refactor the databases themselves).

A query planner will make decisions about which indices to use, in what order, the style of search and comparisons across data that may live in many tables or documents, and other aspects of how a query may be executed efficiently. When accessing big data sets, the quality of a query planner can often differentiate different DBMSs.

R Markdown

R Markdown is a format and technology for literate programming. In literate programming, code and documentation are freely interspersed while both rendering as formatted documents and running as executable code. Whereas Jupyter notebooks, which have many of the same qualities, are stored as JSON documents, R Markdown is purely an extension of the easily human readable and editable Markdown format which lightly annotates plain text with regular punctuation characters to describe specific visual and conceptual elements. With R Markdown, code segments are also included as plain text by indicating their sections with a textual annotation.

RabbitMQ

RabbitMQ is an open source message broker. As with other message brokers, the aggregations of messages sent among systems is often a fruitful domain for data science analysis.

Ratio variable (see NOIR)

Related concepts: categorical variable, continuous variable, interval variable, nominal variable, ordinal variable

Recall

In a classification model, there are numerous metrics that might express the "goodness" of a model. Recall is also called "sensitivity." It is the fraction of true occurences that are identified by a model.

For example, consider this hypothetical confusion matrix:

Predict/Actual	Human	Octopus	Penguin
Human	5	0	2
Octopus	3	3	3
Penguin	0	1	11

In a binary problem, this can be expressed as:

$$\text{Recall} = \frac{true\: positive}{true\: positive + false\: negative}$$

For a multiclass problem, as in the confusion matrix, each label has its own recall. There are 8 true humans in the data set, 5 of them were correctly identified. However, 3 humans failed to be identified (in the whimsical example, all were predicted to be octopi). I.e.:

$$\text{Recall}_{human} = \frac{5}{5 + 3} \approx 62\%$$

An overall recall for a model is often given by averaging (weighted or unweighted) the recall for each label.

Related concepts: accuracy, F1 score, precision

Record (see Row)

Synonyms: observation, row, sample, tuple

Redis (Remote Dictionary Server)

Redis is an open source, in-memory key/value database. Redis supports numerous data types and data structures, including strings, lists, maps, sets, sorted sets, HyperLogLogs, bitmaps, streams, and spatial indices.

Relational database management system (RDBMS)

An RDBMS is a system to store data and implement the relational model developed by E. F. Codd in 1970. Under this relational model, data is stored in tables, with each row constituting a tuple of values, the keys to those values named by the columns of the table. The term "relational" in the name pertains to the fact that data in one table may be related to data in other tables by declaring foreign key relations and/or by performing joins in the query syntax.

For several decades, all RDBMSs have supported the SQL querying language, sometimes with optional extension syntax related to their additional features or data types. Often, but not quite always, RDBMSs are used on multi-user distributed servers, with transactions used to orchestrates write actions among those multiple users.

Popular RDBMSs include PostgreSQL, MySQL, SQLite, Oracle, Microsoft SQL Server, IBM DB2, and others.

Requests

Requests is a full-featured, open source HTTP access library for Python. It is not included in the Python standard library, but is ubiquitious and generally preferred to tools included with minimal Python distributions.

REST (Representational State Transfer)

REST is a software educational style that normatively describes patterns of interactions between HTTP servers and clients. The adjective RESTful is also frequently used. Under this style, the HTTP methods GET, POST, PUT, and DELETE are clearly separated by their intended functions. A main emphasis of the style is statelessness: each request must contain all information needed to elicit a response, and that response should not be dependent on the sequence of prior actions that client made.

rhdf5

Rhdf5 is an R library for working with hierarchical data sets stored in the HDF5 format.

rjson

Rjson is an R library for working with JavaScript Object Notation.

ROSE (Random Over-Sampling Examples)

ROSE is an R package that creates synthetic samplings in the presence of class imbalance. It serves a similar purpose to SMOTE oversampling.

Row

A collection of data consisting of multiple named data items pertaining to the same entity. Depending on the context, the entity can be defined in various ways. For an object in the physical world, for example, it is common in scientific, and other, procedures to take a number of different measurements of that same object, and a row will describe that object. In simulations or other mathematical modeling, a row may contain the results of synthetic sampling of possible values. Considered from the point of view of the actual storage of the data, a focus on the tuple or record structure of the row are more emphasized.

The named data items collected about a single row are generally indicated in the columns of the data. Each column may have a different data type within it, but each different row within that column will share the data type but not generally the data value.

Synonyms: observation, record, sample, tuple

rvest

The rvest package for R is used to scrape and extract data from HTML web pages.

Sample (see Row)

Synonyms: observation, record, row, tuple

Scikit-learn

Scikit-learn is a wide-ranging open source Python library for many machine learning (ML) and data science tasks. It implements a large number of ML models (both supervised and unsupervised), metrics, sampling techniques, decompositions, clustering algorithms, and other tools useful for data science. Throughout its capabilities, scikit-learn maintains a common API; many additional libraries have chosen to implement identical or compatible APIs was well.

Scipy.stats

Scipy.stats is a Python module in the NumPy ecosystem that implements many probability distributions and statistical functions.

Scrapy

Scrapy is a Python library for spidering and analyzing collections of web pages, including a high-performance engine to coordinate retrievals of many pages.

Seaborn

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

SeqKit

SeqKit is a toolkit for maninipulating files in the FASTA and FASTQ formats that are used for storing nucleotide and protein sequences.

Signed integer

An integer represented in computer bits of some specific length. In signed integers, one bit is reserved to hold the sign (negative or positive) of an integer. The largest integer that can be represented, for N bits storing a number, is $2^{N-1}-1$. The smallest integer that can be represented is $-2^{N-1}$

Sizes of integers in many programming languages match sizes of memory units in modern CPUs, and can be 8-bit, 16-bit, 32-bit, 64-bit, or 128-bit. Other bit lengths are rarely defined. In data formats and databases, sizes might be defined by a number of decimal digits rather than binary bits. Some programming languages like Python, TCL, and Mathematica in their default integers, and numerous other programming languages using specific libraries, allow for arbitrary-precision integers that have no size bound. They do this by dynamically allocating more bits to store larger numbers as needed.

Solr

Apache Solr is a search engine based on the Lucene library. As a part of implementing a search engine, Solr contains a document-oriented database or data store.

spaCy

SpaCy is an open-source software library for advanced natural language processing. It is focused on production use and integrates with deep-learning frameworks.

SPARQL Protocol and RDF Query Language

Had J. B. S. Haldane lived later, he might have commented that Free Software developers have "an inordinate fondness for recursive acronyms" (YAML, GNU, etc.). SPARQL is a query language for RDF (Resource Description Framework), or the "semantic web." It has been implemented for a variety of programming languages. SPARQL expresses queries in the form of "subject-predicate-object" triples. This has some similarity to key/value stores, but more to graph databases.

Sphering (see whitening)

Normalization of data under a decomposition.

Synonym: whitening

SQLAlchemy

SQLAlchemy is a Python library that provides an "object-relational mapping" between the tabular and relational structure of RDBMS tables and an object-oriented interface. SQLAlchemy can use drivers for all popular SQL databases, and exposes a variety of methods for manipulating their data within Python.

SQLite

SQLite is a small, fast, self-contained, high-reliability, full-featured, SQL database engine that stores multiple data tables in single files. Binding to access SQLite (version 3) are available for all popular programming languages. The library also comes with a command line tool and shell for manipulation of data using only SQL.

State machine

A "finite-state machine, "finite automaton", or simply "state machine," is a model of computation in which focus moves among a finite number of states or nodes based on a specific sequence of input.

STDOUT / STDERR / STDIN

In Unix-like command shells there are three special files/streams called "standard output", "standard error" and "standard input." They are ubiquitously abbreviated as "STDOUT", "STDERR", and "STDIN" respectively. Composed command-line tools treat these streams in special ways, and they are utilized widely. In particular, STDOUT is usually "data" output while STDERR is usually "status" output, even though they may appear interspersed in terminal sessions.

Stemming

Canonicalization of words to their grammatical roots for natural language processing purposes. In contrast to lemmatization, stemming only treats words individually without their context, and hence can be less accurate.

Related concept: lemmatization

Structured data

While the term "unstructured data" is often used, it is somewhat of a misnomer. "Loosely structured" or "semi-structured" would be more accurate. For example, the paradigmatic example of textual data is at very least structured by the particular sequence in which words occur. Quite likely it is further organized by sequences belonging to chapters, separate messages, or other such units (themselves likely structured by sequence), and moreover usually a variety of metadata such as author identity, subject line, forum, thread, and so on, also pertain to the text itself.

Tab-separated values (TSV; see Comma-separated values)

Delimted files where tabs are used as the line delimiter.

Tabula

Tabula-java is the underlying engine for the GUI tool Tabula. Other bindings include tabula-extractor for Ruby, tabula-py for Python, tabulizer for R, and tabula-js for Node.js. The engine and the tools that utilize it provide interfaces to extract tabular data represented in PDF documents.

Taxonomy

Taxonomy is, in some sense, a special aspect of ontology, it describes the hierarchical relationships among categories of entities. Some labels may be instances of other labels, for example with varying degrees of specificity. If one categorical variable indicates the entity is "mammal", another that it is "feline", and another that it is "house cat" those are all possibly descriptions of the identical entity under different taxonomic levels, and hence part of the ontology of the domain.

While taxonomy is largely narrower than ontology, taxonomy also tends to indicate a focus on the more global level of the domain, not a narrow region of that domain. When one speaks of a taxonomy, it generally indicates an interest in all the relationships among all the classes of entities, and an expectation that those relationships will be tree-like and hierarchical. One might describe ontological features of a single entity, or a small collection of entities, but a taxonomy will normally describe the entire domain of all possible entities.

tibble

The R library tibble is an implementation of the data frame abstraction, but one that tries to do less than other libraries. Quoting from the official documentation:

Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code.

See also: data.frame, data.table

Tidyverse

The Tidyverse is a collection of R packages that share a common philosophy of API design and that are designed to work well together. Core libraries of the Tidyverse are ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats. A variety of other optional packages are also designed to work well with the base collection.

At core, the Tidyverse has an attitude of making data into "tidy" forms, in the sense discussed at more length in chapter 1. As well, the tools within the Tidyverse lend themselves to composition by piping data between methods in a "fluent programming" style.

Tuple (see Row)

Synonyms: observation, record, row, sample

Unsigned integer

An integer represented in computer bits of some specific length. In unsigned integers, no bits are reserved to hold the sign (negative or positive) of an integer, and hence only number zero through a maximum size can be represented. For N bits storing a number, the largest number representable is $2^N-1$.

Sizes of integers in many programming languages match sizes of memory units in modern CPUs, and can be 8-bit, 16-bit, 32-bit, 64-bit, or 128-bit. Other bit lengths are rarely defined. In data formats and databases, sizes might be defined by a number of decimal digits rather than binary bits. Some programming languages like Python, TCL, and Mathematica in their default integers, and numerous other programming languages using specific libraries, allow for arbitrary-precision integers that have no size bound. They do this by dynamically allocating more bits to store larger numbers as needed.

Variable (see Column)

Synonyms: column, feature, field, measurement

Web 0.5

The term "Web 0.5" is a neologism and back-construction from the term "Web 2.0." The latter became popular as a term in the late 2000s. Whereas Web 2.0 was meant as an evolution of the World Wide Web into highly interactive, highly dynamic, visually rich content, Web 0.5 is meant to hearken back to the static, compact, and text-oriented web pages that were developed in the early 1990s. The writer Danny Yee publicized this term, to the minor extent it is used.

Web 0.5 web pages are intended primarily for human readership, in contrast to RESTful web services that are primarily intended to communicate data among computer servers and applications. Their simplicity, however, also makes them easily accessible to web scraping techniques, where relevant.

Whitening

Normalization of data under a decomposition. Transformations such as Principle Component Analysis (PCA) reduce the variance of each subsequent component successively. Whitening is simply rescaling the data within each component to a common scale and center.

Synonym: sphering

XML (eXtensible Markup Language)

XML is a markup language that defines a grammar for representing documents and ancillary schema languages for defining dialects within that broad grammar. The content of XML is always text, and is in-principle human readable while also enforcing a strict structure for automated processing. In essence, XML defines a hierarchical format in which arbitrary elements may be arranged.

XML is used widely in domains such as internal formats for office applications, for representing geospatial data, for message passing among cooperating services, for scientific data, and for many other application uses.

Xpdf

An open source viewing and processing library for Portable Document Format. In particular, Xpdf contains several command-line tools for converting PDF files to other formats, including text. The fork Poppler aims to incorporate additional capabilities that the Xpdf authors consider out of scope for that project.

YAML

YAML is, light-heartedly, an acronym for either "YAML Ain't Markup Language" or "Yet Another Markup Language." It is intended as a highly human-readable and human-writable format that can represent most of the data structures and data types widely used in programming languages. Libraries supporting reading and writing YAML from or to native data structures are available for numerous programming languages.

Glossary