All language is but a poor translation.
–Franz Kafka
Sometimes data lives in formats that take extra work to ingest. For common and explicitly data-oriented formats, common libraries already have readers built into them. Data frame libraries, for example, read a huge number of different file types. At worst, slightly less common formats have their own more specialized libraries that provide a relatively straightforward path between the original format and the general purpose data processing library you wish to use.
A greater difficulty often arises because a given format is not per se a data format, but exists for a different purpose. Nonetheless, often there is data somehow embedded or encoded in the format that we would like to utilize. For example, web pages are generally designed for human readers and rendered by web browsers with "quirks modes" that deal with not-quite-HTML, as is often needed. Portable Document Format (PDF) documents are similar in having intended human readers in mind, and yet also often containing tabular or other data that we would like to process as data scientists. Of course, in both cases, we would rather have the data itself in some separate, easily ingestible, format; but reality does not always live up to our hopes. Image formats likewise are intended for presentation of pictures to humans; but we sometimes wish to characterize or analyze collections of images in some data science or machine learning manner. There is a bit of a difference between Hypertext Markup Language (HTML) and PDF on one hand, and images on the other hand. With the former, we hope to find tables or numeric lists that are incidentally embedded inside a textual document. With the images, we are interested in the format itself as data: what is the pattern of pixel values and what does that tell us about characteristics of the image as such?
Still other formats are indeed intended as data formats, but they are unusual enough that common readers for the formats will not be available. Generally, custom text formats are manageable, especially if you have some documentation of what the rules of the format are. Custom binary formats are usually more work, but possible to decode if the need is sufficiently pressing and other encodings do not exist. Mostly such custom formats are legacy in some way, and a one-time conversion to more widely used formats is the best process.
Before we get to the sections of this chapter, let us run our standard setup code.
from src.setup import * %load_ext rpy2.ipython
%%capture --no-stdout err %%R library(imager) library(tidyverse) library(rvest)
Important letters which contain no errors will develop errors in the mail.
–Anonymous
Concepts:
A great deal of interesting data lives on web pages, and often, unfortunately, we do not have access to the same data in more structured data formats. In the best cases, the data we are interested in at least lives within HTML tables inside of a web page; however, even where tables are defined, often the content of the cells has more than only the numeric or categorical values of interest to us. For example, a given cell might contain commentary on the data point or a footnote providing a source for the information. At other times, of course, the data we are interested in is not in HTML tables at all, but structured in some other manner across a web page.
In this section, we will first use the R library rvest to extract some tabular data, then use BeautifulSoup in Python to work with some non-tabular data. This shifting tool choice is not because one tool or the other is uniquely capable of doing the task we use it for, nor even is one necessarily better than the other at it. I simply want to provide a glimpse into a couple different tools for performing a similar task.
In the Python world, the framework Scrapy is also widely used—it does both more and less than BeautifulSoup. Scrapy can actually pull down web pages, and navigate dynamically amongst them while BeautifulSoup is only interested in the parsing aspect, and it assumes you have used some other tool or library (such as Requests) to actually obtain the HTML resource to be parsed. For what it does, BeautifulSoup is somewhat friendlier and is remarkably well able to handle malformed HTML. In the real world, what gets called "HTML" is often only loosely conformant to any actual format standards, and hence web browsers, for example, are quite sophisticated (and complicated) in providing reasonable rendering of only vaguely structured tag soups.
At the time of this writing, in 2020, the Covid-19 pandemic is ongoing, and the exact contours of the disease worldwide are changing on a daily basis. Given this active change, the current situation is too much of a moving target to make a good example (and too politically and ethically laden). Let us look at some data from a past disease though to illustrate web scraping. While there are surely other sources for similar data we could locate, and some are most likely in immediately readable formats, we will collect our data from the Wikipedia article on the 2009 flu pandemic.
A crucial fact about web pages is that they can be and often are modified by their maintainers. There are times when the Wayback Machine (https://archive.org/web/) can be used to find specific historical versions. Data that is available at a given point in time may not continue to be in the future at a given URL. Or even where a web page maintains the same underlying information, it may change details of its format that would change the functionality of our scripts for processing the page. On the other hand, many changes represent exactly the updates in data values that are of interest to us, and the dynamicness of a web page is exactly its greatest value. These are tradeoffs to keep in mind when scraping data from the web.
Wikipedia has a great many virtues, and one of them is its versioning of its pages. While a default URL for a given topic has a friendly and straightforward spelling that can often even be guessed from the name of a topic, Wikipedia also provides a URL parameter in its query strings that identifies an exact version of the web page that should remain bitwise identical for all time. There are a few exceptions to this permanence; for example, if an article is deleted altogether it may become inaccessible. Likewise if a template is part of a renaming, as unfortunately occured during the writing of this book, a "permanent" link can break. Let us examine the Wikipedia page we will attempt to scrape in this section.
# Same string composed over two lines for layout # XXXX substituted for actual ID because of discussed breakage url2009 = ("https://en.wikipedia.org/w/index.php?" "title=2009_flu_pandemic&oldid=XXXX")
The particular part of that previous page that we are interested in is an infobox about halfway down the article. It looks like this in my browser:
Image: Wikipedia Infobox in Article "2009 Flu Pandemic"
Constructing a script for web scraping inevitably involves a large amount of trial-and-error. In concept, it might be possible to manually read the underlying HTML before processing it, and correctly identify the positions and types of the element of interest. In practice, it is always quicker to eyeball the partially filtered or indexed elements, and refine the selection through repetition. For example, in this first pass, I determined that the "cases by region" table was number 4 on the web page by enumerating through earlier numbers and visually ruling them out. As rendered by a web browser, it is not always apparent what element is a table; it is also not necessarily the case that an elements being rendered visually above another actually occurs earlier in the underlying HTML.
This first pass also already performs a little bit of cleanup in value names. Through experimentation, I determined that some region names contain an HTML <br/> which is stripped in the below code, leaving no space between words. In order to address that, I replace the HTML break with a space, then need to reconstruct an HTML object from the string and select the table again.
page <- read_html(url2009) table <- page %>% html_nodes("table") %>% .[[4]] %>% str_replace_all("<br>", " ") %>% minimal_html() %>% html_node("table") %>% html_table(fill = TRUE) head(table, 3)
This code produced the following (before the template change issue):
2009 flu pandemic data 2009 flu pandemic data 2009 flu pandemic data
1 Area Confirmed deaths <NA>
2 Worldwide (total) 14,286 <NA>
3 European Union and EFTA 2,290 <NA>
Although the first pass still has problems, all the data is basically present, and we can clean it up without
needing to query the source further. Because of the nested tables, the same header is incorrectly deduced
for
each column. The more accurate headers are relegated to the first row. Moreover, an extraneous column that
contains footnotes was created (it has content in some rows below those shown by head()
).
Because
of the commas in numbers over a thousand, integers were not inferred. Let us convert the data.frame to a
tibble
data <- as_tibble(table, .name_repair = ~ c("Region", "Deaths", "drop")) %>% select(-drop) %>% slice(2:12) %>% mutate(Deaths = as.integer(gsub(",", "", Deaths)), Region = as.factor(Region)) data
And this might give us a helpful table like:
# A tibble: 11 x 2
Region Deaths
<fct> <int>
1 Worldwide (total) 14286
2 European Union and EFTA 2290
3 Other European countries and Central Asia 457
4 Mediterranean and Middle East 1450
5 Africa 116
6 North America 3642
7 Central America and Caribbean 237
8 South America 3190
9 Northeast Asia and South Asia 2294
10 Southeast Asia 393
11 Australia and Pacific 217
Obviously this is a very small example that could easily be typed in manually. The general techniques shown might be applied to a much larger table. More significantly, they might also be used to scrape a table on a web page that is updated frequently. 2009 is strictly historical, but other data is updated every day, or even every minute, and a few lines like the ones shown could pull down current data each time it needs to be processed.
For our processing of a non-tabular source, we will use Wikipedia as well. Again, a topic that is of wide interest and not prone to deletion is chosen. Likewise, a specific historical version is indicated in the URL, just in case the page changes its structure by the time you read this. In a slightly self-referential way, we will look at the article that lists HTTP status codes in a term/definition layout. A portion of that page renders in my browser like this:
Image: HTTP Status Codes, Wikipedia Definition List
Numerous other codes are listed in the articles that are not in the screenshot. Moreover, there are section divisions and other descriptive elements or images throughout the page. Fortunately, Wikipedia tends to be very regular and predictable in its use of markup. The URL we will examine is:
url_http = ("https://en.wikipedia.org/w/index.php?" "title=List_of_HTTP_status_codes&oldid=947767948")
The first thing we need to do is actually retrieve the HTML content. The Python standard library module
urllib
is perfectly able to do this task. However, even its official
documentation recommends using the third-party package Requests for most purposes. There is nothing
you
cannot do with urllib
, but often the API is more difficult to use, and is
unnecessarily
complicated for historical/legacy reasons. For simple things, like what is shown in this book, it makes
little
difference; for more complicated tasks, getting in the habit of using Requests is a good idea. Let us open a
page and check the status code returned.
import requests resp = requests.get(url_http) resp.status_code
200
The raw HTML we retrieved is not especially easy to work with. Even apart from the fact it is compacted to remove extra whitespace, the general structure is a "tag soup" with various things nested in various places, and in which basic string methods or regular expressions do not help us very much in identifying the parts we are interested in. For example, here is a short segment from somewhere in the middle.
pprint(resp.content[43400:44000], width=55)
(b'e_ref-44" class="reference"><a href="#cite_note-' b'44">[43]</a></sup></dd>\n<dt><span class=' b'"anchor" id="412"></span>412 Precondition Failed' b' (<a class="external mw-magiclink-rfc" rel="nofo' b'llow" href="https://tools.ietf.org/html/rfc7232"' b'>RFC 7232</a>)</dt>\n<dd>The server does not meet' b' one of the preconditions that the requester put' b' on the request header fields.<sup id="cite_ref-' b'45" class="reference"><a href="#cite_note-45">&#' b'91;44]</a></sup><sup id="cite_ref-46" class=' b'"reference"><a href="#cite_note-46">[45]' b'</a></sup></dd>\n<dt><span class="anchor" id="413' b'"></span>413 Payload Too')
What we would like is to make the tag soup beautiful instead. The steps in doing so are first creating a "soup" object from the raw HTML, then using methods of that soup to pick out the elements we care about for our data set. As with the R and rvest version—as indeed, with any library you decide to use—finding the right data in the web page will involve trial-and-error.
from bs4 import BeautifulSoup soup = BeautifulSoup(resp.content)
As a start at our examination, we noticed that the status codes themselves are each contained within an HTML <dt> element. Below we display the first and last few of the elements identified by this tag. Everything so identified is, in fact, a status code, but I only know that from manual inspection of all of them (fortunately, eyeballing fewer than 100 items is not difficult; doing so with a million would be infeasible). However, if we look back at the original web page itself, we will notice that two AWS custom codes at the end are not captured because the page formatting is inconsistent for those. In this section, we will ignore those, having determined they are not general purpose anyway.
codes = soup.find_all('dt') for code in codes[:5] + codes[-5:]: print(code.text)
100 Continue 101 Switching Protocols 102 Processing (WebDAV; RFC 2518) 103 Early Hints (RFC 8297) 200 OK 524 A Timeout Occurred 525 SSL Handshake Failed 526 Invalid SSL Certificate 527 Railgun Error 530
It would be nice if each <dt> were matched with a corresponding <dd>. If it were, we could just read all the <dd> definitions and zip them together with the terms. Real-world HTML is messy. It turns out—and I discovered this while writing, not by planning the example—that there are sometimes more than one (and potentially sometimes zero) <dd> elements following each <dt>. Our goal then will be to collect all of the <dd> elements that follow a <dt> until other tags occur.
In the BeautifulSoup API, the empty space between elements is a node of plain text that contains exactly the
characters (including whitespace) inside that span. It is tempting to use the API
node.find_next_siblings()
in this task. We could succeed doing this, but this method
will
fetch too much, including all subsequent <dt> elements after the current one. Instead, we can use the
property .next_sibling
to get each one, and stop when needed.
def find_dds_after(node): dds = [] sib = node.next_sibling while True: # Loop until a break # Last sibling within page section if sib is None: break # Text nodes have no element name elif not sib.name: sib = sib.next_sibling continue # A definition node if sib.name == 'dd': dds.append(sib) sib = sib.next_sibling # Finished <dd> the definition nodes else: break return dds
The custom function I wrote above is straightforward, but special to this purpose. Perhaps it is extensible to similar definition lists one finds in other HTML documents. BeautifulSoup provides numerous useful APIs, but they are building blocks for constructing custom extractors rather than foreseeing every possible structure in an HTML document. To understand it, let us look at a couple of the status codes.
for code in codes[23:26]: print(code.text) for dd in find_dds_after(code): print(" ", dd.text[:40], "...")
400 Bad Request The server cannot or will not process th ... 401 Unauthorized (RFC 7235) Similar to 403 Forbidden, but specifical ... Note: Some sites incorrectly issue HTTP ... 402 Payment Required Reserved for future use. The original in ...
The HTTP 401 response contains two separate definition blocks. Let us apply the function across all the HTTP code numbers. What is returned is a list of definition blocks; for our purpose we will join the text of each of these with a newline. In fact, we construct a data frame with all the information of interest to us in the next cells.
data = [] for code in codes: # All codes are 3 character numbers number = code.text[:3] # parenthetical is not part of status text, note = code.text[4:], "" if " (" in text: text, note = text.split(" (") note = note.rstrip(")") # Compose description from list of strings description = "\n".join(t.text for t in find_dds_after(code)) data.append([int(number), text, note, description])
From the Python list of lists, we can create a Pandas DataFrame for further work on the data set.
(pd.DataFrame(data, columns=["Code", "Text", "Note", "Description"]) .set_index('Code') .sort_index() .head(8))
Text | Note | Description | |
---|---|---|---|
Code | |||
100 | Continue | The server has received the request headers an... | |
101 | Switching Protocols | The requester has asked the server to switch p... | |
102 | Processing | WebDAV; RFC 2518 | A WebDAV request may contain many sub-requests... |
103 | Checkpoint | Used in the resumable requests proposal to res... | |
103 | Early Hints | RFC 8297 | Used to return some response headers before fi... |
200 | OK | Standard response for successful HTTP requests... | |
201 | Created | The request has been fulfilled, resulting in t... | |
202 | Accepted | The request has been accepted for processing, ... |
Clearly, the two examples this book walked through in some details are not general to all the web pages you
may
wish to scrape data from. Organization into tables and into definition lists are certainly two common uses
of
HTML to represent data, but many other conventions might be used. Particular domain specific—or likely page
specific—class
and id
attributes on elements is also a common way to mark the
structural role of different data elements. Libraries like rvest, BeautifulSoup, and scrapy all make
identification and extraction of HTML by element attributes straightforward as well. Simply be prepared to
try
many variations on your web scraping code before you get it right. Generally, your iteration will be a
narrowing
process; each stage needs toinclude the information desired, it becomes a process of removing the
parts
you do not want through refinement.
Another approach that I have often used for web scraping is to use the command-line web browsers
lynx
and links
. Install either or both with your system package manager. These
tools
can dump HTML contents as text which is, in turn, relatively easy to parse if the format is simple. There
are
many times when just looking for patterns of intentation, vertical space, searching for particular keywords,
or
similar text processing, will get the data you need more quickly than the trial-and-error of parsing
libraries
like rvest or BeautifulSoup. Of course, there is always a certain amount of eyeballing and retrying
commands.
For people who are well versed in text processing tools, this approach is worth considering.
The two similar text-mode web browsers both share a -dump
switch that outputs non-interactive
text
to STDOUT. Both of them have a variety of other switches that can tweak the rendering of the text in a
variety
of ways. The output from these two tools is similar, but the rest of your scripting will need to pay
attention
to the minor differences. Each of these browsers will do a very good job of dumping 90% of web pages as text
that is easy to process. Of the problem 10% (a hand waving percentage, not a real measure), often one or the
other tool will produce something reasonable to parse. In certain cases, one of these browsers may produce
useful results and the other will not. Fortunately, it is easy simply to try both for a given task or site.
Let us look at the output from each tool against a portion of the HTTP response code page. Obviously, I
experimented to find the exact line ranges of output that would correspond. You can see that only incidental
formatting differences exist in this friendly HTML page. First with lynx
:
%%bash base='https://en.wikipedia.org/w/index.php?title=' url="$base"'List_of_HTTP_status_codes&oldid=947767948' lynx -dump $url | sed -n '397,406p'
requester put on the request header fields.^[170][44]^[171][45] 413 Payload Too Large ([172]RFC 7231) The request is larger than the server is willing or able to process. Previously called "Request Entity Too Large".^[173][46] 414 URI Too Long ([174]RFC 7231) The [175]URI provided was too long for the server to process. Often the result of too much data being encoded as a query-string of a GET request, in which case it should be
And the same part of the page again, but this time with links
:
%%bash base='https://en.wikipedia.org/w/index.php?title=' url="$base"'List_of_HTTP_status_codes&oldid=947767948' links -dump $url | sed -n '377,385p'
requester put on the request header fields.^[44]^[45] 413 Payload Too Large (RFC 7231) The request is larger than the server is willing or able to process. Previously called "Request Entity Too Large".^[46] 414 URI Too Long (RFC 7231) The URI provided was too long for the server to process. Often the result of too much data being encoded as a query-string of a GET
The only differences here are one space difference in indentation of the definition element and some difference in the formatting of footnote links in the text. In either case, it would be easy enough to define some rules for the patterns of terms and their definitions. Something like this:
Let us wave goodbye to the Scylla of HTML, as we pass by, and fall into the Charybdis of PDF.
This functionary grasped it in a perfect agony of joy, opened it with a trembling hand, cast a rapid glance at its contents, and then, scrambling and struggling to the door, rushed at length unceremoniously from the room and from the house.
–Edgar Allan Poe
Concepts:
There are a great many commercial tools to extract data which has become hidden away in PDF (portable document format) files. Unfortunately, many organizations—government, corporate, and others—issue reports in PDF format but do not provide data formats more easily accessible to computer analysis and abstraction. This is common enough to have provided impetus for a cottage industry of tools for semi-automatically extracting data back out of these reports. This book does not recommend use of proprietary tools about which there is no guarantee of maintenance and improvement over time; as well, of course, those tools cost money and are an impediment to cooperation among data scientists who work together on projects without necessarily residing in the same "licensing zone."
There are two main elements that are likely to interest us in a PDF file. An obvious one is tables of data, and those are often embedded in PDFs. Otherwise, a PDF can often simply be treated as a custom text format, as we discuss in a section below. Various kinds of lists, bullets, captions, or simply paragraph text, might have data of interest to us.
There are two open source tools I recommend for extraction of data from PDFs. One of these it the
command-line
tool pdftotext
which is part of the Xpdf and derived Poppler
software suites. The second is a Java tool called tabula-java. Tabula-java is in turn the
underlying engine for the GUI tool Tabula, and also has language bindings for Ruby
(tabula-extractor), Python (tabula-py), R (tabulizer),
and
Node.js (tabula-js). Tabula creates a small web server that allows interaction within a
browser
to do operations like creating lists of PDFs and selecting regions where tables are located. The Python and
R
bindings also allow direct creation of data frames or arrays, with the R binding incorporating an optional
graphical widget for region selection.
For this discussion, we do not use any of the language bindings, nor the GUI tools. For one-off selection of one-page data sets, the selection tools could be useful, but for automation of recurring document updates or families of similar documents, scripting is needed. Moreover, while the various language bindings are perfectly suitable for scripting, we can be somewhat more language agnostic in this section by limiting ourselves to the command-line tool of the base library.
As an example for this section, let us use a PDF that was output from the preface of this book itself. There may have been small wording changes by the time you read this, and the exact formatting of the printed book or ebook will surely be somewhat different from this draft version. However, this nicely illustrates tables rendered in several different styles that we can try to extract as data. There are three tables, in particular, which we would like to capture.
Image: Page 5 of Book Preface
On page 5 of the draft preface, a table is rendered by both Pandas and tibble, with corresponding minor presentation differences. On page 7 another table is included that looks somewhat different again.
Image: Page 7 of Book Preface
Running tabula-java requires a rather long command line, so I have created a small bash script to wrap it on my personal system:
#!/bin/bash # script: tabula # Adjust for your personal system path TPATH='/home/dmertz/git/tabula-java/target' JAR='tabula-1.0.4-SNAPSHOT-jar-with-dependencies.jar' java -jar "$TPATH/$JAR" $@
Extraction will sometimes automatically recognize tables per page with the --guess
option, but
you
can get better control by specifying a portion of a page where tabula-java should look for a table. We
simply
output to STDOUT in the following code cells, but outputting to a file is just another option switch.
%%bash tabula -g -t -p5 data/Preface-snapshot.pdf
[1]:,,Last_Name,First_Name,Favorite_Color,Age "",Student_No,,,, "",1,Johnson,Mia,periwinkle,12.0 "",2,Lopez,Liam,blue-green,13.0 "",3,Lee,Isabella,<missing>,11.0 "",4,Fisher,Mason,gray,NaN "",5,Gupta,Olivia,sepia,NaN "",6,Robinson,Sophia,blue,12.0
Tabula does a good, but not perfect, job. The Pandas style of setting the name of the index column below the other headers threw it off slightly. There is also a spurious first column that is usually empty strings, but has a header as the output cell number. However, these small defects are very easy to clean up, and we have a very nice CSV of the actual data in the table.
Remember from just above, however, that page 5 actually had two tables on it. Tabula-java only captured the first one, which is not unreasonable, but is not all the data we might want. Slightly more custom instructions (determined by moderate trial-and-error to determine the region of interest) can capture the second one.
%%bash tabula -a'%72,13,90,100' -fTSV -p5 data/Preface-snapshot.pdf
First Last Age <chr> <chr> bl> Mia Johnson 12 Liam Lopez 13 Isabella Lee 11 Mason Fisher NaN Olivia Gupta NaN Sophia Robinson 12
To illustrate the output options, we chose tab-delimited rather than comma-separated for the output. A JSON output is also available. Moreover, by adjusting the left margin (as percent, but as typographic points is also an option), we can eliminate the unecessary row numbers. As before, the ingestion is good but not perfect. The tibble formatting of data type markers is superfluous for us. Discarding the two rows with unnecessary data is straightforward.
Finally for this example, let us capture the table on page 7 that does not have any of those data frame library extra markers. This one is probably more typical of the tables you will encounter in real work. For the example, we use points rather than page percentage to indicate the position of the table.
%%bash tabula -p7 -a'120,0,220,500' data/Preface-snapshot.pdf
Number,Color,Number,Color 1,beige,6,alabaster 2,eggshell,7,sandcastle 3,seafoam,8,chartreuse 4,mint,9,sepia 5,cream,10,lemon
The extraction here is perfect, although the table itself is less than ideal in that it it repeats the number/color pairs twice. However, that is likewise easy enough to modify using data frame libraries.
The tool tabula-java, as the name suggests, is only really useful for identifying tables. In contrast, pdftotext creates a best-effort purely text version of a PDF. Most of the time this is quite good. From that, standard text processing and extraction techniques usually work well, including those that parse tables. However, since an entire document (or a part of it selected by pages) is output, that lets us work with other elements like bullet lists, raw prose, or other identifiable data elements of a document.
%%bash # Start with page 7, tool writes to .txt file # Use layout mode to preserve horizontal position pdftotext -f 7 -layout data/Preface-snapshot.pdf # Remove 25 spaces from start of lines # Wrap other lines that are too wide sed -E 's/^ {,25}//' data/Preface-snapshot.txt | fmt -s | head -20
• Missing data in the Favorite Color field should be substituted with the string <missing>. • Student ages should be between 9 and 14, and all other values are considered missing data. • Some colors are numerically coded, but should be dealiased. The mapping is: Number Color Number Color 1 beige 6 alabaster 2 eggshell 7 sandcastle 3 seafoam 8 chartreuse 4 mint 9 sepia 5 cream 10 lemon Using the small test data set is a good way to test your code. But try also manually adding more rows with similar, or different, problems in them, and see how well your code produces a reasonable result.
The tabular part in the middle would be simple to read as a fixed width format. The bullets at top or the paragraph at bottom might be useful for other data extraction purposes. In any case, it is plain text at this point, which is easy to work with.
Let us turn now to analyzing images, mostly for their metadata and overall statistical characteristics.
As the Chinese say, 1001 words is worth more than a picture.
–John McCarthypicture
Concepts:
For certain purposes, raster images are themselves the data sets of interest to us. "Raster" just means rectangular collections of pixel values. The field of machine learning around image recognition and image processing is far outside the scope of this book. The few techniques in this section might be useful to get your data ready to the point of developing input to those tools, but no further than that. Also not considered in this book are other kinds of recognition of the content of images at a high-level. For example, optical character recognition (OCR) tools might recognize an image as containing various strings and numbers as rendered fonts, and those values might be the data we care about.
If you have the misfortune of having data that is only available in printed and scanned form, you most certainly have my deep sympathy. Scanning the images using OCR is likely to produce noisy results with many misrecognitions. Detecting those is addressed in chapter 4 (Anomaly Detection); essentially you will get either wrong strings or wrong numbers when these errors happen, ideally the errors will be identifiable. However, the specifics of those technologies are not within the current scope.
For this section, we merely want to present tools to read in images as numeric arrays, and perform a few basic processing steps that might be used in your downstream data analysis or modeling. Within Python, the libary Pillow is the go-to tool (backward compatible successor to PIL, which is deprecated). Within R, the imager library seems to be most widely used for the general purpose tasks of this section. As a first task, let us examine and describe the raster images used in the creation of this book.
from PIL import Image, ImageOps for fname in glob('img/*'): try: with Image.open(fname) as im: print(fname, im.format, "%dx%d" % im.size, im.mode) except IOError: pass
img/(Ch03)Luminance values in Confucius drawing.png PNG 2000x2000 P img/Flu2009-infobox.png PNG 607x702 RGBA img/Konfuzius-1770.jpg JPEG 566x800 RGB img/UMAP.png PNG 2400x2400 RGBA img/(Ch02)108 ratings.png PNG 3600x2400 RGBA img/DQM-with-Lenin-Minsk.jpg MPO 3240x4320 RGB img/(Ch02)Counties of the United States.png PNG 4800x3000 RGBA img/PCA Components.png PNG 3600x2400 RGBA img/(Ch01)Student score by year.png PNG 3600x2400 RGBA img/HDFCompass.png PNG 958x845 RGBA img/(Ch01)Visitors per station (max 32767).png PNG 3600x2400 RGBA img/t-SNE.png PNG 4800x4800 RGBA img/dog_cat.png PNG 6000x6000 RGBA img/Parameter space for two features.png PNG 3600x2400 RGBA img/Whitened Components.png PNG 3600x2400 RGBA img/(Ch01)Lengths of Station Names.png PNG 3600x2400 RGBA img/preface-2.png PNG 945x427 RGBA img/DQM-with-Lenin-Minsk.jpg_original MPO 3240x4320 RGB img/PCA.png PNG 4800x4800 RGBA img/Excel-Pitfalls.png PNG 551x357 RGBA img/gnosis-traffic.png PNG 1064x1033 RGBA img/Film_Awards.png PNG 1587x575 RGBA img/HTTP-status-codes.png PNG 934x686 RGBA img/preface-1.png PNG 988x798 RGBA img/GraphDatabase_PropertyGraph.png PNG 2000x1935 RGBA
We see that mostly PNG images were used, with a smaller number of JPEGs. Each has certain spatial dimensions, by width then height, and each is either RGB, or RGBA if it includes an alpha channel. Other images might be HSV format. Converting between color spaces is easy enough using tools like Pillow and imager, but it is important to be aware of which model a given image uses. Let us read one in, this time using R.
%%R library(imager) confucius <- load.image("img/Konfuzius-1770.jpg") print(confucius) plot(confucius)
Image. Width: 566 pix Height: 800 pix Depth: 1 Colour channels: 3
Let us analyze the contours of the pixels.
We can work on getting a feel for the data, which at heart is simply an array of values, with some tools the library provides. In the case of imager which is built on CImg, the internal representation is 4-dimensional. Each plane is an X by Y grid of pixels (left-to-right, top-to-bottom). However, the format can represent a stack of images—for example, an animation—in the depth dimension. The several color channels (if the image is not grayscale) are the final dimension of the array. The Confucius example is a single image, so the third dimension is of length one. Let us look at some summary data about the image.
%%R grayscale(confucius) %>% hist(main="Luminance values in Confucius drawing")
%%R # Save histogram to disk png("img/(Ch03)Luminance values in Confucius drawing.png", width=1200) grayscale(confucius) %>% hist(main="Luminance values in Confucius drawing")
Perhaps we would like to look at the distribution only of one color channel instead.
%%R B(confucius) %>% hist(main="Blue values in Confucius drawing")
%%R # Save histogram to disk png("img/(Ch03)Blue values in Confucius drawing.png", width=1200) B(confucius) %>% hist(main="Blue values in Confucius drawing")
The histograms above simply utilize the standard R histogram function. There is nothing special about the fact that the data represents an image. We could perform whatever statistical tests or summarizations we wanted on the data to make sure it makes sense for our purpose; a histogram is only a simple example to show the concept. We can also easily transform the data into a tidy data frame. As of this writing, there is an "impedance error" in converting directly to a tibble, so the below cell uses an intermediate data.frame format. Tibbles are often but not always drop in replacements when functions were written to work with data.frame objects.
%%R data <- as.data.frame(confucius) %>% as_tibble %>% # channels 1, 2, 3 (RGB) as factor mutate(cc = as.factor(cc)) data
# A tibble: 1,358,400 x 4 x y cc value <int> <int> <fct> <dbl> 1 1 1 1 0.518 2 2 1 1 0.529 3 3 1 1 0.518 4 4 1 1 0.510 5 5 1 1 0.533 6 6 1 1 0.541 7 7 1 1 0.533 8 8 1 1 0.533 9 9 1 1 0.510 10 10 1 1 0.471 # … with 1,358,390 more rows
With Python and PIL/Pillow, working with image data is very similar. As in R, the image is an array of pixel values with some metadata attached to it. Just for fun, we use a variable name with Chinese characters to illustrate that such is supported in Python.
# Courtesy name: Zhòngní (仲尼) # "Kǒng Fūzǐ" (孔夫子) was coined by 16th century Jesuits 仲尼 = Image.open('img/Konfuzius-1770.jpg') data = np.array(仲尼) print("Image shape:", data.shape) print("Some values\n", data[:2, :, :])
Image shape: (800, 566, 3) Some values [[[132 91 69] [135 94 74] [132 91 71] ... [148 98 73] [142 95 69] [135 89 63]] [[131 90 68] [138 97 75] [139 98 78] ... [147 100 74] [144 97 71] [138 92 66]]]
In the Pillow format, images are stored as 8-bit unsigned integers rather than as floating-point numbers in [0.0, 1.0] range. Converting between these is easy enough, of course, as is other normalization. For example, for many neural network tasks, the prefered representation is values centered at zero with standard deviation of one. The array used to hold Pillow images in 3-dimensional since it does not have provision for stacking multiple images in the same object.
It might be useful to perform manipulation of image data before processing. The below example is contrived, and similar to one used in the library tutorial. The idea in the next few code lines is that we will mask the image based on the values in the blue channel, but then use that to selectively zero-out red values. The result is not visually attractive for a painting, but one can imagine it might be useful for e.g. medical imaging or false-color radio astronomy images (I am also working around making a transformation that is easily visible in a monochrome book as well as in full color).
The convention used in the .paste()
method is a bit odd. The rule is: Where the mask is 255,
copied
as is; where mask is 0, preserve current value (blend if intermediate). The effect overall in the color
version
is that in the mostly red-tinged image, the greens dominate at the edges where the image had been most red.
In
grayscale it mostly just darkens the edges.
# split the Confucius image into individual bands source = 仲尼.split() R, G, B = 0, 1, 2 # select regions where blue is less than 100 mask = source[B].point(lambda i: 255 if i < 100 else 0) source[R].paste(0, None, mask) im = Image.merge(仲尼.mode, source) im.save('img/(Ch03)Konfuzius-bluefilter.jpg') ImageOps.scale(im, 0.5)
The original in comparison.
ImageOps.scale(仲尼, 0.5)
Another example we mentioned is that transformation of the color space might be useful. For example, rather than look at colors red, green, and blue, it might be that hue, saturation, and lightness are better features for your modeling needs. This is a deterministic transformation of the data, but emphasizing different aspects. It is something analogous to the decompositions like principal component analysis that is discussed in chapter 7 (Feature Engineering). Here we convert from an RGB to HSL representation of the image.
%%R confucius.hsv <- RGBtoHSL(confucius) data <- as.data.frame(confucius.hsv) %>% as_tibble %>% # channels 1, 2, 3 (HSV) as factor mutate(cc = as.factor(cc)) data
# A tibble: 1,358,400 x 4 x y cc value <int> <int> <fct> <dbl> 1 1 1 1 21.0 2 2 1 1 19.7 3 3 1 1 19.7 4 4 1 1 19.7 5 5 1 1 19.7 6 6 1 1 19.7 7 7 1 1 19.7 8 8 1 1 19.7 9 9 1 1 19.7 10 10 1 1 20 # … with 1,358,390 more rows
Both the individual values and the shape of the space have changed in this transformation. The transformation is lossless, beyond minor rounding issues. A summary by channel will illustrate this.
%%R data %>% mutate(cc = recode( cc, `1`="Hue", `2`="Saturation", `3`="Value")) %>% group_by(cc) %>% summarize(Mean = mean(value), SD = sd(value))
`summarise()` ungrouping output (override with `.groups` argument) # A tibble: 3 x 3 cc Mean SD <fct> <dbl> <dbl> 1 Hue 34.5 59.1 2 Saturation 0.448 0.219 3 Value 0.521 0.192
Let us look at perhaps the most important aspect of images to data scientists.
Photographic images may contain metadata embedded inside them. Specifically, the Exchangeable Image File Format (Exif) specifies how such metadata can be embedded in JPEG, TIFF, and WAV formats (the last is an audio format). Digital cameras typically add this information to the images they create, often including details such as timestamp and latitude/longitude location.
Some of the data fields within an Exif mapping are textual, numeric, or tuples; others are binary data. Moreover, the keys in the mapping are from ID numbers that are not meaningful to humans directly; this mapping is a published standard, but some equipment makers may introduce their own IDs as well. The binary fields contain a variety of types of data, encoded in various ways. For example, some cameras may attach small preview images as Exif metadata; but simpler fields are also encoded.
The below function will utilize Pillow to return two dictionaries, one for the textual data, the other for
the
binary data. Tag IDs are expanded to human readable names, where available. Pillow uses "camel case" for
these
names, but other tools have different variations on capitalization and punctuation within the tag names. The
casing by Pillow is what I like to call Bactrian case—as opposed to Dromedary case—both of which differ from
Python's usual "snake case" (e.g. BactrianCase
versus dromedaryCase
versus
snake_case
).
from PIL.ExifTags import TAGS def get_exif(img): txtdata, bindata = dict(), dict() for tag_id in (exifdata := img.getexif()): # Lookup tag name from tag_id if available tag = TAGS.get(tag_id, tag_id) data = exifdata.get(tag_id) if isinstance(data, bytes): bindata[tag] = data else: txtdata[tag] = data return txtdata, bindata
Let us check whether the Confucius image has any metadata attached.
get_exif(仲尼) # Zhòngní, i.e. Confucius
({}, {})
We see that this image does not have any such metadata. Let us look instead at a photograph taken of the author next to a Lenin statue in Minsk.
# Could continue using multi-lingual variable names by # choosing `Ленин`, `Ульянов` or `Мінск` dqm = Image.open('img/DQM-with-Lenin-Minsk.jpg') ImageOps.scale(dqm, 0.1)