Cleaning Data for Effective Data Science

About the Book

It is something of a truism in data science, data analysis, or machine learning that most of the effort needed to achieve your actual purpose lies in cleaning your data. Written in David’s signature friendly and humorous style, this book discusses in detail the essential steps performed in every production data science or data analysis pipeline and prepares you for data visualization and modeling results.

The book dives into the practical application of tools and techniques needed for data ingestion, anomaly detection, value imputation, and feature engineering. It also offers long-form exercises at the end of each chapter to practice the skills acquired.

You will begin by looking at data ingestion of data formats such as JSON, CSV, SQL RDBMSes, HDF5, NoSQL databases, files in image formats, and binary serialized data structures. Further, the book provides numerous example data sets and data files, which are available for download and independent exploration.

Moving on from formats, you will impute missing values, detect unreliable data and statistical anomalies, and generate synthetic features that are necessary for successful data analysis and visualization goals.

By the end of this book, you will have acquired a firm understanding of the data cleaning process necessary to perform real-world data science and machine learning tasks.

Upcoming Events

May 10, 2021: Data Ingestion of Hierarchical and Other Data Formats:
Cleaning Data for Effective Data Science
June 15, 2021: Beginning Machine Learning with scikit-learn:
Understanding Fundamental Concepts and Exploring the API for Supervised Learning
June 22, 2021: Cleaning Data for Effective Data Science: Explorations in Anomaly Detection
(Cape Town Machine Learning Meetup)
June 21-25, 2021: DataTalk.Club Book of the week (Slack discussion)
August 2021: A new book by David that is fun, challenging, and beautiful: The Puzzling Quirks of Regular Expressions

Note on this Online Version

Most, but not all of, of the published Packt title is available at this website. Some mentions of code setup or running code cells will not apply to this static HTML, divided into short sections. There might be a few other words missing that are more about the book context as well.

In other words, I encourage you to buy the book itself!

That said, it's also worth nothing that all the code cells themselves that were used in writing this book within JupyterLab are available at BitBucket repository for the book.

Acknowledgements

I give great thanks to those people who have helped make this book better.

First and foremost, I am thankful for the careful attention and insightful suggestions of my technical editor Lucy Wan, and technical reviewer Miki Tebeka. Other colleagues and friends who have read and provided helpful comments on parts of this book, while it was in progress, include Micah Dubinko, Vladimir Shulyak, Laura Richter, Alessandra Smith, Mary Ann Sushinsky, Tim Churches, and Paris Finley.

The text in front of you is better for their kindnesses and intelligence; all errors and deficits remain mine entirely.

I also thank the thousands of contributors who have created the Free Software I used in the creation of this book, and in so much other work I do. No proprietary software was used by the author at any point in the production of this book. The operating system, text editors, plot creation tools, fonts, programming languages, shells, command-line tools, and all other software used belongs to our human community rather than to any exclusive private entity.