Data Wrangling

Definition

Data Wrangling is a broad term referring to the processes involved when preparing data for analysis. It can include acquiring data, enriching, changing the format and shape of the data, combining, subsetting and sampling data, and cleaning data.

Some common steps involved with Data Wrangling are:

Discovering and gathering the data needed
Merging data from different sources, if necessary
Fixing flaws in the data entries
Extracting the necessary data and put it in the proper structure
Storing it in the proper format for further use

Examples

Merging data from different sources and fixing flaws or errors in data entries.

Tools

Tidyverse is a collection of open source R packages, several of which can be used for data wrangling and cleaning.

Pandas in a collection of open source Python libraries for data manipulation and analysis.

OpenRefine is a user-friendly, point-and-click tool for working with messy data.

Relevant Literature

This short Coursera video (What is Data Wrangling?) provides an excellent overview of the data wrangling process and common tasks involved when preparing data for analysis and publication.

Data Science for Practicing Clinicians: Data Wrangling is a Data Carpentry lesson that provides hands-on experience with installing and using dplyr, a core package in Tidyverse in the R programming language. Basic instructions for filtering, summarizing, parsing, and cleaning data are provided.

The Book Practical Data Wrangling (2017) by Allan Visochek provides information on data wrangling techniques in Python.

Become a Member Organization

Member Services & FAQs

Members Directory

My Institutional Membership

Confronting Health Misinformation

Bridging the Digital Divide

Environmental Determinants of Health

Traveling Exhibitions

Available Now

Funded Projects

Project & Proposal Writing Support

My Projects

Available Now

Obtain a Specialization

Recordings

My Classes

Order Free Informational Materials

An Introduction to Health Literacy

Resources for Healthcare Providers

NNLM Proposal Writing Toolkit

Data Wrangling

Definition

Relevant Literature

Contact Us

Regional Medical Libraries