Data Wrangling

Definition

Data Wrangling is a broad term referring to the processes involved when preparing data for analysis. It can include acquiring data, enriching, changing the format and shape of the data, combining, subsetting and sampling data, and cleaning data

Some common steps involved with Data Wrangling are: 

  • Discovering and gathering the data needed
  • Merging data from different sources, if necessary 
  • Fixing flaws in the data entries 
  • Extracting the necessary data and put it in the proper structure 
  • Storing it in the proper format for further use
Examples

Merging data from different sources and fixing flaws or errors in data entries.

Tools

Tidyverse is a collection of open source R packages, several of which can be used for data wrangling and cleaning.  

Pandas in a collection of open source Python libraries for data manipulation and analysis. 

OpenRefine is a user-friendly, point-and-click tool for working with messy data.

Relevant Literature

This short Coursera video (What is Data Wrangling?) provides an excellent overview of the data wrangling process and common tasks involved when preparing data for analysis and publication.

Data Science for Practicing Clinicians: Data Wrangling is a Data Carpentry lesson that provides hands-on experience with installing and using dplyr, a core package in Tidyverse in the R programming language. Basic instructions for filtering, summarizing, parsing, and cleaning data are provided.

The Book Practical Data Wrangling (2017) by ​​Allan Visochek provides information on data wrangling techniques in Python.

 

Search for a Term

Send us your feedback or suggestions for new terms

Contact information
CAPTCHA
3 + 7 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.
This question is to prevent spam submissions. Contact nwso@hshsl.umaryland.edu for any accessibility issues.