Data visualization

From Wikipedia, the free encyclopedia - View original article

Jump to: navigation, search

Data visualization or data visualisation is viewed by many disciplines as a modern equivalent of visual communication. It is not owned by any one field, but rather finds interpretation across many (e.g. it is viewed as a modern branch of descriptive statistics by some, but also as a grounded theory development tool by others). It involves the creation and study of the visual representation of data, meaning "information that has been abstracted in some schematic form, including attributes or variables for the units of information".[1]

A primary goal of data visualization is to communicate information clearly and efficiently to users via the information graphics selected, such as tables and charts. Effective visualization helps users in analyzing and reasoning about data and evidence. It makes complex data more accessible, understandable and usable. Users may have particular analytical tasks, such as making comparisons or understanding causality, and the design principle of the graphic (i.e., showing comparisons or showing causality) follows the task. Tables are generally used where users will look-up a specific measure of a variable, while charts of various types are used to show patterns or relationships in the data for one or more variables.

Data visualization is both an art and a science. The rate at which data is generated has increased, driven by an increasingly information-based economy. Data created by internet activity and an expanding number of sensors in the environment, such as satellites and traffic cameras, are referred to as "Big Data". Processing, analyzing and communicating this data present a variety of ethical and analytical challenges for data visualization. The field of data science and practitioners called data scientists have emerged to help address this challenge.


A data visualization of Wikipedia as part of the World Wide Web, demonstrating hyperlinks

According to Friedman (2008) the "main goal of data visualization is to communicate information clearly and effectively through graphical means. It doesn’t mean that data visualization needs to look boring to be functional or extremely sophisticated to look beautiful. To convey ideas effectively, both aesthetic form and functionality need to go hand in hand, providing insights into a rather sparse and complex data set by communicating its key-aspects in a more intuitive way. Yet designers often fail to achieve a balance between form and function, creating gorgeous data visualizations which fail to serve their main purpose — to communicate information".[2]

Indeed, Fernanda Viegas and Martin M. Wattenberg have suggested that an ideal visualization should not only communicate clearly, but stimulate viewer engagement and attention.[3]

Data visualization is closely related to information graphics, information visualization, scientific visualization, exploratory data analysis and statistical graphics. In the new millennium, data visualization has become an active area of research, teaching and development. According to Post et al. (2002), it has united scientific and information visualization.[4] Brian Willison has demonstrated that data visualization has also been linked to enhancing agile software development and customer engagement.[5]

KPI Library has developed the “Periodic Table of Visualization Methods,” an interactive chart displaying various data visualization methods. It includes six types of data visualization methods: data, information, concept, strategy, metaphor and compound.[6]

In February 2014, University of Toronto professor Nadia Amoroso demonstrated how Data Visualization techniques increase the understanding of Big Data sets in order to communicate a story at the University of Waterloo Stratford Campus Inspiration Day.[7]

Characteristics of effective graphical displays[edit]

Charles Joseph Minard's 1861 diagram of Napoleon's March - an early example of an information graphic.

Professor Edward Tufte explained that users of information displays are executing particular analytical tasks such as making comparisons or determining causality. The design principle of the information graphic should support the analytical task, showing the comparison or causality.[8]

In his 1983 book The Visual Display of Quantitative Information, Edward Tufte defines 'graphical displays' and principles for effective graphical display in the following passage: "Excellence in statistical graphics consists of complex ideas communicated with clarity, precision and efficiency. Graphical displays should:

Graphics reveal data. Indeed graphics can be more precise and revealing than conventional statistical computations."[9]

For example, the Minard diagram shows the losses suffered by Napoleon's army in the 1812-1813 period. Six variables are plotted: the size of the army, its location on a two-dimensional surface (x and y), time, direction of movement, and temperature. This multivariate display on a two dimensional surface tells a story that can be grasped immediately while identifying the source data to build credibility. Tufte wrote in 1983 that: "It may well be the best statistical graphic ever drawn."[9]

Not applying these principles may result in misleading graphs, which distort the message or support an erroneous conclusion. According to Tufte, chartjunk refers to extraneous interior decoration of the graphic that does not enhance the message, or gratuitous three dimensional or perspective effects. Needlessly separating the explanatory key from the image itself, requiring the eye to travel back and forth from the image to the key, is a form of "administrative debris." The ratio of "data to ink" should be maximized, erasing non-data ink where feasible.[9]

Quantitative messages[edit]

A time series illustrated with a line chart demonstrating trends in U.S. federal spending and revenue over time.
A scatterplot illustrating correlation between two variables (inflation and unemployment) measured at points in time.

Author Stephen Few described eight types of quantitative messages that users may attempt to understand or communicate from a set of data and the associated graphs used to help communicate the message:

  1. Time-series: A single variable is captured over a period of time, such as the unemployment rate over a 10-year period. A line chart may be used to demonstrate the trend.
  2. Ranking: Categorical subdivisions are ranked in ascending or descending order, such as a ranking of sales performance (the measure) by sales persons (the category, with each sales person a categorical subdivision) during a single period. A bar chart may be used to show the comparison across the sales persons.
  3. Part-to-whole: Categorical subdivisions are measured as a ratio to the whole (i.e., a percentage out of 100%). A pie chart or bar chart can show the comparison of ratios, such as the market share represented by competitors in a market.
  4. Deviation: Categorical subdivisions are compared again a reference, such as a comparison of actual vs. budget expenses for several departments of a business for a given time period. A bar chart can show comparison of the actual versus the reference amount.
  5. Frequency distribution: Shows the number of observations of a particular variable for given interval, such as the number of years in which the stock market return is between intervals such as 0-10%, 11-20%, etc. A histogram, a type of bar chart, may be used for this analysis.
  6. Correlation: Comparison between observations represented by two variables (X,Y) to determine if they tend to move in the same or opposite directions. For example, plotting unemployment (X) and inflation (Y) for a sample of months. A scatter plot is typically used for this message.
  7. Nominal comparison: Comparing categorical subdivisions in no particular order, such as the sales volume by product code. A bar chart may be used for this comparison.
  8. Geographic or geospatial: Comparison of a variable across a map or layout, such as the unemployment rate by state or the number of persons on the various floors of a building. A cartogram is a typical graphic used.[10][11]

Analysts reviewing a set of data may consider whether some or all of the messages and graphic types above are applicable to their task and audience. The process of trial and error to identify meaningful relationships and messages in the data is part of exploratory data analysis.

Visual perception and data visualization[edit]

A human can distinguish differences in line length, shape orientation, and color (hue) readily without significant processing effort; these are referred to as "pre-attentive attributes." For example, it may require significant time and effort ("attentive processing") to identify the number of times the digit "5" appears in a series of numbers; but if that digit is different in size, orientation, or color, instances of the digit can be noted quickly through pre-attentive processing.[12]

Effective graphics take advantage of pre-attentive processing and attributes and the relative strength of these attributes. For example, since humans can more easily process differences in line length than surface area, it may be more effective to use a bar chart (which takes advantage of line length to show comparison) rather than pie charts (which use surface area to show comparison).[13]


Data visualization involves specific terminology, some of which is derived from statistics. For example, author Stephen Few defines two types of data, which are used in combination to support a meaningful analysis or visualization:

A table contains quantitative data organized into rows and columns with categorical labels. It is primarily used to lookup specific values. In the example above, the table might have categorical column labels representing the name (a qualitative variable) and age (a quantitative variable), with each row of data representing one person (the sampled experimental unit or category subdivision).

A graph is primarily used to show relationships among data and portrays values encoded as visual objects (e.g., lines, bars, or points). Numerical values are displayed within an area delineated by one or more axes. These axes provide scales (quantitative and categorical) used to label and assign values to the visual objects.[14]

Analytical activities of data visualization users[edit]

Low-level user analytic activities while interacting with an instance of data visualization are presented in the following table:[15] (pro forma abstracts in fourth column are templates that capture the essence of the task[16][17])

Pro Forma
1Retrieve ValueGiven a set of specific cases, find attributes of those cases.What are the values of attributes {X, Y, Z, ...} in the data cases {A, B, C, ...}?- What is the mileage per gallon of the Audi TT?

- How long is the movie Gone with the Wind?

2FilterGiven some concrete conditions on attribute values, find data cases satisfying those conditions.Which data cases satisfy conditions {A, B, C...}?- What Kellogg's cereals have high fiber?

- What comedies have won awards?

- Which funds underperformed the SP-500?

3Compute Derived ValueGiven a set of data cases, compute an aggregate numeric representation of those data cases.What is the value of aggregation function F over a given set S of data cases?- What is the average calorie content of Post cereals?

- What is the gross income of all stores combined?

- How many manufacturers of cars are there?

4Find ExtremumFind data cases possessing an extreme value of an attribute over its range within the data set.What are the top/bottom N data cases with respect to attribute A?- What is the car with the highest MPG?

- What director/film has won the most awards?

- What Robin Williams film has the most recent release date?

5SortGiven a set of data cases, rank them according to some ordinal metric.What is the sorted order of a set S of data cases according to their value of attribute A?- Order the cars by weight.

- Rank the cereals by calories.

6Determine RangeGiven a set of data cases and an attribute of interest, find the span of values within the set.What is the range of values of attribute A in a set S of data cases?- What is the range of film lengths?

- What is the range of car horsepowers?

- What actresses are in the data set?

7Characterize DistributionGiven a set of data cases and a quantitative attribute of interest, characterize the distribution of that attribute’s values over the set.What is the distribution of values of attribute A in a set S of data cases?- What is the distribution of carbohydrates in cereals?

- What is the age distribution of shoppers?

8Find AnomaliesIdentify any anomalies within a given set of data cases with respect to a given relationship or expectation, e.g. statistical outliers.Which data cases in a set S of data cases have unexpected/exceptional values?- Are there exceptions to the relationship between horsepower and acceleration?

- Are there any outliers in protein?

9ClusterGiven a set of data cases, find clusters of similar attribute values.Which data cases in a set S of data cases are similar in value for attributes {X, Y, Z, …}?- Are there groups of cereals w/ similar fat/calories/sugar?

- Is there a cluster of typical film lengths?

10CorrelateGiven a set of data cases and two attributes, determine useful relationships between the values of those attributes.What is the correlation between attributes X and Y over a given set S of data cases?- Is there a correlation between carbohydrates and fat?

- Is there a correlation between country of origin and MPG?

- Do different genders have a preferred payment method?

- Is there a trend of increasing film length over the years?

This taxonomy of user activities can be used in two occasions: while discovering user requirements for particular data visualization project, and during evaluation of a data visualization technique. The taxonomy can also be organized by three poles of activities: retrieving values, finding data points, and arranging data points. This organization is displayed in the following diagram:

Analytic activities of data visualization users

Examples of diagrams used for data visualization[edit]



Network AnalysisNetwork
Bar chart mode 01.svgBar Chart
  • length
  • color
  • time
  • width
  • color
  • time (flow)
  • size
  • color
Gantt ChartGantt Chart
  • color
  • time (flow)
Scatter PlotScatter Plot (3D)
  • position x
  • position y
  • position z
  • color

Data visualization scope[edit]

There are different approaches on the scope of data visualization. One common focus is on information presentation, such as Friedman (2008) presented it. In this way Friendly (2008) presumes two main parts of data visualization: statistical graphics, and thematic cartography.[1] In this line the "Data Visualization: Modern Approaches" (2007) article gives an overview of seven subjects of data visualization:[18]

All these subjects are closely related to graphic design and information representation.

On the other hand, from a computer science perspective, Frits H. Post (2002) categorized the field into a number of sub-fields:[4]

For different types of visualizations and their connection to infographics, see infographics.

Related fields[edit]

Data science process flowchart

Data acquisition[edit]

Data acquisition is the sampling of the real world to generate data that can be manipulated by a computer. Sometimes abbreviated DAQ or DAS, data acquisition typically involves acquisition of signals and waveforms and processing the signals to obtain desired information. The components of data acquisition systems include appropriate sensors that convert any measurement parameter to an electrical signal, which is acquired by data acquisition hardware.

Data analysis[edit]

Data analysis is the process of studying and summarizing data with the intent to extract useful information and develop conclusions. Data analysis is closely related to data mining, but data mining tends to focus on larger data sets with less emphasis on making inference, and often uses data that was originally collected for a different purpose. In statistical applications, some people divide data analysis into descriptive statistics, exploratory data analysis, and inferential statistics (or confirmatory data analysis), where the EDA focuses on discovering new features in the data, and CDA on confirming or falsifying existing hypotheses.

Types of data analysis are:

Data governance[edit]

Data governance encompasses the people, processes and technology required to create a consistent, enterprise view of an organisation's data in order to:

Data management[edit]

Data management comprises all the academic disciplines related to managing data as a valuable resource. The official definition provided by DAMA is that "Data Resource Management is the development and execution of architectures, policies, practices, and procedures that properly manage the full data lifecycle needs of an enterprise." This definition is fairly broad and encompasses a number of professions that may not have direct technical contact with lower-level aspects of data management, such as relational database management.

Data mining[edit]

Data mining is the process of sorting through large amounts of data and picking out relevant information. It is usually used by business intelligence organizations, and financial analysts, but is increasingly being used in the sciences to extract information from the enormous data sets generated by modern experimental and observational methods.

It has been described as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data"[19] and "the science of extracting useful information from large data sets or databases."[20] In relation to enterprise resource planning, according to Monk (2006), data mining is "the statistical and logical analysis of large sets of transaction data, looking for patterns that can aid decision making".[21]

Data transforms[edit]

Data transforms is the process of Automation and Transformation, of both real-time and offline data from one format to another. There are standards and protocols that provide the specifications and rules, and it usually occurs in the process pipeline of aggregation or consolidation or interoperability. The primary use cases are in integration systems organizations, and compliance personnels.

Data presentation architecture[edit]

A data visualization from social media

Data presentation architecture (DPA) is a skill-set that seeks to identify, locate, manipulate, format and present data in such a way as to optimally communicate meaning and proffer knowledge.

Historically, the term data presentation architecture is attributed to Kelly Lautt:[22] "Data Presentation Architecture (DPA) is a rarely applied skill set critical for the success and value of Business Intelligence. Data presentation architecture weds the science of numbers, data and statistics in discovering valuable information from data and making it usable, relevant and actionable with the arts of data visualization, communications, organizational psychology and change management in order to provide business intelligence solutions with the data scope, delivery timing, format and visualizations that will most effectively support and drive operational, tactical and strategic behaviour toward understood business (or organizational) goals. DPA is neither an IT nor a business skill set but exists as a separate field of expertise. Often confused with data visualization, data presentation architecture is a much broader skill set that includes determining what data on what schedule and in what exact format is to be presented, not just the best way to present data that has already been chosen (which is data visualization). Data visualization skills are one element of DPA."


DPA has two main objectives:


With the above objectives in mind, the actual work of data presentation architecture consists of:

Related fields[edit]

DPA work shares commonalities with several other fields, including:

See also[edit]


  1. ^ a b Michael Friendly (2008). "Milestones in the history of thematic cartography, statistical graphics, and data visualization".
  2. ^ Vitaly Friedman (2008) "Data Visualization and Infographics" in: Graphics, Monday Inspiration, January 14th, 2008.
  3. ^ Fernanda Viegas and Martin Wattenberg, "How To Make Data Look Sexy",, April 19, 2011.
  4. ^ a b Frits H. Post, Gregory M. Nielson and Georges-Pierre Bonneau (2002). Data Visualization: The State of the Art. Research paper TU delft, 2002..
  5. ^ Brian Willison, "Visualization Driven Rapid Prototyping", Parsons Institute for Information Mapping, 2008
  6. ^ Lengler, Ralph; Lengler, Ralph. "Periodic Table of Visualization Methods". Retrieved 15 March 2013. 
  7. ^ "Inspiration Day at the University of Waterloo Stratford, Campus". Retrieved April 10, 2014. 
  8. ^ Edward Tufte-Presentation-August 2013
  9. ^ a b c Tufte, Edward (1983). The Visual Display of Quantitative Information. Cheshire, Connecticut: Graphics Press. ISBN 0961392142. 
  10. ^ Stephen Few-Perceptual Edge-Selecting the Right Graph for Your Message-2004
  11. ^ Stephen Few-Perceptual Edge-Graph Selection Matrix
  12. ^ Steven Few-Tapping the Power of Visual Perception-September 2004
  13. ^ Steven Few-Tapping the Power of Visual Perception-September 2004
  14. ^ Steven Few-Selecting the Right Graph for Your Message-September 2004
  15. ^ Robert Amar, James Eagan, and John Stasko (2005) "Low-Level Components of Analytic Activity in Information Visualization"
  16. ^ William Newman (1994) "A Preliminary Analysis of the Products of HCI Research, Using Pro Forma Abstracts"
  17. ^ Mary Shaw (2002) "What Makes Good Research in Software Engineering?"
  18. ^ "Data Visualization: Modern Approaches". in: Graphics, August 2nd, 2007
  19. ^ W. Frawley and G. Piatetsky-Shapiro and C. Matheus (Fall 1992). "Knowledge Discovery in Databases: An Overview". AI Magazine: pp. 213–228. ISSN 0738-4602. 
  20. ^ D. Hand, H. Mannila, P. Smyth (2001). Principles of Data Mining. MIT Press, Cambridge, MA. ISBN 0-262-08290-X. 
  21. ^ Ellen Monk, Bret Wagner (2006). Concepts in Enterprise Resource Planning, Second Edition. Thomson Course Technology, Boston, MA. ISBN 0-619-21663-8. 
  22. ^ The first formal, recorded, public usages of the term data presentation architecture were at the three formal Microsoft Office 2007 Launch events in Dec, Jan and Feb of 2007-08 in Edmonton, Calgary and Vancouver (Canada) in a presentation by Kelly Lautt describing a business intelligence system designed to improve service quality in a pulp and paper company. The term was further used and recorded in public usage on December 16, 2009 in a Microsoft Canada presentation on the value of merging Business Intelligence with corporate collaboration processes.

Further reading[edit]

External links[edit]