Why Visualized Data Analysis Is Something You Should Have in Your Toolkit

By Chen Qiang, nicknamed Chensha at Alibaba.

In this age where just about everyone needs to put on the hat of a data analyst, almost all Alibaba employees, and employees elsewhere, are involved in the processes of data collection, processing, and consumption. As a part of this, data visualization is an important piece of connecting data processing to consumption, and its quality is crucial. An excellent case of visualization can make for compelling insights, whereas poor visualization content makes all your hard work worthless. Today, Chen Qiang, a senior product manager with Alibaba, will introduces data visualization tools and explain how to select an effective chart.

In my area of enterprise data management, the level that I see data visualization is taken to on a day to day, especially in PowerPoint slides and reports varies greatly. Today, we will provide my general thoughts, and suggestions about how to use data visualization tools effectively.

In particular, we will offer suggestions about the following processes:

Selecting easy-to-use tools
Processing appropriate datasets
Selecting appropriate chart types
Learning from examples

Data Visualization Tools

First, I would like to talk about the three goals of data visualization: accuracy, clarity, and elegance. Charts that meet these three goals are called efficient data visualization charts:

Accuracy: This means to accurately convey data feature information, without any omissions or redundancy, and without leaking confidential information or misreading your viewers.
Clarity: The less time it takes to obtain the chart feature information, the better.
Elegance: Charts should be attractive instead of being cool, while reflecting the principle of parallelism that charts in the same scenario should follow the same standards.

In addition, these three goals are ranked by importance: accuracy > clarity > elegance. In principle, you should be following all of the underlying principles.

The question is, how can we select convenient tools to help us meet these goals? In fact, efficient data visualization can be roughly divided into two types based on different purposes from Cole Nussbaumer Knaflic's video interview:

Interpretative

The feature information or value of target data are known, and the main purpose is to convey insights and explanations to others. Therefore, the refinement of presentation is your main goal.

Exploratory

The feature information or value of target data are unknown. The main purpose is to analyze and explore data. Therefore, fast and efficient data interaction is your main goal.

Before you begin, think about your intentions. Generally, you cannot do everything at once. Although flexible data transformation and refined data representation are not conflicting, it is difficult to balance them. This leads to a certain bias in the available tools. Simple and easy-to-use BI tools can help us quickly complete visualization, but the content may be unsatisfactory. Theoretically, some professional chart code libraries can be adjusted to fit any details, but this method has obvious disadvantages in efficiency and ease of use that is a threshold.

In principle, tools are not good or bad. "Good" and "bad" are only relative terms. In fact, many tools can help produce attractive and effective visualized content while being easy-to-use. I conducted a qualitative evaluation of the common types of representative tools from a personal perspective. These are my results below.

Here I highlighted Excel and Tableau. As the most popular and well-known BI software in the industry, Tableau is easy to use and produces excellent visualization results, helping you analyze and explore data. Microsoft Office Excel is a product that is seriously underestimated by most people. It can easily fine-tune various charts and draw beautiful data charts.

If you do not have any preexisting inclinations, Microsoft Office Excel is very suitable as your first data visualization tool for in-depth research.

Selection of Effective Charts

Visualization Process

Putting aside some professional theories, we cam simply divide the steps for creating charts into three:

Select the data to be delivered.
Search visualization methods.
Implement the methods and improve the details.

For Step 1, you need to consider the information conveyed by the data itself and your own insight, and list such information clearly. This helps you to choose an appropriate visualization method. Next, I will introduce the general methodology. This is not a simple example of choosing a column chart or a pie chart.

Visual Encoding

First, let's look at two figures:

When visual encoding (color) is used to convey information, it is easier for people to understand the features of the original data.

For example, in a common column chart, we use the "height" and "relative position" of columns to convey two sets of data information. Column charts are often easier to understand than spreadsheet data that are not encoded.

We need to get familiar with two important concepts: visual encoding and visual channel. If the human brain is regarded as an information decoding system, visualization is the encoding process of information or data. After the information is visually encoded, the content is transmitted to the brain through the eyes, and the brain decodes the information and obtains knowledge.

Then, if there are so many visual channels for charts, how can we select appropriate visual channels for mapping data? Data types and the expressive force of visual channels need to be comprehensively considered. I'll explain the basic theory in detail here.

Data Types

Generally, data is divided into three types: category, order, and number. Apples and bananas belong to a category, dates belong to an order, and 5,000 in profit is a numeric value. In many commercial visualization tools, ordered and categorized data is called dimension data, whereas numeric data is called measurement (metric) data.

The visual channels that are applicable to dimension data and measurement data are very different. For example, color hue applies to dimension data, but not to measurement data. Select the correct visual channel to transmit information more efficiently.

Expressive Power of Visual Channels

In the book titled Data Visualization by Professor Chen Wei of Zhejiang University, four indicators are used to judge the expressive force of a channel:

Accuracy
Identifiability
Separability
Visual prominence

A deep understanding of these standards can help us comprehend the reasons behind some suggestions for visualized chart modifications.

This set of measurement data reflects the rationality of some production experience in scientific ways. For example:

Some professional designers are very much against the use of pie charts, because the area and angle do not change linearly with human perception.
Column charts that use length to map data are generally the best options for visualization, because the perceived length and the actual length are linearly related.
It is also not recommended to use 3D effects in conventional statistical charts for business, because volume seriously affects the accuracy of perception.

Separability refers to the idea that multiple visual channels cannot be used without restrictions. Every time a channel is added to map data, the impact on existing encoding methods must be considered. In particular, the addition of size can affect the effect of other visual channels.

Let's take a column chart as an example. The column chart in the following figure uses width to map a measurement field, but the width affects the effectiveness of the length. Simultaneous use of the two channels tends to make the area become the visual channel, which will impact the overall effect of the chart.

A colleague asked me why I did not add the "rounded corner" function to the columns in the column chart. In fact, it is also due to this reason. Excessively rounded corners will cause a loss of accuracy in the length, damaging the overall expressiveness of a chart.

Academically, there is a long list of expressive priorities for visual encoding of data. Here, I'll simplify these concepts and provide only one list of recommended visual channels. Theoretically, these channels can be used together. Of course, you should select the best solution based on your actual situation.

Another effective practice is to not build a visual solution from scratch. Instead, add visual channels to the most basic statistical chart types, and then make continuous attempts to achieve the expected results. Not all charts can use all visual channels. For example, an administrative map does not have the length channel.

Design Principles

For chart design, good visual encoding is the most important point. In addition, data screening is also a challenge. Too much information will make the chart look chaotic, causing cognitive overload. For visualized content built in JavaScript, data interaction is also a focus of attention.

Cognitive Load

Generally, articles on visualization measure the load by using the data-ink ratio. Unreasonable design will convey excessive, redundant, or meaningless information to the audience.

First, we need to pay attention to whether too much data is presented or visual channels are misused.

Second, we can use the Gestalt principles to simplify or optimize our chart elements and reduce the cognitive load.

The Gestalt principles consist of eight items. Here, I focus on the most important items among them: the principle of proximity, principle of similarity, and principle of closure.

The Principle of Proximity

People tend to perceive physically similar elements to be a whole.

Let me show this simply in a line of dots:

... ........ .......

You may naturally think that these are three groups. By using this psychological phenomenon, we have built a typical grouped column chart.

This can also be used to guide users to read the tabular data from Cole Nussbaumer Knaflic's blog:

The Principle of Similarity

Objects with similar attributes such as color, size, and shape, tend to be considered to be a whole or be correlated.

This psychological phenomenon, coupled with color hue, can easily promote visual prominence and allow us to quickly notice the processed target data. The preceding example that includes number 5 shows the principle of similarity.

From my personal experience, color is the best visual channel for applying this principle. We can use a legend to map the legend information to the content in the chart area, and simply put, this psychological phenomenon is also in effect here.

We can further use this effect to help users interpret charts.

The Principle of Closure

People tend to consider factors that are enclosed together as a single group.

The principle of closure is often used for annotation. The use of a small amount of "ink" can make the target area more visually prominent. Let's further process the preceding case of the principle of similarity to describe the effect of the principle of closure.

Flexible use of the Gestalt principles and visual encoding of features is an important skill for data visualization in specific charts.

When we see a chart and point out the defects of the chart based on our aesthetic experience, we may as well reflect on which psychological principle this defect violates or whether the data-ink ratio is imbalanced.

Experience

I have a lot of practical experience in visualization. Here I want to further emphasize the enormous impact of position and color.

Position

Position is a rich concept. Every element in a chart is placed in a position. You must be careful with the position attribute of elements. You need to think about the positions of axis labels, the position of the description text, the position of the title, the position of the legend, or the relative position of the graph itself. You can adjust the chart structure to make it easier for users to understand the information you want to convey.

Alignment: The "principle of continuity" in psychology uses alignment to create an invisible path, which leads people to interpret information more easily. We must pay attention to the use of alignment at all times.
Sorting: You must always sort data. Columns in a column chart should never be randomly placed. Sorting is the most important application for the position attributes of elements. Conveying data in an irregular manner puts an extra interpretation burden on users.
Reference: Positions are relative. If you want to accurately interpret data positions, you must use a reference system. The reference system can be either the x-axis or the y-axis, or a relative reference between two points or columns. In short, you must have a reference point.

Color

Color is the most important and the most easily abused visual encoding method. Color has three variants: hue, saturation, and brightness. Color variation also stimulates people's emotions. Therefore, the use of colors must be carefully considered by chart creators. Here are some practical tips:

Try not to use both red and green at the same time, because color blind viewers won't be able to distinguish between the two colors. This is also why the first sequence of the default color of most chart libraries and software is blue.
Dimension data uses hue, whereas measurement (metric) data uses saturation and brightness.
Use the least number of colors possible, while preserving the completeness of information.
Colors must be consistent with the context. For example, if the preceding chart uses "green" to represent Hong Kong, the following charts should not use "yellow" or "red" to represent Hong Kong. When "predicted data" uses green, blue should be used for "actual data" to maintain order and reduce users' cognitive load.
Colors can attract more attention than other visual channels, which means that colors fatigue people more easily. Make sure that the colors you use are "ordered." A sequence that is multi-colored and has multiple shades of colors is not acceptable.
Before designing, you can refer to some brand design manuals, which are generally called visual manuals. Using brand colors makes it easier to win the favor of target users. However, not all brand colors are applicable. Therefore, we need to consider their effect before using them.
Note that when you want to use a color. Due to cultural and religious differences, the emotional impact of the same color on different groups is very different. For example, Chinese people like red, whereas Western people do not necessarily like it. Hospitals and the financial industry are also typically color-sensitive. When preparing a chart for stock traders, do not use green as the primary color.

Recommendations on selection of statistical charts

There are a lot of online resources. Before selecting a chart, you need to figure out the goal that your data is to convey. Data analysis is varied. To summarize, there are no more than four goals: comparison, focus, induction, and deduction. Based on these initial goals, we will choose to instantiate it.

The following figure shows widely circulated suggestions about charts:

The UK's Financial Times has also published relevant suggestions:

Download

For the visualized data analysis that the technical personnel should never ignore, here is a source image download address for the suggestions from the UK's Financial Times (zoom in out to see the words clearly): https://alitech-public.oss-cn-beijing.aliyuncs.com/1567064473032/shuju%20fenxi.png

Data Preparation

Adjust Data Structure for Visualization

Generally, before creating a data chart, you need to work through the processes of data collection and processing. Consider MaxCompute for example, which is familiar to Alibaba developers, as an example. The following figure shows a simple process:

In order to meet certain design specifications, maintenance capabilities, and robustness, most data warehouses do not allow the upper-layer applications of the data warehouse to perform customized intrusion design. However, the data formats required by different types of applications are different. In the field of visualization, for final chart production, making some adjustments to the data is very common, especially by using BI software for mapping. The adjustments include but are not limited to the following:

1. Conversion of rows and columns

For a clustered column chart for data comparison and analysis, different tools have different configuration methods for data interaction. The row and column data in the table needs to be flexibly converted to meet the corresponding software requirements.

2. Readable conversion

The data in the original table may contain only feature data such as "ID", "XXX code", and English content. To make the final visual processing effective, and make the chart easy to interpret, we need more additional data for association processing. For example:

Locate the "dimension table" associated with the "fact table" and obtain the name, and other information of the ID.
Ensure all terms are in the target language of the chart.
Find easy-to-recognize data such as "short name" and "nickname".
In the time data field, convert time-type fields that fit the business scenario, such as "quarter", "fiscal year", "week", and "trading day".

3. Conversion for business scenarios

This type of conversion needs to be combined with certain specific scenarios. Generally, the original table provides only raw data, while specific scenarios provide data conversion rules. The following is a typical example:

Divide "age" into segments. The original table records only the users' birthdays, which are then processed into the "18-24 years" and "25-30 years" range fields. Such processing helps users interpret and create visualized content.
Customers are divided into new and old customers. Both "new" and "old" are concepts that are relative to time. Data about "new customers" and "old customers" are not stored in original data tables in the data warehouse. You need to analyze the data based on the current time window to customize the "new customers" and "old customers" fields.

Remove Abnormal Data

Abnormal data is unavoidable in raw data. Erroneous data, or dirty data and test data, and unreasonable data can be collectively referred to as abnormal data. If such data is not removed, the final effect of the visualized presentation is directly affected. Then, the analysis effect and decision-making efficiency are affected. Before you create a chart, make sure that you have completed this step.

Unreasonable data is also relative to specific analysis scenarios. For example, we set a data metric to measure the performance of telemarketers. That is, an average of three successful deals per week means that the telemarketer is excellent. To meet this scenario, we need to remove "interns" from relevant data, or specifically, employees with too few years of work.

Process Special Values in a Refined Manner

For the effect of visualization, we need to focus on ambiguous and "extreme data." The existence of such data can sometimes affect the presentation of our content.

Empty, null, 0

These three data values are typically ambiguous data. In some scenarios, they express the same meaning, whereas in other scenarios they represent quite different information. I use an exam as a case to compare the differences among these three data values:

0: John Smith took the English exam and got 0 points.
Empty: John Smith did not take the English exam.
Null: John Smith does not have a English exam.

When visualizing representations, make sure to pay attention to accurately represent data.

Extreme Data

Extreme data refers to an extremely uneven distribution of data, such as "Among 100 pieces of data about sales distribution of a product, one piece of data shows 100,000, while the remaining 99 pieces are between 0-1,000." In the graph that genuinely represent the actual data, it will be difficult to see the feature information of most sample data. Corresponding processing steps, such as removing and shortening extreme data, and interpreting the corresponding text, need to be taken for this specific business scenario.

Converge Data to a Reasonable Level

When using BI tools for visual representations, you need to pay attention to the size of raw data. Generally, the performance of the server where software services are located is not unlimited. An appropriate size of data can help you achieve the best performance from interactive presentation.

When the raw data volume is too large, you can remove some fields and aggregate the data according to actual scenarios.

In addition, most tools support "derived fields." As far as possible, solidify such derived data in the "materialized" or "entity table" phase, which can help improve performance. For derived data that needs to be computed by business intelligence (BI) software, the computing process determines its performance:

Computing in the process of storing data as a physical table or a physical data model > Computing through online analytical processing (OLAP) > In-memory computing > Computing that leads to data transfer from internal memory to external memory

Case Study

This article has briefly introduced some data visualization techniques, but these are far from complete. To achieve the best practices of the data visualization field, you will need to acquire a large amount of knowledge and apply this knowledge flexibly.

Here is an excellent case of optimizing a visualized chart. The original material is from the English blog of Cole Nussbaumer Knaflic. If you are interested, you can check out her blog for more useful information. Mike Bostock, the author of D3, has also produced a large amount of high-quality visualized content, which is worth learning from.

Case study: Improvement of Visualization for Capital Budget Data

The Original Chart

At first glance, there are no mistakes in this chart, but the blog author has identified the following areas for improvement:

The readability of the y-axis label is improved.
The x-axis does not distinguish between the past and the future.

In fact, there are other hidden improvements that have not been pointed out:

As mentioned above, visual channels affect each other's effectiveness. The size of the small squares in this legend is too small, which affects the interpretation of colors. Therefore, it is difficult to distinguish between the blue major projects and the light blue proposed allowance. Imagine the more extreme legends that you have encountered.
The color saturation of the black axis label is too high, which affects the visual prominence of the chart itself. A label with a light color makes it easier for users to focus on the data itself.
The measurement data description is missing.

Improved Version 1

Here, the author removes the square bullets in the legend and colors the text. This solves the problem of legend recognition.

At this moment, the author encounters a difficult situation and does not know how to further improve this legend. However, obviously, the data-ink ratio is too high with many relatively bright colors, and both the color and pattern visual channels are used in the improved version 1. There is too much ink, but no more data information is conveyed obviously.

Therefore, the author made various analyses in the improved version 2. It is worth learning to remove interference by using black and white as shown in the following figure. Some design students may be familiar with this method. Colors can interfere with a designer's judgment. Product directors usually use black and white when drawing a prototype.

Improved Version 2

The budgets of the three major projects decreased significantly from 2018 to 2019, and then decreased slowly over time:

Other projects also show a downward trend:

The budget for new projects is increasing significantly:

The other two analyses are similar:

The author analyzes some data features in the improved version 2. Obviously, these features are not represented in visualization, and the chart still has room for improvement.

Improved Version 3

In the three improved versions, the author uses the customer's brand colors. This method is usually very useful, unless the customer's brand colors are too many and too bright.

In addition, the author guides the user perspective to the existing allowance and the proposed allowance.

Obviously, we have seen the rule for two groups of data. Then, how can we convey this rule more easily to users and make it more easy-to-understand through visualization? The author improves the chart label format.

Improved Version 4

At this step, the improvement is quite successful. It seems like the addition of all the previous information alone is enough, but the author has further insight. Users need to pay attention to the relative difference between the two lines. Therefore, the improved version is also updated as follows:

Improved Version 5

Next is the final process. The author uses a suitable format to add personal insights and interpretation into the chart.

Final Version

The author inserts the previously ignored data into the chart in an appropriate way, and uses annotations following the principle of similarity to establish associations between the graphics and the text.