Introduction to Accounting Data Analytics and Visualization
Income Statement
Balance Sheet
Statement of Cash Flows
Statement of Shareholder’s Equity
Control risk refer to the risk of material misstatement occuring as a result of insufficient controls or failure to uphold the controls.
The inherent risk is the risk of misstatements occurring for reasons other than control weaknesses.
The detection risk is the risk that a material misstatement will not be discovered by the audit procedures.
Analytical Thinking
Analytical mindset = thinking carefully and methodically
To achieve organizational objectives, by providing actionable insight.
Standard Costing, Financial Statement articulation, and Auditing
- Standard costing and variance analysis
Inventory is recorded at its standard rather than actual cost. The benefit of doing this is that at the end of the fiscal period, you can gain a detailed understanding of why actual profit either exceeded or failed to meet expectations. So that managers know where to direct their attention to improving performance.
eg. When looking at the input to make a product…
Activity variances allow you to quantify the impact of making and selling more or less units than expected. Price and quantity variances assume that you perfectly predicted the actual production and sales volume. The price variances look at what the impact would be on profit if you purchase the inputs at higher or lower prices than expected. Quantity variances look at what the impact would be on profit if you used more or less inputs that expected to make each unit of your finished good.
Together, these variances add up to the overall difference between budget profit and actual profit.
2. Financial Statement Articulation
Investors and creditors may want to know that so they can determine if it’s worth investing more or pulling out some of their investment. Regulators may want to know so that they can determine the extent of the tax liability.
The balance sheet identifies the value of assets at the end of the fiscal period, and the extent to which those assets are owned by shareholders versus creditors.
The statement of cash flows explains the extent to which the cash balance changed as a result of the company’s operations relative to its financing and investing activities.
The statement of shareholders’ equity gives a detailed explanation of how equity changed as a result of investments from owners relative to distributions, to owners and net income.
The income statement explains the extent to which the company’s equity increased or decreased from last year as a result of the company’s operations. It compares the revenue paid by customers relative to the expenses incurred by the company to generate the goods and services that were sold during the period.
3. Auditing
In an audit, the hope is that the company’s financial statements are free of material misstatements. The audit risk model is a framework that helps auditors determine where to focus their efforts and how much effort to exert to gather evidence about the amounts shown on the financial statements.
The audit risk model asserts that the total amount of audit risk is equal to the control risk times the inherent risk times the detection risk.
If the audit risk is low then less effort need to be exerted to gather evidence than if the audit risk is higher. If the control risk is high or the inherent risk is high auditors will have to spend more time and incur more costs to gather enough evidence to ensure that the financial statements are not materially misstated.
Deductive reasoning: Top-down approach. You are basing on some rules. Process of evaluating an inference using a set of true premises.
eg. We always follow the basic rules of accounting:
- Assets = Liabilities + Owner’s Equity
- Finished Goods(FG) Beg bal + COGM — COGS = FG End bal
It's like playing sudoku game, where there is a fixed rule.
Inductive reasoning: Bottom-up approach. There are no rules to base on. Its often inferred from observing historical patterns. When evaluating an inference the premises are not intended to be valid.
eg.
- Bad debt expense = average amount of uncollected A/R
For instance, we observe that during the past 5 years, it is observed that bad debts account for about 3% of A/R each month. Then, we can establish a general rule that each month the bad debt expense will amount to 3% of monthly credit sales.
Conclusion:
Data analytics can help with deductive reasoning by confirming that the general premise is true in a specific context.
Data analytics can also help with inductive reasoning by identifying patterns that can be used to create a more general rule, which can then be applied in other contexts.
Advanced Analytics and the Art of Persuasion
Summarize your analysis. Keeping the main point as well as allow readers to understand.
- Use a familiar structure
eg. Use scatter plots, line graphs, bar charts, and pie charts, rather than beautiful charts and graphs which are difficult to understand.
Use new charts sparingly. (like bullet charts, and parallel coordinate plots)
2. Acknowledge limitations
Every analysis has a limitation. By taking a little time to acknowledge the limitations of your analysis, it shows that you’re trying to avoid biased results. It also shows that you’re willing to work together with others and to continue learning.
You may have to do some additional analysis which will probably lead to greater insight and more convincing to others.
Analytical Thinking
Analytical mindset = thinking carefully and methodically
To achieve organizational objectives, by providing actionable insight.
Standard Costing, Financial Statement articulation, and Auditing
- Standard costing and variance analysis
Inventory is recorded at its standard rather than actual cost. The benefit of doing this is that at the end of the fiscal period, you can gain a detailed understanding of why actual profit either exceeded or failed to meet expectations. So that managers know where to direct their attention to improving performance.
eg. When looking at the input to make a product…
Activity variances allow you to quantify the impact of making and selling more or less units than expected. Price and quantity variances assume that you perfectly predicted the actual production and sales volume. The price variances look at what the impact would be on profit if you purchase the inputs at higher or lower prices than expected. Quantity variances look at what the impact would be on profit if you used more or less inputs that expected to make each unit of your finished good.
Together, these variances add up to the overall difference between budget profit and actual profit.
2. Financial Statement Articulation
Investors and creditors may want to know that so they can determine if it’s worth investing more or pulling out some of their investment. Regulators may want to know so that they can determine the extent of the tax liability.
The balance sheet identifies the value of assets at the end of the fiscal period, and the extent to which those assets are owned by shareholders versus creditors.
The statement of cash flows explains the extent to which the cash balance changed as a result of the company’s operations relative to its financing and investing activities.
The statement of shareholders’ equity gives a detailed explanation of how equity changed as a result of investments from owners relative to distributions, to owners and net income.
The income statement explains the extent to which the company’s equity increased or decreased from last year as a result of the company’s operations. It compares the revenue paid by customers relative to the expenses incurred by the company to generate the goods and services that were sold during the period.
3. Auditing
In an audit, the hope is that the company’s financial statements are free of material misstatements. The audit risk model is a framework that helps auditors determine where to focus their efforts and how much effort to exert to gather evidence about the amounts shown on the financial statements.
The audit risk model asserts that the total amount of audit risk is equal to the control risk times the inherent risk times the detection risk.
If the audit risk is low then less effort need to be exerted to gather evidence than if the audit risk is higher. If the control risk is high or the inherent risk is high auditors will have to spend more time and incur more costs to gather enough evidence to ensure that the financial statements are not materially misstated.
Fact Framework for making data-driven decisions using big data.
- Frame the question
The first and most important step in the Fact framework is to frame a question in a way that can be answered using your tools and skillset, and that furthers the organization’s goals.
Identify a problem — As you learn more about data and how to analyze it, you’ll be able to better identify opportunities for applying data and analytic technologies in accounting settings as well as many other settings.
Once you’ve identified the problem, it’s important to frame a question about it in a way that will encourage a careful, methodical approach so that you get to actionable insight.
what factors caused a decline in sales for regions in which sales were lower than expected?
Q1 → in identifying regions with lower than expected sales, a classification algorithm is implied.
Q2 → a regression algorithm that will quantify the relationship between sales and the factors that explain sales.
Conclude:
Your ability to frame a question is important because it will encourage a careful methodical approach. If a careful methodical approach is followed, then the answer is more likely to be trusted and acted upon.
2. Assemble the Data
steps: finding data, extracting, transforming, and loading the data or ETL, as well as data wrangling.
datasets: www.data.gov. , Google, Amazon, Microsoft, Kaggle, Data.world and Data-Is-Plural
(gather/find/identify data) You could create a survey to gather your own data, or you can create a web crawler to gather data that is dispersed on multiple websites, or you can start measuring something that hasn’t previously been measured.
(ETL-Extract,Transform,Load) Data repository store data in a variety of formats. Typically, you want to transform the data to a table format so that you can perform the visualization or analysis. (eg. can use graph database format)
The most commonly known data repository is a relational database, which stores data in multiple tables. A Structured Query Language or SQL, is often used to extract the desired subset of data, which is then returned in a table format. Sometimes, data is stored in a relational database but it’s accessed using an Application Programming Interface or API, which has a structure like a website rather than a SQL query.
Once the data is in a table format, you’ll most likely need to go through a data wrangling or data munging process and this process includes cleaning data, combining data with other data and cleaning it again. For instance, for the managerial accountant trying to identify factors that lead to a decrease in sales for under-performing regions, you may find that there are incomplete observations in the weather data like, no record of the max temperature. So you’ll need to decide how to deal with those observations. During the data wrangling process, you may realize that some of the data you found has too many errors so you may decide that you need to find a different dataset.
Conclude:
Assembling the right data is important and requires a lot of work.
3. Calculate Results
Summary statistics: min, max, mean, median, standard deviation
It helps prepare you for using advanced analytics.
Advanced analytics are able to consider lots of factors at once and identify those factors that are most influential.
Typically divided into groups based on the purpose of the algorithm. These groups include classification, regression, clustering, natural language processing, variable reduction, image recognition, and speech recognition. The main idea behind each algorithm is to evaluate data to find the patterns and then create a model that can be used to explain the relationships in the data in some way.
Once the model is created, the accuracy of the model is evaluated using a variety of diagnostic measures. One idea behind most metrics is a loss function, which describes how much accuracy is lost by using the model. The more inaccurate the model, the greater the loss and the less useful it is.
To evaluate models accuracy, we commonly use Training and Testing datasets. The training dataset is used to create the model and the test dataset is used to evaluate the accuracy of the model. Two Cautions: 1. Stay rooted to the main question, as you might get distracted by other relationships or you might find a relationship that doesn't exist in the data. 2. Beware of fictitious relationships. Need to have a hypothesis to support your findings. If you don’t have a hypothesis, you will likely find patterns in the data. For instance, the consumption of ice cream may appear to be related to the level of thefts. can inaccurately conclude that ice cream consumption causes theft. However, it’s more likely that warm weather is causing a higher consumption of ice cream as well as causing more people to be outside leading to more theft.
Conclude:
Have a question and a hypothesized relationship in mind so that you don’t get distracted and so that you can get a gut check on the accuracy of your results.
4. Tell Others About the Results
Sharing results in a way that’s accessible to your audience refers to both the technology used to communicate as well as the language and images that are used to convey the message. If you are sharing for a one-time meeting, then a printed version will be better. If you are going to share with a wider audience, then use online version.
Balance main ideas with nuance. Too many details will confuse and lose your audience, too few details may lead your audience to overgeneralize the results.
Do not use unfamiliar terms.
Translate to monetary value whenever possible.
Suggest courses of action to take whenever possible.
Observe and listen to your audience. They may notice some points you might have missed out.
What is Data?
Factual information (such as measurements or statistics) used as a basis of reasoning, discussion, or calculation.
Data is an intangible asset. It is useful to have lots of the same type of data (large sample size would mean more accurate analysis) and a large variety of data (gives you more insights into the data).
Characteristics that Make Data Useful for Decision Making
- Relevance → Data should be related to the question that is being asked. eg. look at historical revenue, cost of goods sold.
2. Accessible → The harder it is to get the data into the analysis software, the less likely it is that it will be used.
It may be in the wrong format, it may be password protected and you don’t have the password. It may require a costly subscription or a costly measurement tool or it may take a long time to get the data.
3. Accuracy → eg. If the measure of net income that you are using is one that has been recorded inaccurately, or one that was corrupted in the data extraction, or transformation process, then the usefulness decreases. Unfortunately, inaccurate data can be worse than no data.
4. Completeness → Sometimes, a dataset is incomplete because it’s hard to collect data from all the observations.
Survivorship bias refers to situations when only data for successful outcomes exists or is included in analysis. This is an important consideration in financial settings when failed companies are not included in the analysis because data doesn’t exist for the period being analyzed. This can skew the results to be more positive than what they really are. Sometimes, data is deliberately left out. Cherry-picking data refers to the practice of choosing only the most beneficial data. This can occur when people only analyze a subset of the data that supports the decisions for which they are in favour. So they leave out other relevant data that gives a more realistic description.
Conclude:
The most relevant data is probably prohibitively costly, or that the most accessible data is incomplete. Thus, getting useful data is an iterative process of starting with a question, looking for data, and then revising the scope or focus of the question based on the limitations of the data.
Structured vs. Unstructured Data
These structures include document stores, Key-Value Stores, graph databases, and many more. These structures optimize data for reasons other than reducing storage space.
Unstructured Data:
XBRL data uses standardized tags so that financial information can be broken down and compare it across different companies by humans and computers.
Unstructured data eg. text, images
Structuring Text we need to first convert to numeric values.
Properties of a Tidy Dataframe
Data Types
Strings, numeric and data types
Strings are combined to become larger strings by using concatenate function in Excel.
Numeric data are often displayed in the form of currency and often include commas after every three digits.
Conclude:
It’s important to understand the different data types so that you can be sure that the data is encoded in a way that will allow you to process it correctly.
Data Dictionaries
Characteristics:
- Brief overview of the data
where the data came from? How to access the data? What each row represents? Why it was gathered?
Special information that may be necessary to refresh the data, such as where I can get a password, or how I can get an authentication token. If the dataset is available for a lot of others to use, then there may be instructions on how to cite the data.
2. Table of column names and descriptions
Like a legend to describe what the names are.
There should also be a column for notes, indicating anything special about the data.
XBRL has a taxonomy that is essentially a data dictionary.
Conclude:
Data dictionaries are really important for keeping track of what information is in a dataset.
Wide Data vs. Long Data
Wide data benefits:
- Efficient — because row name and column data are not repeated
- Easy to update —repeated data means difficult to make changes and updates
- Easier for humans to quickly process
- More common for use in algorithms
Wide data is frequently the shape of data required for regression analysis and machine learning algorithms that depend on having all features of an observation on the same row.
Long data benefits:
- Easier for visualizations
- Better for interactive dashboards — allows you to drill down by slicing the data using fewer filters. For example, Pivot tables in Excel work best with long DataFrames.
Changing from a long data frame to a wide data frame by just select the pivot table button:
If you select the design tab and then turn off subtotals and grand totals, you’ll have a tidy, wide DataFrame that you can quickly copy and paste and use for other analyses.
To make it interactive we can add slices and timeline, chart.
Merging Data
Join — Adding more column. matching two datasets by the values in one or more columns.
Stack/union —The columns have the same column name and type.
Left join →A left join means that we are only going to add in rows from the right data frame if the value in the join column exists in the left data frame. Use vlookup. However, vlookup has limitations. First, the index value has to be in the first column of the table. Second, the table has to be ordered by the lookup value. We can learn to use index-match, you don’t have to create new tables that are ordered correctly.
The values that you’re joining on have to be exactly the same. If the net loss line-item change to net income, left joins will not work because those labels will not be the same. This is why the XBRL tags are so important. Those labels don’t change. So, if you use those, you should be able to match up the right numbers no matter what year of data you’re using.
A different way of combining data is stacking rows on top of each other. It’s essentially copying and pasting files from one DataFrame to the bottom of another DataFrame.
Data Automation
Whenever you have repetitive tasks, you should consider training your computer to do it for you.
eg. if you have data and lots of files that need to be stack in a single dataframe in one file
Visual basics on excel
Visualization Distributions
Histograms and density.
If the distribution is tall and skinny, then the data is not useful for providing insights because it stays the same regardless of what is going on. There is no variation. If the distribution is flat and wide, means they may be some outliers that need to be addressed to make the data useful.
Summary statistics .aka. Descriptive statistics. Use quantitative values to describe the distribution of data. Statistics include mean, standard deviation and percentile(min,median,max).
Missing Data — if there are too many missing values, the column of data will not be useful. We can estimate missing values by using simple methods like average and medians or using a more complex method.
eg. we can keep columns that have only 20% of missing values.
Conclude:
The importance of looking at the distribution of data, how transformation can help, and the importance of considering missing values in data.
Visualizing Data Relationships
It is important to understand the shape of the column of data in relation to each other.
Common transformation is to scale/divide the columns by the asset.
Another way to consider the relationship of data is by looking at the correlation.
The ‘0.85’ suggest a strong relationship
The -0.24 suggest a weak correlation. The small magnitude of 0.24 indicate there is not a strong linear relationship as can be seen from the scatterplot.
Conclude:
Not only is it important to understand the shape of the column of data independently from one another, but it is also important to understand how the column of data is related to each other.
Summary
Bigger is not always better. Relevant, accessible, complete and accurate information which is well organized is better.
Data visualizations are an important part of analyzing data
Why Visualize Data?
assembly of data and the calculation part of the fact model. Essentially encodes the data in a way that allows our wonderful brains to serve as the calculation tool to uncover the patterns.
Data visualisation can convey relationships without having to understand statistical terms and they convey lots of nuance.
The directed purpose of data visualization is related to the ‘tell’ part of the factor model. Data visualizations helped to effectively communicate patterns in a way that doesn’t rely much on language or statistical knowledge.
Visual Perception Principles
Point 1: Data visualizations can have too much information. Oftentimes if we rely only on the default setting for many of the charting tools or embellish charts too much, then the main point may be hidden. Carefully consider the main elements of the visualization that will help communicate the relationships. Focus on making those elements stand out and then remove distractions.
Point 2: We naturally focus on what stands out and doesn’t conform.
Point 3: This principle is especially important when choosing what type of chart to use. This means that you need to be familiar with what your audience is expecting. Some charts may seem easy for one audience to understand but not for others. For example, this candlestick chart may have too much information in it for users who unfamiliar with it.
Conclude:
It’s important to keep the purpose of your chart in mind because it’s easy to use features just because they’re artistically appealing. For instance, using many different colors may be more visually appealing, but it also may hide the main point. Similarly, three-dimensional shapes, interesting textures, and creative backgrounds may enhance the artistic appeal, but they often can hide a main point by adding too many distractions.
The principles of removing distractions, our tendency to focus on what stands out and our ability to recognize familiar patterns. Factors that make this difficult include multidimensional data, the need for a context, familiarity with the data, and the desire to fit too many points onto one chart.
Data Visualization Building Blocks
Four fundamental building blocks: position, size, color, and shape, can be used to build effective data visualizations.
- Shape
Circles often used to represent a total amount. eg. pie chart, sunburst charts, network plots, bubble plots, dot plots, Venn diagrams, chord diagrams, and radial charts.
Lines are often used to communicate connection. eg. time series plot, parallel coordinate plots, network plots, and dendograms.
Rectangles are often used to communicate relative value. eg. bar chart show the relative as well as absolute value, box plots to show the relative spread of observations.
When considering how to use shapes, it’s often the case that fewer is better. That way, when providing context, you can use a different shape to draw attention to a specific part of the plot.
2. Position
Always order the elements of your data visualization in a meaningful way.
Order the observations in descending order.
Ordering chart characteristics alphabetically is ideal, but that’s much less common. One instance in which this makes sense is if you’re visualizing the individual operating expenses for many companies using a grouped bar chart.
If you are showing a time series chart, then it makes sense to order the observations by date.
If you’re using a pie chart, then ordering alphabetically is typically never helpful. Instead, the largest values usually start at the top and move clockwise from largest to smallest.
3. Color
Using too much color can easily distract the attention from the main point.
Oftentimes, a brighter color should be used to highlight the main point, while the rest of the chart should have a more muted color. A lot of information can be conveyed by using different shades of the same color. Typically, darker shades are used to convey more extreme values.
Color palettes have a familiar meaning, which should be considered because they can help to make a chart memorable and can quickly convey a main idea. For instance, in the United States, red, yellow and green correspond to stop, slow down and go, or bad, medium and good. For instance, if shades of red, yellow and green are used to keep track of companies on a multi-line chart, readers may think that the red company has some undesirable characteristic about it. If that’s not the case, then you’ll probably want to use a color palette that doesn’t include red and green. Another consideration associated with color palettes is to not use colors that are hard to differentiate by those who are colorblind like red and green. A final color palette consideration is that some colors nationally go well together and are more pleasing.
4. Size
Larger sizes typically correspond to larger values. For instance, a scatter plot of net income values can be sized by the return on investment percentage, to quickly communicate the return for every dollar invested.
When making relative comparisons, if the range of values is too large, then it can hide some variation in smaller values.
One rule of thumb is that if the data you’re trying to direct attention to is less than 10 times smaller than the largest value, then either transform the data, eliminate the largest values, or use an additional chart for two different groups.
Conclude:
Most of the time, less is more effective. It's more effective because our working memory can only handle a limited amount of information. Most important is directing attention to the main point.
Basic Chart Data
www.quandl.com — find financial data
Scatter Plots
it’s apparent that Tesla has much more variation in the open price and the close price, given the range on the x-axis and the y-axis, how different that is compared to Toyota and General Motors. It’s also clear to me that there’s a lot more variation in the intraday open and close price for Tesla.
They allow you to see the form of the relationship if there are outliers and the amount of variation in the data. Scatter plots are also the basis for many other plots, such as bubble plots, and some maps. The main weakness of scatter plots, is that they do not do well if there are lots of overlapping points because those points are hidden.
Bar Charts
A bar chart is good at communicating relative comparisons and distributions among data. It is good at showing the frequency with which values are distributed across a range.
This bar chart communicates not only kind of two ends of the spectrum but the relative values between all of the observations.
A special bar chart called the histogram, is great for communicating how data is distributed. They are pretty simple and easy for people to interpret. Consequently, they’re used a lot. The main limitation of bar charts is that they take up a lot of space.
Box and Whisker Plots
Also known as box plots, are effective at visually communicating statistical information about distributions.
Pros:
The outliers are more visible than in the histogram. It also shows some of the key percentile values.
Cons:
It takes more training to understand how to interpret it. The second is that there is not as much nuance in the box and whisker plot.
Conclude:
Box plots are effective for quickly communicating statistical information about a distribution. It takes more cognitive effort to visualize aspects of their shape, but they are nice and compact, allowing multiple distributions to be compared at once.
Line Charts
Effective way to quickly direct attention to how values change over time.
(top image)Tesla’s closing stock price has more variation, more growth.
(top image)It shows that on average the other companies are much less variable and have a much lower closed price over this time period. However, whenever you aggregate things together, it does remove some detail.
Conclude:
Line charts are commonly used to show how things change over time. They’re also effective because you don’t have to think very hard to recognize the change in patterns.
Maps
Maps are excellent ways to quickly and memorably communicate location.
Choropleth map in Excel: A choropleth map fills in geographic boundaries with multiple colors, or multiple color shades to communicate different values.
It’s pretty clear that the US has the highest average volume of shares traded on a daily basis. What do you notice about the shades of red in this map? They all look the same even though they’re not.
This is another instance of how it’s hard to see variation when the largest value is more than ten times bigger than the smallest value.
Conclude:
Maps are really good for quickly communicating location.
Financial Chart Data
eg. Waterfall charts, candlestick charts, tree maps and some burst charts
index-match is better than vlookup.
Waterfall Charts
They are great for quickly visualizing the path of net income what have the largest impact on that income.
One limitation of a builtin waterfall chart is that it doesn’t correctly show summary negative values
Candlestick Charts
Excel’s built-in waterfall chart cannot correctly show a summary of negative values.
Candlestick charts are used to quickly communicate intraday and interday patterns in stock market prices.
Treemaps and Sunburst Charts
Treemaps uses rectangle to direct attention to relative sizes.
Tree maps are similar to bar charts in that they are useful for directing attention to relative values
Tree maps are particularly useful for hierarchical or nested data.
Assets make up half of the total area.
Innermost ring is the main category of the balance sheet.
Second ring shows the current and non-current asset and liabilities.
Outermost ring is the individual line item.
The order is based on the largest values to the smallest value in a clockwise direction.
Sunburst shows more levels, efficient use of space, not get distracted and having too much information.
Both graph shows hierarchical data.
However, both graph are not effective in showing negative numbers. When dealing with negative numbers in tree maps, we can subtract it from a summary line item. An alternative approach is to adjust the range of the size of rectangles so that they correspond to the largest and smallest numbers; however, that isn’t an option in Excel.
Sparklines and Facets
A faceted chart is a useful alternative to a multiline chart that has lots of overlapping lines.
Facets are an alternative way to show multiple charts in a small amount of space.
It is easier to identify and compare trends because the lines are not overlapping.
Both sparklines and facets are useful ways to show multiple plots on many small graphs. Sparklines works well when embedded in tables, facet works best as stand-alone chart.
Charts to Use Sparingly
pie chart — It is hard interpret those areas, estimates might not be accurate. Have to hover over the pie to evaluate what each slice represent.
bar chart — easier to read than pie chart
Pareto chart — dual axis chart
Stack vs group bar chart
Sparklines
Tableau
Getting Started with Tableau
Tableau can automatically convert original column names to a more presentable format in the field name.
Tableau is quick and easy to use.
Scatter Plots in Tableau — 1
Columns — xaxis
Rows — yaxis
Scatter Plots in Tableau — 2
If we click on the Analytics tab in the far left sidebar, we can see three sections of analytics: Summarize, Model, and Custom.
Bar Charts and Histograms in Tableau
Histogram:
Observation:
1)There appear to be more positive returns. 2) Tesla is much more varied than at least Toyota and General Motors
Histograms enable us to group the observations for a measure into bins, and then shows the frequency with which observations occur.
This is a blessing and a curse, a blessing because it takes so little time and looks so nice. A curse because it’s easy to create charts that are more complex than necessary.
Box Plots and Line Charts in Tableau
Great for exploring the data.
Adding Dimensions in Tableau
Scatter plots are amazing because they can show up to six dimensions (x-axis location, y-axis location, color, size, shape, and motion), and possibly even more with other features, as with 3D scatter plots. However, showing six dimensions can be overwhelming, so be judicious when deciding the number of dimensions to show.
Facets and Groups in Tableau
Facets graph:
Using both facets and groups can be very helpful for providing context without providing so much detail that the main point gets overwhelmed.
Plotting several small charts in the same space can be a really useful way to show context. This has a variety of names such as facets, trellises, panel charts, and small multiples. One way to reduce the amount of detail is to group variables together.
Context is an important part of data visualizations because it helps provide some kind of benchmark for evaluating the data.
Data Joins in Tableau
A left join means that we’ll keep all of the original data, but we’ll only bring in data from the new table that matches on the fields that are in the original data.
Conclusion, you can see how easy it is to combine multiple data sources together in Tableau. The hard part is getting the data ready to join before connecting to it in Tableau.
Tableau Analytics — Forecasts
Forecast:
To create a trend line in Tableau, we should first create a scatter plot.
In conclusion, Tableau has some pretty useful tools to not only help describe the past and diagnose how a company got to where it’s at. But it can also be used to predict the future to some extent, as well as to make prescriptions about how to successfully navigate the future.
Tableau Analytics — Clusters and Confidence Intervals
The band indicates the range in which you would expect to find approximately 95% (default percentage that we can change) of the observations.
Communicating Tableau Analyses
You can easily remove a graph you do not want on the dashboard by clicking on the ‘cross’.
Edit:
Filter:
Floating window
Allows you to share your work in pdf
View in slideshow
In a Tableau dashboard, we are able to find a funnel on the floating menu of a chart, which allows us to use almost any chart as a filter.
Conclude:
Data visualization tools are useful not only to explore data but also how to effectively communicate a story.
Week 6
Framing a Question: Larry’s Commissary
Forecasting sales is important:
To make sure we can get the correct amount of inventory.
To let us know how many employees should be working on a given day
To help ensure that employees are working on non-value-added tasks.
Assembling Data
Sales data, weather data, and game data
Create data dictionary for each dataset. These data dictionaries are included
as separate tabs in the Excel workbook.
using vlookup
Data Analysis ToolPak and Descriptive Statistics
Creating histograms:
If the mean is greater than the median, then there may exist some extremely large values.
The kurtosis indicates how close the data is to normal distribution.
Correlation
Correlation is the extent to which two variables are linearly related to each other.
How do you use a correlation matrix?
First, find your dependent variable.
Multicollinearity. it’s not real desirable to have because they’re basically explaining the same thing.
Correlation tells us how variables are related to each other. Correlation is the extent to which two variables are linearly related.
Linear Models
multiple regression is powerful:
A linear function with parameters created based on relationships in historical data.
Relationship between NetTotal and Quantity:
Relationship between NetTotal and tmax:
Relationship between NetTotal and netTotal_364:
Linear models can be useful to make predictions about the future as long as they meet the following criteria:
- Has a strong correlation between variable we are trying to make the forecast and the variable that we are using to make the forecast.
2. The variable that is being used to make the forecast can be reliably estimated ahead of time.
Note:
One way to determine if the model is effective is if it provides a more accurate prediction than just using the overall average.
When creating a linear model, we should try to capture as much variation as we can.
Simple Regression
Regression analysis can help solve these problems.
Add trendline:
Add R² to the chart:
Deterministic relationship means if you know what x is, you will know with certainty what y is.
Q) If you had to select one of these variables to predict netTotal, which would you choose?
Although quantity explains 94 percent of the variation in netTotal, since we’re trying to forecast netTotal seven days in advance, we can’t use quantity because we won’t know that until the same time that we know netTotal.
netTotal_364’s R² is much higher than tmax, so choose netTotal_364.
dependent variable: netTotal
independent variables: quantity, netTotal_364, and tmax
Note:
The main objective of regression analysis is to find the parameters for a linear function using historical data.
R-squared is the percentage of variation explained by the regression line.
Regression Diagnostics 1: Regression Summary, ANOVA, and Coefficient Estimates
Standard error represents the average absolute distance between the regression estimate and the actual value.
ANOVA stands for Analysis of Variance:
tells us how r² is calculated.
The purpose of the ANOVA table is to provide more information about R-squared.
Significance F determines whether the model does a good job in predicting values.
If the Significance F in the ANOVA table is LESS than .05, then we can conclude that the model does a better job of explaining variation relative to the mean.
coefficient:
t stat is Coefficients/standard error
If the t stat is ≥2, P-value will be around 0.05 or less. (statistically significant)
If the p-value of a coefficient is less than 0.05, it means there’s only a 5% chance that the coefficient estimate is not different from zero.
In conclusion, using the regression tool and Excel’s data analysis ToolPak provides a lot more insight than the trend line feature of Excel scatter plots.
Multiple Regression
Simple regression is when a single independent variable is regressed on the dependent variable.
Multiple regression is when multiple independent variables are regressed on the dependent variable, resulting in a single linear model with a slope coefficient for each independent variable.
Note:
We don’t have to choose only one variable to use as an independent variable. We can combine their effect into a single linear model.
In regression analysis, if the independent variable is insignificant, then we should exclude it from final calculation to preserve simplicity.
If two independent variables explain the same variation in the dependent variable, the coefficient estimates will report only the unique effect.
Regression Diagnostics 2: Predicted Values, Residuals, and Standardized Residuals
Regression model output:
Residuals can identify whether assumptions about regression have been met.
Residuals can help to identify ways to fine tune the model.
Residuals can identify observations that are not predicted well.
Regression Diagnostics 3: Line Fit Plots, Adjusted R Square, and Heat Maps for P-Values
These line fit plots don’t form a straight line because our model has more than one independent variable. If we were to plot the predicted values in three dimensions, then we’d see something that looks like a straight sheet of paper in three-dimensional space.
whenever you include an additional independent variable, R-squared will increase. That’s because by chance there’s a little bit of correlation.
If an independent variable does not explain enough of the variation in the dependent variable, then the adjusted R-square starts going down.
In conclusion, multiple regression is powerful but can quickly get complex. The diagnostic plots and metrics helped to identify a simple but accurate linear model. Just as scatter plots reveal more nuanced information about the correlation between variables, line fit plots help to identify areas in which the model does not do a good job of making predictions. The adjusted R-square helps to keep model simple by imposing a penalty for independent variables that don ‘t explain variation in the dependent variable.
Making a Forecast with a Linear Model
it’s very unlikely that the forecast(point estimate) will exactly equal the actual sales. So this is where a confidence interval is very helpful.
lower/upper bound. We calculate by lower/upper*forecast values.
daysSinceLastPurchase is a positive value of 0.01. Means it is increasing as a greater number of new customers come to eat at Larry’s Commissary. daysincelastpurchase isn’t necessarily driving the revenue to go up but it’s reflecting the proportion of new customers that are eating at Larry’s Commissary on a given day.
In conclusion, a linear model is easy to communicate and easy to use to make forecasts. When combined with human interpretation, it’s also nice because it separates out the impact of each independent variable. Thus, it can help managers by suggesting strategies on what actions to take to increase sales or what actions to avoid to decrease sales.
Here are some questions to consider that may help you in communicating the results:
Polynomial Regression Models
In conclusion, whenever you see a curvilinear relationship between an independent variable and the dependent variable, you should consider adding in an additional term to the regression function to allow the regression algorithm the freedom to fit parameters to the curve.
If it’s clear that there are two bends in the relationship, then you’d want to also add in a cubed parameter. Even though we’re creating a polynomial regression model, we’re doing it in such a way that it’s a linear combination of terms which keeps things relatively simple.
Categorical Variables
Nominal variables are categorical variables that does not have a natural order.
Ordinal variable are categorical variables that have a natural order. eg. high, medium, and low; or gold, silver, and bronze.
Note:
The categorical variable is also known as a qualitative variable.
When dealing with categorical variables, they must first be converted to numeric values before they can be used in regression.
Nominal variables are categorical variables that are not naturally rank ordered.
Multiple Indicator Variables
In performing a regression analysis, when there is a nominal categorical variables that doesn’t have natural rank order, convert the labels to multiple variable columns whose value can only be 0 or 1.
Interaction Terms
Investigate the impact of interaction terms using two continuous variables
Conclusion, the main idea of interaction effects is to capture the possibility that the impact of an independent variable is not the same for every level of another independent variable. Ultimately, this provides a lot of flexibility for multiple regression models to incorporate complex relationships.
Regression Summary
Optimization with Excel Solver
There are many types of problems for which the Solver can be used. For instance, cost accounting often has a segment on finding the optimal production schedule given a specified product mix, resource usage, and beginning and ending inventory levels.
Solver Constraints and Reports
Logit Transformation
sales above 4000
Odds are identified as ratios:
There is a 3.45 greater chance of sales being below 4000 than sales being above 4000.
There’s only about a 29% chance of sales being above 4000 compared to being below 4000.
Note: the range of odds is asymmetric. If the probability is 0 to 0.5 the odds will be 0 to 1. If probability is above 0.5 the odds range from 1 to infinity.
logit is the log of the odd ratio:
Transform from logit value to probability:
Simple Logistic Regression
In conclusion, logistic regression is a commonly used tool for creating a linear model to predict which category an observation falls into.
Logistic regression is useful for modeling binary outcomes of true/false, yes/no, win/lose, above/below, etc. The defining feature is that the dependent variable is modeled as a logit value, or the log of the odds ratio.
Logistic Regression Accuracy
In conclusion, logistic regression is a really useful tool for classifying data and can include many independent variables.
A simple yet powerful way to evaluate the accuracy of a logistic regression model is the number of observations accurately predicted.
Excel Macro VBA
Recording Macros
If we want to do stacking of data we can do manual copy paste. Alternatively, we can use macro function.
Step 1: Go to ‘Developer’ \ ‘Record Macro’
Step 2: Once you click ‘ok’ the macro will start recording and you can do whatever you want.
eg. let say you want Macro to record you changing condition formatting color.
Step 3: Once you are done, click ‘stop recording’
Step 4: Test your Macro
If user don’t know the shortcut keys, they can access in other ways;
- Go to the Developer tools and click Macros
2. We create a Macro button
We then assign which Macro to the button
The button can be of any shape
Lastly, save the file as Excel Macro-enabled Workbook (.xlsm)
Basics of VB Editor
Going to the code
Visual Basic Editor is an Integrated development environment or IDE.
The Visual Basic Editor or VB editor Helps navigate, organize and create VBA code.
Basics of VBA
For Loops, Variables, Index Numbers, and Last Rows
Programming Hints
start small — use small data size to test if your code works, then subsequently use more data.
Copy and paste — copy codes online and learn from it.
VB Hint 1:
Learnt to comment and uncomment large blocks of code, rather than deleting the code.
VB Hint 2:
Learn to read error messages.
Once you click ‘Debug’ it takes you to the line of code that caused that error message to pop up.
VB Hint 3:
Step through the data.
VB Hint 4:
Send messages to your self in this ‘immediate’ window.
Code hints in Macro:
Hint 1:
Use message boxes to communicate a message to the user.
Let user know its done.
Hint 2:
Use an input box. It is useful when you want to get information from the user while the macro is running.
Let say you only want to copy 3 rows of data
Hint 3:
Screen updating. It will take a while to update the data. So one way to speed things up is to turn off that screen updating at the beginning and then turn it back on at the end so I can see the results.
Now it takes a shorter time to load. This is important to improve the speed as you want to deal with larger amounts of data.
Hint 4:
Constant update of the program running.
Sometimes a macro was running but you don’t know that it’s running. Sometimes it’s frozen and you’re not sure if it’s frozen or if it’s just taking a long time to complete.
Let say you want to see the status by the number of iteration done:
Or you could set it as a percentage:
Conditional Statements
if-else statement
No indicated rows by user:
User indicates only copy 2 rows:
Specify more than 2 conditions else if:
Macro for Creating Multiple Histograms
Choose data to create a histogram
After you are done, stop the macro recording.
Edit the Histogram VBA code:
Got to Macro and select or go to Visual Basics.
Note: The command to exit a for loop is ‘Exit For’.
Clustering Overview
https://www.consumerfinance.gov/
Clustering → unsupervised learning
regression → supervised learning
K means clustering — This clustering algorithm requires that the user identify the number of clusters to produce from the data. The algorithm then creates K clusters that are grouped on how close they are to the center of each cluster.
Density based — This clustering algorithm creates clusters based on the proximity of data points to other data points. It requires that the user identify the acceptable distance between each data point and the other data points, as well as the number of points required to create a cluster.
If you need to divide the data into two or three groups, then K-means is probably better. On the other hand, if you want to let the data dictate the number of groups, then density-based clustering is probably better.
K-Means Clustering in Excel
is a really useful algorithm when you’re not sure how to classify your data.
centroid → a fancy name for saying the center of a cluster.
Tempcluster is when Dist1 < Dist2 then print 1, else 2
The 4 numbers for centroid calculates the average base on their cluster.
Note:
K-means clustering can be used when we are not sure how to classify each observation.
When performing k-means clustering, the first thing we should do is to assign observations to represent the center of each cluster.
When performing k-means clustering, the user must explicitly specify the number of clusters that should be created.
K-Means Clustering Macro
It is a common type of unsupervised machine learning algorithm. It’s unsupervised because it doesn’t require that the observations include a dependent variable.
Using Macro button to redo the clustering:
Next, create the scatterplot manually:
Use the color button to update the cluster color:
The code used basic ideas: for loops, variable assignment, conditional statements, and commonly used mathematical operations like addition, subtraction, and squaring values.
Clustering On a Larger Scale
adding more columns is that the columns that have larger values will dominate how the clusters are formed. To solve, scale the data so it ranges from 0 to 1.
So no data will dominate.
Conclude:
These concepts are applicable to other data analytic languages like Python, and R. As well as the IDEs, that are often used for these languages, like spider and RStudio.
The Scikit library for Python, and the cluster package for R, can perform k-means clustering and regression, much more quickly than what can be done on Excel.
*******************************THE END***************************