Nowadays, with the current amount of data being generated, we must also proceed to process it correctly. The process of data analysis is divided into five basic categories.
Step one - Defining the question:
In any data analysis process, the first step is to define the question, in other words, the objective of the data analysis. In the data analysis terminology, it's called the problem statement. The objective comes with defying a hypothesis and finding ways to test this hypothesis.
Examples of a good problem statement may be based on professional or personal accomplishments. Your problem statement may be about spending your time off more effectively. But problem statement examples in business are targets to solve specific business needs like increasing sales targets, establishing businesses online or reducing employee turnover - HARAPPA
The process of defying the hypothesis might start with a simple “What business problem needs to be solved?”. However, this question is not the core of the issue, which needs to be solved.
To be able to solve this hypothesis, a data analyst needs to know the business and the business's goals. If the data analyst does not know the business well enough, it might lead to falsely formed hypotheses and misleading outcomes of the analysis.
For example, the business provides the transportation of goods to the customer. The primal question might be “Why is transportation so costly?”. This is where the business knowledge of the data analyst comes into play. Secondary questions might be “What is the truck occupancy?”, “How often do we ship in particular places?”, “How do we plan the transportation routes - does one truck deliver multiple shipments?”.
In order to define the right questions, the business metrics and key performance indicators (KPIs) need to be taken into account.
Key performance indicators (KPIs) refer to a set of quantifiable measurements used to gauge a company’s overall long-term performance. KPIs specifically help determine a company's strategic, financial, and operational achievements, especially compared to those of other businesses within the same sector - Investopedia
Step two - Data collection:
In order to find the best solution for the set hypothesis, you need to get the proper amount of data which needs to be appropriate and well-aggregated. Secondly, the data analyst needs to determine if the data should be quantitative (numeric) data or qualitative (descriptive) data.
Example of quantitative data is sales data, transportation data, and so on. On the other hand, the quantitative data is, for example, the customers' reviews and other descriptions.
This data are divided into groups:
First-party data,
Second-party data,
Third-party data.
First-party data:
This data is collected by the company itself. Data is collected directly from the customers through transaction data, CRM systems (customer relationship management), or any data within the company's ERP (enterprise resource planning), including surveys and feedback. The first-party data is mostly well-structured and well-organized.
Second-party data:
Secondary-party data is very useful and helps to enrich your first-party data and also enrich your data analysis. This data is first-party data of other organizations. This data is also mostly well-structured and organized. A good example of this data is social media activity or app activity.
It can provide greater value than third-party data, which is usually available to anyone who wants to buy it - Signal
Third-party data:
Third-party data is collected by the organization focused on aggregating data from numerous sources. In most cases this data is not organized or well structured, the data is primarily collected to conduct industry reports or research.
Unlike first-party data, third-party data usually comes not from the direct relationship between a customer and a company - Signal
The last part is to define the data management platform (DMP) which helps you to identify and collect data from numerous sources. There are numerous platforms that provide this service.
A data management platform (DMP) is a unifying platform to collect, organize and activate first-, second- and third-party audience data from any source, including online, offline, mobile, and beyond. It is the backbone of data-driven marketing and allows businesses to gain unique insights into their customers - Lotame
Third step - Data cleaning:
The next step after the data collection is to clean the data. This part is very important because it determines if the data analyst works with high-quality data.
Even though the process is not strictly defined, there are some common steps that need to be done to clean the data appropriately:
Deleting duplicates values/rows or unnecessary data,
Missing data or error correction,
Getting rid of all outliers that might cause misleading outcomes,
Verify and question the collected data.
The data analyst usually spends between 70-90% of the time on this task. This might sound like a lot of time, however, the correctly identified inaccurate data and focus on errors very highly impact the outcomes of the analysis. In some cases, not spending enough time on this step can set you back at the beginning of the whole cleaning process. So don't rush it.
Cleaning large datasets manually might be very tricky, luckily you can use tools to clean them more efficiently. One of these tools can be Python's library called Pandas or using R libraries.
Fourth step - Analyzing the data:
There are plenty of methods to analyze the data. Most common and well know are:
Univariate analysis (one-dimensional),
Time-series analysis,
Regression analysis.
However, more important is the goal of this analysis:
Descriptive analysis (describing what has already happened),
Diagnostic analysis (describes why something has happened),
Predictive analysis (determine the future trends based on historical data),
Prescriptive analysis (helps to make recommendations for the future).
Fifth step - Presenting the analysis results:
The last step of the data analysis is to share the results and insights with the stakeholders. It needs to be done in a particular way of interpreting the outcomes and presenting them in a way that all the audience is able to understand.
The insights need to be very clear and should not be biased. The best way to present the data is to use reports, dashboards, and interactive visualizations.
Do not forget to mention all the gaps or highlight insights that can still be available for interpretation.
Matěj Srna
Comments