If you poll 100 data scientists and ask what they spend most of their time doing, 99% will respond with cleaning data.
This time-consuming process isn’t just limited to unstructured pools of information. Time series data that has context also needs to be cleaned. Some simple steps can be taken when capturing this info to simplify this process, reducing the time and cost of analysis.
Clean data is information that has had all major anomalies removed. Anomalies can be caused by several different events. For example, during machine startup and stops, sensors may produce values well outside the typical range seen during machine operation. When a sensor is replaced it may need to be calibrated to ensure proper operation. If it cannot be calibrated, noting which part was changed in the data can be helpful in correcting it at a future time.
In addition to anomalies, a significant amount of time can be spent normalizing data from multiple systems. Defining how measurements should be reported across machines and making the context adjustment when the information is captured is important.
For example:
Is a temperature reported in Fahrenheit or Celsius?
Is it reported as an integer or a floating point?
How many decimal places are required?
If the value can be negative, what format is used to record the negative number?
What is the maximum and minimum value for the given sensor?
What values indicate failure for a given sensor?
Finally, time is a critical parameter. How it is calibrated, verified, recorded, and errors reported are all valuable to the analysis process.
How much time do you spend cleaning data?
Clean Data by Design
Designing a data capture solution so that the information is as clean and organized as possible from the beginning is the best way to ensure the quality.
The first step is to create a data plan document. The goal of this document is to define formats for all info types that could be captured. This will make sure a level of consistency is used across the company. It will also simplify comparisons across systems and eliminate the need for significant data normalization efforts.
This plan does not explicitly define all data that will be collected or every sensor type that will be used. Attempting to create a very specific plan may force engineers to work outside the specification. A general blueprint that can be applied consistently across all systems is preferred.
Additionally, the data plan should define units used for all measurements including metric vs. imperial and specific units in those systems. For example, one unit for measuring pressure should be defined, Pascal or PSI but not both.
Here is a simple list of measurements:
Measuring distance (inches, yards, centimeters, meters)
Measuring pressure (Pascal, PSI)
Measuring temp (Celsius, Fahrenheit, Kalvin)
Measuring mass (Gram, Kilogram, Pound, Ton)
Measuring volume (Milliliter, Liter, Gallon)
The numerical format for each sample value should be defined in the data plan. An effort should be made to reduce the number of formats supported. Selecting a slightly larger numerical format so all values can be stored in that format may simplify future programming.
For example, the document may define that all pressure measurements will be made in pound per square inch and stored as a 32bit floating point value. Allowing for a maximum value of 2^10 with a step size of 0.5 and a step size of 0.0005 for numbers with an absolute value less than 1. It is possible this value could fit into a 16bit floating point value, but 32 bit is chosen because that level of accuracy is required for distance measurements.
A data structure is used to define the context information that will be associated with a specific sample type and can be defined in advance. In many programming languages, this is referred to as defining a class and the associated attributes.
For example, a single sensor value class can be defined as follows:
CLASS SENSOR SINGLE INPUT
Name: String
Vendor: String
Model Number: String
Machine: String
Parameter: Must select one => Meter, Pascal, Celsius, Kilogram, Liter
Value: 32 Floating Point
TimeCoarse: Year, Month, Day, Hour
TimeFine: Minute, Second, Millisecond
ExpansionPointer: Address for Future expansion
Because it is impossible to define the perfect structure, an extendable structure allows for future expansion of the description.
Adding Additional Context
A data scientist looking at information that has simple contextualization can eventually eliminate most of the anomalies. This can be done by assuming the machine operated normally for a long period of time providing a solid baseline.
One method is to identify the standard deviation of the data and remove any that is more than a standard deviation or two from the mean.
Another strategy is to eliminate zeros or extreme readings vs the mean value. These are great strategies for analyzing the performance of any system when it is operating in a steady state, and much of this data may be valuable if it has context.
Rather than throwing away data based on the values position relative to the mean, if the min and max valid readings are defined for a given sensor, samples recorded beyond those two limits can be removed. This requires that the min and max limits for the sensor be included in the sensor structure as a fixed attribute, like the model number.
Another option is to identify a root cause for the data anomalies by working with the development team. They can help to determine the cause of the sensor results, but it can be a slow process. The design team can also find the root cause if they document the min and max expected sensor values during normal operation in the development process.
Lastly, sensors can produce extreme values during the power on, machine startup, or when an emergency stop occurs. If an additional class attribute is created to mark these events in time, this information can be used to refine the data.
In summary, adding some additional context to data can be useful. This data is available on the machine and is often overlooked. Collecting this information consistently across all product or systems being monitored can provide additional value when comparing systems or customers.
What Could Go Wrong?
Planning for and creating clean data is a significant investment. Many may ask if it is required, and if so what can go wrong? The first thing to be aware of is that data analytics is an iterative process. Typically, it focuses on simple conclusions that are then tested.
These tests are done by making decisions based on the original data and measuring the impact of these changes. Iterative processes are slow and may take some time to generate measurable results. There is also some room for error when it is analyzed incorrectly. Errors in the analysis can result in poor decisions that impact the performance of the company.
As an example, consider the error that could result from sampling data too infrequently. Assume a company is collecting data on their leased machines in the field. The system is recording pallets stacked monthly. In the graph below, the number of units produced drops by 50% prior to a customer returning a machine.
This may be an indication that a user is going to terminate their lease and return the machine. It may be worth it to notify the salesperson if the number of units drops off, so they can reach out to the customer before the customer makes the decision to return the machine.
In this example, the data is too coarse. If it was collected per day or hour, it would show that customers use the product at the same level prior to returning the system. The falloff is because leases terminate in the middle of the month. Additional context (lease return date) or increasing sampling frequency, would eliminate this incorrect conclusion.
It is also common for a data analytic effort to identify and focus on information that is well understood. This can help create confidence in the data but does not lead to improvements in company performance. At its core, data mining is the process of finding relevant patterns in data.
Unfortunately, there are many irrelevant patterns in the data. Spending time on irrelevant data is costly and the process of leveraging this analysis involves making a change and testing the result which consumes time. Additional context can help the data scientist eliminate irrelevant data or reach solutions faster.
In conclusion, clean data is required for a company to operate efficiently because it reduces the amount of time required to eliminate anomalies. Extra sample context provides the data scientists with more information that can be leveraged or ignored at the time of analysis.
It is very hard to add additional context to data after the fact. Finally creating a data plan before or deploying a data capture or industrial IoT strategy, can reduce costs and improve results.