It has become a bit of a cliché in data science blog posts to say that about 80 percent of a data scientist’s time is spent cleaning up the data (and the other 20 percent is spent complaining about cleaning data), but it is a fairly apt description of my life in this field.
When we use the word “cleaning,” it implies that the data is dirty or contaminated (and often it is), but really what we are trying to do is make sure that the data represents reality. If you use non-representative data for further analysis (machine learning, reporting, or even manual data exploration), your results will be incomplete, inaccurate, misleading, or simply wrong.
Ensuring your data is telling you the truth is a big part of what makes applied data science hard. There are many factors that contribute to how much work is required to get your data into an useable state. These factors include, but are not limited to, where your data is coming from, how complex is the system you are trying to represent, and what you are doing with the data.
Where you are getting your data from can have a big impact on what it is going to look like. Depending on your application, you may be using data from a public source, an existing data collection system, or building your own system from scratch. In some cases, the collection system will be tightly controlled, while others will allow for incomplete or erroneous entries, or may even be completely free form.
If you are relying on human inputs you can expect things to get a little messy. Missing information, letters in numeric fields, malformed structured fields (dates, times, SKUs, etc.), approximations, and in some cases, attempts to game the system.
At Raven, some customers will share their existing data with us. Sometimes this is data that operators manually enter a log based on how long they think an operation took. We find data like this is often rounded to the nearest 15 or 30 minutes. When an operator estimates and rounds, one starts to question the veracity of the data.
Even when you can connect directly to whatever it is that you are measuring and design the system yourself, you might encounter issues. Inconsistencies can arise from power failures, Wi-Fi outages, faulty or damaged equipment, sensor fusion issues, or edge cases that were not considered. You must devise reasonable and consistent methods for handling these anomalies.
It’s also possible that user experience issues will make it challenging to capture the absolute truth straight from the source. At Raven, we fuse on/off data collected from machines with operator input explaining the reasons for downtime. We need to capture the operator’s intent while not becoming a nuisance. We sacrifice perfect raw data for ease of use (and user acceptance) and rely on some data cleaning (both automated and manual) to figure out the truth after the fact.
It’s almost a guarantee that some data munging will be required to go from your raw data to an useable representation of reality.
It should go without saying that, the more complex the system, the more challenging it is to accurately represent that system. This may be true, but even apparently simple systems can be surprisingly intricate when you start to consider the different ways things can happen.
A big part of getting the data into the right shape is having a pretty good idea of what that general shape is. You need to know about your system in order to describe it properly. When you know your system you can make better decisions about what needs to be included, what can be ignored, and what assumptions can be made to make life easier.
For example, when we look at manufacturing data from a machine, we often have data that tells us when a machine goes from active to idle or vice versa. This machine activity information is accompanied by operator supplied tags to let us know why the machine was idle. Each idle period is typically tagged once to let us know why the machine was idle. We typically encounter a couple of examples
The machine is in full production for an hour, goes idle for five minutes, then goes back to full production. The operator tags that they were filling the blanks during the idle time, so we may infer that the entire five-minute idle period was for the filling operation. With the given information, this seems like a reasonable assumption; or
The machine is in full production and then goes idle. The operator tags tool sharpening. The machine is idle for five minutes and then goes active for about two minutes, idle for about two minutes and repeats this three times until finally returning to active for an extended period of time.If we accept this information directly, we may say the machine was in tool sharpening for five minutes, there were three two-minute periods of untagged downtime with some short production periods in between.
With some domain knowledge, the data scientist would know that it takes at least three minutes to produce a part, so these short uptimes were not real production (they were likely test cuts to ensure that the newly sharpened tool is calibrated properly). In reality, the tool sharpening event took 17 minutes and those short periods of production should actually be considered idle time.
Seemingly simple data can easily be misleading if you don’t understand how it was collected, and how the system you are modelling actually works.
Once you have a representative set of data, structure it in a way that it is easy to use and manipulate. The tools used for visualization, machine learning, or other processes may dictate the form of your data, or you may be trying to minimize disk usage.
Transforming your final form may happen in conjunction with other steps, but it is important to make sure that you don’t lose any fidelity when you change form (or that any fidelity losses are acceptable).
The level of fidelity may depend on your application. If you work with population data, you may only need country-level numbers, while other applications may require province/state-level metrics. Knowing your system and what you are trying to do (or may wish to do) with your data is important for making these decisions.
Clean data is the foundation for further analysis. If you are not representing reality with your data, you are not going to get the desired results from your analysis. At Raven, we are trying to provide real-time feedback to operators, supervisors, and managers. We work with data that represents what is actually happening on the shop floor, so our real-time feedback is not just noise – it adds value.
This is why data scientists spend most of their time wrangling data. If the data is clean and formatted properly the rest of the analysis is easier.