You are familiar with iot and if you like the cool IoT pictures of “things” lighting up all over the world much like the picture here, this article is for you.
The primary value in an IoT system is its ability to perform analytics on the acquired data and extract useful insights. But for folks who have been there and done that or are doing that, building a pipeline for performing scalable analytics with the volume, velocity and variety of data associated with IoT systems is no easy task.
Kinds of Data
We encounter three kinds of data on an IoT platform:
Metrics – sensor data such as temperature, humidity, vibration, and so on. This is typically high volume of small amounts of data with a high velocity.
Diagnostics – data that gives indication into the overall health of a machine, a system or a process. Diagnostic data often comprises but is not limited to log data and may indicate that a process has started to fail and might need further analysis to determine the root cause.
Transactions – Data related to interaction between systems and human beings. This is what completes the M2M2P (Machine to Machine to Person) together with Metrics and Diagnostic Data, completes what I call M2M2P data. This may include adjustment to the parameters of a machine, completion of a planned maintenance etc. The sort of data that transactions give rise to is likely to be less frequent than metrics and diagnostics, but may be more complex and insightful when combined with the other kinds of data.
The real value of IoT is in bringing these different types of data together to add value. And that’s also one of its key challenges. It's a challenge that involves data acquisition, combination/integration, data transformation, data processing and computation, data storage, and, more often than not overlooked, data security. Here is a typical IoT pipeline:
A variety of protocols enable the receipt of events from IoT devices, especially at the lower levels of the stack. The more popular and widely supported protocols for transferring data in IoT applications are MQTT, XMPP and Constrained Application Protocol (CoAP). Time-series data is captured as events take place around these devices. This use of real-time information provides a complete record for each device, as it happens.
Whether it is preemptive troubleshooting, monitoring or improving customer service, answers to business problems almost always require combining sensor data from multiple sensors often across multiple physical locations with data from other sources like ERP, Manufacturing systems, PLM systems and other home grown systems. You can combine the device-generated data with other metadata about the device, or with other datasets, such as weather or traffic data, for use in subsequent analysis. By combining data you can also add checks such as averaging data across multiple devices to avoid acting upon data only from a single device.
Enrich and/or transform data
Data from devices in their raw form may not be suited for analytics. Data may be missing, requiring an enrichment step, or representations of values may need transformation. You may also convert the data into another format. Also, in order to avoid “drowning in sensor data”, certain events must be filtered out in order to better manage the storage and processing requirements. While processing diagnostic data it is also typical to filter out all of the 'normal' data in order to focus on the “true diagnostic” data that suggests there is a problem that needs further investigation. However, transformations and enrichments can be expensive and may add significant latency to the overall pipeline. Hence it is best to avoid the need to rerun the transformations if you rerun a sequence of events.
Aggregate and compute data
By adding computation to your pipeline, you can apply streaming analytics to data while it is still in the processing pipeline. Calculated metrics are typically written to a persistent data store. The calculated metrics can then be used to suit business requirements - trigger real-time notifications, update real-time dashboards or generate events for downstream applications. For most applications, it makes sense to store the computed/aggregated data alongside the “raw” data. Time-series data is a form of data that is typically stored in the persistent data store because in the business domains related to IoT, it is often essential to understand how data relates with time.
Stream processing technologies are required to combine and correlate data from different sources often in real-time. Failures, exceptions, service level breaches will be identified in real-time, allowing notifications to be sent out immediately with the ability to drive automated actions and responses. Storm or Spark fare typical technology choices because they excel at managing high-volume streams and performing operations over them, like event correlation, rolling metric calculations, and aggregate statistics. They also leave the door open to implement any “custom” algorithms that may be required.
It is critical that the platform can extract deep predictive analytics from historical data combined with real-time events. Therefore, integrating a low-latency, high-throughput database system into the platform is extremely important. In addition to massive scale, IoT workflows often require both fast lookup and writes, requiring a persistent store that responds fast to high volume reads and writes. Common data stores are HDFS and NoSQL databases like Cassandra. Relationships are often captured using Graph Databases.
This must be a key overarching objective of any analytics platform. Virtually anything connected to the Internet has the potential of being hacked, no matter how unlikely. Internet of Things devices often lack systematic protections against viruses or spam. In addition to these, privacy and security need to be covered by an IoT data processing platform. Protecting the privacy of users is key, from data masking to support for encryption.
Secondly, any devices connected to internet, would potentially be monitored. If you are putting cameras in all the rooms so that you can monitor your teenager or intruder remotely from your office, the same cameras can be potentially monitored by hackers.
Third, the devices can be subjected to Denial of Service attack hence they would not perform as expected.
Fourth, is they can be controlled. Once, their control plane is hacked in, the devices can be controlled.
458 percent increase in vulnerability scans of IoT devices in the last two years.
88 percent said they lack full confidence in the security of their business partners' IoT devices
Approximately 68 percent plan to invest in IoT security in 2016.
Last but not the least, accuracy of data is of utmost importance because it not only impacts the queries today, but also impacts the predictive analytics to be gleaned out of it in future and all downstream applications. Take telemetry data from vehicles. If the time order of data is not completely aligned and accurate, then it points to potentially different results when analyzed.
As you see, there is a lot going on for those of us working behind those “cool, connected everything” pictures that make the splash everyday.