Scaling Support for Data Collection and Storage of Real-Time Sensor Data

People:

Related Projects:

Summary: We present the real-time 24/7 data-flow architecture for our depression monitoring system. Heterogeneous sensors such as infrared motion sensors, wireless weight scales, accelerometers on the bed, acoustic sensors on a smart phone or in the environment, anonymity-aware cameras, and contact sensors are deployed in a home. The architecture is extensible and new sensor types can be added as needed. Each stream is generated by a device and often preprocessed at the device - (compressing, summarizing, or removing noise) before being sent to and inserted into the main database. In addition to the main database there is an archiving database that is used for saving the original high data-rate periodic raw data. The reason we separate the raw data from the processed data is that filtering and selecting dense data is computationally expensive, and often times, the behavior modules do not need information at that resolution. However, the archived information enables new pre-processing techniques to be applied when advances are made in knowledge discovery from low level patterns or any time when it becomes useful to re-look at the raw data.


Generic database: Heterogeneity makes data stream storage in the main database complex. This is because we must support both low-data rate devices such as motion detectors and contact switches together with high-date rate devices such as microphones, ECGs, and video cameras. These real-timestreams must also be supplemented with other data such as current medications, medical history, and depression screening questionnaires and clinician’s assessment. Traditional approaches for sensor networks such as logging <timestamp,value> pairs into a SQL relational database can be extremely wasteful in this case. Other DBMS include Aurora which targets continuous data feeds, but does not handle heterogeneity. Instead, we use a generic interface to the database, but the underlying method for storage is based on the characteristics of the stream type. Aperiodic data and supplementary data is stored into SQL tables. High frequency data is stored in file bundles using variable-bitrate lossy compression, and the time block segments are indexed into the SQL database. Each stream contains metadata about the source of the feed, the signal type, type of compression to be used.

Data compression techniques: Because human behavior is circadian and often repetitious, we will investigate using a variety of methods such as wavelet compression or discrete cosine transforms to efficiently store data in these file bundles. We plan on investigating a combination of lossy and lossless compression techniques guided by the needs of depression monitoring to balance detection accuracy and memory requirements. Some lossy compression can impact the disease inference modules. For instance, in particular signals some high frequency, infrequent data represents an anomaly. There is a tradeoff between inherent sensor noise, data sensitivity, redundancy. Statistics for all stream types are stored in their own tables, and modified when values are inserted. Data-aging will be handled automatically, removing older and less-relevant data, summarizing it, or down-sampling as necessary.
 

Behavioral modules: After the streams are stored in the database, several behavioral modules can run concurrently detecting anomalies or characteristics in the streams that suggest increased risk of depression. For instance, the sleeping module will monitor the number of sleeping interruptions, movement levels, and sleeping intervals and compare this trend to the patient’s normal behaviors. We designed the MedStream interface to connect the inference logic to the database to hide the complexities of the storage implementation. Some examples this interface provides are instantaneous values (using interpolation as necessary for sparse datasets) or returning time segments.
 

Behavioral summary: Additionally, statistics such as mean, standard deviation, trend, and circadian probability distribution functions (PDFs) will be made available. We will engineer the database to process real-time queries and summary information fast, while allowing penalties to incur from making historical queries. Dynamically, the database will pre-compute and cache results of queries coming from the behavior modules based on the frequency for incoming queries. Often, the modules run at different intervals based on the behaviors being monitored. For instance, weight modules run once a week, sleeping assessments run daily, and movement and other modules much more frequently. The database will be engineered to return statistical summary information very quickly. For example, many activity events, like sleeping, can be represented as a PDF. This function will be generated from historical data from the bed sensors when the person goes to sleep and wakes up. Fast generation of the PDF makes anomaly detection based on likelihood estimation very fast. We intend to use OLAP (Online analytic processing) hypercubes for structuring the data. OLAPs use a hierarchy in which summary data is stored in subcubes with fast lookup time, but their are drill down procedures into the cube to get more information. We will store the multi-dimensional data in these cubes, to improve the ability to find relationships between attributes in our dataset. Challenges remain in our application such as heterogeneous data, sparsity, and compression that we intend to target for medical sensing which we intend to address with creation and linking of different cubes. In addition, we will implement a push-style event system. When rare but important events need to be monitored, the module can subscribe to the database for notification of this event. For instance, a template or pattern for arrhythmia can be given to the database and charged the responsibility for notifying the cardiac monitoring module. This alleviates the expensive polling operation from the behavioral modules, and can also improve responsiveness for certain high-risk things such as falling, suicide, unconsciousness.