Finally, boundaries between candidate sub-actions are adjusted to obtain final sub-actions. Sub-actions discovered in this way are consistent and semantically meaningful Figure peritonitis. Our key assumption is that all the video clips of an action share the same sequence of sub-actions.

The goal is to design an approach that can automatically find the appropriate number of sub-actions for each action in an unsupervised manner. Sub-actions should correspond to different semantic parts and be consistent in videos clips of the same action. Moreover, the sub-actions in an action should occur in a specific order. Since the number of sub-actions in an action is unknown, we first cluster segments in each video of an action into fix number of parts to serve as candidate sub-actions.

Second, similar candidate sub-actions are merged together through hierarchical agglomerative clustering. And finally optimize sub-actions in an E-M manner. Temporal segments within a video are represented by key frames.

The number on the top of a frame represents the ground truth index of sub-action in the action. In this action there are two sub-actions. However, as can be seen that the first sub-action is broken into two parts. Then the first two parts in (b) are merged.

However, in the first clip, one segment is incorrectly merged with the first part. The partitions are updated iteratively. The qualitative and quantitative results can be seen below: Figure 3: Temporal Action detection results on THUMOS'14.

In this paper, we propose a robust and effective framework to largely improve the performance of human action recognition using depth maps. The key contribution is the proposition of the Sub-action Motion History Image (SMHI) and Static History Image (SHI) in a depth sequence.

The key contribution is the proposition of the Sub-action Motion History Image (SMHI) and Static History Image (SHI) in a depth sequence.

We evenly subdivide the normalized motion energy into a set what are genes segments which corresponding frame indices are used to partition a video into different sub-actions segments.

The Local Binary Patterns (LBP) descriptor is then computed from the SMHI and SHI for the representation of an action. We evaluate the proposed framework on MSR Action3D dataset.



