This paper proposes a new spatio-temporal feature description method for human motion analysis of videos. First, cooccurrences of SOEs(Spatio-temporal Orientation Energy) between two different regions of whole body area are defined as a low-level feature vector. Second, some visual words are selected from these feature vectors obtained from various human motion video, and defined as templates. Then L2 norm between templates and a low-level feature vecter is defined as a middle-level feature vector. Experimental results show our proposed features can recognize human motions well by using one-all SVM.