Abstract: Precisely localizing temporal intervals for each action segment in long raw videos is essential challenge in practical video content analysis (e.g., activity detection or video caption ...