People make fast, spontaneous, and consistent judgements of social situations, even in complex physical contexts with multiple-body dynamics (e.g. pushing, lifting, carrying, etc.). What mental computations make such judgments possible? Do people rely on low-level perceptual cues, or on abstract concepts of agency, action, and force? We describe a new experimental paradigm, Flatland, for studying social inference in physical environments, using automatically generated interactive scenarios. We show that human interpretations of events in Flatland can be explained by a computational model that combines inverse hierarchical planning with a physical simulation engine to reason about objects and agents. This model outperforms cue-based alternatives based on hand-coded (multinomial logistic regression) and learned (LSTM) features. Our results suggest that humans could use a combination of intuitive physics and hierarchical planning to interpret complex interactive scenarios encountered in daily life.