AI models are increasingly adopted in clinical practice, yet their generalizability outside controlled validation settings remains unclear. We aimed to evaluate the real-world performance of an FDA-cleared commercial pulmonary embolism (PE) detection model and identify technical, demographic, and clinical factors associated with performance variation, to inform post-production monitoring and deployment strategies.