Abstract: The multimodal featuring fusion for natural human-computer interaction involves complex intelligent architectures to face unexpected errors and mistakes made by users. These architectures should react to events that occur simultaneously with eventual redundancy from different input media. In this paper, intelligent agent based generic architectures for multimedia multimodal dialog protocols are proposed. Global agents are decomposed into relevant components. Each element is modeled separately using timed Colored Petri networks. The elementary models are then linked together to obtain the full architecture. Hence, maintainability, understandability and the modification of the architecture are facilitated. For validation purpose, the proposed multi-agent architectures are applied on a practical example.