Large artificial intelligence (AI) models have garnered significant attention for their remarkable, often "superhuman", performance on standardized benchmarks. However, when these models are deployed in high-stakes verticals such as healthcare, education, and law, they often reveal notable limitations. For instance, they exhibit brittleness to minor variations in input data, present contextually uninformed decisions in critical settings, and undermine user trust by confidently producing or reproducing inaccuracies.
These challenges in applying large models necessitate cross-disciplinary innovations to align the models' capabilities with the needs of real-world applications. We introduce a framework that addresses this gap through a layer-wise abstraction of innovations aimed at meeting users' requirements with large models. Through multiple case studies, we illustrate how researchers and practitioners across various fields can operationalize this framework.
Beyond modularizing the pipeline of transforming large models into useful "vertical systems", we also highlight the dynamism that exists within different layers of the framework. Finally, we discuss how our framework can guide researchers and practitioners to (i) optimally situate their innovations, (ii) uncover overlooked opportunities, and (iii) facilitate cross-disciplinary communication of critical challenges.
Framework Overview
The following figure provides an overview of the proposed framework for situating innovations, opportunities, and challenges in advancing vertical systems with large AI models; read from bottom to top. Large models form the base for vertical systems. These models need scaffolding to demonstrate properties such as robustness, interpretability, efficiency, and privacy before being useful as vertical-specific systems. Vertical-specific adaptations are required to deliver value within specific verticals. This involves curating data, designing or adapting modeling approaches, vertical-centric evaluations, and interfacing the model's outputs with users. General problems in designing interfaces and interactions between the system and users are designated as vertical-user intermediaries. The dynamism between the framework layers is noteworthy (depicted by ↓ and ↑). Over time, vertical-agnostic properties, especially those applicable to many vertical systems, could become ingrained properties in future models as development strategies evolve. Similarly, modeling for vertical adaptation could become less prominent as large models become efficiently adaptable, exemplified by the success of in-context learning with large language models. Finally, vertical-specific insights for interfacing systems with users and general interfacing techniques influence each other over time.

Examples grounded in the framework
Vertical-User Intermediaries
Trust calibration : How to best communicate models' capabilities and limitations?
Algorithm aversion or appreciation exists across many verticals. Introducing uncertainty expressions ("I'm not sure, but...") in AI-generated responses leads to decreased over-reliance and calibrated trust among users - benefiting healthcare assistance, educational tutoring, and generative information retrieval.
Feedback loops : Capturing user feedback for iterative refinement of models
Designing interfaces that capture real-time, in-situ, and implicit user feedback will unlock iterative refinement of underlying systems across many verticals, enabling continuous improvement based on actual user interactions.
Dynamic interfaces : Design interfaces that engage users with higher and lower NFC levels
Users with higher Need For Cognition (NFC) levels benefit from complex interfaces, while others struggle with them. This pattern spans many verticals and continues with recent generative AI technologies, requiring adaptive interface design.
|
Vertical Adaptation in Healthcare
Data : Curating data for tuning with medical dialogues
Requires close involvement of domain experts to curate specialized datasets of diagnostic conversations necessary for accurately tuning multimodal models for clinical use.
Modeling : Enabling patient history-taking for diagnoses
Off-the-shelf LLMs lack properties required for accurate data modeling like history-taking. In radiology, this involves equipping multimodal LLMs with temporal modeling capabilities for patient history analysis.
Evaluation : Evaluating data quality by clinical standards
Generic natural language generation metrics are ineffective in capturing clinically pertinent differences. Meaningful metrics are needed to guide future research and measure clinical accuracy.
Interfacing : Clinician-AI collaboration for error correction
Collaboration loses effectiveness when experts either overly rely on AI predictions or are excessively critical of them. Proper interfacing is crucial for effective human-AI teamwork in clinical settings.
Vertical Adaptation in Education
Data : Curating data that supports better pedagogy
Central challenge involves curating data that captures diverse pedagogical strategies, covering broad range of topics, learner demographics, and instructional modes encountered in real classrooms.
Modeling : Enabling efficient learner-tuned pedagogy
Rather than inefficient fine-tuning, it's advantageous to allow learners to specify desired attributes across pedagogical dimensions and have the model reflect them for personalized instruction.
Evaluation : Evaluating engagement and motivation
Evaluation should be grounded in learning science, prioritizing motivating and promoting engagement from learners, not just "giving the right answer" - focusing on educational effectiveness.
Interfacing : Grounding responses in what the learner "sees"
Crucial that interfaces support learner-tutor interactions such that conversations are grounded in what the student "sees" - a core principle of student-centric pedagogy.
|
Vertical-Agnostic Properties
Robustness : Are models robust to realistic variations in input?
Models must handle plausible variations in user-provided inputs across multiple modalities, as it's unreasonable to assume users will constrain inputs to training distribution margins.
Privacy : Do models carefully handle sensitive data/PII?
Models must handle personally identifiable information already encoded from pre-training corpus and sensitive data provided by users (from patients, students) during iterative refinement.
Interpretability : Can models provide interpretable predictions?
Large AI models must provide interpretable predictions that foster transparency, enabling users to understand and trust the reasoning behind AI-generated outputs.
|
Multimodal Large Language Models
How can modalities beyond language (visual, audio, sensor data) be reliably processed?
Multimodal models could process radiology scans with diagnostic questions, raw ECG signals for health analytics, provide voice-based tutoring, and review lengthy codebases. However, questions remain around how well multimodal LLMs can reason over non-textual data forms.
|