Introduction
We offer comprehensive consulting to build and optimize your complete AI software stack, covering key areas such as heterogeneous computing hardware (CPU/GPU/NPU), operator software stack development, AI compiler performance tuning, and AI inference systems and model optimization. Our tailored solutions boost the efficiency of our clients’ AI model inference and hardware execution, helping to construct a robust and scalable foundational AI platform and improving the client’s ability to convert technical capabilities into tangible business value.
Optimizing the Operator Development Workflow for a Major AI Platform
In collaboration with a major Chinese AI platform provider, we tackled their core engineering challenges: managing a large and diverse set of operators developed in multiple languages, which drove up technical stack costs and complexity. We also overhauled the client’s operator precision and performance testing frameworks and engineering processes, streamlining regression validation and shortening test execution times, improving overall development and delivery efficiency.
Based on systematic diagnosis and unique characteristics of the client's AI platform software, we implemented multi-dimensional improvements: 1. Optimized the operator development technology stack, enhancing consistency and coherence of the developer experience while carefully balancing rapid delivery efficiency and ultimate performance; 2. Optimized the C++-based operator development framework, reducing the complexity of developing sophisticated operators, thereby increasing code reuse and overall development efficiency; 3. Streamlined the operator cross-language test mechanism across different languages, resolving the issue of fragmented test case creation, and enabling seamless end-to-end validation for individual operators; 4. Enabled effective operator fusion testing by designing and implementing a Domain-Specific Language (DSL) based on graph construction and validation, simplifying the creation of complex graph fusion test cases and significantly improving their effectiveness; 5. Accelerated CI and test execution for operators, optimizing the concurrent execution performance of operator tests, shortening performance test case runtimes, and significantly improving the overall efficiency of the CI (Continuous Integration) pipeline.
The optimized operator development engineering process has enhanced both end-to-end developer experience and delivery efficiency. The refactored C++ framework allows for more flexible code composition and reuse, substantially improving code reuse rates. Furthermore, the streamlined automated testing process for operators has boosted both testing efficiency and feedback speeds, with the average lines of code per test case dropping from 200 to just 50.
Optimizing the AI Compiler and Inference Stack for a Custom NPU
We partnered with a major NPU company to optimize the architecture, performance, and engineering efficiency of their in-house AI compiler and inference engine.
In partnership with the client’s experts, we performed an in-depth performance analysis on their AI compiler and inference engine based on the unique characteristics of their in-house NPU. Then, we proceeded to redesign and model the core architecture, ensuring that the compiler and inference engine remained independent while still being able to work synergistically to boost overall performance. Pairing with the client’s developer team, we refactored, developed, and tested the core architecture and code based on high-performing coding practices. The entire engagement spanned 9 months, and culminated in a smooth and successful launch of the new version.
The new version of the AI compiler supports more flexible extensibility, fulfilling long-term evolutionary needs of the platform. Specifically, the refactored AI inference engine delivered an overall performance increase of more than threefold, far exceeding the project's original performance targets, and achieving the collaborative goal of enhancing the client's product competitiveness.
AI Paradigm Innovation and Research

System Software Performance Engineering and Optimization
When businesses expand, they often face critical performance issues: slowing systems, spiraling resource costs, and instability during peak demand. Our “System Software Performance Engineering & Optimization” service addresses this by systematically building performance into your processes, pinpointing software and hardware bottlenecks, and implementing end-to-end optimization strategies. The goal is to boost the performance and resource efficiency of large-scale systems comprehensively.

Software Architecture Design & Refactoring
In the lifecycle of large-scale software systems, enterprises often face challenges such as architecture decay that hinders scalability, the accumulation of technical debt that slows down development cycles, and poor coupling that fails to support business growth. Our "Software Architecture Design & Refactoring" service addresses these issues directly. Guided by the client's business requirements and grounded in key architectural technologies, we provide comprehensive, full-stack technical architecture solutions and consulting across the entire spectrum, from Domain-Driven Design (DDD) to component-based architecture, from architectural styles to quality assurance reviews, and from core design principles to performance optimization.