GPU clusters, the workhorses of modern artificial intelligence, often sit idle between training jobs, representing a significant cost for operators. Now, a modern platform called InferenceSense aims to turn those dark cycles into revenue streams by running AI inference directly on unused hardware. The approach, pioneered by researchers behind the widely adopted vLLM inference engine, could reshape how neocloud providers monetize their infrastructure and potentially lower costs for AI developers.
The core idea is simple: instead of letting expensive GPUs sit dormant, InferenceSense allows neocloud operators to offer their spare capacity for AI inference tasks, splitting the resulting revenue. This differs from existing spot GPU markets, where cloud vendors rent out raw compute power. InferenceSense provides a complete inference stack, streamlining the process for engineers and maximizing the utilization of available resources. The launch of InferenceSense marks a shift towards more efficient GPU utilization and a potential new economic model for the AI infrastructure landscape.
FriendliAI, the company behind InferenceSense, was founded in 2021 by Byung-Gon Chun, the researcher whose work on continuous batching laid the foundation for vLLM. Chun, a former professor at Seoul National University, developed the “Orca” paper introducing continuous batching – a technique that dynamically processes inference requests, improving efficiency over traditional batching methods. This technique is now considered industry standard and is central to vLLM’s performance.
How InferenceSense Works
InferenceSense integrates with existing Kubernetes deployments, a common resource orchestration tool used by neocloud operators. Operators allocate a pool of GPUs to a Kubernetes cluster managed by FriendliAI, specifying conditions for reclaiming the hardware when needed. The platform then spins up isolated containers to serve inference workloads on open-weight models like DeepSeek, Qwen, Kimi, GLM, and MiniMax. FriendliAI handles demand aggregation through direct clients and inference aggregators such as OpenRouter, optimizing models and managing the serving stack. Operators receive a real-time dashboard displaying model usage, token processing, and accrued revenue.
The key differentiator for InferenceSense lies in its focus on token throughput rather than raw capacity. FriendliAI claims its engine delivers two to three times the throughput of a standard vLLM deployment, though the exact figure varies depending on the workload. This increased efficiency translates to higher potential revenue for operators.
A Different Approach to GPU Monetization
Traditional spot GPU markets, offered by providers like CoreWeave, Lambda Labs, and RunPod, involve renting out hardware. InferenceSense, however, leverages hardware already owned by neocloud operators, allowing them to define participation rules and scheduling agreements with FriendliAI. This distinction is crucial: spot markets monetize capacity, while InferenceSense monetizes the actual processing of AI tasks – the tokens.
FriendliAI’s inference engine is built using C++ and custom GPU kernels, diverging from the Python-based frameworks common in competing stacks. The company has also developed its own model representation layer, along with implementations of speculative decoding, quantization, and KV-cache management, all contributing to its performance gains. This optimized stack allows operators to potentially earn more revenue per unused cycle than by simply offering raw compute capacity.
Implications for AI Engineers and Neoclouds
For AI engineers, the emergence of platforms like InferenceSense adds another layer to the neocloud versus hyperscaler decision-making process. While price and availability have traditionally been the primary considerations, the ability of neoclouds to monetize idle capacity could incentivize them to offer more competitive token pricing. Chun believes that “When we have more efficient suppliers, the overall cost will proceed down
,” and that InferenceSense can contribute to making AI models more affordable.
It’s still early days, and widespread adoption of InferenceSense won’t immediately shift infrastructure decisions. However, engineers tracking inference costs should monitor whether increased neocloud adoption leads to downward pressure on API pricing for models like DeepSeek and Qwen over the next 12 months. The platform represents a potentially significant step towards a more efficient and cost-effective AI infrastructure ecosystem.
The launch of InferenceSense signals a growing focus on maximizing the utilization of existing AI infrastructure. As the demand for AI continues to rise, solutions that can unlock untapped capacity and lower costs will be crucial for fostering innovation and accessibility. The coming months will be key to observing how neocloud providers embrace this new approach and its impact on the broader AI landscape.
What are your thoughts on the potential of platforms like InferenceSense to reshape the AI infrastructure market? Share your insights in the comments below.