IndexCache: New Technique Cuts LLM Compute Costs by 75% for Long Contexts
Researchers at Tsinghua University and Z.ai have unveiled IndexCache, a novel sparse attention optimization technique that accelerates inference for large language models (LLMs) handling extensive context windows. Delivering up to ... Read More