
Executive Editor Carl Franzen wrote in VentureBeat this week: “…a new paper released by Google Research suggests that we may have been overthinking it. The researchers found that simply repeating the input query—literally copying and pasting the prompt so it appears twice—consistently improves performance across major models including Gemini, GPT-4o, Claude, and DeepSeek.
“The paper, titled ‘Prompt Repetition Improves Non-Reasoning LLMs,’ released last month just before the holidays, presents a finding that is almost suspiciously simple: for tasks that don’t require complex reasoning steps, stating the prompt twice yields significantly better results than stating it once.
Even better, because of how transformer architecture works, this ‘one weird trick’ comes with virtually zero penalty in terms of generation speed.”
So now we can double the number of times we query LLMs to get better results. Somehow, that doesn’t seem like an efficiency improvement.
Moreover, isn’t Google Research’s finding on the face of it a further indictment of brute force transformer architecture?
New hybrid AI designs allow executives to adopt a smarter, human-focused approach to AI agents.
This post explores the limitations of Large Language Models (LLMs) and how Graph RAG complements LLMs and reduces business risks in enterprise agentic AI implementations. Graph RAG succeeds by taking the heavy lifting away from statistical AI, which often struggles when forced to handle business tasks that require facts, logic and human judgment together rather than supposition.
In a graph RAG architecture, the LLM acts as a front end to interpret language. Meanwhile, the back end uses a human-in-the-loop, agent-assisted graph model to manage facts, business rules, and decisions. This approach creates a clear, richly connected, disambiguated data environment that both humans and AI agents can use and manage.
The result is a trustworthy, lower risk AI system built specifically for business needs.
GraphRAG business AI workflow
GraphRAG users can ask questions in natural language and receive logically validated answers. Semantic metadata stored and managed in a graph database ensures disambiguated, rules-governed data and knowledge management. See Andreas Blumauer’s Talk to Your Graph webinar for examples on how graphRAG works in practice. This diagram outlines the main techniques and benefits of the approach.

Familiar shortfalls of transformer architecture
The shortfalls of conventional LLMs many users are familiar with haven’t gone away. Here’s a short refresher on what the fundamental issues with transformer (i.e., parallel processing, self-supervised) LLMs have been.
- Built-in processing inefficiency
Every time you double the length of the text you give the model, the amount of work the computer has to do quadruples. Because of this quadratic problem, processing long books or complex legal documents becomes expensive. Processing slows the more volume increases. - Memory (context window) limits
Every transformer has a limit on how much information it can see at one time. Once you go past that limit, the model effectively forgets the beginning of the conversation. - Massive energy consumption
Because these models are so complex, training them requires enough electricity to power thousands of homes for months. This makes them environmentally unfriendly and very expensive for anyone but the hyperscalers to build. - A prediction rather than a thinking machine
A transformer doesn’t actually know facts, and it is a guessing rather than a reasoning machine. It predicts the most likely next word based on patterns. - Never enough data
To work well, a transformer needs to read almost everything on the internet. While humans can learn what a “dog” is from seeing two or three examples, a transformer needs to see thousands of pictures or descriptions of dogs to understand the concept reliably. - A black box model
Even the scientists who build these models don’t fully understand why they make certain decisions. Because there are billions of internal connections, it is very hard to diagnose and fix a specific error or ensure the model is always being fair and unbiased. - Relative model stasis
Once a transformer is finished training the model is frozen in time. It doesn’t learn from its mistakes or update itself with today’s news unless it goes through a long and expensive retraining process.
How providers have addressed these shortfalls to date
Major LLM providers such as OpenAI, Google, and Anthropic have not abandoned the transformer architecture. Instead, they have upgraded it using a set of engineering workarounds.
In this sense, LLM providers perpetuate the old architecture, but have incrementally improved performance, at least over the short term.
1. Solving the quadratic speed problem
To stop models from slowing down as text gets longer, providers use FlashAttention and Sparse Attention.
- How it works: Instead of every word looking at every other word (which is slow), the model uses math tricks to ignore unimportant connections or process data in blocks.
- The Result: This is why you can now paste 100-page PDFs into Claude or GPT-4o and get a response in seconds rather than minutes.
2. Expanding context windows
Providers have moved from a 4,000-word limit to 100,000 or even 1 million (in the case of Google’s Gemini). They have done this using RoPE Scaling and KV Caching.
- How the scaling and caching works: LLM providers stretch the model’s sense of position so it doesn’t get confused by high numbers of words. They also save the parts of the conversation it has already read so it doesn’t have to re-read the whole prompt every time you ask a follow-up question.
- The Result: More extensive conversations, though models still occasionally forget details buried in the center of a long prompt.
3. Using RAG and database queries and retrievals for reliability
Because of the unpredictable tendency of LLMs on their own to hallucinate, companies use Retrieval-Augmented Generation (RAG).
- How it works: Instead of asking the model to rely on its memory, the system first searches a trusted database (like the web or your files), finds the right information, and feeds it to the model as an open book to read from.
- The Result: The model acts more like a researcher in a library and less like a person guessing from memory.
How semantic graph RAG boosts RAG’s potential
The fact that repeating the prompt improves performance is perhaps the ultimate evidence that LLM transformers are statistical engines, not logical thinkers. Statistical architecture often fails to note an instruction the first time because it is too busy calculating the mathematical probability of the next word.
Relying on such a system to manage complex business data is like hiring a talented but distracted assistant who needs everything repeated twice to stay on task. It is the major flaw of an architecture that prioritizes pattern matching over actual comprehension, leading to an efficiency drain that no enterprise should settle for.
Instead of doubling down on these statistical glitches, the path forward lies in the hybrid Graph RAG model. By relying on a structured knowledge graph and the deterministic logic of inferred connections, rules, and descriptions, we stop asking the transformer to be all things to all people.
This architecture uses the LLM for what it does best—interpreting human language—while leaving the critical management and use of facts and rules to a system that doesn’t need to be told things twice. The result is a more elegant, efficient, and trustworthy AI that serves the business rather than just mimicking it.
While vector-only RAG is a basic tool for finding similar-sounding text snippets, semantic graph RAG is a sophisticated system that understands the meaningful connections between facts.
| Feature | Vector-only RAG | Semantic graph RAG |
| Logic | “Find things that sound similar.” | “Follow links between known facts.” |
| Data type | Unstructured text snippets. | Entities and defined relationships. |
| Accuracy | Good for general FAQ. | Best for complex, multi-step business logic. |
| Discovery | Limited to what’s in the snippet. | Can discover hidden patterns across data. |
For further information:
1. On transformer limitations & scaling
Source: Training Compute-Optimal Large Language Models (The “Chinchilla” Paper) Abstract: This paper investigates the relationship between model size and the amount of training data. The researchers discovered that most modern LLMs (like GPT-3) were actually under-trained because they lacked enough high-quality data relative to their size. It highlights the data plateau concern: we are running out of human-generated text to feed these models. It provides the mathematical scaling laws that dictate how much power and data are needed to see performance gains, validating the critique of massive resource demands.
- Citation: Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models. arXiv:2203.15556.
- URL: https://arxiv.org/abs/2203.15556
2. On reasoning & neurosymbolic AI
Source: Deep Learning is Hitting a Wall. What’s Next? Abstract: Gary Marcus, a leading cognitive scientist, argues that simply adding more data to transformers will never lead to true understanding. He advocates for Neurosymbolic AI—merging the pattern recognition of deep learning with the rule-based logic of classical AI. This source focuses on why LLMs struggle with compositional tasks (like complex math or logic) that require structured, step-by-step thinking rather than just predicting the next word.
- Citation: Marcus, G. (2022). Deep Learning is Hitting a Wall. Nautilus.
- URL: https://nautil.us/deep-learning-is-hitting-a-wall-238440/
3. On Spiking Neural Networks (SNNs)
Source: Deep Learning With Spiking Neurons: Opportunities and Challenges
Abstract: This research explores the potential of neuromorphic computing to solve the energy crisis in AI. Unlike standard transformers that require constant floating-point math operations, SNNs communicate via discrete signals (spikes). This paper outlines how these models can achieve near-zero power consumption when idle. It also addresses the training gap—the difficulty of getting these brain-like models to perform as well as transformers on language tasks.
URL: https://arxiv.org/abs/2109.12894
Citation: Eshraghian, J. K., et al. (2023). Training Spiking Neural Networks Using Lessons From Deep Learning. Proceedings of the IEEE.






Leave a Reply