The Hidden Parameter That Cut Our LLM Response Times by 68%

Tech Team
6 nov
7 Min. de lectura

TL;DR: We wanted to understand what actually matters when deploying LLMs in production beyond marketing claims.

We benchmarked GPT-OSS 120B across Cerebras, Groq, Azure, and AWS in October 2025. Vendor/press claims show Cerebras attaining up to ~3,000 tokens/sec on wafer-scale inference (peak hardware), while independent single-request benchmarks put AWS Bedrock at ~230 tokens/sec for steady per-session throughput. Crucially, tuning Bedrock's reasoning_effort parameter materially cut end-to-end times in our runs (113s → 22s). Different metrics measure different things, vendor claims vs independent benchmarks vs aggregate throughput, and understanding which is which matters for production decisions.

This piece breaks down what we found, why variability and compliance matter, and how we approach infrastructure decisions at Promtior.

The Speed Demos You See Aren't the Whole Story

Earlier this year, demos in AI communities kept showing Groq and Cerebras smashing benchmarks, with time-to-first-token graphs that made other providers look slow. The narrative was compelling: faster = better. And who wouldn't want 3,000+ tokens per second?

We wanted to understand if speed actually changes what you can build, or if it's primarily good marketing. So we tested the same model (GPT OSS 120B) across four providers AWS Bedrock, Cerebras, Groq, Azure AI with identical LangGraph agent architectures.

At Promtior, we believe in choosing the right tool for each job. We mix OpenAI, Anthropic, Qwen, Nova (reasoning models, fast models, big models, small models) across the same architecture. This test wasn't about crowning a winner, it was about understanding trade-offs.

What we found surprised us. Speed matters, but consistency, hidden parameters, and compliance ended up mattering more in practice.

The Setup: LangGraph ReAct Agents

We implemented ReAct (Reasoning + Acting) agents using LangGraph across all four providers with identical tool sets (math functions, knowledge base search, text utilities). We ran them asynchronously on the same queries to avoid sequential bias.

Why LangGraph for This Test?

We needed to swap providers without rewriting agent logic. LangGraph sits at the right abstraction level: lower than LangChain's high-level agents (which are fine for POCs but break when clients need custom routing or state management), but higher than raw API calls (which would mean manually managing state, tool routing, streaming events, and checkpointing).

In practice, swapping providers meant changing the LLM initialization,`ChatCerebras` vs `ChatBedrockConverse`, while keeping the graph structure, tools, and state management identical. This let us isolate provider differences (speed, streaming, parameter support) without conflating them with architectural changes.

The trade-off: we still hit provider-specific quirks (streaming behavior, hidden parameters like `reasoning_effort`), but debugging was easier because the agent logic stayed constant. LangChain 1.0 standardized tool calling formats, which made the multi-provider testing significantly smoother than it would have been previously.

We measured:

1. Latency consistency across multiple runs

2. Streaming support and behavior

3. Production gotchas that don't show up in demos

The Numbers: Speed, Cost, and Variability

Important: Different metrics measure different things. Vendor/press numbers are peak hardware claims; independent benchmarks are per-request steady-state measurements; aggregate metrics are cluster-level throughput. We've labeled each accordingly.

Here's what we measured in October 2025, combining our tests with Artificial Analysis independent benchmarks:

Provider	Input (USD/M tokens)	Output (USD/M tokens)	Tokens/Second	Metric Type	Streaming?
Cerebras	$0.35	$0.75	~3,429	Vendor peak hardware claim (Cerebras)	✅ Yes
Groq	$0.15	$0.75	~477	Provider-reported operational speed (Groq)	✅ Yes
Azure	$0.15	$0.60	~384	Independent benchmark (per-request steady-state)	✅ Yes
AWS	$0.15	$0.60	~232	Independent benchmark (per-request steady-state)	❌ Limited*

*At the time of our October 2025 testing, Bedrock's streaming support for GPT-OSS models was limited or not supported in some runtimes and SDKs (AWS Documentation); behavior differs by API/region and is changing quickly.

Key findings:

- Cerebras reports peak hardware speeds up to ~3,000 tokens/sec (vendor measurements), though observed production throughput varied significantly in our tests: sometimes 3,000 tokens/second, sometimes 600

- Independent steady-state benchmarks show AWS Bedrock producing roughly ~230 tokens/sec per-request, a useful metric for single-session latency expectations

- AWS and Azure offered the most competitive pricing for input/output

- Groq balanced speed and cost well with more consistency than Cerebras

Higher bars indicate faster average speed. Error bars show observed variance from our testing (Cerebras & AWS only). Note: Cerebras numbers are vendor peak claims; AWS/Azure are independent per-request benchmarks.

What Actually Mattered: Consistency Over Raw Speed

In our testing, speed variance affected user experience more than we expected. Cerebras would occasionally drop from 3,000 tokens/second to 600, a 5x swing. For a chatbot demo, that's the difference between "wow, this is instant" and "why is it stuttering?"

AWS, on the other hand, delivered consistent 230 tokens/second performance. Predictable and reliable. In production environments, consistency often matters more than peak performance.

As one of our engineers noted: "I'd rather explain to a client why something takes 4 seconds consistently than explain why it took 1 second yesterday and 8 seconds today."

Streaming: Critical for Demos, Less Important for Multi-Agent Systems

For user-facing demos, streaming is essential. Seeing tokens arrive in real-time changes how people perceive responsiveness.

But in multi-agent systems, streaming matters less internally. Most production architectures today aren't single-model chatbots. They're orchestrations: Agent A calls Agent B, waits for a full response, then routes to Agent C. Streaming between agents isn't useful because you need the complete message to make routing decisions.

Streaming becomes important again at the final step when sending a response back to the user. If your architecture has 3-4 agent hops before that, the speed gains from streaming get diluted.

In our experience, streaming is table stakes for demos but less critical for production agents.

The Hidden Parameter: AWS's `reasoning_effort`

At the time of our testing in October 2025, AWS Bedrock was migrating to their new Converse API. The new reasoning models (including GPT OSS 120B) support a parameter called `reasoning_effort`, popularized by GPT-5 but quietly introduced by labs like Qwen from Alibaba months earlier.

This parameter controls how much compute the model allocates to internal reasoning. Set it to `low`, and the model still gives you quality answers (this is a 120B parameter MoE, not a small model), it just doesn't spend extra compute on extended chain-of-thought. Set it to `high`, and it writes verbose internal reasoning chains before answering.

Think of it like Claude Sonnet vs Claude Opus for a simple task. Sonnet gives you a great answer fast. Opus gives you a great answer with three paragraphs explaining its methodology first. Both are capable models, one just allocates more compute to thinking out loud.

The challenge: By default, most providers set this to `medium` or even `high`. Ask the model "what's 2+2?" and it might write 6 paragraphs of reasoning about the philosophy of math before answering, too much reasoning could even be harmful, specially in smaller models.

The documentation gap: AWS's documentation barely mentioned this parameter AWS Documentation. Their tutorials were silent. We found one line buried deep in the low-level API docs:

"You can also send extra params through `additional_model_request_fields`."

No examples. No guidance. No list of what parameters actually exist.

We had to brute-force test parameter names. We tried `thinking` (used by some Claude models), `reasoning_effort` (used by GPT OSS 120B), and variations until we found that last one worked this way:

additional_model_request_fields={"reasoning_effort": "low"}

The possible options are `low`, `medium`, and `high`. GPT 20B also supports `minimal`.

The impact: In a project with several multi-agentic orchestrators and subagents, our initial tests were taking 113 seconds per response with the default `reasoning_effort` setting. After tuning `reasoning_effort` to match query complexity, we dropped to 22 seconds on average. Same model, same query, 68% faster.

A single parameter change delivered 68% faster responses in our multi-agentic orchestrator system for queries that didn't need extended reasoning chains.

Compliance Considerations for Regulated Industries

Cerebras and Groq offered impressive speed and competitive pricing. However, at the time of our evaluation, we couldn't easily find prominent documentation about HIPAA compliance, SOC 2, or financial data certifications.

AWS and Azure, on the other hand, had extensive certification documentation readily available. For projects in healthcare or fintech, this documentation availability is often a requirement, not just a preference.

What We Learned About Provider Selection

Note: These are observations from our October 2025 testing. Every project has unique requirements.

Our testing revealed that different providers excelled in different areas, and the "right" choice depends heavily on the specific context:

Speed and performance varied significantly. Cerebras delivered the highest peak speeds but with notable variance. AWS and Azure provided more predictable performance. Groq balanced speed with consistency better than the extremes.

Compliance and documentation availability differed across providers. Established cloud platforms had extensive certification documentation readily accessible. Newer specialized providers focused more on performance benchmarks.

Ecosystem integration matters. Some providers fit naturally into existing cloud infrastructure and tooling. Others require more custom integration work but may offer performance advantages.

Cost structures varied beyond just per-token pricing. Factors like minimum commitments, egress fees, and operational overhead affect total cost of ownership.

Key considerations we found important:

Different use cases (demos, production agents, automations) have different priority hierarchies
Parameter tuning can significantly impact performance, but documentation varies in quality
Client ecosystem and existing infrastructure often constrains or guides provider selection
Security and compliance requirements can be non-negotiable filters

Why the Infrastructure Layer is Fragmenting

Five years ago, you picked OpenAI or Google, and that was it. Today, we're mixing providers within the same architecture. Small models for routing. Fast models for drafting. Reasoning models for decisions. Open-source models for cost.

We believe the successful providers will be the ones that compose well: easy APIs, clear docs, predictable behavior. That's what enables building production systems efficiently.

At Promtior, our approach is: test everything, assume nothing, and always leave room to swap providers. The landscape changes every quarter, and the best infrastructure choice today might not be the best one in six months.

Final Thoughts

Cerebras showed 15x higher average speed than AWS in our October 2025 testing. But it also showed more variance. AWS delivered more consistent performance. With proper parameter tuning, we could improve AWS response times significantly without sacrificing reliability.

The key takeaway: match your infrastructure to your specific requirements. Demos, healthcare applications, multi-agent systems, and automations each have different priorities around speed, consistency, compliance, and cost.

And when evaluating providers, check the documentation thoroughly. Important parameters might be documented in unexpected places, specially in such new API implementations, or set to defaults that don't match your use case.

Joaquin Bonifacino - AI Lab Lead, Promtior

Want to talk infrastructure, agent orchestration, or hidden AWS parameters?

Let us know!