RYS-XLarge: Model Comparison
News/2026-03-10-rys-xlarge-model-comparison-u7x8v
⚖️ ComparisonMar 10, 20267 min read

RYS-XLarge: Model Comparison

Featured:HuggingFace
RYS-XLarge: Model Comparison

RYS-XLarge vs Competitors: Which Should You Choose?

RYS-XLarge (a clever layer-duplication technique applied to a 72B model) is best for researchers and tinkerers exploring LLM internals on consumer hardware, while frontier models like Qwen2-72B, Llama-3.1-70B, and Claude-3.5-Sonnet still dominate for practical production use. This article compares the “no-weights-changed” RYS-XLarge that topped the mid-2024 Hugging Face Open LLM Leaderboard with its predecessor base model and the leading alternatives, focusing on the questions that matter most: whether the upgrade is worth it, how it stacks up against the competition, price/performance, and migration effort.

Overview

In March 2026, independent researcher David Noel Ng published “LLM Neuroanatomy,” detailing how he reached #1 on the Hugging Face Open LLM Leaderboard without training, merging, or modifying any weights. By identifying a block of seven middle layers in a 72-billion-parameter model that appear responsible for abstract reasoning, he duplicated those layers and stitched them back into the model. The resulting RYS-XLarge model outperformed thousands of fine-tuned and merged competitors on the six-benchmark suite (IFEval, BBH, MATH Lvl 5, GPQA, MuSR, and MMLU-PRO).

This technique is fundamentally different from conventional scaling: it leverages an internal “neuroanatomy” insight that early layers act as input translators, late layers as output translators, and middle layers perform pure abstract reasoning in a language-agnostic representation. The discovery originated from two observations: (1) LLMs can perform complex reasoning when both input and output are in Base64, and (2) the unusual layer-alternating architecture of the Goliath-120B merge.

Feature Comparison Table

ModelContext WindowPrice (input/output per M tokens)Standout CapabilityBest For
RYS-XLarge (72B base + 7 duplicated middle layers)Check latestFree (open weights)Layer duplication for enhanced middle-layer reasoning without any trainingResearch, mechanistic interpretability, running on 2× gaming GPUs
Base 72B predecessorCheck latestFree (open weights)Standard transformer performanceGeneral open-source use
Qwen2-72B-Instruct128KCheck latest official pricingStrongest overall benchmark scores in 2024 leaderboard eraHigh-performance open-source tasks
Llama-3.1-70B128KFree (open weights) / hosted pricing variesExcellent reasoning and tool-use balanceProduction open-source deployments
Claude-3.5-Sonnet (Anthropic)200K$3 / $15Superior real-world reasoning and safetyEnterprise applications, complex tasks

Detailed Analysis

Worth upgrading from the predecessor?
The improvement is meaningful but highly specialized. Ng started with an existing 72B model and duplicated seven specific middle layers that his custom “brain scanner” (a Transformer interpretability tool) identified as the core reasoning block. No weights were changed and no gradient steps were taken. The resulting RYS-XLarge model climbed to the top of the Hugging Face Open LLM Leaderboard at a time when the leaderboard featured intense competition from fine-tuned models with names like Nous-Hermes, Dolphin, and NeuralBeagle14-7B.

For users already running the base 72B model, the upgrade is essentially free in terms of compute cost (just a one-time architectural edit) and delivers measurable gains on the exact six benchmarks used by the leaderboard. However, the gain is narrow: it primarily boosts performance on abstract reasoning tasks that benefit from deeper middle-layer computation. It is not a general intelligence leap and may not translate to better chat, coding, or creative writing performance outside those benchmarks.

vs the competition
RYS-XLarge’s achievement is impressive given the constraint of zero weight modification and running on only two gaming GPUs. At the time, the leaderboard was the “Colosseum” for open-weight models. Qwen2-72B-Instruct ultimately set a high bar with an average score of 43.02 on the revised, harder benchmarks. Llama-3.1-70B offered a strong balance of reasoning and efficiency. Closed models like Claude-3.5-Sonnet remained superior for real-world reliability, safety, and long-context understanding.

The RYS technique demonstrates that architectural surgery on middle layers can outperform many heavily fine-tuned models, validating the hypothesis that middle layers perform language-agnostic reasoning. Yet it still trails the very best fine-tuned or natively trained 70–72B models on the full spectrum of capabilities. The method is more proof-of-concept for mechanistic interpretability than a new state-of-the-art architecture.

Price/performance verdict
RYS-XLarge offers exceptional price/performance for anyone who can run a 72B-class model locally. Because no new training is required and weights are simply duplicated and re-stitched, the only cost is the modest increase in inference compute from the extra seven layers. On two gaming GPUs it remains practical, making it one of the most cost-effective ways to push benchmark scores in a research setting.

In contrast, training or heavily fine-tuning a 72B model requires substantial cloud spend. Hosted API options (Qwen, Llama via providers, Claude) carry per-token pricing that quickly exceeds the one-time engineering effort of the RYS method for heavy users. For pure research or experimentation on consumer hardware, RYS-XLarge is hard to beat on price/performance. For production workloads where reliability and ecosystem support matter more, the extra cost of frontier models or well-supported open models is usually justified.

Migration effort
Switching to RYS-XLarge from its 72B predecessor is relatively low effort: it requires implementing the layer-duplication script (publicly implied in the blog post), validating that the seven identified middle layers are duplicated correctly, and reloading the modified architecture. No retraining or quantization changes are needed beyond what the base model already uses.

Migrating from a different competitor (e.g., Qwen2-72B or Llama-3.1-70B) is more involved. Users must port prompts, adjust for any behavioral differences caused by the duplicated reasoning layers, and re-test downstream applications. The biggest migration friction is that RYS-XLarge is currently a single research artifact rather than a polished, maintained family of models with multiple sizes and instruction-tuned variants. Expect additional engineering to integrate it into existing serving stacks like vLLM or Hugging Face Text Generation Inference.

Use Case Recommendations

Best for researchers and interpretability enthusiasts
If your goal is understanding Transformer internals or experimenting with “LLM Neuroanatomy,” RYS-XLarge is a must-try. The technique opens a new avenue for mechanistic interpretability without needing massive training clusters.

Best for startups
Startups focused on cost efficiency and local deployment should evaluate RYS-XLarge if they already have the infrastructure to run 70B+ models on-prem. The performance-per-GPU advantage on two gaming cards can meaningfully reduce inference costs compared to running larger or more heavily quantized alternatives.

Best for enterprise
Enterprise users needing reliability, safety, long context, and vendor support should stick with Claude-3.5-Sonnet or well-supported open models like Llama-3.1-70B. The RYS approach, while clever, remains experimental and lacks the ecosystem, documentation, and risk mitigation that enterprises require.

Best for hobbyists with gaming GPUs
Anyone with two high-end consumer GPUs should experiment with RYS-XLarge. Achieving leaderboard-topping performance without training is a remarkable demonstration of what can be done in a basement.

Verdict

RYS-XLarge is a fascinating and worthwhile experiment that proves the value of studying LLM “neuroanatomy.” It is a must-upgrade for researchers and anyone deeply interested in how Transformers reason internally. For most practical applications, it remains a wait-and-see or skip it choice until the technique is scaled, generalized, and packaged into an easy-to-use model family.

The real value of this work may not be the leaderboard position itself but the insight that middle layers can be surgically enhanced. It suggests a future where model improvement comes as much from architectural understanding as from more compute and data. For now, the technique delivers impressive benchmark wins on a budget, but frontier closed and open models still provide broader capability and easier deployment.

Sources


All technical specifications, pricing, and benchmark data in this article are sourced directly from official announcements. Competitor comparisons use publicly available data at time of publication. We update our coverage as new information becomes available.

Original Source

dnhkng.github.io

Comments

No comments yet. Be the first to share your thoughts!