Rethinking Product Search: Why Smaller Models and Better Context Win
research
Oct 17, 2025
Introduction
At Onton, we recently ran a large-scale experiment to understand how well modern text-embedding models handle the messy, real-world language of e-commerce. The goal was simple: when a shopper types “cheap couch” or “modern walnut lamp with brass base,” how do we ensure the right products actually appear first?
Method
We curated a dataset of about a thousand products spanning twenty home goods brands, including dozens of diverse home goods categories like lamps, chairs, paintings, tables, beds, and more. Each product came with its official description and image. To simulate realistic search behavior, we generated one hundred queries based on an analysis of real-world user queries: a blend of highly specific, long-tailed phrases and a handful of broader, intent-driven ones. After embedding both queries and product descriptions using various models, we measured how closely they aligned, and then manually validated whether the top results truly matched the search intent.
Results
What we found was both fascinating and frustrating. Many product descriptions, especially those scraped from online catalogs, are heavily SEO-optimized but semantically hollow. A supposedly “modern outdoor table” might read more like a lifestyle ad than an actual description, emphasizing warm summer evenings with friends, while never once mentioning the material, shape, or color of the table. The text was rich in adjectives, but poor in meaning.
This disconnect revealed an essential truth: in e-commerce, product descriptions are designed for algorithms, not necessarily accuracy. Product images, by contrast, hold a wealth of factual and structural information- color, shape, materials, patterns, proportions - that text alone rarely conveys. To bridge that gap, we experimented with what we call augmented descriptions. Using GPT-4o, we generated short, factual product descriptions derived from images (describing shape, texture, and key visual cues) and merged them with the original text, pruning away redundant marketing language. We tested several prompt formulations, and the best results came from those that explicitly instructed the model to “describe the physical object and omit lifestyle language.” This approach dramatically improved retrieval quality and underscored the importance of metadata enrichment, not just retrieval-augmented generation, but what we’ve started thinking of as contextual RAG: expanding on an original document with richer, orthogonal context rather than simply appending snippets of related text.
When we compared embedding models on this dataset, the results told a subtle story. The top performer overall was E5-base, with near-perfect separation between relevant and irrelevant product pairs (AUC ≈ 0.985, Cohen’s d ≈ 3.21). Yet its smaller sibling, E5-small, performed almost identically, achieving an AUC of 0.981 and Cohen’s d of 3.01, while being dramatically faster and lighter. In practice, this means that E5-small retrieves roughly 98–99 percent of the same relevant items as E5-base, but with a fraction of the compute cost.
Interestingly, many higher-dimensional models, including some that boast superior benchmarks elsewhere, struggled on our dataset. We suspect they overfit to the “noise” in SEO-heavy descriptions, learning to emphasize sentiment and tone rather than objective features. Lower-dimensional models like E5-small, by contrast, seemed to act as natural regularizers: their compact representation spaces forced them to focus on the most stable semantic signals. In e-commerce, that restraint turns out to be an asset. E5-small’s success shouldn’t be surprising. Built on a contrastive pretraining objective that directly optimizes for sentence-level similarity, it’s inherently tuned for tasks like search and retrieval. Its 384-dimensional embeddings strike a balance between expressiveness and efficiency, aligning well with the intrinsic complexity of product attributes. While E5-base remains the gold standard for ultimate recall, E5-small’s agility makes it an ideal production workhorse, especially when latency and cost matter.
Conclusion
The broader lesson is that good metadata matters as much as good models. Even the most sophisticated embeddings are only as strong as the text they’re given. The key is not just retrieval-augmented generation as we usually think of it, but contextual retrieval. Our findings suggest that in e-commerce, better search doesn’t come from larger models or higher-dimensional vectors, but from clearer context. Sometimes the simplest step (like actually describing what a lamp looks like) does more for accuracy than doubling model size or fine-tuning embedding models. The future of product search will belong not only to those who collect the most data, but to those who give their models the clearest view of the world they’re meant to understand in machine learning and neural networks promise even greater innovations and societal benefits.
Research by: Aditri Bhagirath
Introduction
At Onton, we recently ran a large-scale experiment to understand how well modern text-embedding models handle the messy, real-world language of e-commerce. The goal was simple: when a shopper types “cheap couch” or “modern walnut lamp with brass base,” how do we ensure the right products actually appear first?
Method
We curated a dataset of about a thousand products spanning twenty home goods brands, including dozens of diverse home goods categories like lamps, chairs, paintings, tables, beds, and more. Each product came with its official description and image. To simulate realistic search behavior, we generated one hundred queries based on an analysis of real-world user queries: a blend of highly specific, long-tailed phrases and a handful of broader, intent-driven ones. After embedding both queries and product descriptions using various models, we measured how closely they aligned, and then manually validated whether the top results truly matched the search intent.
Results
What we found was both fascinating and frustrating. Many product descriptions, especially those scraped from online catalogs, are heavily SEO-optimized but semantically hollow. A supposedly “modern outdoor table” might read more like a lifestyle ad than an actual description, emphasizing warm summer evenings with friends, while never once mentioning the material, shape, or color of the table. The text was rich in adjectives, but poor in meaning.
This disconnect revealed an essential truth: in e-commerce, product descriptions are designed for algorithms, not necessarily accuracy. Product images, by contrast, hold a wealth of factual and structural information- color, shape, materials, patterns, proportions - that text alone rarely conveys. To bridge that gap, we experimented with what we call augmented descriptions. Using GPT-4o, we generated short, factual product descriptions derived from images (describing shape, texture, and key visual cues) and merged them with the original text, pruning away redundant marketing language. We tested several prompt formulations, and the best results came from those that explicitly instructed the model to “describe the physical object and omit lifestyle language.” This approach dramatically improved retrieval quality and underscored the importance of metadata enrichment, not just retrieval-augmented generation, but what we’ve started thinking of as contextual RAG: expanding on an original document with richer, orthogonal context rather than simply appending snippets of related text.
When we compared embedding models on this dataset, the results told a subtle story. The top performer overall was E5-base, with near-perfect separation between relevant and irrelevant product pairs (AUC ≈ 0.985, Cohen’s d ≈ 3.21). Yet its smaller sibling, E5-small, performed almost identically, achieving an AUC of 0.981 and Cohen’s d of 3.01, while being dramatically faster and lighter. In practice, this means that E5-small retrieves roughly 98–99 percent of the same relevant items as E5-base, but with a fraction of the compute cost.
Interestingly, many higher-dimensional models, including some that boast superior benchmarks elsewhere, struggled on our dataset. We suspect they overfit to the “noise” in SEO-heavy descriptions, learning to emphasize sentiment and tone rather than objective features. Lower-dimensional models like E5-small, by contrast, seemed to act as natural regularizers: their compact representation spaces forced them to focus on the most stable semantic signals. In e-commerce, that restraint turns out to be an asset. E5-small’s success shouldn’t be surprising. Built on a contrastive pretraining objective that directly optimizes for sentence-level similarity, it’s inherently tuned for tasks like search and retrieval. Its 384-dimensional embeddings strike a balance between expressiveness and efficiency, aligning well with the intrinsic complexity of product attributes. While E5-base remains the gold standard for ultimate recall, E5-small’s agility makes it an ideal production workhorse, especially when latency and cost matter.
Conclusion
The broader lesson is that good metadata matters as much as good models. Even the most sophisticated embeddings are only as strong as the text they’re given. The key is not just retrieval-augmented generation as we usually think of it, but contextual retrieval. Our findings suggest that in e-commerce, better search doesn’t come from larger models or higher-dimensional vectors, but from clearer context. Sometimes the simplest step (like actually describing what a lamp looks like) does more for accuracy than doubling model size or fine-tuning embedding models. The future of product search will belong not only to those who collect the most data, but to those who give their models the clearest view of the world they’re meant to understand in machine learning and neural networks promise even greater innovations and societal benefits.
Research by: Aditri Bhagirath
Introduction
At Onton, we recently ran a large-scale experiment to understand how well modern text-embedding models handle the messy, real-world language of e-commerce. The goal was simple: when a shopper types “cheap couch” or “modern walnut lamp with brass base,” how do we ensure the right products actually appear first?
Method
We curated a dataset of about a thousand products spanning twenty home goods brands, including dozens of diverse home goods categories like lamps, chairs, paintings, tables, beds, and more. Each product came with its official description and image. To simulate realistic search behavior, we generated one hundred queries based on an analysis of real-world user queries: a blend of highly specific, long-tailed phrases and a handful of broader, intent-driven ones. After embedding both queries and product descriptions using various models, we measured how closely they aligned, and then manually validated whether the top results truly matched the search intent.
Results
What we found was both fascinating and frustrating. Many product descriptions, especially those scraped from online catalogs, are heavily SEO-optimized but semantically hollow. A supposedly “modern outdoor table” might read more like a lifestyle ad than an actual description, emphasizing warm summer evenings with friends, while never once mentioning the material, shape, or color of the table. The text was rich in adjectives, but poor in meaning.
This disconnect revealed an essential truth: in e-commerce, product descriptions are designed for algorithms, not necessarily accuracy. Product images, by contrast, hold a wealth of factual and structural information- color, shape, materials, patterns, proportions - that text alone rarely conveys. To bridge that gap, we experimented with what we call augmented descriptions. Using GPT-4o, we generated short, factual product descriptions derived from images (describing shape, texture, and key visual cues) and merged them with the original text, pruning away redundant marketing language. We tested several prompt formulations, and the best results came from those that explicitly instructed the model to “describe the physical object and omit lifestyle language.” This approach dramatically improved retrieval quality and underscored the importance of metadata enrichment, not just retrieval-augmented generation, but what we’ve started thinking of as contextual RAG: expanding on an original document with richer, orthogonal context rather than simply appending snippets of related text.
When we compared embedding models on this dataset, the results told a subtle story. The top performer overall was E5-base, with near-perfect separation between relevant and irrelevant product pairs (AUC ≈ 0.985, Cohen’s d ≈ 3.21). Yet its smaller sibling, E5-small, performed almost identically, achieving an AUC of 0.981 and Cohen’s d of 3.01, while being dramatically faster and lighter. In practice, this means that E5-small retrieves roughly 98–99 percent of the same relevant items as E5-base, but with a fraction of the compute cost.
Interestingly, many higher-dimensional models, including some that boast superior benchmarks elsewhere, struggled on our dataset. We suspect they overfit to the “noise” in SEO-heavy descriptions, learning to emphasize sentiment and tone rather than objective features. Lower-dimensional models like E5-small, by contrast, seemed to act as natural regularizers: their compact representation spaces forced them to focus on the most stable semantic signals. In e-commerce, that restraint turns out to be an asset. E5-small’s success shouldn’t be surprising. Built on a contrastive pretraining objective that directly optimizes for sentence-level similarity, it’s inherently tuned for tasks like search and retrieval. Its 384-dimensional embeddings strike a balance between expressiveness and efficiency, aligning well with the intrinsic complexity of product attributes. While E5-base remains the gold standard for ultimate recall, E5-small’s agility makes it an ideal production workhorse, especially when latency and cost matter.
Conclusion
The broader lesson is that good metadata matters as much as good models. Even the most sophisticated embeddings are only as strong as the text they’re given. The key is not just retrieval-augmented generation as we usually think of it, but contextual retrieval. Our findings suggest that in e-commerce, better search doesn’t come from larger models or higher-dimensional vectors, but from clearer context. Sometimes the simplest step (like actually describing what a lamp looks like) does more for accuracy than doubling model size or fine-tuning embedding models. The future of product search will belong not only to those who collect the most data, but to those who give their models the clearest view of the world they’re meant to understand in machine learning and neural networks promise even greater innovations and societal benefits.
Research by: Aditri Bhagirath