2025-01-14 17:38:49
Crowdsourcing subjective value is a great idea, but when no thought has been put into how to do it, the results are neither useful nor healthy.
In this article we discuss the problems with the classic five star rating system and why it needs to be abolished in favor of better expressions of public sentiment.
## Part 1: What's In A Star?
Let's start with a thought experiment:
- Do you buy products on Amazon that have 3 stars?
- Do you take rides from Uber drivers that have 3 stars?
- Do you download apps on the App Store that have 3 stars?
- Do you visit places on Google Maps that have 3 stars?
"Not on purpose," is probably your answer. But why not? After all, a three out of five is better than the median; it's 60 percent! That is solidly "above average" by any usual measure. Although it makes sense in mathematical terms, it is not correct to say that a 3 star product is at least half as good as a 5 star product.
The funny thing about the 5 star rating system is that nobody really knows exactly what the stars mean, mostly because they aren't ever explained. They are just a weak abstraction of a score out of 5, which itself is an abstraction of a score out of 100, and it is used to average all ratings for a thing into an average star rating. Presumably, 5 is the best and 1 is the worst, and somewhere in between is... what exactly? Mediocre? Passable? Acceptable? Average? Who knows! This is the first problem with this system; while the top and bottom of the scale may be implied (ambiguity still isn't great), the middle of the scale is entirely ambiguous.
Most of the time, people start with a default supposition that a product ought to be 5 stars, and then they subtract stars according to their own perceived deficiencies of the product. On the flip side, people who are dissatisfied with a product tend to start at 1 star and perhaps award it a couple more if they are feeling charitable.
This dualistic approach highlights another issue with this kind of rating system. Despite the pure quantitative mechanism, the feelings and disposition of the rater ultimately assign meaning to the numeric scale.
Ultimately, any 5 star rating system ends up being a gamut ranging from hate it to love it with a lot of ambiguity in the middle.
To put it simply, the stars mean nothing. But to be more precise, the stars are the *average* of the *numeric expression* of the public sentiment about a thing.
OK, public sentiment. That's what we want right? We want to know what the public thinks about the product so we can make a good purchase decision.
Except, unfortunately, the 5 star rating system does an incredibly imprecise job at capturing what the public actually thinks about a product. Let's dive into why this is. But first, to conclude our thought experiment, I'd like to establish an informal consensus that we instinctually seek out the highest rated items and avoid ratings less than 4.5 out of 5, with infrequent exceptions. If this sounds like you, then let's proceed.
## Part 2: Trash or Treasure
How does one turn their feelings into a number? This is not something people are good at, or do naturally, or are even required to do very often. It is also completely subjective and arbitrary. Does "happy" mean 5 or 100 or 902,581? It really depends on what you are trying to measure.
However, you've probably done this on a survey. Normally, a range of options is given with accompanying numbers. You may have heard a guiding explanation such as "Rate one through five, with one being 'least likely' and five being 'most likely' to recommend to a friend." Such explanations are useful in assisting a person in expressing their personal sentiments as a numeric range, and anecdotally I seem to provide more nuanced answers when I have guidance as to what each value actually means to me. However, such explanations are missing from most places where five star rating systems can make or break one's livelihood as a seller, developer, or musician. Users are normally given no guidance as to how a particular star value should correspond to their sentiment.
Accordingly, this manifests in an all too common star rating smiley. Like the famed "smiley face curve" equalizer setting popularized in the 70's in which the frequencies of a song were engineered to grow from the midranges upward toward the treble and bass making a smile shape (who's midrange is the lowest), the star rating smile is a common sight on product reviews where the most common review ratings appear anecdotally to be 5 and 1, then 4 and 2, and lastly 3.
The fact that the most common ratings are five stars and one star indicates that people most often do not think about their experience as a gradient between good and bad, but rather simply label their sentiment as only "good" or "bad" with little room for nuance. Having not received any guidance for expressing their sentiment numerically, this should be wholly unsurprising. But, I also feel like this binary outcome is a very natural expression that requires minimal mental energy to produce, and is therefore the most efficient expression of sentiment. I suspect that in another universe where we have a commonplace 15 star rating system but humans are the same, the most common ratings are 15 stars and 1 star respectively.
Other factors influence this binary good/bad paradigm too. People know instinctively that leaving a bad review is bad for business, and often this is their vengeful response to any sense of feeling as if they have been mistreated, misled, or swindled by the seller. Often times the sheer excitement of the new product compels people to write reviews only minutes after they have received it, precluding them from providing an experienced and time-tested perspective of how the product performs.
These quirks of human behavior can skew five star ratings severely. But the behaviors themselves are not bad! They are totally natural and expected. The real problem is how the five star rating system fails to capture these commonplace human sentiments accurately.
We need to build a rating system that works for people rather than forcing people to fit into a poorly executed rating system.
## Part 3: Feast or Famine
It is natural that when presented with a plethora of options, humans will be drawn to the options that are labeled as "the best". When "best" possible is five stars, then five stars naturally becomes the [Schelling point](https://en.wikipedia.org/wiki/Focal_point_(game_theory))for all buyers and sellers, where the vast majority of economic activity is concentrated.
When all ratings are seen from a five-based perspective, fours look barely adequate. Threes look unacceptable. Twos are abject. And ones are complete disasters.
You may have heard about how [Uber would suspend drivers whose ratings fell below 4.6](https://www.inc.com/minda-zetlin/uber-rider-ratings-deactivation-lyft-rideshare.html). The tendency of five star rating systems is to create a gradient of sentiment where 1 to 4.5 stars is "bad" and 4.5 to 5 stars is "good".
One rarely discussed outcome of this dynamic is that good ratings are nearly impossible to compare to each other. As alluded to in the introduction thought experiment, the narrow range between 4.5 and 5 leaves very little room to distinguish from a good product and an excellent product; both are simply "good" or "not bad". As highly rated products are usually the products people are most interested in, it is unfortunate that a favorable rating would actually provide less signal than a bad rating as to the product's relative quality in the spectrum of good ratings.
Meanwhile, bad ratings have wide berth to compare numerically. The astronomical range between 1 and 4.5 is a very spacious gradient in which to assess how bad one product is versus another, except nobody actually cares because nobody is going to buy it anyway!
In the five star rating system, the better a product is, the *less* information we get to compare it to other similarly good products. This is a critical design flaw, because the better a product is and the more people review it, the *more* information we should have about it versus other similar products. It would be better if the range of "good" ratings was wider so that good ratings could be compared to each other.
I'd like to emphasize that the *loss of precision* as a product becomes more desirable is a truly ludicrous mechanic of the five star rating system. This poorly conceived rating paradigm completely pervades our digital economic systems and determines the success of millions of producers. Let's be clear about the stakes. Assigning subjective value to things is completely overlooked for what it really is: a absolutely critical and monumentally influential economic activity. Because the most important economic activity is how we assign value to things with money, the second most important economic activity is how we *inform* our economic value assignments with subjective value assignments such as these.
We need to fix how we rate things. It could literally change how entire economies function.
## Part 4: Quality or Quantity
Valve has obviously put thought into how they handle ratings for video games sold through their extremely popular and long-lived Steam platform.
Reviews are not allowed until a certain amount of hours of the game have been played. The review must assign a thumbs-up or thumbs-down, and then provide a minimum amount of text.
Then, rather than simply providing an average of these binary ratings, Steam averages them over a recent period of time, creating a dynamic where if a game developer releases a new update, the reviews written more recently are sure to influence the overall rating of the game rather than old ratings that haven't taken the new update into consideration.
Finally, the aggregated rating itself is displayed as "Overwhelming Negative", "Very Negative", "Mostly Negative", "Mixed", "Mostly Positive", "Very Positive", or "Overwhelmingly Positive".
Ask yourself, which product would you be more likely to try?
- 2.5/5 stars
- "Mixed Reviews"
Ironically, the Steam rating system gives users less flexibility to express their sentiment as a number, and yet provides more depth, nuance, and balance to the resulting ratings. This, ladies and gentlemen, is what it looks like when somebody gives a damn about how things work. It is a great system.
Another interesting emergent behavior that accompanies many Steam reviews, as well as many reviews on other sites, are user generated pros/cons lists. These lists are helpful and offer qualitative labels that help to explain the quantitative rating assignment, although these labels are not mechanically related to the rating system itself.
Steam has another mechanism to express qualitative judgements in the form of labels, although they had to be restricted because they were being abused in undesirable or nefarious ways. Now the labeling system does more to tell you about what the game is rather than what people think about it, which is fine, but I feel like it is a missed opportunity.
Therefore, I'd like to propose a simple system that combines thumbs up/down and labeling to be used in the context of nostr's review system.
## Part 5: QTS
I call this new review system QTS, or the "Qualitative Thumb System". [Originally I developed QTS when working at Arcade Labs.](https://github.com/ArcadeLabsInc/arcade/wiki/ArcadeSocial)
[This PR for a new nostr review mechanism](https://github.com/nostr-protocol/nips/pull/879) allows for a lot of flexibility in how you apply ratings to things, so QTS is simply a method of applying rating values that creates a better human-oriented review system. QTS is a way of using reviews.
In essence, QTS capitalizes on our very human instinct to assign a "good" or "bad" label by limiting the quantitative assessment to a thumbs-up or thumbs-down. Then, QTS provides qualitative labels that describe possible positive sentiments that describe different aspects of the thing being rated.
First, the user chooses thumbs-up or thumbs-down as their overall assessment. If they do nothing else, this is sufficient to capture their sentiment. However, labels should be provided which the user can check or toggle on to increase their rating further.
The initial thumbs up is worth 0.5, and each label is worth (0.5 / number of labels). The minimum rating is 0 (thumbs down, no labels), and the maximum rating is 1 (thumbs up and all labels). Any rating 0.5 or above is trending toward good and below 0.5 is trending toward bad.
The labels should be applicable in the context of the thing. So, for example, I might provide the following labels for place reviews in a Google Maps-style app:
- Convenient
- Clean
- Affordable
- Memorable
- Inviting
These labels could possibly describe any place. It's OK if a place doesn't have all of these qualities. Zero labels and at thumbs up is still a "good" rating. Each label selected is essentially a "cherry on top" and its absence may indicate that either it isn't applicable OR the place failed to earn it.
Likewise, it is possible that you may give a place a thumbs-down and apply labels; this would result in a rating higher than 0 but still in the bad gradient (below 0.5).
Here are the key benefits I want to highlight of QTS:
- The 5 star rating system forces a user to do the work of translating their sentiment into a quantity. With QTS, the user never has to translate their feelings into a number! They only express good or bad and pick labels, and the QTS mechanism does the work of translating this into a computation-friendly value.
- The 5 star rating system generally results in a "bad" range from 1 to 4.5 and a "good" range from 4.5 to 5. QTS balances this with a "bad" rating at 0, a "good" rating at 0.5, and an excellent rating is anything above 0.5 (up to 1.0). This creates the maximum possible gradient between good and bad which makes it easier to compare similar ratings. Recall this in contrast to the 5 star rating system which actually _loses precision_ as more ratings are provided.
- It is also helpful that similar QTS ratings may have different labels, which will allow people to make easy qualitative assessments that do not depend on users generating their own pros/cons list.
## Part 6: Implementing QTS
Here is how it works:
You give the user the option to rate a thumbs up or thumbs down.
Then, you also give your user the option to choose from a predefined set of positive labels. You can have any number of labels but try to keep it below 10 so as not to overwhelm your users. Keep the labels general enough that they could potentially apply to any thing being rated.
For example, if you were providing labels for Amazon.com, some good labels would be:
- Good Value
- Good Quality
- As Described
- Durable
- Right Size
These labels are general enough that they could apply to almost any product. It is important to create labels that are general so that when comparing product ratings you are comparing the same labels. It is possible however that something like Amazon.com could define a different QTS label subset for each product category, and then the labels could be more specific to that category.
For example, a product category of Candles could have "Long Burning", "Good smell", "Safe", etc. These labels are much more specific, but appropriate for the Candle product category. The main point is that products which should be compared should use the same QTS label set.
Some poor examples of labels would be:
- Orange (not really relevant to the product's assessment)
- Easy to Lift (only relevant to certain products)
- Made in USA (not really relevant to the product's assessment)
- Cheap (not descriptive enough and could be interpreted as negative)
A score is derived as follows:
- a thumbs-down is a score of 0.00
- a thumbs-up is a score of 0.50
- a label is worth 0.50 ÷ the number of labels available. So, if there are 3 labels to pick from, each label is worth 0.1666. The labels should all have the same value.
Here is an example by calvadev being used on Shopstr:
https://github.com/nostr-protocol/nips/pull/879#issuecomment-2502210146
You can adjust the weights however you want. The fundamental thing that QTS prescribes is that a thumbs up gives a 50% score, and labels each contribute an equal share up to another 50%.
# Conclusion
With nostr we have a great opportunity to improve the economic information available to the planet. A more efficient market based on higher quality information will improve civilization in ways we may not expect, but definitely deserve!
If you like this post, be sure to give it a thumbs up ✌😁