Section By Ian Cutress & Andrei Frumusanu

Does anyone remember our articles regarding unscrupulous benchmark behavior back in 2013? At the time we called the industry out on the fact that most vendors were increasing thermal and power limits to boost their scores in common benchmark software. Fast forward to 2018, and it is happening again.

Benchmarking Bananas: A Recap

cheat: verb, to act dishonestly or unfairly in order to gain an advantage.

AnandTech exposing benchmark cheating on smartphones has a long and rich history. It is quite apt that this story goes full circle, as the one to tip off Brian on Samsung’s cheating behaviour on the Exynos Galaxy S4 a few years back was Andrei, who now writes for us.

When we exposed one vendor, it led to a cascade of discussions and a few more articles investigating more vendor involved in the practice, and then even Futuremark delisting several devices from their benchmark database. Scandal was high on the agenda, and the results were bad for both companies and end users: devices found cheating were tarnishing the brand, and consumers could not take any benchmark data as valid from that company. Even reviewers were misled. It was a deep rabbit hole that should not have been approached – how could a reviewer or customer trust what number was coming out of the phone if it was not in a standard ‘mode’?

So thankfully, ever since then, vendors have backed off quite a bit on the practice. Since 2013, for several years it would appear that a significant proportion of devices on the market are behaving within expected parameters. There are some minor exceptions, mostly from Chinese vendors, although this comes in several flavors. Meizu has a reasonable attitude to this, as when a benchmark is launched the device puts up a prompt to confirm entering a benchmark power mode, so at least they’re open and transparent about it. Some other phones have ‘Game Modes’ as well, which either focus on raw performance, or extended battery life.

Going Full Circle, At Scale

So today we are publishing two front page pieces. This one is a sister article to our piece addressing Huawei’s new GPU Turbo, and while it makes overzealous marketing claims, the technology is sound. Through the testing for that article, we actually stumbled upon this issue, completely orthogonal to GPU turbo, which needs to be published. We also wanted to address something that Andrei has come across while spending more time with this year’s devices, including the newly released Honor Play.

The Short Detail

As part of our phone comparison analysis, we often employ additional power and performance testing on our benchmarks. While testing out the new phones, the Honor Play had some odd results. Compared to the Huawei P20 devices tested earlier in the year, which have the same SoC, the results were also quite a bit worse and equally weird.

Within our P20 review, we had noted that the P20’s performance had regressed compared to the Mate 10. Since we had encountered similar issues on the Mate 10 which were resolved with a firmware update pushed to me, we didn’t dwell too much on the topic and concentrated on other parts of the review.

Looking back at it now after some re-testing, it seems quite blatant as to what Huawei and seemingly Honor had been doing: the newer devices come with a benchmark detection mechanism that enables a much higher power limit for the SoC with far more generous thermal headroom. Ultimately, on certain whitelisted applications, the device performs super high compared to what a user might expect from other similar non-whitelisted titles. This consumes power, pushes the efficiency of the unit down, and reduces battery life.

This has knock-on effects, such as trust, in how the device works. The end result is a single performance number is higher, which is good for marketing, but is unrealistic to any user with the device. The efficiency of the SoC also decreases (depending on the chip), as the chip is pushed well outside its standard operating window. It makes the SoC, one of the differentiating points of the device, look worse, all for the sake of a high benchmark score. Here's the example of benchmark detection mode on and off on the Honor Play:

GFXBench T-Rex Offscreen Power Efficiency
(Total Device Power)
AnandTech Mfc. Process FPS Avg. Power
(W)
Perf/W
Efficiency
Honor Play (Kirin 970) BM Detection Off 10FF 66.54 4.39 15.17 fps/W
Honor Play (Kirin 970) BM Detection On 10FF 127.36 8.57 14.86 fps/W

We’ll go more into the benchmark data on the next page.

We did approach Huawei about this during the IFA show last week, and obtained a few comments worth putting here. Another element to the story is that Huawei’s new benchmark behavior very much exceeds anything we’ve seen in the past. We use custom editions of our benchmarks (from their respective developers) so we can test with this ‘detection’ on and off, and the massive differences in performance between the publicly available benchmarks and the internal versions that we’re using for testing is absolutely astonishing.

Huawei’s Response

As usual with investigations like this, we offered Huawei an opportunity to respond. We met with Dr. Wang Chenglu, President of Software at Huawei’s Consumer Business Group, at IFA to discuss this issue, which is purely a software play from Huawei. We covered a number of topics in a non-interview format, which are summarized here.

Dr. Wang asked if these benchmarks are the best way to test smartphones as a whole, as he personally feels that these benchmarks are moving away from real world use. A single benchmark number, stated Huawei’s team, does not show the full experience. We also discussed the validity of the current set of benchmarks, and the need for standardized benchmarks. Dr. Wang expressed his preference for a standardized benchmark that is more like the user experience, and they want to be a part of any movement towards such a benchmark.

I explained that we work with these benchmark companies, such as Kishonti (GFXBench) and Futuremark (3DMark), as well as others, to help steer them in a way that is better represented for benchmarking. We explained employing a benchmarking mode to game test results is not a solution to solving what they see as a misrepresentation of user experience with these benchmarks. This is especially valid when the chip ends up with lower efficiency – but to be honest with test: the only way for it to be better related to user experience is to run it in the standard power envelope that every regular game runs in.

Huawei stated that they have been working with industry partners for over a year to find the best tests closest to the user experience. They like the fact that for items like call quality, there are standardized real-world tests that measure these features that are recognized throughout the industry, and every company works towards a better objective result. But in the same breath, Dr. Wang also expresses that in relation to gaming benchmarking that ‘others do the same testing, get high scores, and Huawei cannot stay silent’.

He states that it is much better than it used to be, and that Huawei ‘wants to come together with others in China to find the best verification benchmark for user experience’. He also states that ‘in the Android ecosystem, other manufacturers also mislead with their numbers’, citing one specific popular smartphone manufacturer in China as the biggest culprit, and that it is becoming ‘common practice in China’. Huawei wants to open up to consumers, but have trouble when competitors continually post unrealistic scores.

Ultimately Huawei states that they are trying to face off against their major Chinese competition, which they say is difficult when other vendors put their best ‘unrealistic’ score first. They feel that the way forward is standardization on benchmarks, that way it can be a level field, and they want the media to help with that. But in the interim, we can see that Huawei has also been putting its unrealistic scores first too.

Our response to this is that Huawei needs to be a leader, not a follower on this issue. I explained that the benchmarks we use (GFXBench) are well understood and are ‘standard’, and as real world as possible, but there are benchmarks we don’t use (AnTuTu) because they don’t mean anything. We also use benchmarks such as SPEC, which are very standard in this space, to evaluate an SoC and device. 

The discussion then pivoted towards the decline in trust Huawei’s benchmark numbers in presentations as a result of this. We already take the data with a large grain of salt, but now we have no reason to listen to them as we do not know which values are in this ‘benchmark’ mode.

Huawei’s reaction to this is that they will ensure that future benchmark data in presentations is independently verified by third parties at the time of the announcement. This was the best bit of news.

Our Reaction

While not explicitly stated in a clear line, Huawei is admitting to doing what they are doing, citing specific vendors in China as the primary reason for it.

We understand the impact that higher marketing numbers, however this is the worst way to do it – rather than calling out the competition for bad practices, Huawei is trying to beat them at their own game, and it’s a game in which everyone loses. For a company the size of Huawei, brand image is a big part of what the company is, and trying to mislead customers just for a high-score will backfire. It has backfired.

Huawei’s comments about standardized benchmarking are not new – we’ve heard it since time immemorial in the PC space, and several years ago, Arm was similarly discussing it with the media. Since then the situation has gotten better: the canned benchmark companies speak with game developers to develop real-world scenarios, but they also want to push the boundaries.

The only thing that hasn’t happened in the mobile space compared to the PC space on this is proper in-game benchmark modes that output data properly. This is something that is going to have to be vendor driven, as our interactions with big gaming studios on in-game benchmarks typically falls flat. Any frame rate testing on mobile requires additional software, which can require root, however Huawei recently disabled the ability to root their phones. Though we're told that that at some point in the future, Huawei will be re-enabling rooting for registered developers soon.

Overall, while it’s positive that Huawei is essentially admitting to these tactics, we believe the reasons for doing so are flimsy at best. The best way to implement this sort of ‘mode’ is to make it optional, rather than automatic, as some vendors in China already do. But Huawei needs to lead from the front if it ever wants to approach Samsung in unit sales.

Huawei did not go into how the benchmarking detection will be addressed in current and future devices. We will come back to the issue for the Mate 20 launch on October 16th.

Raw Benchmark Numbers
Comments Locked

84 Comments

View All Comments

  • Cicerone - Friday, September 7, 2018 - link

    But sometimes Kirin 970 is on the same level with 2016 Exynos 8890 found on Samsung S7.
  • shogun18 - Tuesday, September 4, 2018 - link

    > I think it's important for users to know that the Kirin 970 has a significantly weaker GPU than the S845

    How so? If some popular game needs 10,000 shader OPS to run at 800x600 at 30 frames/sec what difference does it make if one SoC can pump out 8000 (admittedly synthetic - are you really going to tell me you're going to notice 24FPS vs 30? pahlease), or 15,000 or another 40,000? Ok, so does OPS/Watt actually matter in anybody's evaluation metric? No. Does anyone choose a phone based on this one lets me run X game for 30 minutes before running out of batt but I can get 40 minutes with this other one because in "game mode" the manufacturer took liberties with wattage?
  • cfenton - Tuesday, September 4, 2018 - link

    What modern phone runs at 800x600? Also, faster GPUs can get closer to 60fps, which is definitely a noticeable improvement over 30fps.

    If all you're playing is Candy Crush, then it doesn't matter what GPU you have, but if you're playing Fortnite or the upcoming Elder Scrolls game, then GPU performance is important. If two phones are roughly the same price, but one of them has 3x the GPU power with no downsides, I'm going to go with the faster one every time.
  • shogun18 - Tuesday, September 4, 2018 - link

    The human eye in games like Fortnite etc can only process a very limited frame rate. So anything over 30 is basically pointless. Plus factor in using a 27+ monitor(s) vs a piddly-ass phone screen with lousy (by comparison to "gaming" monitors) refresh characteristics the benchmark is even less useful.

    eg. https://www.pcgamer.com/how-many-frames-per-second...
  • cfenton - Tuesday, September 4, 2018 - link

    That article make it very clear that people can tell the difference between 60fps and 30fps. Its claim is that it's only an improvement in smoothness, not an improvement in our ability to track changes. A higher frame rate won't improve my ability to pick out movement.

    60fps looks better than 30fps. If I can choose between the two, at the same resolution, I'm always going to pick 60fps. Will it make me better at the game? No. Does it make the game look at feel better? Yes.
  • techconc - Monday, September 10, 2018 - link

    @shogun18 - I always find it amusing when people present "evidence" to support their position only to find out the evidence they are producing very clearly refutes their position. The article very clearly states:
    "Certainly 60 Hz is better than 30 Hz, demonstrably better." - Professor Thomas Busey

    From my own perspective, I would suggest to you that games need to have a 30 fps at minimum to be playable and to appear to be somewhat fluid. 60 fps is clearly better, but not "twice as good". You can see the difference though. On my iPad, I can do 120 fps on games like World of Tanks Blitz and can even notice that difference. For some games, reaction time is critical and network performance also plays a role in this. However, higher frame rates can indeed provide a competitive advantage.
  • shogun18 - Tuesday, September 11, 2018 - link

    did you BOTHER to read to the end let alone comprehend what was being put forth? The human brain is SLOW! It's massively parallel but it's SLOW. Just like our ears are crap compared to other creatures who actually have good hearing. If you're playing FPS on a phone you're an idiot to begin with. Fluidity or more properly the perception of same doesn't make your performance better. Your reaction time is also completely shit compared to the theoretical frame rate you think you are perceiving. Anyone who cares about game play on a phone is a moron.
  • Reflex - Tuesday, September 4, 2018 - link

    Buyers should be able to value whatever they wish when making their purchasing decisions. Lying to them denies them the right to make decisions based on the criteria that matter most to them, whether it be nice cameras, great screens, excellent call quality, or yes, 'geekmarks' or whatever.

    It's not for you to determine what is most important to a customer, nor is it ethical to lie about one of those or other items in order to trick people who value them into buying your product.
  • boozed - Tuesday, September 4, 2018 - link

    Funny you should say that, considering the reason for the existence of this website.
  • Samus - Wednesday, September 5, 2018 - link

    You need to put a performance metric on things somehow. Cars have horsepower and torque, batteries have volts and milliamps, and food has protein and carbs.

    Unfortunately these metrics do not come from the SoC manufacturer, but the phone vendor. That therein lies the problem. "Overclocking" or boosting a SoC beyond reasonable thermal design limitations is blatant cheating if it can't be sustained throughout, say, a game, that the benchmark is momentarily mimicking.

    At the end of the day, this is really an Android problem too, because the freedom the OS gives phone vendors to manipulate the kernel, scheduler, and frequency curve of the CPU/GPU. This kind of flexibility didn't exist (and still doesn't exist) in other mobile operating systems.

    So imagine if this were happening in the PC space. Where vendors were selling overclocked systems WITHOUT SAYING they were overclocked. Where vendors were manipulating the real-world benefits of a GPU with software that faked benchmark results.

    I would liken it to what happened with the game console clones of the 80's, when there were third-party Atari's, Intellivisions, etc, that had custom CPU's running at higher frequencies. In that case, it actually hurt developers more than consumers (but still hurt consumers) because developers couldn't even depend on a performance metric for the platform they were developing for. This is partially why there were virtually no third party developers (Activision and Hudsonsoft - who later developed their own console simply to have some control over the hardware environment! - were effectively the first cross-platform developers.)

Log in

Don't have an account? Sign up now