Hi D3us,
Thank you again for the amazing job you did! My 15" Macbook Pro Late
2011 is working like new again!!!
The issue of the entire situation became so clear to me from your explanation!
And the problem is so logical actually after I heard your ideas....
All because of one poor choice of Apple in the production process.
Here's my attempt to try to help all other curious people understand what's going on with those 2011 machines (and who knows which other models).
The reason why so many of these 2011 macbook pro machines fail/die on us:
The logic boards, which are created in an amazing (to me, at least) way, contain very small to relatively large (BGA http://en.wikipedia.org/wiki/Ball_grid_array) components.
These components (incl the little tin balls) are all positioned perfectly on the contacts of the BGA/PCB (http://en.wikipedia.org/wiki/Printed_circuit_board) and then that whole thing goes through some kind of reflow oven. Think of the kind of conveyor belt oven Pizza Hut uses.
Goal: Heat the entire PCB incl all components to approx 240ºC to get all components soldered to the PCB, but to stay within the safety limits of ALL components and standardised solder timings/temps.
That "approx" is where the problem resides.
All the components on the PCB are different and have different max temperature (+duration @ that temp) specifications.
Different sized component can influence the heat penetration in the reflow oven, which can result in effective temperature differences at the solder balls. And remember: Some of these components are BGA’s, like the failing GPUs = These have small solder balls between the chip and pcb.
So Apple + the suppliers of the components had to "choose" a common "safe" max temperature that would serve the purpose of a) creating a good connection and b) not damaging any of the components..
And here lies some irony:
By trying not to damage some component(s), they kind of completely destroyed the reliability of this range of machines...
Let's say they chose to move the PCBs through that oven at max 240ºC (the melting temperature of tin is 232ºC) for “some” short period of time.
The AMD BGA in our macbooks has probably about 700-800 solder balls.
Now imagine those tiny balls heating up, but not long enough...
They are between the chip and pcb, so might need more time to get fully heated.
Some balls might turn completely into liquid phase, but some might not entirely turn into liquid phase and just go soft only!
Then you get the situation where probably most balls melted completely, forming a full soldering connection, while some balls might just have gotten soft and gotten "pressed flat" between the GPU and the PCB, making an electrical "pressure" contact, instead of a real soldering contact.
So you might say that Apple may have been operating too much on the “edge”, concerning choice of heating temperature in their ovens..
And there you go, problem created, but well hidden!
Immediate tests after production always showed perfect functionality, because the "pressure connections" made a fine electrical connection at that moment.
But then 2 main factors start to change that picture:
- TIME
- TEMPERATURE
As time passes and as you heat up your GPU nice and hot (basically by normal use of your nice, expensive Mac, even just by watching a movie), on each usage cycle, the GPU expands+shrinks a tiny bit.
Imagine this happening for a few years and also imagine how oxidation just loves to crawl in between those pressure joints.
Especially as these unibody models are a huge chunk of aluminium that cause plenty of condensation inside when moving the device from a cold car into a warm room, etc.
So there you go.
A wrong choice for the oven temperature and "baking" duration creates imperfect solder joints and time + specific high GPU temperatures ruines those bad joints further and our GPUs fail.
Why mainly the GPU and not other components?
Usually, the CPU doesn't operate on 100% of its capacity during its (normal) operational life, especially with those fast quad core CPU's nowadays.
Insufficient memory and other slower hardware components form a bottle neck anyway (usually) which prevent the CPU from making lots and lots of 100% occupation hours.
The GPU, on the other hand, is a component that gets it's rectangular butt kicked quite often, quite severely. The GPU is utilised in higher percentages much more then the CPU apparently.
Causing that 30W GPU to undergo many more "severe" temperature changes then the 45W CPU will experience.
Still, CPUs are also reported to have failed on these boards (and restored successfully after a "simple" reflow procedure).
So what was done to my MacBook was a REFLOW, not a reballing.
A reflow is basically repeating the process that Apple was supposed to be doing correctly in their oven in the first place, but then at the RIGHT temperature where all solder balls actually DO flow 100% completely.
The endresult?
What Apple would have supplied if they had risen the temperature just a little bit higher, or maybe for a little bit longer period of time.
I have paid a visit to D3us last saturday. He reflowed my GPU.
So no new chip, no reballing, just making the original solder balls melt good (for the first time ever).
And the result?
My Mac works like new!! And personally I feel that this whole story is super logical and the solution is just very practical.
And, UNLESS Apple changed the protocols for soldering drastically after the 2011 model, all following models might be up for similar problems after several years of service...? If they soldered everything according to the same protocols from 2011 on, I'm not surprised that Apple isn't responding to this very quickly. Maybe they are researching their behinds off, to get a picture of what's really happening here. Which might have a scary conclusion..
Since my Mac was out of warranty already, I didn't really care whether Apple would pay back the €125 I spent or not.
This was such a small price for bringing my Mac back better then ever before. And as I use my Mac to make a living, I really couldn't be bothered by spending €125 to get this thing working as it should.
My advice; think about the logic in this story, and consider this procedure, as it works in almost all cases. Of course, some component could be literally "broken", besides that the soldering is bad. Then it would mean replacing that too.
If it'd be impossible to find which component is broken, it's always possible to find a refurbished 2nd hand logic board and reflow that too.
Costs are still always lower then buying a nice new, replacement board from Apple Service which has the same problems built in!
The story above here also explains why the replacement boards keep showing the same defects, as they are all produced in the same way, with imperfect soldering connections. It also explains why nobody can find a successful "software fix". It also explains the variety in related symptoms, as it's of course unknown which of the 700-800 tin balls didn't melt/flow good!
Most of all: Applying this fix turns our dead Macs back into the reliable machines they're supposed to be. The hardware components are basically all just fine. The choice of combination of components is also just fine. Macs are in fact just fine concerning hardware.
Great even.
But if you do a bad job soldering good parts together, well... then this weird behaviour is really not a surprise, right?
Fixing the boards in this way, which Apple will NEVER be able to do in an affordable way (for them, on such a large scale), seems the most practical solution to me.
What Apple would officially have to go through, to fix this RIGHT, is this (and then imagine if you see that happening):
- create new logic boards
- produce them with different production standards, deviating from what they primarily decided was "safe for all components", heating them up to a higher temperature
- send every 2011 Mac owner (and any other who reports similar weird
problems) a nice new Logicboard...
Again: Do you see that happening?
Excuse my mistakes in english, it's not my native language.
I just hope that this story will help people to get an idea what's happening with their machine and what might be the most cost effective, durable solution..
D3us: Thumbs up and thanks again!!