Rendered at 11:22:27 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
sudb 16 minutes ago [-]
I've managed to successfully use the ANE to accelerate text-to-speech models on iOS (as an aside - this was much more straightforward than the equivalent on Android).
Really wish this author would blog more, this piece is incredible and includes the code.
Also ModernBERT is amazing if you haven’t used it before, worth spending time with - have used it myself for classification tasks and it’s very impressive.
It does not seem to cover the Neural Accelerators, Apple's equivalent of the Tensor Cores. They only got released on M5 platform. This is probably the most important part to cover.
sakras 5 hours ago [-]
Neural accelerators are easy to use from Metal. They kick in automatically if you do a matmul using Metal Performance Primitives and you use bf16 or smaller (they don't seem to work in fp32).
wmf 12 hours ago [-]
Those are part of the GPU not the Neural Engine.
carbocation 14 hours ago [-]
This scans very much as AI-written.
thx67 13 hours ago [-]
This is obvious Claude slop writing, the author would be advised to use vale [1] with samples of their own writing as a guide.
> Performance begins with the roofline. On the M1 the engine holds about 12 fp16 TFLOP/s of compute against
a DRAM-bandwidth ceiling. The roofline has a ridge point near 141 FLOP per byte, a 2 MB working-set
threshold, a 0.23 ms floor under any single dispatch, and efficiency near 0.37 picojoules per FLOP at the
compute optimum. On a 256-channel 3x3 convolution it runs about 3.8 times faster than the same chip’s
GPU and 9 times more energy-efficient. The roofline pairs the engine’s throughput ceilings with its measured
power.
> Reaching the engine is not the same as running an arbitrary graph on it. The operations the engine executes
are distinct from the ones a capability bit only advertises. A feature attested in the hardware tables or
accepted by the compiler frontend counts only once a compile-and-run confirms it, and several advertised
operations, three-dimensional convolution among them, never lower to the engine at all. Weight compression
on the direct path cuts bandwidth, not only stored size. On the unentitled engine, int4 lookup-table weights
run about 2.37 times faster than fp16, and structured sparsity 1.55 to 1.64 times faster at 0.43 times the
bytes.
Please no. The author would be advised to write their own original thoughts.
thx67 12 hours ago [-]
It was a joke, nothing could save this "paper". I don't think the author wrote anything. They pointed claude at a directory and said "write a paper"
dkdcdev 14 hours ago [-]
why?
saagarjha 9 hours ago [-]
It has many technical mistakes besides the odd writing style
labcomputer 13 hours ago [-]
1. It uses non-idiomatic terminology in several places.
2. It repeats the same finding over and over (141 flops per byte, for example), without going deeper.
3. I stopped reading about a quarter of the way through because it felt like it was never going to stop teasing me about what it was going to tell me and actually tell me it.
4. It seems to assume the reader has a lot of context that isn't explicitly laid out (and which the reader wouldn't get just from reading the prior work, which is cited).
For example, I understand some of what it is saying because I used some similar techniques to benchmark things in the past (running at multiple scales to estimate overhead + marginal gains with a linear regression), but I wouldn't expect anyone who hasn't personally done that to follow the prose.
dylan604 12 hours ago [-]
> 4. It seems to assume the reader has a lot of context that isn't explicitly laid out (and which the reader wouldn't get just from reading the prior work, which is cited
I've had this complaint well before LLMs were used. People writing about topics they have a lot of knowledge in the subject tend to make the assumption only other subject knowledgeable readers will read it. Or that it never edited by a real editor that would enforce rules like spelling out acronyms on first use. Or forcing additional information when too many details have been left out on the assumption it would already be known.
There's plenty of this type of writing to have trained the bots that way
hbn 14 hours ago [-]
Cmd-F for "AI" has 1000+ hits!
thomspoon 13 hours ago [-]
The burden of proof should be with the beholder. Must be so easy to scream AI when you don’t want to read an article.
thx67 13 hours ago [-]
You obviously haven't read it, because it is clunky garbage.
> 19.4 Pacing compiles after a failure
> A failed compile is not free of side effects on the shared compile service. A compile that fails restarts the
service, which takes a few seconds to come back, and failures that keep arriving faster than the service can
restart between them keep it from making progress, so unrelated compiles slow down until the failures stop.
The effect is a function of how fast failures arrive, not how many occur: failures spaced out past the restart
interval cause no degradation at all. On detecting a failed compile, wait at least one restart interval, roughly
15 seconds, before the next compile, so a burst of failures cannot accumulate. No hard failure-count cap is
needed.
The whole document is less nutritious than a wonderbread miracle whip sandwich.
nielsbot 7 hours ago [-]
you forgot the bologna and iceberg lettuce
natpalmer1776 10 hours ago [-]
Personally I'm not in the habit of printing and eating articles I read, but in the unlikely event that I did I find it even less likely that I would be concerned with its' nutritional content. (/s)
throwa356262 13 hours ago [-]
Is there a non-slop version of this information available?
I am reading up on GPU / ML micro architecture and am looking for some good sources.
jval43 5 hours ago [-]
There was this article recently, which I personally found interesting:
I did however struggle to run a diffusion model on the ANE - but found that mlx-swift and iPhone GPU sufficed: https://www.duration.ai/blog/generating-images-with-a-2020-i...
Really wish this author would blog more, this piece is incredible and includes the code.
Also ModernBERT is amazing if you haven’t used it before, worth spending time with - have used it myself for classification tasks and it’s very impressive.
> Performance begins with the roofline. On the M1 the engine holds about 12 fp16 TFLOP/s of compute against a DRAM-bandwidth ceiling. The roofline has a ridge point near 141 FLOP per byte, a 2 MB working-set threshold, a 0.23 ms floor under any single dispatch, and efficiency near 0.37 picojoules per FLOP at the compute optimum. On a 256-channel 3x3 convolution it runs about 3.8 times faster than the same chip’s GPU and 9 times more energy-efficient. The roofline pairs the engine’s throughput ceilings with its measured power.
> Reaching the engine is not the same as running an arbitrary graph on it. The operations the engine executes are distinct from the ones a capability bit only advertises. A feature attested in the hardware tables or accepted by the compiler frontend counts only once a compile-and-run confirms it, and several advertised operations, three-dimensional convolution among them, never lower to the engine at all. Weight compression on the direct path cuts bandwidth, not only stored size. On the unentitled engine, int4 lookup-table weights run about 2.37 times faster than fp16, and structured sparsity 1.55 to 1.64 times faster at 0.43 times the bytes.
https://vale.sh/
2. It repeats the same finding over and over (141 flops per byte, for example), without going deeper.
3. I stopped reading about a quarter of the way through because it felt like it was never going to stop teasing me about what it was going to tell me and actually tell me it.
4. It seems to assume the reader has a lot of context that isn't explicitly laid out (and which the reader wouldn't get just from reading the prior work, which is cited).
For example, I understand some of what it is saying because I used some similar techniques to benchmark things in the past (running at multiple scales to estimate overhead + marginal gains with a linear regression), but I wouldn't expect anyone who hasn't personally done that to follow the prose.
I've had this complaint well before LLMs were used. People writing about topics they have a lot of knowledge in the subject tend to make the assumption only other subject knowledgeable readers will read it. Or that it never edited by a real editor that would enforce rules like spelling out acronyms on first use. Or forcing additional information when too many details have been left out on the assumption it would already be known.
There's plenty of this type of writing to have trained the bots that way
> 19.4 Pacing compiles after a failure
> A failed compile is not free of side effects on the shared compile service. A compile that fails restarts the service, which takes a few seconds to come back, and failures that keep arriving faster than the service can restart between them keep it from making progress, so unrelated compiles slow down until the failures stop. The effect is a function of how fast failures arrive, not how many occur: failures spaced out past the restart interval cause no degradation at all. On detecting a failed compile, wait at least one restart interval, roughly 15 seconds, before the next compile, so a burst of failures cannot accumulate. No hard failure-count cap is needed.
The whole document is less nutritious than a wonderbread miracle whip sandwich.
I am reading up on GPU / ML micro architecture and am looking for some good sources.
https://news.ycombinator.com/item?id=47208573 Inside the M4 Apple Neural Engine, Part 1: Reverse Engineering (maderix.substack.com) 376 points | 3 months ago | 122 comments