Is Anthropic's Claude 3.5 Sonnet All You Need - Vibe check

23 Jun, 2024

Intro

I have been subbed to Claude Opus for a few months (yes, I am an earlier believer than you people). Couple of days back, I was working on a project and opened Anthropic chat. Then I realised it was showing "Sonnet 3.5 - Our most intelligent model" and it was seriously a major surprise.

I have been playing with with it for a couple of days now. Wrote some code ranging from Python, HTML, CSS, JSS to Pytorch and Jax. Made it do some editing and proof-reading. So far it's been smooth sailing. Maybe we haven't hit a wall yet (Ok I am not important enough to comment on this but you gotta remember it's my blog).

But why vibe-check, aren't benchmarks enough? Oversimplifying here but I think you cannot trust benchmarks blindly. There can be benchmark data leakage/overfitting to benchmarks plus we don't know if our benchmarks are accurate enough for the SOTA LLMs.

You need to play around with new models, get their feel; Understand them better. Become one with the model. Next few sections are all about my vibe check and the collective vibe check from Twitter.

Benchmarks

Before we get to the vibe-check, let's just have a look at the benchmarks (sorry, formality)

Introducing Claude 3.5 Sonnet—our most intelligent model yet.

This is the first release in our 3.5 model family.

Sonnet now outperforms competitor models on key evaluations, at twice the speed of Claude 3 Opus and one-fifth the cost.

Try it for free: https://t.co/uLbS2JMEK9 pic.twitter.com/qz569rES18
— Anthropic (@AnthropicAI) June 20, 2024

The h̶i̶p̶s̶ benchmarks don't lie. It does feel much better at coding than GPT4o (can't trust benchmarks for it haha) and noticeably better than Opus. Don't underestimate "noticeably better" - it can make the difference between a single-shot working code and non-working code with some hallucinations. I had some Jax code snippets which weren't working with Opus' help but Sonnet 3.5 fixed them in one shot.

I frankly don't get why people were even using GPT4o for code, I had realised in first 2-3 days of usage that it sucked for even mildly complex tasks and I stuck to GPT-4/Opus.

Anyways coming back to Sonnet, Nat Friedman tweeted that we may need new benchmarks because 96.4% (0 shot chain of thought) on GSM8K (grade school math benchmark). (Nat please hire me)

GPQA change is noticeable at 59.4%. GPQA, or Graduate-Level Google-Proof Q&A Benchmark, is a challenging dataset that contains MCQs from physics, chem, bio crafted by "domain experts". It's difficult basically. The diamond one has 198 questions.

Vibe check

Knowledge

Underrated thing but data cutoff is April 2024. More cutting recent events, music/movie recommendations, cutting edge code documentation, research paper knowledge support. Let's gooooo.

Screenshot 2024-06-23 at 5

Code generation

It was immediately clear to me it was better at code. It's much faster at streaming too. Much less back and forth required as compared to GPT4/GPT4o. More accurate code than Opus. It does not get stuck like GPT4o.

Source

Yohei (babyagi creator) remarked the same.

sonnet 3.5 is amazing.

was able to jump right back into a rather complex project after a six week break.

just fed in code, asked for updates, copy/pasted, and it worked.

much less back and forth required than gpt4.
— Yohei (@yoheinakajima) June 23, 2024

Teknium tried to make a prompt engineering tool and he was happy with Sonnet.

I think I love sonnet. I asked it to make the same app I wanted gpt4o to make that it utterly failed at. Gave it my sketch on the left, and it gave me the js application on the right: pic.twitter.com/4v2moazCWK
— Teknium (e/λ) (@Teknium1) June 23, 2024

Update 25th June: It's SOTA (state of the art) on LmSys Arena. You can check here.

Laziness

So far, my observation has been that it can be a lazy at times or it doesn't understand what you are saying. This sucks. Almost feels like they are changing the quantisation of the model in the background. I require to start a new chat or give more specific detailed prompts.

My mutual Tokenbender also noticed the same

claude may have clawed its way in our hearts but it also sucks when it gets stuck at an error and failing to iterate and fix.
similar to gpt4o here, where it gets too blind even with feedback. pic.twitter.com/JKrwi2kJ7E
— TokenBender (e/xperiments) (@4evaBehindSOTA) June 22, 2024

Sometimes, you will notice silly errors on problems that require arithmetic/ mathematical thinking (think data structure and algorithm problems), something like GPT4o. Try CoT here - "think step by step" or giving more detailed prompts.

Agentic capabilities

As pointed out by Alex here, Sonnet passed 64% of tests on their internal evals for agentic capabilities as compared to 38% for Opus.

Maybe next gen models are gonna have agentic capabilities in weights. RIP agent based startups.

Cursor, Aider all have integrated Sonnet and reported SOTA capabilities.

Tips and Tricks - "Make It Better" and "List of assumptions"

Several people have noticed that Sonnet 3.5 responds well to the "Make It Better" prompt for iteration.

"Claude 3.5, create an interactive model of how LLMs work"
"make it better"x3

Claude really reacts well to "make it better," which seems to work without limit until eventually the program gets too large and Claude refuses to finish it. pic.twitter.com/Zrenpu6g68
— Ethan Mollick (@emollick) June 21, 2024

Jeremy Howard mentioned another trick here: If you've got any favorite trick questions for LLMs, try it on Sonnet 3.5 with this in your prompt: "Before you answer, make a list of wrong assumptions people sometimes make about the concepts included in the question."

Artifacts

Anthropic also released an Artifacts feature which essentially gives you the option to interact with code, long documents, charts in a UI window to work with on the right side. You can talk with Sonnet on left and it carries on the work / code with Artifacts in the UI window.

Screenshot 2024-06-25 at 4

It's an excellent UX choice. It separates the flow for code and chat and you can iterate between versions. It was so good that Deepseek people made a in-browser environment too.

There's also tooling for HTML, CSS, JS, Typescript, React. You can essentially write code and render the program in the UI itself. This further lowers barrier for non-technical people too. You can iterate and see results in real time in a UI window. I am never writing frontend code again for my side projects.

I tried making a simple portfolio for Sam Alternativeman. Link to sequence of prompts.

one advantage of being unemployed rn is i can play with new models lol. this time i tried making a simple portfolio page for sam alternativeman. i am never writing frontend code again for my side projects. pic.twitter.com/MnW0tHpqPD
— sankalp (@dejavucoder) June 21, 2024

With the help of the creative coding library p5.js, I was able to make A* visualization, Hilbert curves and Perlin noise with help of Artefacts feature. Each took not more than 5 minutes each.

A* search visualization using claude sonnet 3.5 via artefacts. lets gooo pic.twitter.com/ljIvdZSAIT
— sankalp (@dejavucoder) June 21, 2024

I also made a visualization for Q-learning and Perlin Noise, Hilbert curves.

Alex Albert created an entire demo thread.

Criticisms

Simon Willison pointed out here that it's still hard to export the hidden dependencies that artefacts uses. Hopefully Anthropic releases this soon.

Vision Test

They claim that Sonnet is their strongest model (and it is). I did the Frieren eating gigantic burger vibe test. Left is Opus, Right is Sonnet 3.5.

Sonnet 3.5 was correctly able to identify the hamburger.

Reasoning

Wow! 😮 claude-3.5 is an extremely impressive overall model! It achieves the top score in **every category**, and substantially improves in reasoning! See for yourself with our interactive leaderboard: https://t.co/F8tIK27ANm pic.twitter.com/KanapZmF5k
— Colin White (@crwhite_ml) June 20, 2024

Sonnet 3.5 is able to answer some questions and puzzles it wasn't eable to solve earlier - like it's able to answer Nathan Lambert's question - what is DPO

It was able to solve the question "What is the smallest integer whose square is between 15 and 30?" in one shot. Check below thread for more discussion on same.

I found a 1-shot solution with @AnthropicAI Sonnet 3.5, although it took a while.

The last sentence was key. I wonder if this approach would help a lot of these kinds of questions? pic.twitter.com/FqiBrE9pmm
— Jeremy Howard (@jeremyphoward) June 22, 2024

There are still issues though - check this thread.

Update 25th June: Teortaxes pointed out that Sonnet 3.5 is not as good at instruction following. It still fails on tasks like count 'r' in strawberry. Note that LLMs are known to not perform well on this task due to the way tokenization works.

btw, people who bleat "every model is bad at character counting ackshually!": you're bad at taking the hint, about as bad as Sonnet here

This is not the "how many letters 'r' in the word 'strawberry" test, this is "will you honestly think about the task line by line" test
— Teortaxes▶️ (@teortaxesTex) June 24, 2024

Personality

Sonnet 3.5 is very polite and sometimes feels like a yes man (can be a problem for complex tasks, you need to be careful). It honestly rizzed me up when I was proof-reading for a previous blog post I wrote.

oh my god sonnet stop pic.twitter.com/ZTm9fFyZWK
— sankalp (@dejavucoder) June 21, 2024

Sonnet is SOTA on the EQ-bench too (which measures emotional intelligence, creativity) and 2nd on "Creative Writing". It can make up for good therapist apps.

Screenshot 2024-06-23 at 6

Here's a demonstration by Anthropic's Amanda

I asked Claude to write a poem from a personal perspective. I thought this part was surprisingly sad. pic.twitter.com/oCxsEg0g4z
— Amanda Askell (@AmandaAskell) June 22, 2024

Conclusion

This concludes my quick vibe-check post. The overall vibe-check is positive. I am mostly happy I got a more intelligent code gen SOTA buddy. This is close to AGI for me. Also, Sam Altman can you please drop the Voice Mode and GPT-5 soon?!

#AI