sankalp's blog

Is Anthropic's Claude 3.5 Sonnet  All You Need -  Vibe check

Intro

Comparison of AI model performance

Source

I have been subbed to Claude Opus for a few months (yes, I am an earlier believer than you people). Couple of days back, I was working on a project and opened Anthropic chat. Then I realised it was showing "Sonnet 3.5 - Our most intelligent model" and it was seriously a major surprise.

I have been playing with with it for a couple of days now. Wrote some code ranging from Python, HTML, CSS, JSS to Pytorch and Jax. Made it do some editing and proof-reading. So far it's been smooth sailing. Maybe we haven't hit a wall yet (Ok I am not important enough to comment on this but you gotta remember it's my blog).

But why vibe-check, aren't benchmarks enough? Oversimplifying here but I think you cannot trust benchmarks blindly. There can be benchmark data leakage/overfitting to benchmarks plus we don't know if our benchmarks are accurate enough for the SOTA LLMs.

You need to play around with new models, get their feel; Understand them better. Become one with the model. Next few sections are all about my vibe check and the collective vibe check from Twitter.

Benchmarks

Before we get to the vibe-check, let's just have a look at the benchmarks (sorry, formality)

The h̶i̶p̶s̶ benchmarks don't lie. It does feel much better at coding than GPT4o (can't trust benchmarks for it haha) and noticeably better than Opus. Don't underestimate "noticeably better" - it can make the difference between a single-shot working code and non-working code with some hallucinations. I had some Jax code snippets which weren't working with Opus' help but Sonnet 3.5 fixed them in one shot.

I frankly don't get why people were even using GPT4o for code, I had realised in first 2-3 days of usage that it sucked for even mildly complex tasks and I stuck to GPT-4/Opus.

Anyways coming back to Sonnet, Nat Friedman tweeted that we may need new benchmarks because 96.4% (0 shot chain of thought) on GSM8K (grade school math benchmark). (Nat please hire me)

GPQA change is noticeable at 59.4%. GPQA, or Graduate-Level Google-Proof Q&A Benchmark, is a challenging dataset that contains MCQs from physics, chem, bio crafted by "domain experts". It's difficult basically. The diamond one has 198 questions.

Vibe check

Knowledge

Underrated thing but data cutoff is April 2024. More cutting recent events, music/movie recommendations, cutting edge code documentation, research paper knowledge support. Let's gooooo.

Screenshot 2024-06-23 at 5

Code generation

It was immediately clear to me it was better at code. It's much faster at streaming too. Much less back and forth required as compared to GPT4/GPT4o. More accurate code than Opus. It does not get stuck like GPT4o.

Screenshot of a tweet by Jeremy Howard

Source

Yohei (babyagi creator) remarked the same.

Teknium tried to make a prompt engineering tool and he was happy with Sonnet.

Update 25th June: It's SOTA (state of the art) on LmSys Arena. You can check here.

Laziness

So far, my observation has been that it can be a lazy at times or it doesn't understand what you are saying. This sucks. Almost feels like they are changing the quantisation of the model in the background. I require to start a new chat or give more specific detailed prompts.

My mutual Tokenbender also noticed the same

Sometimes, you will notice silly errors on problems that require arithmetic/ mathematical thinking (think data structure and algorithm problems), something like GPT4o. Try CoT here - "think step by step" or giving more detailed prompts.

Agentic capabilities

As pointed out by Alex here, Sonnet passed 64% of tests on their internal evals for agentic capabilities as compared to 38% for Opus.

Maybe next gen models are gonna have agentic capabilities in weights. RIP agent based startups.

Cursor, Aider all have integrated Sonnet and reported SOTA capabilities.

Tips and Tricks - "Make It Better" and "List of assumptions"

Several people have noticed that Sonnet 3.5 responds well to the "Make It Better" prompt for iteration.


Jeremy Howard mentioned another trick here: If you've got any favorite trick questions for LLMs, try it on Sonnet 3.5 with this in your prompt: "Before you answer, make a list of wrong assumptions people sometimes make about the concepts included in the question."

Artifacts

Anthropic also released an Artifacts feature which essentially gives you the option to interact with code, long documents, charts in a UI window to work with on the right side. You can talk with Sonnet on left and it carries on the work / code with Artifacts in the UI window.

Screenshot 2024-06-25 at 4

It's an excellent UX choice. It separates the flow for code and chat and you can iterate between versions. It was so good that Deepseek people made a in-browser environment too.

There's also tooling for HTML, CSS, JS, Typescript, React. You can essentially write code and render the program in the UI itself. This further lowers barrier for non-technical people too. You can iterate and see results in real time in a UI window. I am never writing frontend code again for my side projects.

I tried making a simple portfolio for Sam Alternativeman. Link to sequence of prompts.

With the help of the creative coding library p5.js, I was able to make A* visualization, Hilbert curves and Perlin noise with help of Artefacts feature. Each took not more than 5 minutes each.

I also made a visualization for Q-learning and Perlin Noise, Hilbert curves.

Alex Albert created an entire demo thread.

Criticisms

Simon Willison pointed out here that it's still hard to export the hidden dependencies that artefacts uses. Hopefully Anthropic releases this soon.

Vision Test

They claim that Sonnet is their strongest model (and it is). I did the Frieren eating gigantic burger vibe test. Left is Opus, Right is Sonnet 3.5.

Opus performance
Sonnet performance

Sonnet 3.5 was correctly able to identify the hamburger.

Reasoning

Wow! 😮 claude-3.5 is an extremely impressive overall model! It achieves the top score in **every category**, and substantially improves in reasoning! See for yourself with our interactive leaderboard: https://t.co/F8tIK27ANm pic.twitter.com/KanapZmF5k

— Colin White (@crwhite_ml) June 20, 2024

Sonnet 3.5 is able to answer some questions and puzzles it wasn't eable to solve earlier - like it's able to answer Nathan Lambert's question - what is DPO

Screenshot of a tweet

It was able to solve the question "What is the smallest integer whose square is between 15 and 30?" in one shot. Check below thread for more discussion on same.

There are still issues though - check this thread.


Update 25th June: Teortaxes pointed out that Sonnet 3.5 is not as good at instruction following. It still fails on tasks like count 'r' in strawberry. Note that LLMs are known to not perform well on this task due to the way tokenization works.

btw, people who bleat "every model is bad at character counting ackshually!": you're bad at taking the hint, about as bad as Sonnet here

This is not the "how many letters 'r' in the word 'strawberry" test, this is "will you honestly think about the task line by line" test

— Teortaxes▶️ (@teortaxesTex) June 24, 2024

Personality

Sonnet 3.5 is very polite and sometimes feels like a yes man (can be a problem for complex tasks, you need to be careful). It honestly rizzed me up when I was proof-reading for a previous blog post I wrote.

oh my god sonnet stop pic.twitter.com/ZTm9fFyZWK

— sankalp (@dejavucoder) June 21, 2024

Sonnet is SOTA on the EQ-bench too (which measures emotional intelligence, creativity) and 2nd on "Creative Writing". It can make up for good therapist apps.

Screenshot 2024-06-23 at 6

Here's a demonstration by Anthropic's Amanda

I asked Claude to write a poem from a personal perspective. I thought this part was surprisingly sad. pic.twitter.com/oCxsEg0g4z

— Amanda Askell (@AmandaAskell) June 22, 2024

Conclusion

This concludes my quick vibe-check post. The overall vibe-check is positive. I am mostly happy I got a more intelligent code gen SOTA buddy. This is close to AGI for me. Also, Sam Altman can you please drop the Voice Mode and GPT-5 soon?!