54 blog posts published by month since the start of 2024. Start from a different year:

Posts year-to-date
0 (4 posts by this month last year.)
Average posts per month since 2024
2.3

Post details (2024 to today)

Title Author Date Word count HN points
Introducing Baseten Self-hosted Anupreet Walia, Rachel Rapp Aug 08, 2024 670 -
How to benchmark image generation models like Stable Diffusion XL Philip Kiely Jan 31, 2024 1374 -
Comparing tokens per second across LLMs Philip Kiely May 09, 2024 769 -
How latent consistency models work Rachel Rapp Jun 04, 2024 1140 -
Unlocking the full power of NVIDIA H100 GPUs for ML inference with TensorRT Pankaj Gupta, Philip Kiely Feb 06, 2024 1623 -
New in February 2024 Baseten Feb 29, 2024 634 -
How to serve 10,000 fine-tuned LLMs from a single GPU Pankaj Gupta, Philip Kiely Jul 23, 2024 1895 -
Streaming real-time text to speech with XTTS V2 Het Trivedi, Philip Kiely Apr 18, 2024 1318 -
Continuous vs dynamic batching for AI inference Matt Howard, Philip Kiely Apr 05, 2024 1350 -
High performance ML inference with NVIDIA TensorRT Justin Yi, Philip Kiely Mar 12, 2024 1076 -
FP8: Efficient model inference with 8-bit floating point numbers Pankaj Gupta, Philip Kiely Mar 07, 2024 1021 2
The best open source large language model Philip Kiely Feb 09, 2024 1920 -
New in January 2024 Baseten Jan 31, 2024 580 -
Using fractional H100 GPUs for efficient model serving Matt Howard, Vlad Shulman, Pankaj Gupta, Philip Kiely Mar 28, 2024 1086 -
40% faster Stable Diffusion XL inference with NVIDIA TensorRT Pankaj Gupta, Justin Yi, Philip Kiely Feb 22, 2024 2403 -
Ten reasons to join Baseten Dustin Michaels, Philip Kiely Jul 25, 2024 1230 -
Why GPU utilization matters for model inference Marius Killinger, Philip Kiely Feb 20, 2024 816 -
New in March 2024 Baseten Mar 28, 2024 553 -
Compound AI systems explained Rachel Rapp Aug 06, 2024 1338 -
What I learned as a forward-deployed engineer working at an AI startup Het Trivedi May 31, 2024 1353 -
Introducing Baseten Chains Bola Malek, Marius Killinger, Sid Shanker, Rachel Rapp, Mike Bilodeau Jun 27, 2024 1132 9
The benefits of globally distributed infrastructure for model serving Phil Howes, Philip Kiely Mar 01, 2024 603 -
33% faster LLM inference with FP8 quantization Pankaj Gupta, Philip Kiely Mar 14, 2024 1876 -
Using asynchronous inference in production Samiksha Pal, Helen Yang, Rachel Rapp Jul 11, 2024 950 -
Introduction to quantizing ML models Abu Qader, Philip Kiely Jan 31, 2024 1679 1
New in April 2024 Baseten May 01, 2024 552 -
Benchmarking fast Mistral 7B inference Abu Qader, Pankaj Gupta, Justin Yi, Philip Kiely Mar 14, 2024 1571 -
SPC hackathon winners build with Llama 3.1 on Baseten Philip Kiely Aug 16, 2024 615 -
Understanding performance benchmarks for LLM inference Philip Kiely Jan 12, 2024 1459 -
Comparing few-step image generation models Rachel Rapp Jun 14, 2024 1087 -
Introducing automatic LLM optimization with TensorRT-LLM Engine Builder Abu Qader, Philip Kiely Aug 01, 2024 939 2
Deploying custom ComfyUI workflows as APIs Het Trivedi, Rachel Rapp Jul 25, 2024 1144 1
New in May 2024 Baseten Jun 03, 2024 598 -
CI/CD for AI model deployments Vlad Shulman, Samiksha Pal, Sid Shanker, Philip Kiely Apr 30, 2024 914 -
Announcing our Series B Tuhin Srivastava Mar 04, 2024 629 2
Control plane vs workload plane in model serving infrastructure Colin McGrath, Matt Howard, Philip Kiely May 29, 2024 870 -
Baseten Chains explained: building multi-component AI workflows at scale Marius Killinger, Rachel Rapp Jul 02, 2024 2424 -
How to double tokens per second for Llama 3 with Medusa Abu Qader, Philip Kiely Aug 20, 2024 1462 2
The best open-source image generation model Philip Kiely Aug 29, 2024 1409 -
How to build function calling and JSON mode for open-source and fine-tuned LLMs Bryce Dubayah, Philip Kiely Sep 12, 2024 1339 1
Introducing function calling and structured output for open-source and fine-tuned LLMs Bryce Dubayah, Philip Kiely Sep 12, 2024 604 -
Building high-performance compound AI applications with MongoDB Atlas and Baseten Philip Kiely Sep 17, 2024 1425 -
Introducing Baseten Hybrid: control and flexibility in your cloud and ours Mike Bilodeau, Rachel Rapp Sep 26, 2024 633 -
Baseten partners with Google Cloud to deliver high-performance AI infrastructure to a broader audience Mike Bilodeau, Rachel Rapp Sep 26, 2024 688 -
Export your model inference metrics to your favorite observability tool Helen Yang, Nicolas Gere-lamaysouette, Philip Kiely Oct 05, 2024 493 -
Evaluating NVIDIA H200 GPUs for LLM inference Pankaj Gupta, Philip Kiely Oct 23, 2024 1294 -
Introducing canary deployments on Baseten Sid Shanker, Jonathan Rochette, Raymond Cano, Rachel Rapp Nov 01, 2024 932 -
Create custom environments for deployments on Baseten Samiksha Pal, Raymond Cano, Sid Shanker, Rachel Rapp Nov 15, 2024 621 -
Introducing Custom Servers: Deploy production-ready model servers from Docker images Tianshu Cheng, Bola Malek, Rachel Rapp Dec 09, 2024 807 -
Generally Available: The fastest, most accurate, and cost-efficient Whisper transcription William Gao, Derrick Yang, Tianshu Cheng, Rachel Rapp Dec 12, 2024 1145 -
A quick introduction to speculative decoding Pankaj Gupta, Justin Yi, Philip Kiely Dec 20, 2024 1139 -
Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference Justin Yi, Abu Qader, Bryce Dubayah, Rachel Rapp Dec 20, 2024 904 -
How we built production-ready speculative decoding with TensorRT-LLM Pankaj Gupta, Justin Yi, Philip Kiely Dec 20, 2024 2729 -
New observability features: activity logging, LLM metrics, and metrics dashboard customization Suren Atoyan, Aaron Relph, Marius Killinger, Sid Shanker, Rachel Rapp Dec 23, 2024 540 -