64 blog posts published by month since the start of 2024. Start from a different year:

Posts year-to-date
10 (20 posts by this month last year.)
Average posts per month since 2024
2.7

Post details (2024 to today)

Title Author Date Word count HN points
Introducing Baseten Self-hosted Anupreet Walia, Rachel Rapp Aug 08, 2024 670 -
How to benchmark image generation models like Stable Diffusion XL Philip Kiely Jan 31, 2024 1374 -
Comparing tokens per second across LLMs Philip Kiely May 09, 2024 769 -
How latent consistency models work Rachel Rapp Jun 04, 2024 1140 -
Unlocking the full power of NVIDIA H100 GPUs for ML inference with TensorRT Pankaj Gupta, Philip Kiely Feb 06, 2024 1623 -
New in February 2024 Baseten Feb 29, 2024 634 -
How to serve 10,000 fine-tuned LLMs from a single GPU Pankaj Gupta, Philip Kiely Jul 23, 2024 1895 -
Streaming real-time text to speech with XTTS V2 Het Trivedi, Philip Kiely Apr 18, 2024 1318 -
Continuous vs dynamic batching for AI inference Matt Howard, Philip Kiely Apr 05, 2024 1350 -
High performance ML inference with NVIDIA TensorRT Justin Yi, Philip Kiely Mar 12, 2024 1076 -
FP8: Efficient model inference with 8-bit floating point numbers Pankaj Gupta, Philip Kiely Mar 07, 2024 1021 2
The best open source large language model Philip Kiely Feb 09, 2024 1920 -
New in January 2024 Baseten Jan 31, 2024 580 -
Using fractional H100 GPUs for efficient model serving Matt Howard, Vlad Shulman, Pankaj Gupta, Philip Kiely Mar 28, 2024 1086 -
40% faster Stable Diffusion XL inference with NVIDIA TensorRT Pankaj Gupta, Justin Yi, Philip Kiely Feb 22, 2024 2403 -
Ten reasons to join Baseten Dustin Michaels, Philip Kiely Jul 25, 2024 1230 -
Why GPU utilization matters for model inference Marius Killinger, Philip Kiely Feb 20, 2024 816 -
New in March 2024 Baseten Mar 28, 2024 553 -
Compound AI systems explained Rachel Rapp Aug 06, 2024 1338 -
What I learned as a forward-deployed engineer working at an AI startup Het Trivedi May 31, 2024 1353 -
Introducing Baseten Chains Bola Malek, Marius Killinger, Sid Shanker, Rachel Rapp, Mike Bilodeau Jun 27, 2024 1132 9
The benefits of globally distributed infrastructure for model serving Phil Howes, Philip Kiely Mar 01, 2024 603 -
33% faster LLM inference with FP8 quantization Pankaj Gupta, Philip Kiely Mar 14, 2024 1876 -
Using asynchronous inference in production Samiksha Pal, Helen Yang, Rachel Rapp Jul 11, 2024 950 -
Introduction to quantizing ML models Abu Qader, Philip Kiely Jan 31, 2024 1679 1
New in April 2024 Baseten May 01, 2024 552 -
Benchmarking fast Mistral 7B inference Abu Qader, Pankaj Gupta, Justin Yi, Philip Kiely Mar 14, 2024 1571 -
SPC hackathon winners build with Llama 3.1 on Baseten Philip Kiely Aug 16, 2024 615 -
Understanding performance benchmarks for LLM inference Philip Kiely Jan 12, 2024 1459 -
Comparing few-step image generation models Rachel Rapp Jun 14, 2024 1087 -
Introducing automatic LLM optimization with TensorRT-LLM Engine Builder Abu Qader, Philip Kiely Aug 01, 2024 939 2
Deploying custom ComfyUI workflows as APIs Het Trivedi, Rachel Rapp Jul 25, 2024 1144 1
New in May 2024 Baseten Jun 03, 2024 598 -
CI/CD for AI model deployments Vlad Shulman, Samiksha Pal, Sid Shanker, Philip Kiely Apr 30, 2024 914 -
Announcing our Series B Tuhin Srivastava Mar 04, 2024 629 2
Control plane vs workload plane in model serving infrastructure Colin McGrath, Matt Howard, Philip Kiely May 29, 2024 870 -
Baseten Chains explained: building multi-component AI workflows at scale Marius Killinger, Rachel Rapp Jul 02, 2024 2424 -
How to double tokens per second for Llama 3 with Medusa Abu Qader, Philip Kiely Aug 20, 2024 1462 2
The best open-source image generation model Philip Kiely Aug 29, 2024 1409 -
How to build function calling and JSON mode for open-source and fine-tuned LLMs Bryce Dubayah, Philip Kiely Sep 12, 2024 1339 1
Introducing function calling and structured output for open-source and fine-tuned LLMs Bryce Dubayah, Philip Kiely Sep 12, 2024 604 -
Building high-performance compound AI applications with MongoDB Atlas and Baseten Philip Kiely Sep 17, 2024 1425 -
Introducing Baseten Hybrid: control and flexibility in your cloud and ours Mike Bilodeau, Rachel Rapp Sep 26, 2024 633 -
Baseten partners with Google Cloud to deliver high-performance AI infrastructure to a broader audience Mike Bilodeau, Rachel Rapp Sep 26, 2024 688 -
Export your model inference metrics to your favorite observability tool Helen Yang, Nicolas Gere-lamaysouette, Philip Kiely Oct 05, 2024 493 -
Evaluating NVIDIA H200 GPUs for LLM inference Pankaj Gupta, Philip Kiely Oct 23, 2024 1294 -
Introducing canary deployments on Baseten Sid Shanker, Jonathan Rochette, Raymond Cano, Rachel Rapp Nov 01, 2024 932 -
Create custom environments for deployments on Baseten Samiksha Pal, Raymond Cano, Sid Shanker, Rachel Rapp Nov 15, 2024 621 -
Introducing Custom Servers: Deploy production-ready model servers from Docker images Tianshu Cheng, Bola Malek, Rachel Rapp Dec 09, 2024 807 -
Generally Available: The fastest, most accurate, and cost-efficient Whisper transcription William Gao, Derrick Yang, Tianshu Cheng, Rachel Rapp Dec 12, 2024 1145 -
A quick introduction to speculative decoding Pankaj Gupta, Justin Yi, Philip Kiely Dec 20, 2024 1139 -
Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference Justin Yi, Abu Qader, Bryce Dubayah, Rachel Rapp Dec 20, 2024 904 -
How we built production-ready speculative decoding with TensorRT-LLM Pankaj Gupta, Justin Yi, Philip Kiely Dec 20, 2024 2729 -
New observability features: activity logging, LLM metrics, and metrics dashboard customization Suren Atoyan, Aaron Relph, Marius Killinger, Sid Shanker, Rachel Rapp Dec 23, 2024 540 -
Driving model performance optimization: 2024 highlights Pankaj Gupta Jan 14, 2025 1530 -
Private, secure DeepSeek-R1 in production in US & EU data centers Amir Haghighat, Philip Kiely Feb 11, 2025 1274 -
Testing Llama 3.3 70B inference performance on NVIDIA GH200 in Lambda Cloud Pankaj Gupta, Philip Kiely Feb 11, 2025 1033 -
Baseten Chains is now GA for production compound AI systems Marius Killinger, Tyron Jung, Rachel Rapp Feb 12, 2025 1123 -
How multi-node inference works for massive LLMs like DeepSeek-R1 Phil Howes, Philip Kiely Feb 15, 2025 1303 -
Announcing Baseten’s $75M Series C Tuhin Srivastava Feb 26, 2025 739 -
How we built high-throughput embedding, reranker, and classifier inference with TensorRT-LLM Michael Feil, Philip Kiely Mar 28, 2025 2035 -
Introducing Baseten Embeddings Inference: The fastest embeddings solution available Michael Feil, Rachel Rapp Mar 28, 2025 782 -
The best open-source embedding models Philip Kiely Apr 07, 2025 1254 -
Building performant embedding workflows with Chroma and Baseten Philip Kiely Apr 11, 2025 570 -