Title | Author | Date | Word count | HN points |
---|---|---|---|---|
New observability features: activity logging, LLM metrics, a… | Suren Atoyan, Aaron Relph, Marius Killinger, Sid Shanker, Rachel Rapp | Dec 23, 2024 | 540 | - |
How we built production-ready speculative decoding with Tens… | Pankaj Gupta, Justin Yi, Philip Kiely | Dec 20, 2024 | 2729 | - |
Introducing our Speculative Decoding Engine Builder integrat… | Justin Yi, Abu Qader, Bryce Dubayah, Rachel Rapp | Dec 20, 2024 | 904 | - |
A quick introduction to speculative decoding | Pankaj Gupta, Justin Yi, Philip Kiely | Dec 20, 2024 | 1139 | - |
Generally Available: The fastest, most accurate, and cost-ef… | William Gao, Derrick Yang, Tianshu Cheng, Rachel Rapp | Dec 12, 2024 | 1145 | - |
Introducing Custom Servers: Deploy production-ready model se… | Tianshu Cheng, Bola Malek, Rachel Rapp | Dec 09, 2024 | 807 | - |
Create custom environments for deployments on Baseten | Samiksha Pal, Raymond Cano, Sid Shanker, Rachel Rapp | Nov 15, 2024 | 621 | - |
Introducing canary deployments on Baseten | Sid Shanker, Jonathan Rochette, Raymond Cano, Rachel Rapp | Nov 01, 2024 | 932 | - |
Evaluating NVIDIA H200 GPUs for LLM inference | Pankaj Gupta, Philip Kiely | Oct 23, 2024 | 1294 | - |
Export your model inference metrics to your favorite observa… | Helen Yang, Nicolas Gere-lamaysouette, Philip Kiely | Oct 05, 2024 | 493 | - |
Baseten partners with Google Cloud to deliver high-performan… | Mike Bilodeau, Rachel Rapp | Sep 26, 2024 | 688 | - |
Introducing Baseten Hybrid: control and flexibility in your … | Mike Bilodeau, Rachel Rapp | Sep 26, 2024 | 633 | - |
Building high-performance compound AI applications with Mong… | Philip Kiely | Sep 17, 2024 | 1425 | - |
Introducing function calling and structured output for open-… | Bryce Dubayah, Philip Kiely | Sep 12, 2024 | 604 | - |
How to build function calling and JSON mode for open-source … | Bryce Dubayah, Philip Kiely | Sep 12, 2024 | 1339 | 1 |
The best open-source image generation model | Philip Kiely | Aug 29, 2024 | 1409 | - |
How to double tokens per second for Llama 3 with Medusa | Abu Qader, Philip Kiely | Aug 20, 2024 | 1462 | 2 |
SPC hackathon winners build with Llama 3.1 on Baseten | Philip Kiely | Aug 16, 2024 | 615 | - |
Introducing Baseten Self-hosted | Anupreet Walia, Rachel Rapp | Aug 08, 2024 | 670 | - |
Compound AI systems explained | Rachel Rapp | Aug 06, 2024 | 1338 | - |
Introducing automatic LLM optimization with TensorRT-LLM Eng… | Abu Qader, Philip Kiely | Aug 01, 2024 | 939 | 2 |
Deploying custom ComfyUI workflows as APIs | Het Trivedi, Rachel Rapp | Jul 25, 2024 | 1144 | 1 |
Ten reasons to join Baseten | Dustin Michaels, Philip Kiely | Jul 25, 2024 | 1230 | - |
How to serve 10,000 fine-tuned LLMs from a single GPU | Pankaj Gupta, Philip Kiely | Jul 23, 2024 | 1895 | - |
Using asynchronous inference in production | Samiksha Pal, Helen Yang, Rachel Rapp | Jul 11, 2024 | 950 | - |
Baseten Chains explained: building multi-component AI workfl… | Marius Killinger, Rachel Rapp | Jul 02, 2024 | 2424 | - |
Introducing Baseten Chains | Bola Malek, Marius Killinger, Sid Shanker, Rachel Rapp, Mike Bilodeau | Jun 27, 2024 | 1132 | 9 |
Comparing few-step image generation models | Rachel Rapp | Jun 14, 2024 | 1087 | - |
How latent consistency models work | Rachel Rapp | Jun 04, 2024 | 1140 | - |
New in May 2024 | Baseten | Jun 03, 2024 | 598 | - |
What I learned as a forward-deployed engineer working at an … | Het Trivedi | May 31, 2024 | 1353 | - |
Control plane vs workload plane in model serving infrastruct… | Colin McGrath, Matt Howard, Philip Kiely | May 29, 2024 | 870 | - |
Comparing tokens per second across LLMs | Philip Kiely | May 09, 2024 | 769 | - |
New in April 2024 | Baseten | May 01, 2024 | 552 | - |
CI/CD for AI model deployments | Vlad Shulman, Samiksha Pal, Sid Shanker, Philip Kiely | Apr 30, 2024 | 914 | - |
Streaming real-time text to speech with XTTS V2 | Het Trivedi, Philip Kiely | Apr 18, 2024 | 1318 | - |
Continuous vs dynamic batching for AI inference | Matt Howard, Philip Kiely | Apr 05, 2024 | 1350 | - |
New in March 2024 | Baseten | Mar 28, 2024 | 553 | - |
Using fractional H100 GPUs for efficient model serving | Matt Howard, Vlad Shulman, Pankaj Gupta, Philip Kiely | Mar 28, 2024 | 1086 | - |
Benchmarking fast Mistral 7B inference | Abu Qader, Pankaj Gupta, Justin Yi, Philip Kiely | Mar 14, 2024 | 1571 | - |
33% faster LLM inference with FP8 quantization | Pankaj Gupta, Philip Kiely | Mar 14, 2024 | 1876 | - |
High performance ML inference with NVIDIA TensorRT | Justin Yi, Philip Kiely | Mar 12, 2024 | 1076 | - |
FP8: Efficient model inference with 8-bit floating point num… | Pankaj Gupta, Philip Kiely | Mar 07, 2024 | 1021 | 2 |
Announcing our Series B | Tuhin Srivastava | Mar 04, 2024 | 629 | 2 |
The benefits of globally distributed infrastructure for mode… | Phil Howes, Philip Kiely | Mar 01, 2024 | 603 | - |
New in February 2024 | Baseten | Feb 29, 2024 | 634 | - |
40% faster Stable Diffusion XL inference with NVIDIA TensorR… | Pankaj Gupta, Justin Yi, Philip Kiely | Feb 22, 2024 | 2403 | - |
Why GPU utilization matters for model inference | Marius Killinger, Philip Kiely | Feb 20, 2024 | 816 | - |
The best open source large language model | Philip Kiely | Feb 09, 2024 | 1920 | - |
Unlocking the full power of NVIDIA H100 GPUs for ML inferenc… | Pankaj Gupta, Philip Kiely | Feb 06, 2024 | 1623 | - |