Vibe-aware playlists: match the mood/context of the request #2

Open
opened 2026-02-22 18:54:50 +00:00 by antialias · 0 comments
Owner

Problem

Right now playlist generation just pulls the top-N tracks by cosine similarity to the overall taste profile. But "play me some music" at 7am making breakfast is a very different request from "play me some music" at 10pm unwinding. The system has no concept of vibe, mood, or context — it always generates the same style of playlist.

Scenarios

  • "Play something chill in the living room" → should favor mellow, ambient, downtempo
  • "Play something upbeat for a party" → should favor high-energy, danceable tracks
  • "Play something for focus/work" → should favor instrumental, minimal vocals
  • "Play me some new stuff" → should heavily weight discovery over known tracks
  • "Play my favorites" → should heavily weight known tracks, minimal discovery
  • "Play something like Danger Mouse" → should anchor recommendations around a specific artist's embedding rather than the global taste profile

Design questions

  • CLAP is multimodal — it supports text embeddings too. We could embed the user's request as text ("chill evening music") and use that as the query vector instead of (or blended with) the taste profile. This is potentially very powerful since CLAP was trained on audio-text pairs.
  • Text-conditioned recommendations: similarity = α * cosine(track, taste_profile) + (1-α) * cosine(track, text_embedding) where α controls the balance between personal taste and requested vibe.
  • Genre/mood filtering — simpler approach: tag tracks with mood/genre and filter before ranking. But this loses the nuance of embedding space.
  • Time-of-day heuristics — could automatically adjust vibe based on time, but this feels brittle and presumptuous.
  • Context from speaker — "Kitchen stereo" at dinner time vs "Study speaker" during work hours could inform defaults.

Key insight

CLAP's text encoder is the killer feature here. We already have the model loaded for audio embedding. Using model.get_text_features() to embed the user's request and blending it with the taste profile could give us vibe-aware recommendations with zero additional infrastructure. The playlist generation endpoint already accepts parameters — we just need a vibe or query string parameter.

Impact

This is what makes the system feel magical vs mechanical. "Play me something chill" should actually play chill music, not just "music you've listened to before sorted by similarity."

## Problem Right now playlist generation just pulls the top-N tracks by cosine similarity to the overall taste profile. But "play me some music" at 7am making breakfast is a very different request from "play me some music" at 10pm unwinding. The system has no concept of vibe, mood, or context — it always generates the same style of playlist. ## Scenarios - "Play something chill in the living room" → should favor mellow, ambient, downtempo - "Play something upbeat for a party" → should favor high-energy, danceable tracks - "Play something for focus/work" → should favor instrumental, minimal vocals - "Play me some new stuff" → should heavily weight discovery over known tracks - "Play my favorites" → should heavily weight known tracks, minimal discovery - "Play something like Danger Mouse" → should anchor recommendations around a specific artist's embedding rather than the global taste profile ## Design questions - **CLAP is multimodal** — it supports text embeddings too. We could embed the user's request as text ("chill evening music") and use that as the query vector instead of (or blended with) the taste profile. This is potentially very powerful since CLAP was trained on audio-text pairs. - **Text-conditioned recommendations**: `similarity = α * cosine(track, taste_profile) + (1-α) * cosine(track, text_embedding)` where α controls the balance between personal taste and requested vibe. - **Genre/mood filtering** — simpler approach: tag tracks with mood/genre and filter before ranking. But this loses the nuance of embedding space. - **Time-of-day heuristics** — could automatically adjust vibe based on time, but this feels brittle and presumptuous. - **Context from speaker** — "Kitchen stereo" at dinner time vs "Study speaker" during work hours could inform defaults. ## Key insight CLAP's text encoder is the killer feature here. We already have the model loaded for audio embedding. Using `model.get_text_features()` to embed the user's request and blending it with the taste profile could give us vibe-aware recommendations with zero additional infrastructure. The playlist generation endpoint already accepts parameters — we just need a `vibe` or `query` string parameter. ## Impact This is what makes the system feel magical vs mechanical. "Play me something chill" should actually play chill music, not just "music you've listened to before sorted by similarity."
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: antialias/haunt-fm#2