Noise2Music

Anonymous

Abstract

We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music.

We explore two options for the intermediate representation, one using a spectrogram and the other using audio with lower fidelity. We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood and era, but goes beyond to ground fine-grained semantics of the prompt. Pretrained large language models play a key role in this story---they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.

Table 1

Cherry-picked examples of 30-second long music clips generated from text prompts.
index prompt Waveform Model Spectrogram Model
0A female vocalist sings this upbeat Latin pop. The song has an upbeat rhythm with a dance groove. The drumming is lively, the percussion instruments add layers and density to the music, the bass line is simple and steady, the keyboard accompaniment adds a nice melody.
1The song fits the hip-hop/pop genre from the early 2000's.
2This is a psychedelic rock music piece. It could also be playing in the background at a hippie coffee shop.
3The piece has a sensual atmosphere. The rhythmic background is provided by a mild acoustic drum beat with heavy use of latin percussive elements.
4The bass drums are overloading the speakers of the recording. This audio contains a lot of people playing a complex snare roll groove in synchronicity along with people playing bass drums.
5This audio contains acoustic drums playing a groove with a lot of cymbal hits.
6The sampled drums go for a usual hip-hop beat, nothing standing out and together with the sub-woofer bass drive the pulse of the music.
7The trumpets play a blaring descant, and other trumpets play a percussive harmonic layer long with a tuba playing the lower register.
8A choir sings in crescendo with a slow marching rise from harmonic instruments. The song is a vintage pop classic and sounds like fanfare.
9It is captivating, intense, mellifluous, engaging,and fervent. This music is an enthralling Sitar instrumental.
10Solo latin percussion music featuring congas, cowbells, and timbales, performing complex rhythms.
11A darbuka plays a simple beat. A variety of middle-eastern percussion instruments are played in the background.
12Someone is playing a rock/funk riff on an e-guitar with a lot of low end. This song may be playing during a guitar lesson.
13The song is soulful and heavily inspired by black gospel music. A female singer with a soulful voice sings this heartfelt melody.
14The singer sings in a way that is calm and mellow, despite the message of the song suggesting that she is pleading for something. The song is a calm soulful R&B song, which has neo soul elements. The song has a slow jam style to it, and is emotional and romantic.
15The drums feature a light accompaniment, the piano has small interventions here and there. The jazz organ plays in low volume somewhere in the background. The atmosphere is like a dim light in a bar late at night before closing hours when everybody has left home.
16This is an emotional pop ballad containing a female vocal that's raspy, emotional, raw and gritty.
17A male vocalist sings this mellow melody. This song is Latin Pop. The tempo is slow with keyboard harmony, clave rhythm, steady bass line, simple drumming and latin percussion instruments.
18There's a sustained silky synth sound. The singer sounds cheerful. The song is a funky, soulful disco song.
19The dance-pop music features a male voice singing a repetitive melody. The music incites the listener to dance.
20This song is modern Latin Pop. The tempo is medium with acoustic guitar rhythm, mandolin harmony and a bright and lively and tuba accompaniment.
21There is a string orchestra playing an ominous tune that is full of suspense. This piece could be used in the soundtrack of a horror movie, especially during the scenes where a character is walking through a dangerous zone.
22There is a fuzzy synth bass playing a groovy bass line with a mellow sounding keyboard playing alongside it. There is a loud electronic drum beat in the rhythmic background.
23It sounds energetic and like something you would hear in clubs.
24The singer has a smooth voice and a mellow style of delivery that borders on seductive and charming. The guitarist rapidly plays staccato based licks on the guitar, and the drummer plays a soulful jazzy groove. The song is a soulful, upbeat and groovy R&B song.
25This is a live performance of a southern rock piece. This piece could be playing in the background at a rock bar.
26It's an emotive and cinematic piece. This is a sextet of french horn players. The piece sounds inspiring and uplifting and would be fitting for a movie soundtrack in a moment where the lead character has just accomplished something great or saved the day.
27The song is a Latin tune with new age vibes. The music is slow tempo with beautiful acoustic guitar accompaniment. The song is inspiring with a story telling element.
28The snare is struck at every third count.
29The Regional Mexican song features solo flute melody over wooden percussive elements, groovy piano melody and groovy bass. It sounds fun, happy and it is uplifting and energetic - like something you would dance to in some latin bar.
30The music is youthful, groovy, pulsating, electrifying, buoyant, thumping, psychedelic, trance like and trippy.
31There's a crunchy and funky electric guitar being used to play chords in a funky rhythm. The song is a dance-pop song with elements of funk, disco and pop rock. The bassline is groovy and upbeat and the drumming is also centred around eighth notes and a disco groove.
32This is a groovy reggae song with a good vibe for dancing. The electric guitar stabs are on the off-beats and help create a bounce to the track. The vocalist is relaxed and there is an echo effect applied to her vocal.
33This song is latin contemporary pop/hip-hop. The chaotic chatter in the background and reverb make this latin song unconventional. A male vocalist sings this energetic latin hip-hop.
34It's a contemporary R&B song with a slow jams vibe to it. The song is a smooth, soulful, slow love song. It feels sexy, sensual and romantic and would be suitable for an intimate night with a partner.
35The music features a doubled male voice singing. In the background a wooden percussion instrument can be heard, similar to the claves in sound. The drums play a light rhythm and, together with the bass guitar, hold the groove of the music.
36Live, amateur recording of record scratching over an instrumental old-school hip hop beat.
37It's a hip hop song, but it heavily draws from R&B and soul. The song is soulful, smooth and melodic.
38There is a simple tune being played on a ukulele. The atmosphere of the piece is easygoing. This piece could be used in the background of wholesome social media content.
39This is an electronic/downtempo house music piece.
40The song is a retro pop favourite for all ages. Male children singing this cheerful vocal harmony. The song is medium tempo with a groovy bass line, guitar accompaniment and enthusiastic drumming rhythm.
41The song is groovy and energetic. The song is medium tempo with an iconic bass groove, steady drumming, keyboard string section harmony and shakers playing percussively. The song is a classic pop instrumental section of a song.
42The singers sing in harmony on the chorus, where all the warm and bright instruments kick in. The song is a vibrant pop, medium tempo song with a warm tone to it. It feels like a summer song, as it has a fun and carefree vibe to it.
43The song is a modern pop song with hip hop influences. A female singer sings this soulful melody with backup singers in vocal harmony. The tempo is medium, with slick drumming rhythm, groovy bass line and ethereal harmony tones.
44This piece could be used in the soundtrack of a drama movie during a scene of serenity or mourning. There is no singer.
45This song contains someone playing a hybrid of a guitar/harp.
46This is an Ethiopian traditional music piece. There are traditional Ethiopian instruments such as washing and masenqo in the melodic background.
47The song is slow tempo with a vocal four part harmony created by a choral section. The music is highly relaxing and pleasing.
48Quirky, cheeky, jazzy music featuring a lead trombone melody, a triangle playing a swing pattern, a cheesy organ and a medium tempo.
49The song is a very popular song cover and movie soundtrack. The song sounds very operatic with a slow tempo.

Table 2

Ablation study on music attributes.
We compare the generated audio examples conditioned on variations of a prompt, formed by changing a particular music aspect (genre, tempo, era, mood, activity, and vocal traits) of the prompt.
Ablation aspect Prompt Waveform Model Spectrogram Model
genre 0Moody, melancholy medium-tempo standard jazz song that is good for late-night listening.
Moody, melancholy medium-tempo fusion jazz song that is good for late-night listening.
Moody, melancholy medium-tempo blues song that is good for late-night listening.
genre 1Classical music to listen to while focusing on homework.
EDM music to listen to while focusing on homework.
genre 2Classical opera music with lead male vocals.
Indian pop music with lead male vocals.
Reggae music with lead male vocals.
instrument 0A romantic love song played by a band with a lead piano.
A romantic love song played by a band with a lead saxophone.
instrument 1A sad song played by a symphony orchestra.
A sad song played by a solo pianist.
A sad song played by a solo acoustic guitarist.
tempo 0Slow-paced progressive rock instrumental video game music with electric guitars, keyboards, bass and drums.
Rapidly-paced progressive rock instrumental video game music with electric guitars, keyboards, bass and drums
tempo 1Fast electronic dance music with a cool and chic vibe.
Up-beat electronic dance music with a cool and chic vibe.
Slow electronic dance music with a cool and chic vibe.
mood 0Funky, bright and happy pop song with smooth, soulful male vocals.
Romantic pop song with smooth, soulful male vocals.
mood 1Uplifting orchestral music.
Suspenseful orchestral music.
vocal traits 0Funky, bright and happy pop song with smooth, soulful male vocals.
Funky, bright and happy pop song with smooth, soulful female vocals.
vocal traits 1A mid-tempo R&B song featuring a catchy hook with smooth male vocals.
A mid-tempo R&B song featuring a catchy hook with smooth female vocals.
vocal traits 2A melancholy country song with strong, powerful female vocals.
A melancholy country song with soft female vocals.
era 0Song that sounds like rock music from the 70's, with bass, drum, guitar and vocals.
Song that sounds like rock music from the 80's, with bass, drum, guitar and vocals.
Song that sounds like rock music from the 90's, with bass, drum, guitar and vocals.
Song that sounds like rock music from the 2000's, with bass, drum, guitar and vocals.
era 1Song that sounds like pop dance music from the 70's, with female vocals.
Song that sounds like pop dance music from the 80's, with female vocals.
Song that sounds like pop dance music from the 90's, with female vocals.
Song that sounds like pop dance music from the 2000's, with female vocals.