Yes, we're starting with the transcription task. CNNs for local prediction are i...

Yes, we're starting with the transcription task. CNNs for local prediction are interesting, and we're also curious about capturing the temporal structure of music with something recurrent. It seems like a time series model that understands something about western music should help with music transcription just like language models help with speech transcription.

The style transfer stuff comes later and as you observe, we'll probably need some new ideas to make that work well. I haven't thought about this deeply yet, but my intuition is that maybe instrumental timbre is an audio analog of visual texture, so maybe a reasonably direct "port" of style-transfer to the audio domain would let us construct demos that, for example, rewrite a cello recording to sound like trombone.

Let us know when your dataset is complete! I love jazz.