Skip to the content.

Introduction

In this paper, we propose StylePitcher, the first general-purpose pitch curve generation model that learns to follow singing styles from reference audio. We formulate pitch generation as a masked infilling task: given surrounding pitch context and musical scores, StylePitcher learns to generate missing pitch segments that naturally continue the style patterns from context.

Once trained, StylePitcher serves as a plug-and-play module for diverse applications: enhancing pitch correction with style preservation, enabling style transfer in SVS systems, and improving expressiveness in SVC.

Fig.1 Illustration of the methods. Subscripts *off* and *in* denote features from off-key and in-key singing in the APC task; *ref* and *tgt* refer to reference and target content for SVS and SVC tasks.

Generation Examples

We present audio examples from baselines and our methods in this section, including following applications:

Automatic Pitch Correction (APC)

Given a detuned singing voice (Off-key Singing) and a target note sequence (Target Notes), APC aims to generate an in-tune pitch contour that aligns with the target notes while preserving the singer’s original style for correction. We compare our methods, i.e., StylePitcher and StylePitcher (w/o smooth), with Diff-Pitcher.

Off-key Singing Target Notes Diff-Pitcher StylePitcher StylePitcher (w/o smooth)

Zero-shot Singing Voice Synthesis (SVS) with Style Transfer

Given target lyrics, musical score and a reference singing voice, style transfer for SVS aims to synthesize vocals that match the target content while resembling the singing styles of the reference. In the following examples, the singers in the reference audio have never been seen by the model during training, i.e., a zero-shot setting. We compare our methods, i.e., StylePitcher and StylePitcher (w/o smooth), with StyleSinger. All the methods support both parallel and non-parallel style transfer.

Parallel Style Transfer

In the parallel style transfer setting, the content of the reference singing voice is the same as the target lyrics and musical scores. Thus, the Reference audio here provides both the singing content (lyrics and musical scores) and the singing styles for re-synthesis.

Reference StyleSinger StylePitcher StylePitcher (w/o smooth)

Non-Parallel Style Transfer

In the non-parallel style transfer setting, the content of the Reference singing voice differs from the target lyrics and notes.

Example 1

Reference StyleSinger StylePitcher StylePitcher (w/o smooth)

Example 2

Reference StyleSinger StylePitcher StylePitcher (w/o smooth)

Style-informed Singing Voice Conversion (SVC)

While most existing SVC methods use the unchanged F0 from the target audio to generate vocals with the reference singer’s timbre, StylePitcher can modify the F0 to also capture the singing styles of the target singer. In the following examples, Target audio provides the singing content, and Reference audio indicates the timbre and styles to convert. We compare our methods, i.e., StylePitcher and StylePitcher (w/o smooth), with In-house SVC.

Target Reference In-house SVC StylePitcher StylePitcher (w/o smooth)