Speech Synth!

Paul Pridham

I've always loved those old robotic speech synths, and I could never find some nice and easy code for implementing one, so I decided to create my own... whee! I based it on the SPO256 speech synth chip... I found a WAV of the complete allophone set, split them up into individual allophones (ugh) using the SPO256 specification as a guide, and wrote some generic "synth" code that takes a list of speech opcodes and streams the appropriate allophones to a buffer. There's an Allegro layer on top for output, but it's not necessary.

When it's all completed you'll be able to easily make a custom speech synth using whatever allphone/phoneme set you want. I'll probably look into some more advanced features such as changing pitch/speed as well as resampling for different sample rates and sizes (currently a robotic but charming 8000Hz at 8-bits).

Anyway, here's an example of the output: We Control the Universe

Whee, now I can put robot voices in everything I...

miran

The link doesn't work...

spellcaster

Heh great (link works for me)
But why didn't you send me my alluphone files yet?

Paul Pridham

Oops! Sorry Lenny. I'll can get that ready for you. Do you need the SPO256 spec as well?

Torbjörn Josefsson

ooooh! the coolness! wish I had a game I could use it in!

spellcaster

Paul just send me the stuff... this is pretty cool It's quite easy to get it running, and submitting new "voices" is simple as well.
Imagine the possibilities... speech syth for all your allegro programs...
Chris, is the AAA site for 2003 up already? We have a worthy candiate for "Best add-on lib" already

KaBlammyman

That is hella tite...Now I want a real robot. Honestly, that speech thing is cool. Is it hard to design one of those things?

Paul Pridham

Well, depends on the type of synth. Mine is a simple concatenative synth using allophones. A more advanced method is to use diphones, which are transitions between phonemes (whereas allophones are the distinct sounds of a particular phoneme).

Non-concatenative speech synths are called Formant synths, and they use parameters to simulate the "tube" that is shaping the speech, and applying these parameters as a filter to a "buzzer" to generate the sound. Pretty cool, but over my head currently. If you're interested in this, do a search for Linear Predictive Coding.

I am still fiddling with my synth, and once I have a bit more done with it I'll post it up. I want to make another allophone set (based on the C64 SAM speech synth), and a minimal diphone set as well to see how smooth I can make the phoneme transitions sound.

dudaskank

This works for other languages? (Portuguese ^_^?)

If no, it's hard to add this support for multi-language?

Maybe a config file for these allophones?

^__^

23yrold3yrold

Quote:

Chris, is the AAA site for 2003 up already?

Bug Tom The new Pixelate site should be up soon (yeah, right!) so I guess I'll mention it

Paul Pridham

dudaskank: Portuguese or any language is no problem. If you have the phonemes/allophones for it, then all you have to do is load them up and start using it! You can even record your own.

Heck, I was thinking the other day how I might invent a Goblin language, or use some squeaky-toy samples I recorded to make a rat language. It's no problem to write a config file format for this ... in fact, I've already got one and I will include it as an example when I put the code online.

The whole thing is really simple, actually. The speech synth is less than 300 lines of code. The only thing you really need is a good set of samples for creating the speech.

Matthew Leverton

How large are the samples, and at what expense is it to generate speech realtime?

Paul Pridham

Well, the first set of samples I made are 8-bit @ 8KHz, with 60 allophones amounting to 122K total. The set I'm working on now is 8-bit @ 22KHz, so it may be around triple the size of the other set. I'll probably make a semi-diphoned set after this one. Using diphones, you can expect even more samples. Big systems like Festival have many megabytes of diphone samples, and generally use the same technique with excellent results.

As for the overhead of this type of speech "synth" (really, it's barely synth at all), it's negligible. All it's doing is streaming samples to you in the order you requested them. Cheap as it is, hey, it works well! If the extra effort is put into creating an allophone/diphone set to suit your needs, then it's all you may need. For my purposes, which is cheap and dirty robot voices for games and my own amusement, it seems to good to be true.

The Formant/LPC type synths are more CPU intensive, but I'm not sure how much. Heck, the C64 had a very good synth on it, done completely in software. It's something I'm going to look at in the future, as I'd love to have my own Formant synth. There's such a synth (or two) in MAME, emulating a speech chip for games such as Gauntlet. It would be a good place to look for starters.

DanTheKat

If only my sound was installed, and I had a microphone.

Hmm, "All your base are belong to us", heh heh heh heh heh heh heh.......

Paul Pridham

I was just working on a new set of allophones, and something.. strange.. happened. I've been.. persuaded to upload this message for the members of Allegro.cc. It's in all of our best interests if we.. don't resist: message from our new master

Here's a transcript:

"Hello people of allegro dot cc. This is the voice of your new master. Bow down before me, and I will spare your lives. Defy my, and my robot armies will destroy you. Hahahaheheheheeeheeehaaa!"

Matt Smith

You've beaten me to it

I started doing the exact same thing with probably the exact same SP0256 file.

For some very good diphone sets try Embrola, but the licensing is non-commercial only.

hmm, embrola no google.
D'oh, it's Mbrola

Paul Pridham

Heh, that's kind of funny. What do you have so far... are you using the same concatenative technique? If you want I can send you what I've got currently, which includes 2 allophone sets I've made. Might save you some effort.

I was thinking it might be cool to reverse-engineer the C64 SAM synth using VICE, but I imagine that would be a lot of work. I haven't tried such a thing.

I installed Mbrola the other day. Yep, it seems pretty cool, but damn, those diphone sets are pretty big. Plus they're in a format I'd have to figure out, not nicely seperated into diphones.

Matt Smith

I fell asleep with that "my robot armies will destroy you" message playing in a loop. I just woke up from a dream about a dalek ::)

I never even got as far as chopping up the phonemes myself. I would love a copy of yours.

My gameplan was to make an editor first, for building dictionaries of phonetically spelled words and phrases. Then I was going to record a phoneme set for copyright-free distribution with my animation program.

Diphone sets will always be big, there's no way around it really. I was going to use a 'SP0256 squared' format of 4096 diphones for upward compatability.

Ultimately I wanted to do a formant synth too, so there would be more flexibility in setting parameters like pitch, speed, volume and mood. The editor I'm planning will allow singing along with a tune or just the natural sing-song of spoken speech e.g. say("hello","cheerfully"); say ("hello","sarcastically"); would use different sing-songs.

Paul Pridham

Quote:

I fell asleep with that "my robot armies will destroy you" message playing in a loop. I just woke up from a dream about a dalek

Haha, how does a guy fall asleep to that?

Here's a copy of my 'phones: http://www3.sympatico.ca/ppridham/misc/sounds/phones.zip. This includes SAM as well as the SPO256. Too bad allophone concatenation is unintelligible to most (so I've been hearing, anyway). :-/

Quote:

Have you searched around for any text-to-speech code? There're a few decent public domain C-code versions floating around. There's even one specifically for the SPO256. I imagine that this combined with a dictionary for text-to-phonemically-challenged words would cover the bases.

Quote:

Diphone sets will always be big, there's no way around it really. I was going to use a 'SP0256 squared' format of 4096 diphones for upward compatability.

Wow, that's more than the big systems like Festival use... I think they sit around 1200, since not all phoneme-phoneme transitions are common or required. I doubt you'll ever top that 4096 set out. I also happened across a program called Diphone Studio (no linky at the moment) that looks promising for creating the diphone database... you may want to check it out.

I was hoping it might be possible to make a very minimal set of 400 or less from SAM, perhaps sticking to pared down rules based on transitions to/from either front or back vowels. The whole idea of creating a diphone set seems pretty daunting, though. :-/

For the formant stuff, I was wanting to snarf Frank Palazzolo's TMS5220 code from MAME and adapt it as a generic speech synth (Elf needs food... badly). One day when I wrap my head around the theory I'll try rolling my own.

Your plan sounds pretty keen, can't wait to see what comes of it!

Matt Smith

The idea for 4096 diphones was was simplicity and upward compatability in the API and dictionaries. With a smaller physical diphone set, there would be a table to translate 4096 logical diphones.

My plan was to release a simple system like yours first. If I was you I would release it as soon as a replacement phoneme set is recorded.

I can't promise anything, as I have too many projects on the go right now, but I'll try and record a copyright free phoneme set although my microphone is crap, my voice is crap, and I only have Sound Recorder. It would be better if someone with decent audio gear did it.

piccolo

i have no idea what diphones or allophones are .
i am guessing SPO256 speech synth chip is a prom with speech synth code in it right.

my speech synth is much simpler it works like a keyboard.

Matt Smith

A NEW PHONEEM SET SI RACORDED!!!!!1

-> HERE <-

piccolo: It isn't a prom, it synthesises speech with a little microcode processor, noise and tone generators and a filter. when this chip came out in the early 80s, proms were enormously expensive and only 8k

Paul Pridham: SEND ME TEH COAD!!!!! I haven't tested the phonemes yet, so I don't know what they sound like strung together.

Paul Pridham

Hahaha! That is some funny stuff right there. I ran your phonemes... and they make the perfect gimp voice.

Here's my code: http://www3.sympatico.ca/ppridham/misc/sounds/speechy.zip. Should be pretty straightforward, but beware the "main2.c" that tests stuff is somewhat hackish.

By the way, you forgot to record WH.

piccolo

can you host mine so people can use the code it in broland C++ 5.02 and executes at a prompt.
this was my fist program and my fist and last program put to the side note the only reson i stopped making it is because i wanted to do it side by side with my 2nd program "THe Game" (which i am working on now..) and i could not get broland C++ 5.02 to work with Allegro so i moved to MSVC++6 leving my poor "first" behind. but because of my new knowledge from you guys like 23year i am much more powerfull, i now know how to use a thing called strings and i know how to add supper GUIeS to my stuff. i know for sure i can convert my speech program over to MSVC and make it 10000 times better in a snap :-X but after my game.

post and ans and i will edit my post with the link you give me

psundlin

Here is my little speech thingy editor:
[url http://hem.passagen.se/peter95]

It is not so powerful, but you can create and load/save files. There is no wav/voc saving yet. The source code is included. The source is a bit messy. As always.

Paul Pridham

Heh, it's cool to see that there are more of us out there.

Matt Smith

I've jiggered them a bit. they sound less gimpy now (I blocked the gap between my front teeth for dh1). Some of them still aren't quite right yet, but that can wait until I have a better editor.

matt-0256-2.zip

I'm surprised how much it sounds like me. This suggests that everyone should record their own set and get their gf to make one too.

I hacked at the speech string too, it's more intelligible with either allophone set

  char *str="hh1 eh ll ow pa3 pp iy pp el pa3 ax tt1 pa3 ax ll ll eh pa2 gg2 rr2 ow pa3 dd2 ao pa2 tt2 pa3 ss iy pa3 ss iy pa5 pa5 "
    "dh1 ih ss pa3 ih zz pa3 dh1 ax pa3 vv oy ss pa3 ax vv pa3 yy2 or pa3 nn1 uw2 pa3 mm aa ss pa2 tt2 er2 pa5 pa5 "
    "bb2 aw pa3 dd2 aw nn1 pa3 bb2 iy ff or pa3 mm iy pa5 ae nn1 dd1 pa3 ay pa3 ww ih ll pa3 ss pa2 pp xr pa3 "
    "yy2 or pa3 ll ay vv zz pa5 pa5 dd2 iy ff ay pa3 mm iy pa5 ae nn1 dd1 pa3 mm ay pa3 rr1 ow bb2 ao pa2 tt2 pa3 "
    "ar mm iy zz pa3 ww ih ll pa3 dd2 iy ss pa2 tt2 rr2 oy pa3 yy2 uw2 pa5 pa5 "
    "hh1 ao hh1 ao hh1 eh hh1 eh hh1 eh hh1 iy hh1 ae hh1 aa hh1 aa aa hh1 ao";

Some of the "phonemes" in the SP-0256 set are definitely diphones, is this why it's called an allophone set rather than phonemes?

Cheradenine Zakalwe

Matt: I d-loaded yor first set of phonemes...You manage to make "F***-Off" sound very convincing!!!

Paul Pridham

Well, I think they are called allophones because they aren't really true diphones, but special phoneme cases... like the front/back vowel related transitions for a particular phoneme.

By the way, I made a bit "nicer" of a front end for this, I'll post it all up later. I just need to make a few changes so that you can specify which allophone set to use.

piccolo

come on guy tell me how too do the link thing where do i upload :'( i want to show off my code too you know

Paul Pridham

Well... you have to have somewhere to upload it to! I'm using my internet account's measley 5MB of webspace to store stuff. I think you may be able to start up a page on Geocities or somewhere and put files online as well, though I've never tried those free webspace providers.

piccolo

thanks ill it yahoo should be good 8-)

Paul Pridham

OK, I've made a little front-end demo for "Speechy," and you can specify an allophone set to use from the command line. You get to type in allophones, press ENTER to play them, TAB to save the speech to "out.raw", DEL to clear the whole line, and ESC to quit. I've included the allophone sets I've made. Matt's set should also work with this as well, just copy it into a folder under the "allophones" folder like the other sets, and make a copy of the spo256.txt and call it something else for Matt's set. Also, change the name in the copy of spo256.txt file to the directory you placed the new allophones under.

Anyway, here she blows: http://www3.sympatico.ca/ppridham/misc/sounds/speechy.zip

If you want to convert the out.raw to a WAV, load it up in Goldwave or somesuch thing and save it back as a WAV. Make sure that the sampling rate you use matches that in the associated .txt file.

One thing I have noticed is that I don't think a full diphone set is needed to make the speech intelligible. Certain "fricatives" sound pretty intelligible when mixed with the various vowel and dipthong sounds... for instance: V, DH, Z, CH, SH, S, etc. basically, any unvoiced sounds seem to stand well alone.

Diphones would need to be made for most vowel-to-vowel and voiced consonant-to-vowel transitions, although I think that many of these could be made into specialized allophones or generic diphones, rather than a plethora of every possible diphone transition.

Matt Smith

I'm thinking that seperating the voiced and fricative parts into seperate samples would help in various ways. It would probably double the size of a phoneme/allophone set but would make a diphone set much smaller because of all the duplicates. It would also let the two parts be mixed and matched for greater variety of voices.

I'm working on my general purpose animation editor now, as that makes a good basis for a "voice tracker" too. All the allophones will need loop points so they can be synced to frame rates.

Cheradenine Zakalwe

Quote:

I'm working on my general purpose animation editor now, as that makes a good basis for a "voice tracker" too. All the allophones will need loop points so they can be synced to frame rates.

I noticed you using Shockwave Flash on parts of your site Matt (ie the News link) would this be useful in those sorts of situations??

Matt Smith

I'm so transparent

The Flash Editor is nearly good enough to use but would be a pain as you would have to manually create a key frame for each allophone and drag each one into place.

Ideally I'd like to generate FLA files from my editor for post-production in Flash, but only SWF is an open format, so I'll have to write them directly.

Thomas Fjellstrom

Quote:

I'm so transparent

Funny you should say that Matt most of your head dissapears when your eyes glow

Cheradenine Zakalwe

Quote:

I'm so transparent

Naah! Just a case of putting Pi and root 2 together and getting... err..

Quote:

The Flash Editor is nearly good enough to use but would be a pain as you would have to manually create a key frame for each allophone and drag each one into place.

Right! Never used shockwave myself but I get what you mean.. look forward to seeing what you come up with...

dudaskank

Quote:

If you want to convert the out.raw to a WAV

Why not saving directly to wav? Only change this piece of code in main2.c ^__^

1if(save)
2{
PACKFILE *pfp;
int bps = speak->bits/8 * ((speak->stereo) ? 2 : 1);
int i, s;
pfp = pack_fopen("out.wav", F_WRITE);
pack_fputs("RIFF", pfp);                /* RIFF header */
pack_iputl(36+length2, pfp);              /* size of RIFF chunk */
pack_fputs("WAVE", pfp);                /* WAV definition */
pack_fputs("fmt ", pfp);                /* format chunk */
pack_iputl(16, pfp);                    /* size of format chunk */
pack_iputw(1, pfp);                    /* PCM data */
pack_iputw((speak->stereo) ? 2 : 1, pfp);      /* mono/stereo data */
pack_iputl(speak->freq, pfp);              /* sample frequency */
pack_iputl(speak->freq*bps, pfp);          /* avg. bytes per sec */
pack_iputw(bps, pfp);                  /* block alignment */
pack_iputw(speak->bits, pfp);              /* bits per sample */
pack_fputs("data", pfp);                /* data chunk */
pack_iputl(length2, pfp);                /* actual data length */
if (speak->bits == 8) {
  pack_fwrite(speak->data, length2, pfp);    /* write the data */
}
else {
  for (i=0; i < (int)speak->len * ((speak->stereo) ? 2 : 1); i++) {
    s = ((signed short *)speak->data)<i>;
    pack_iputw(s^0x8000, pfp);
  }
}
pack_fclose(pfp);  
save=FALSE;
31}

^__^

Paul Pridham

Go right ahead. You've got the source code.

dudaskank

My copy is changed ^__^

Is possible to "translate" a real string, like hello world, into the allophones. I mean, you type hello world, and not h eh l l oh pa4....

^__^

Paul Pridham

Yes, there is public domain code available to convert text to speech. Some code that should be easy to adapt can be found here: http://www.wps.com/products/Story-Teller/technical/T2A/

Anomalous

There is also the Microsoft speech API <hide>... really quite versatile, text-to-speech, speech-to-text, incorporates easily with TAPI. Good stuff.

Matt Smith

Is it versatile enough to work in Linux? ::)

Cheradenine Zakalwe

Well I'll give you ONE guess!

Matt Smith

I bet it doesn't have stuff like THIS!

These have a 1-to-1 relationship with the SP-0256 allophone set. This could plainly be improved by making the di/triphones use multiple frames, but it's a start.

http://www.the-good-stuff.freeserve.co.uk/allegro/speech/mouths1-0256.png

download the demo

Unzip into speechy dir

gcc -o mouthdemo.exe mouthdemo.c speechy.c voice_mgr.c -lalleg

Paul Pridham

Heh, whoa... that's awesome. Can't wait to see it. Just promise you won't be pasting those luscious red lips over that hairy Matt Smith avatar.

CGamesPlay

MattSmith:
The demo is missing allophones/spo256-2.txt

[edit]
I copied the original spo256.txt file, but I get some kinda error when I try to kill the program.

cl mouthtest.c speechy.c voice_mgr.c /ling alleg.lib

Matt Smith

aha, you need to either

make an allophones/spo256-2/ dir and unzip matt-0256-2.zip in there

or edit mouthtest.c and change spo256-2.txt to spo256.txt

because Paul has removed my allophone set from his download.

Paul Pridham

Err... sorry, I never added your allophone set Matt, because they're not mine and didn't want to make any assumptions. Should be a simple fix-up though.

Matt Smith

really? I wasn't sure. I thought I got a copy of mine back when I downloaded yours because you added the !wh.wav ( I just checked, and it WAS there when I downloaded )

Anyway, feel free to distribute my set. I did it originally to remove the risk of GI suing for theirs, although I suppose that's a grey area legally as it's synthesised not sampled. Technically the copyright in the recording would then be owned by whoever made the file you used.

I think some more rerecording and trickery will be needed in order to synchronise the voice to the animation, rather than the other way around as it is in this demo.

hehe, look what's coming next Windows Demo 88K zipped, requires alleg40.dll

http://www.the-good-stuff.freeserve.co.uk/allegro/speech/screendump.png

fresh in. Try the Evil Britney version.

[Edit] typos in both URL's, now fixed ::)

Johan Halmén

Mac OS has a speech extension with lots of different voices. They sound really natural, actually too natural. Not that robotic sound. Or there are some robot voices, too. The Chipmonk Basic interpreter has a built in function that takes strings as an argument and reads out plain English.

When will we see the speech routines included in Allegro?

Zaphos

Quote:

When will we see the speech routines included in Allegro?

I'm putting five cents on never! This is pretty clearly add-on library material.

Cool work to paul and matt, heh, this is interesting stuff that I hadn't heard of, really, before now. Very fun!

Paul Pridham

Haha... Matt, you're a crazy man.

Matt Smith

OH NOES , TEH BABIES AR SINGIGN !!!!!!1

well, only 1 baby singing a scale so far.

complete source & data

To do

multiple voices
looping samples
text to speech (algorithmic + dictionary)
editor for composing speech strings with singing/intonation

EDIT:

try this example of speech with embedded intonation

  char *str="D4 hh1 eh ll A3 ow C4 ow pa3 pp D4 iy A3 pp el pa3 C4 ax B3 tt1 pa3 a3 ax ll ll D4 eh pa2 gg2 rr2 C4 ow pa3 dd2 A3 ao pa2 tt2 pa3 C4 ss A3 iy pa3 C4 ss A3 iy pa5 pa5 "
    "D4 dh1 ih ss pa3 C4 ih zz pa3 B3 dh1 ax pa3 D4 vv C4 oy ss pa3 C4 ax vv pa3 B3 yy2 D4 or pa3 C4 nn1 E4 uw2 pa3 C4 mm E4 aa ss pa2 tt2 A3 er2 pa5 pa5 "
    "bb2 D4 aw pa3 C4 dd2 aw nn1 pa3 B3 bb2 iy ff D4 or pa3 C4 mm iy pa5 B3 ae nn1 dd1 pa3 ay pa3 ww C4 ih ll pa3 ss pa2 D4 pp xr pa3 "
    "C4 yy2 or pa3 D4 ll ay C4 ay vv zz pa5 pa5"
    "dd2 iy ff D4 ay pa3 C4 mm iy pa5 B3 ae nn1 dd1 pa3 mm ay pa3 rr1 D4 ow bb2 C4 ao pa2 tt2 pa3 "
    "ar mm B3 iy zz pa3 ww ih ll pa3 dd2 C4 iy ss pa2 tt2 rr2 E4 oy D4 oy pa3 C4 yy2 B3 uw2 pa5 pa5 "
    "C5 hh1 ao A4 hh1 ao E4 hh1 ao C4 hh1 ao"
    "C#5 hh1 eh A#4 hh1 eh F4 hh1 eh C#4 hh1 eh"
    "D5 hh1 iy B4 hh1 iy F#4 hh1 iy D4 hh1 iy"
    "D#5 hh1 aa C5 hh1 aa G4 hh1 aa E4 hh1 ao C5 ao B4 ao A#4 ao A4 ao G#5 ao";

Obviously the evil laugh needs some work on the arpeggiation.

Thread #241388. Printed from Allegro.cc

1	if(save)
2	{
3	PACKFILE *pfp;
4	int bps = speak->bits/8 * ((speak->stereo) ? 2 : 1);
5	int i, s;
6	pfp = pack_fopen("out.wav", F_WRITE);
7	pack_fputs("RIFF", pfp); /* RIFF header */
8	pack_iputl(36+length2, pfp); /* size of RIFF chunk */
9	pack_fputs("WAVE", pfp); /* WAV definition */
10	pack_fputs("fmt ", pfp); /* format chunk */
11	pack_iputl(16, pfp); /* size of format chunk */
12	pack_iputw(1, pfp); /* PCM data */
13	pack_iputw((speak->stereo) ? 2 : 1, pfp); /* mono/stereo data */
14	pack_iputl(speak->freq, pfp); /* sample frequency */
15	pack_iputl(speak->freqbps, pfp); / avg. bytes per sec */
16	pack_iputw(bps, pfp); /* block alignment */
17	pack_iputw(speak->bits, pfp); /* bits per sample */
18	pack_fputs("data", pfp); /* data chunk */
19	pack_iputl(length2, pfp); /* actual data length */
20	if (speak->bits == 8) {
21	pack_fwrite(speak->data, length2, pfp); /* write the data */
22	}
23	else {
24	for (i=0; i < (int)speak->len * ((speak->stereo) ? 2 : 1); i++) {
25	s = ((signed short *)speak->data)<i>;
26	pack_iputw(s^0x8000, pfp);
27	}
28	}
29	pack_fclose(pfp);
30	save=FALSE;
31	}