accessibility
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Accessibility] Can you help write a free version of HTK?


From: Bill Cox
Subject: Re: [Accessibility] Can you help write a free version of HTK?
Date: Mon, 12 Jul 2010 01:24:02 -0700

Hi, Eric.  You make some good points below.

On Fri, Jul 9, 2010 at 2:06 PM, Eric S. Johansson <address@hidden> wrote:
> The consensus of this August body was that all of the speech recognition
> toolkits out there (Julius HEK, Sphinx) were all designed to keep graduate
> students busy but not designed for use in the real world. I did take a look
> at Simon, it looks like it's the closest of the bunch but I estimate it
> somewhere between 5 to 8 years away from being useful (i.e. on parity with
> NaturallySpeaking).

I agree that these tools are grad-student research oriented.  I also
agree that a major rewrite may be required to compete with Naturally
Speaking.  However, I disagree that we have to be competitive with
Naturally Speaking to be productive programming by voice.  I used
Dragon Dictate in 1996, and later Naturally Speaking.  I found that
Dragon Dictate was just about the same as Naturally Speaking in terms
of productivity for writing code.  The main problem was that Naturally
speaking would make me pause between commands, like Dragon Dictate,
and only did continuous recognition for dictating text.

Simon's approach can reduce the active vocabulary at any time to
perhaps a couple of hundred words or less, apparently enabling high
accuracy.  If we could have continuous command recognition, we could
easily beat my old productivity.  I read that there's a newer tool
called vocola, which enables continuous command recognition with
Naturally Speaking.

> Based on my experience in OSSRI, you could shorten that
> timeline if you had around $10-$15 million to spend and pay for full-time
> developer efforts but you're not looking at anything any faster than three
> years. Speech recognition is unbelievably hard problem that doesn't work
> very well but works well enough to keep people trying. This is why there is
> little or no competition in the market (high-cost, low results)

It's a hard problem, I agree.  Perhaps I'm somewhat wishful in my
thinking, but no matter how I do the calculation, I estimate we have
many times more potential volunteers than such a project will require.
 I think the main trick is finding advisors who do have the extensive
knowledge about how to make good recognition engines, and effectively
organising volunteers.  I think you probably would agree that
Naturally Speaking is not the only good recognition engine ever sold.
There should be experts around from failed or abandoned efforts who
could help as advisors.  Give me one or two of those guys, a dozen
motivated volunteer voice coders, and three years, and I think we
could get there.

> it may seem like I'm trying to drag things down but, I mostly try to keep
> people from making the same old mistakes I've lived through multiple times
> in the past. What I believe is necessary to support disabled people is not
> going to be pleasant for those driven by OSS ideology. For example:
>
> Handicap accessibility trumps politics.

I'm using Google's cloud-computing gmail service to write this e-mail.
 I typically review them with a closed-source binary TTS called voxin.
 I've been contacted by Skype twice today, and I've watched a couple
flash videos.  I think we are in violent agreement on this point.
People with disabilities need solutions, not a philosophy.

Let's look at where we are.  In the early 1990's a tiny company wrote
Dragon Dictate, using the signal processing hardware in the sound card
to make speech recognition on PC's useful for the first time.  They're
market was exclusively people with physical impairments.  I discovered
them in 1996, when I needed them to remain a programmer.  There may
have been some new code written by the community to get around the
crap we get from Nuance, but it seems that the tools they ship hasn't
improved programming by voice significantly in well over a decade.
Instead, they focus on helping us write emails faster.  How nice.
Look at where the real innovation in this area is coming from.  Is it
from Nuance, or the user community?  For future innovation, where
should we look?

> If a disabled person is kept from working because of ideology, then the
> ideology is wrong.

Agreed.

> I use NaturallySpeaking because for a fair number of
> tasks, it works and works far better than typing. I'm not even going to try
> and open source equivalent because it's still too much work that burns my
> hands that I need to use on other tasks so I can feed myself (cooking and
> making money). If someone was to tell me, they had a fully featured
> programming by voice package for thousand dollars complete with a
> restrictive license, I would use it without a second thought except how I
> get the money. I wouldn't lose a second of sleep over the licensing as long
> as it let me make money to live.

I was there.  I bought every version of Dragon Dictate and Naturally
Speaking published from 1996 until about 2003.  I also bought every
microphone that seemed promising at improving recognition rates.  By
the way, what do people feel is the best microphone now days?

I do a ton of volunteer work for Vinux, which is Linux based on
Ubuntu, customised for the needs of the visually impaired.  People
often post emails saying, "Today I'm switching my main machine to
Vinux!"  I generally suggest that dual-booting, or having Vinux on a
virtual machine is the way to go.  Vinux is not as productive an
environment as either Windows with JAWs or Mac for the blind, at least
not yet.  However, we aim to be better than either.  To get there as
rapidly as possible, I would like volunteers to continue using what
works best for them.  Except Sina.  He should switch to 100% Vinux
today!

> From my perspective, OSS ideology blinds developers and organizations from
> solving the real problem, keeping disabled developers and others operating
> computers at a level equivalent to TAB usability.

I agree.  I see this happening.  There are certain mailing lists where
you get flamed badly if you suggest people could be more productive
with proprietary tools.  Frankly, it's a bit scary discussing this on
a gnu.org list.

However, FOSS seems to be the only way that we can organise many
volunteers from around the globe to work together to write and improve
accessibility tools.  This isn't about ideology or politics or
freedom.  It's about people like us who are fed up with being second
class citizens, and tired of begging for access to new technology.
This is about programmers like us taking control over the future of
accessibility, because we're not going to get what we need otherwise.

> This tells me that any OSS
> accessibility interface should work from the application in towards the
> accessibility tool. For example, any tools used to make applications
> accessible should be built first using existing core technology such as
> NaturallySpeaking. Developing recognition engines should be dead last
> because they have the smallest impact on employability or usability.

Why not do both in parallel?  There are so many of us, yet each of us
has unique gifts and skills.  Most of us should do as you suggest, and
work at the application level to improve accessibility.  I think some
of us should become SR and TTS experts and work on the next
generation.  Actually, if I didn't have to work so hard with glue and
tape to make Vinux work, SR and TTS is the sort of thing I'd probably
do well at.

When I do simple estimates, I just can't see how we don't have enough
potential volunteers to do this.  I just can't believe that 99.9% of
us with RSI injuries or visual impairments are the sort of people to
sit on our butts and do nothing.  From what I've seen, a fair
percentage of us happen to be decent programmers, and are the sort
that refuse to believe we have limitations.

> We should be putting more effort into building appropriate speech level user
> interfaces instead of replicating the same cruel mistakes and useless hacks
> of the past 15 to  20 years.   instead of trying to get people to speak the
> keyboard or build interfaces which have been proven to destroy people's
> voices, we should be spending our time looking at other solutions for
> enabling applications without any application modifications or solving
> command discovery problems. Both of these solutions can reducing vocal and
> cognitive load which is a good thing. I've seen too many people try to use
> speech recognition in inappropriate ways (i.e. programming by voice using
> macros) end up doubly disabled both in the hands and the throat. Talk about
> well and truly screwed.

Perhaps I have a strong voice, but I spoke non-stop to my computer for
10 hours a day for over three years, and found that all I had to do
was sip water constantly.  I programmed by voice using macros,
eventually writing over 1,600 of them, mostly to control emacs.  I
think it was the best way to continue my career, without giving into
my typing limitations.

I am very interested in ideas like you suggest for enabling
applications without modifications, and doing anything that reduces
vocal and cognitive load.  We need new ideas, and I agree with your
point about not needing another useless type-by-voice project.  Part
of the problem is that many of these projects are funded by well
meaning institutions, but implemented by people interested in research
and their own careers.  I think the code we write would be far better
focused on our own needs.

> I've worked out a few models of how to produce better speech interfaces.
> Given my hands don't work well and I can't write code anymore, I have not
> been able to implement prototypes.

Sorry, but I have to ask: if you can dictate e-mail, why can't you
write code?  Anyway, you don't have to type code to contribute.  I
would like to hear more about your models.  I'm want to put together
an e-mail list to discuss programming by voice, and the direction we
should take in implementing and improving the tools we need.  Your
input is welcome!  Would it be better to host that e-mail list in
vinux land, or in gnu.org land?  Regardless, I would like to work in
Vinux to enable programming by voice at some basic level, and then I'd
like to get lots of voice coders on board to make it better.

> I'll spare description but only say that
> I have talked about them with people involved in the speech recognition
> world and gotten double thumbs up on the ideas.

Let's hear them!  The new list may be a good place to discuss them.

> The current accessibility toolkits are doomed to fail because there is a
> 15ish year history of that model failing. They count on application
> developers to do things they have no financial interest in doing.

Well said!  However, you and I don't need financial interests.  We
already have selfish interests in building a future full of highly
accessible technology.

> In a
> speech recognition world, the number of applications explicitly integrated
> with NaturallySpeaking is virtually unchanged since NaturallySpeaking
> version 4. The number of incidentally integrated applications (through the
> use of "standard edit control") has dropped because there are more people
> using multiplatform toolkits that don't follow standard practices or use a
> standard edit controls.  There is exactly one OSS application which was
> enabled for speech recognition but that has fallen into disrepair because
> I've been told "it would encourage the use of proprietary packages".   nice
> way to treat the disabled.

Yep.  The current state of affairs sucks.  I think it's time we give
up on Nuance and friends, and fixed it ourselves.

> I would like to see accessibility start focusing on the edges, tools where
> people work. I used buzzword, a flash-based word processor, because it works
> better, faster, with better recognition than any open source word processor.
>  I'm even considering going back to Microsoft Word because that has
> specifically is supported and enabled.  Why not make something like
> OpenOffice or ABIword work with speech recognition because that lets people
> make an open-source choice at a level that matters to them.  All the other
> crap can come later once they understand the benefits of open-source
> applications.

Until we've got a good continuous large vocabulary FOSS solution, the
best we could do is use Naturally Speaking through wine to talk to
OpenOffice and other apps.  There are a couple of projects that are
working towards this, and I would like to work with them in Vinux.
However, if I were to lose my ability to type, I would not settle for
a hacked solution like that.  I would do what I did before, and
install Windows, and by the latest Naturally Speaking.  This is why
it's so important to succeed with a decent large vocabulary FOSS
project.

However, we don't have to match NS right away.  Enabling native
decently productive voice programming may be enough to attract many
volunteers who would help advance the tools.

> I also suggest looking to history. Look at all the things that have failed
> repeatedly. I can give you a very long list that's very discouraging but the
> nice thing about the list is that it forces you to think different. Don't
> try to impose a GUI interface on speech recognition. Build a user interface
> which has discoverability.

Sorry, I didn't understand that.  What does it mean to impose a GUI
interface, and to build an interface which has discoverability.

> Don't try to force a disabled user to work on a
> single machine. Embrace the fact that your applications, data etc. run on a
> different machine.  remember that with speech recognition, you don't need to
> just enter data, you also need to edit it.

I think that the blind have similar issues.  Typically, they make
their interface to the world on one or two machines, which today are
typically Windows machines.  It would be outstanding to be able to use
any machine.  There are several solid speech recognition engines, but
few that have the other features to make writing documents by voice
effective.

>> http://xvoice.sourceforge.net/
>>
>> Xvoice was in fact used by programmers with typing impairments up
>> until the day IBM stopped selling licenses to ViaVoice for Linux.
>> When IBM did that, those programmers lost the ability to program by
>> voice natively in Linux.  IBM derailed programming by voice in Linux
>> for a decade, and we still have not recovered.  In case you didn't
>> know, Microsoft owns HTK, not Cambridge University.  So, every Linux
>> project that depends on HTK can be killed at any time by Microsoft.
>
> That's not exactly what happened.

Granted, I suffer from a strong case of ignorance when it comes to
xvoice and also everything that's happened in voice coding for the
last ten years.  However, I can learn!

> In the first place, programming by voice
> is never really been practical.

I was at least at 80% of my productivity coding by voice as I had been
while typing.  It's not easy to do, but we can help make it easier.

> Creating code by voice became more practical
> with the voice coder project. Not wonderful but, better than straight
> dictation except it ruins your ability to dictate comments. IBM had nothing
> to ruining your ability to program by voice. It was that we couldn't get any
> attention by anyone in the open-source community to help us with the
> problem. We have a solution, it does some really nice things but I think the
> problem needs to be solved by going a different direction.
>
> as a person that actually tried to use the IBM product, it was a stinking
> pile of crap that had a boatload of errors that IBM had no interest in
> fixing. When I posted a list of failures, that message was censored from the
> list. I sent it to a bunch of people who asked questions, see seeing the
> list and the second time it got through. As far as I'm concerned, it wasn't
> useful, it was a cruel joke that it hundred hours of my life and my hands
> which was a loss I didn't need at the time.

I also tried ViaVoice, and gave up on it quickly.  It had a good core
recognition engine, but that was it.  It simply was not useful for
programming by voice in Windows, at least when I tried it.  Naturally
Speaking was also a steaming pile, but the steam was a bit less nasty.
 It was actually barely usable if you held your nose.  The core
recognition engine in Naturally Speaking is great, but last I checked,
NS still sucked for coding.  However, it is doable.

> As for the whole HTK thing, I really don't care. I use NaturallySpeaking, if
> nuance stops selling it, I can keep using my license.

If you stop upgrading to the current tools, and stick with old ones,
your life will  slowly become more and more difficult, and the world
will become less and less accessible.

We have this problem big-time in Linux in the vision impaired
community.  PulseAudio trashed speech output, so almost everybody just
stopped upgrading their systems to the latest code.  The result was
major distros stopped getting feedback from the vision impaired to let
them know when they broke accessibility further.  The major distro's
accessibility began to degrade rapidly.

If I can point at a single accomplishment I'm proud of so far in the
accessibility area, it's porting Vinux back to the latest stable
branch of the most widely used Linux distro, and working with lots of
people to fix all sorts of stuff that had broken.  Now we're able to
work with recent Linux code, and help make accessibility better.
We're attracting some good developers, many who are fully blind, to
help us improve accessibility at a far more rapid pace.  Except for
Sina, who still hasn't gotten me that remote support app!

> If anything gets in
> the way, I go to court to get a remedy. I suspect I would not be the only
> disabled person working with the courts either.  if you take the same
> approach to HTK (i.e. mirror in case of legal disaster), you can move on
> with your life and deal with a problem when it comes up. I believe the
> courts look favorably on innovative solutions that solve disability problems
> without impairing normal commercial activity.

If I could just ship HTK, in violation of their license, and achieve
the goal of gaining a wide enough user base of typing impaired
programmers to help us write the next generation of code, I'd do it.
Unfortunately, that option is not realistic.  I'd lose support of the
Simon developers (not that I have any!), and never get Simon into
Ubuntu, Fedora, Debian, or any other major distro.  Realistically,
we'll need to deal with this license issue.  Simon seems to be far
ahead of other FOSS projects in delivering functionality that would
enable programming by voice in Linux.  I think the fastest way to get
there would be dealing with the license issue, and then using Simon to
do programming by voice in emacs.

> Also, I know people don't want to hear this but programming by voice is
> independent of the speech recognition engine. If you build on top of the
> dragonfly SDK, you don't care if you are using Microsoft or nuance for your
> speech recognition engine.

Actually, I love hearing that!  However, I think it is only mostly
true.  For the next generation of speech enabled tools, I think we're
going to need more functionality from the speech engine.  For example,
I've never seen Naturally Speaking switch vocabularies based on the
application with focus.  Is this possible with NS?  I've also never
seen it do continuous command recognition, though it sounds like a
FOSS bolt-on application is parsing the result of continuous speech
recognition and executing commands continuously.  Is that what vocola
does?

> If you want to really support disabled people,
> help build applications using dragonfly and once you solve the problem for
> disabled users, then go build a speech recognition engine.   Remember,
> handicap accessibility trumps politics. If we can't work, it's bloody
> useless.

Handicap accessibility trumps politics.  100% agreed.  Wait until
we've solved all the problems using NS, and then replace it?  If I
felt we could solve all the problems using NS in Windows, I wouldn't
bother trying to work with people to replace it.  However, as you
pointed out, progress over the last several years has basically
sucked, though it seems some free like stuff like vocola has moved
forward.

> Let let me reinforce. Typing impaired people (a really bad nomenclature
> since I'm also driving impaired, door opening impaired, preparing food
> impaired, hugging other people impaired...) don't care about licenses. They
> care about being able to participate online, work, write, etc.

Agreed.  When I couldn't type, I was also beer chugging impaired.  It
hurt to lift the damned bottle.  I was also shampooing impaired,
zipper impaired, and my girlfriend (now wife) kept trying to hold my
hand, and I would try and not grimace.

> full native
> language dictation is the most import feature. If you are speech recognition
> package can't be used to create a message like this e-mail, then you failed.
> Completely and totally failed.

I don't agree.  Solid large-vocabulary speech recognition would be
nice, but it's low on the list of what we need.  We have seen good
solutions in Naturally Speaking, ViaVoice, Windows Speech Recognition,
and other commercial tools.  Companies seem to think the real money is
in convincing average Windows users to use speech recognition to write
e-mails.  I can do it on Windows, Mac, and even my Google Nexus One
with fair accuracy.  NS happens to do the related stuff better, like
letting me correct stuff by voice, but solid recognition seems to be
widespread technology.  So long as we're not trying to be FOSS
purists, we can continue to use these tools until we have better FOSS
tools.  In the meantime, we can help make programming by voice easier
and more productive.

> How about this. Let's start with something simple like fixing Emacs vr-mode
> so we can use NaturallySpeaking with Emacs on multiple platforms. If you
> can't get a useful tool for disabled programmers working then something is
> seriously wrong and I don't believe you have the interests of disabled
> programmers in mind. Harsh words but right now, I can't use Emacs and I go
> to proprietary editor because that's the only choice I have if I'm going to
> work.

Ok, let's fix programming by voice programming with emacs.  I'm on
board with that.  I did program by voice with emacs, but all that work
I did was not useful for other voice coders.  I think that the Simon
concept of scenarios may provide a way for us to write emacs macros
that we can share.  I also like what I read on the Vocola.net site.
We should dive into that and see what can be leveraged.  We should
form an e-mail list with experts in coding by voice - meaning the
people who actually do it, and we should see if we can come up with a
worthwhile plan or roadmap.

> Maybe this other example might help. When the free software foundation for
> started up, Emacs ran on a bunch of proprietary platforms. It showed people
> the benefits of open source. Then came a whole bunch of other components in
> the gnu tool chain.   Eventually, thanks to Linux, a TAB was able to use a
> completely free system or, a broadly functional not so free system. Right
> now, we are back at the beginning. We don't even have the basic Emacs
> equivalent in handicap accessibility applications. Let's start with Emacs
> again and gradually add speech recognition enhancements throughout the
> entire system.

Sounds good to me!

Thanks for all the insights.
Bill



reply via email to

[Prev in Thread] Current Thread [Next in Thread]