CMUdict

The Carnegie Mellon Pronouncing Dictionary.

The Carnegie Mellon University Pronouncing Dictionary (CMUDict), created by the Speech Group in the School of Computer Science at CMU, is "an open-source machine-readable pronunciation dictionary for North American English that contains over 134,000 words".

Usage

var cmudict = require( '@stdlib/datasets/cmudict' );

cmudict( [options] )

Returns datasets from the Carnegie Mellon Pronouncing Dictionary (CMUdict).

var data = cmudict();
/* returns
    {
        'dict': {...},
        'phones': {...},
        'symbols': [...],
        'vp': {...}
    }
*/

The function accepts the following options:

data: dataset name. The following names are recognized:
- dict: the main pronouncing dictionary.
- phones: manners of articulation for each sound.
- symbols: complete list of ARPABET symbols used by the dictionary.
- vp: verbal pronunciations of punctuation marks.

To only return the main pronouncing dictionary, set the data option to dict.

var opts = {
    'data': 'dict'
};

var data = cmudict( opts );
/* returns
    {
        'A': 'AH0',
        'A(1)': 'EY1',
        'A\'S': 'EY1 Z',
        // ...
    }
*/

To return only sound articulation manners, set the data option to phones.

var opts = {
    'data': 'phones'
};

var data = cmudict( opts );
/* returns
    {
        'AA': 'vowel',
        'AE': 'vowel',
        'AH': 'vowel',
        // ...
    }
*/

To return only ARPABET symbols used by the dictionary, set the data option to symbols.

var opts = {
    'data': 'symbols'
};

var data = cmudict( opts );
/* returns
    [
        'AA',
        'AA0',
        'AA1',
        // ...
    ]
*/

To return only the verbal pronunciations of punctuation marks, set the data option to vp.

var opts = {
    'data': 'vp'
};

var data = cmudict( opts );
/* returns
    {
        '!exclamation-point': 'EH2 K S K L AH0 M EY1 SH AH0 N P OY2 N T',
        '"close-quote': 'K L OW1 Z K W OW1 T',
        '"double-quote': 'D AH1 B AH0 L K W OW1 T',
        // ...
    }
*/

Notes

Vowels carry a lexical stress marker (0: No stress, 1: Primary stress, 2: Secondary stress).
The phoneme set is based on the ARPAbet symbol set developed for speech recognition.

Examples

var cmudict = require( '@stdlib/datasets/cmudict' );

var opts = {};

opts.data = 'phones';
console.dir( cmudict( opts ) );

opts.data = 'symbols';
console.dir( cmudict( opts ) );

opts.data = 'dict';
console.dir( cmudict( opts ) );

CLI

Usage

Usage: cmudict [options]

Options:

  -h,    --help                Print this message.
  -V,    --version             Print the package version.
         --data name           Dataset name: dict, phones, symbols, vp.

Notes

If the --data option is set to a supported dataset name, the CLI prints the contents of the respective dataset file as plain text. Otherwise, the output format is newline-delimited JSON (NDJSON).

Examples

$ cmudict --data symbols
AA
AA0
AA1
AA2
...

License

The data files (databases) and their contents are licensed under a BSD-2-Clause license. The software is licensed under Apache License, Version 2.0.