Spam Assassin
The Spam Assassin public mail corpus.
Usage
var corpus = require( '@stdlib/datasets/spam-assassin' );
corpus()
Returns the Spam Assassin public mail corpus.
var data = corpus();
// returns [{...},{...},...]
Each array
element has the following fields:
id
: message id (relative to messagegroup
)group
: message groupchecksum
: object containing checksum infotext
: message text (including headers)
The message group
may be one of the following:
easy-ham-1
: easier to detect non-spam e-mails (2500 messages)easy-ham-2
: easier to detect non-spam e-mails collected at a later date (1400 messages)hard-ham-1
: harder to detect non-spam e-mails (250 messages)spam-1
: spam e-mails (500 messages)spam-2
: spam e-mails collected at a later date (1396 messages)
The checksum
object contains the following fields:
type
: checksum type (e.g., MD5)value
: checksum value
Examples
var corpus = require( '@stdlib/datasets/spam-assassin' );
var data;
var i;
data = corpus();
for ( i = 0; i < data.length; i++ ) {
console.log( 'Character Count: %d', data[ i ].text.length );
}
CLI
Usage
Usage: spam-assassin [options]
Options:
-h, --help Print this message.
-V, --version Print the package version.
--format fmt Output format: 'txt' or 'ndjson'.
Notes
- The CLI supports two output formats: plain text (
txt
) and newline-delimited JSON (NDJSON). The default output format istxt
.
Examples
$ spam-assassin
License
The data files (databases) are licensed under an Open Data Commons Public Domain Dedication & License 1.0 and their contents are licensed under Creative Commons Zero v1.0 Universal. The software is licensed under Apache License, Version 2.0.