Paste: sentences split
Author: | kobi |
Mode: | factor |
Date: | Sat, 11 Sep 2010 16:31:03 |
Plain Text |
USING: accessors assocs fry general-utils grouping kernel
locals math.order regexp sequences sorting splitting ;
IN: sentences
: abbrev-pattern ( -- regexp )
R/ ((\w\.)+\w)|((\w\.){2,}) / ;
: sort-slices-by-from ( slices -- slices' )
[ [ from>> ] bi@ <=> ] sort ;
: naive-sentence-split ( str -- lines )
R/ .[.?!:]+/ [ re-split ] [ all-matching-slices ] bi-curry bi append
sort-slices-by-from
2 <groups>
[ concat trim-spaces ] map
harvest ;
: abbreviations ( string -- abbrevs )
abbrev-pattern all-matching-subseqs [ trim-spaces ] map ;
: (abbrev-replace-pairs) ( abbrevs -- pairs )
dup [ { CHAR: . } "。" replace1 ] map zip ;
: reverse-pairs ( pairs -- sriap )
[ values ] [ keys ] bi zip ;
: replace-all ( text pairs -- text' )
[ first2 replace-subseq ] each ;
:: sentence-split ( text -- sentences )
text abbreviations
(abbrev-replace-pairs) :> pairs
text pairs replace-all
naive-sentence-split
pairs reverse-pairs '[ _ replace-all ] map ;
New Annotation