Skip to content

The cleaning script clean some num worng #8

@jiejie1993

Description

@jiejie1993

`ORIGINAL| 2021-02-03 16:09:00 天润乳业公告,拟在新疆生产建设兵团第十二师222团投资建设10000头规模化奶牛示范牧场项目,222团予以提供牧场运营所需配套资源,222团保证二十年内免费提供本项目使用的设施农业用地,免征土地租赁费,并保障项目生产经营所需水、电等基础配套设施。 |

CLEANED| 02-03 16:09:00 天润乳业公告,拟在新疆生产建设兵团第十二师222团投资建设头规模化奶牛示范牧场项目,222团予以提供牧场运营所需配套资源,222团保证二十年内免费提供本项目使用的设施农业用地,免征土地租赁费,并保障项目生产经营所需水、电等基础配套设施。 |`
as shown above, the year num "2021" and the "10000" num is deleted, what config cause the deleting?

my config file is:
basic: batch_size: 3000 input: Astock_all_converted.jsonl is_jsonl: true num_workers: 32 output: Astock_all.jsonl result_key: target source_key: target extractors: ContentExtractor: save_key: pageContent TimeExtractor: save_key: pagePublishTime TitleExtractor: save_key: pageTitle filters: SimplifiedFilter: config_file: t2s.json SymbolFilter: filter_control: true filter_emoji: true TextCleaner: filter_extraspace: true filter_personal: true filter_url: true TextIntegrityChecker: do_end_clip: true double_mark_check: true end_mark_check: true length_check: true min_length: 16

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions