Wild Dataset中的SynthText有多少个字符数?

如何解决Wild Dataset中的SynthText有多少个字符数?

我从official的Wild Dataset中下载了SynthText。

然后,我阅读了官方的readme.txt,但是我找不到数据集有多少个字符。 我用谷歌搜索,但找不到它...

如下面的示例图像所示,存在一些符号,例如.:-。因此,此数据集包含字母(27)+数字(10)+一些符号(?)。

enter image description here

有人知道吗?

解决方法

我实现了自己的代码,可以对符号进行计数。

def get_characters(basedir,imagedirname='SynthText',skip_missing=False):

    class Symbols:
        def __init__(self):
            self.symbols = set()

        def update(self,data):
            self.symbols = self.symbols.union(data)

        def __len__(self):
            return len(self.symbols)

        def __str__(self):
            return ''.join(self.symbols)

    symbols = Symbols()

    def csvgenerator(annodir,imagedir,cbb,wBB,imname,txts,symbols,**kwargs):
        image_num = kwargs.get('image_num')
        i = kwargs.get('i')

        imgpath = os.path.join(imagedir,imname)

        img = cv2.imread(imgpath)
        h,w,_ = img.shape
        if not os.path.exists(imgpath):
            if not skip_missing:
                raise FileNotFoundError('{} was not found'.format(imgpath))
            else:
                logging.warning('Missing image: {}'.format(imgpath))
                raise _Skip()


        # convert txts to list of str
        # I don't know why txts is
        # ['Lines:\nI lost\nKevin ','will                ','line\nand            ',# 'and\nthe             ','(and                ','the\nout             ',# 'you                 ',"don't\n pkg          "]
        # there is strange blank and the length of txts is different from the one of wBB
        txts = ' '.join(txts.tolist()).split()
        text_num = len(txts)

        if wBB.ndim == 2:
            # convert shape=(2,4,) to (2,1)
            wBB = np.expand_dims(wBB,2)

        assert text_num == wBB.shape[2],'The length of text and wordBB must be same,but got {} and {}'.format(
            text_num,wBB.shape[2])

        # replace non-alphanumeric characters with *
        alltexts_asterisk = ''.join([re.sub(r'[^A-Za-z0-9]','*',text) for text in txts])
        assert len(alltexts_asterisk) == cbb.shape[
            2],'The length of characters and cbb must be same,but got {} and {}'.format(
            len(alltexts_asterisk),cbb.shape[2])
        for b in range(text_num):
            text = txts[b]

            symboltext = re.sub(r'[A-Za-z0-9]+','',text)

            symbols.update(symboltext)

        sys.stdout.write('\r{},and number is {}...{:0.1f}% ({}/{})'.format(symbols,len(symbols),100 * (float(i + 1) / image_num),i + 1,image_num))
        sys.stdout.flush()

    _gtmatRecognizer(csvgenerator,basedir,imagedirname,customLog=True,symbols=symbols)

    print()
    print('symbols are {},and number is {}'.format(symbols,len(symbols)))


def _gtmatRecognizer(generator,customLog=False,**kwargs):
    """
        convert gt.mat to https://github.com/MhLiao/TextBoxes_plusplus/blob/master/data/example.xml

        <annotation>
            <folder>train_images</folder>
            <filename>img_10.jpg</filename>
            <size>
                <width>1280</width>
                <height>720</height>
                <depth>3</depth>
            </size>
            <object>
                <difficult>1</difficult>
                <content>###</content>
                <name>text</name>
                <bndbox>
                    <x1>1011</x1>
                    <y1>157</y1>
                    <x2>1079</x2>
                    <y2>160</y2>
                    <x3>1076</x3>
                    <y3>173</y3>
                    <x4>1011</x4>
                    <y4>170</y4>
                    <xmin>1011</xmin>
                    <ymin>157</ymin>
                    <xmax>1079</xmax>
                    <ymax>173</ymax>
                </bndbox>
            </object>
            .
            .
            .

        </annotation>

        :param basedir: str,directory path under \'SynthText\'(,\'licence.txt\')
        :param imagedirname: (Optional) str,image directory name including \'gt.mat\
        :return:
        """
    logging.basicConfig(level=logging.INFO)

    imagedir = os.path.join(basedir,imagedirname)
    gtpath = os.path.join(imagedir,'gt.mat')

    annodir = os.path.join(basedir,'Annotations')

    if not os.path.exists(gtpath):
        raise FileNotFoundError('{} was not found'.format(gtpath))

    if not os.path.exists(annodir):
        # create Annotations directory
        os.mkdir(annodir)

    """
    ref: http://www.robots.ox.ac.uk/~vgg/data/scenetext/readme.txt
    gts = dict;
        __header__: bytes
        __version__: str
        __globals__: list
        charBB: object ndarray,shape = (1,image num). 
                Character level bounding box. shape = (2=(x,y),4=(top left,...: clockwise),BBox word num)
        wordBB: object ndarray,image num). 
                Word level bounding box. shape = (2=(x,BBox char num)
        imnames: object ndarray,image num,1).
        txt: object ndarray,shape = (i,image num).
             Text. shape = (word num)
    """
    logging.info('Loading {} now.\nIt may take a while.'.format(gtpath))
    gts = sio.loadmat(gtpath)
    logging.info('Loaded\n'.format(gtpath))

    charBB = gts['charBB'][0]
    wordBB = gts['wordBB'][0]
    imnames = gts['imnames'][0]
    texts = gts['txt'][0]

    image_num = imnames.size

    for i,(cbb,txts) in enumerate(zip(charBB,wordBB,imnames,texts)):
        imname = imname[0]

        try:
            generator(annodir,i=i,image_num=image_num,**kwargs)
        except _Skip:
            pass

        if not customLog:
            sys.stdout.write('\rGenerating... {:0.1f}% ({}/{})'.format(100 * (float(i + 1) / image_num),image_num))
        sys.stdout.flush()


    print()
    logging.info('Finished!!!')

最后,我得到了符号编号。 看来ASCII printable characters没有空格。

INFO:root:Loading ~/data/text/SynthText/SynthText/gt.mat now.
It may take a while.
INFO:root:Loaded

}&|%_(],$^{+?#@/-`).<=;~['>:\!"*,and number is 32...100.0% (858750/858750)
INFO:root:Finished!!!

symbols are }&|%_(],and number is 32

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 &lt;select id=&quot;xxx&quot;&gt; SELECT di.id, di.name, di.work_type, di.updated... &lt;where&gt; &lt;if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 &lt;property name=&quot;dynamic.classpath&quot; value=&quot;tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-