确定报纸文章中的列数

如何解决确定报纸文章中的列数

让我们想象以下报纸文章需要分析的列数（解决方案应该是3个文本列）。我尝试使用cv2库和python检索列的数量，并在StackOverflow上找到以下建议：Detect number of rows and columns in table image with OpenCV

但是，由于该解决方案的表格结构合理，因此可以很容易地提取列和行的数量。基于该解决方案，这是我想到的：

import numpy as np
from imutils import contours
import cv2

# Load image,grayscale,Gaussian blur,Otsu's threshold
image = cv2.imread('example_newspaper_article.png')
gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray,(5,5),0)
thresh = cv2.threshold(blur,240,255,cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Find contours and remove text inside cells
cnts = cv2.findContours(thresh,cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    area = cv2.contourArea(c)
    if area < 10000:
        cv2.drawContours(thresh,[c],-1,(255,255),30)

# Invert image
invert = thresh
offset,old_cY,first = 10,True
visualize = cv2.cvtColor(invert,cv2.COLOR_GRAY2BGR)

# Find contours,sort from top-to-bottom and then sum up column/rows
cnts = cv2.findContours(invert,cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
(cnts,_) = contours.sort_contours(cnts,method="top-to-bottom")
for c in cnts:
    # Find centroid
    M = cv2.moments(c)
    cX = int(M["m10"] / M["m00"])
    cY = int(M["m01"] / M["m00"])

    # New row
    if (abs(cY) - abs(old_cY)) > offset:
        if first:
            row,table = [],[]
            first = False
        old_cY = cY
        table.append(row)
        row = []
    # Cell in same row
    if ((abs(cY) - abs(old_cY)) <= offset) or first:
        row.append(1)
    # Uncomment to visualize
    #cv2.circle(visualize,(cX,cY),10,(36,12),-1)
    #cv2.imshow('visualize',visualize)
    #cv2.waitKey(200)

print('Rows: {}'.format(len(table)))
print('Columns: {}'.format(len(table[1])))

cv2.imshow('invert',invert)
cv2.imshow('thresh',thresh)
cv2.waitKey()

我认为，增加drawContours方法的厚度参数将在某种程度上有所帮助，但不幸的是，这并不能解决问题。结果看起来像这样：

我认为，在文本区域上绘制矩形会更有用吗？有谁知道解决方案，可以帮助我吗？预先感谢！

解决方法

每当有这样的任务时，我都会沿y轴计数像素，并尝试找出相邻列之间的（大）差异。那将是我完整的管道：

将图像转换为灰度；使用Otsu逆二进制阈值以获取黑色背景上的白色像素。
做一些形态上的封闭，这里使用一个大的垂直线内核连接同一列中的所有像素。
计算所有白色像素；计算相邻列之间的绝对差。
手动或通过使用scipy.signal.find_peaks来找到该“信号”中的峰值。峰标识每个文本列的开始和结束，因此文本列的数量是峰数的一半。

以下是整个代码，包括一些可视化内容：

import cv2
import matplotlib.pyplot as plt     # Only for visualization output
import numpy as np
from scipy import signal
from skimage import io              # Only for web grabbing images

# Read image from web (attention: RGB order here,scikit-image)
image = io.imread('https://i.stack.imgur.com/jbAeZ.png')

# Convert image to grayscale
gray = cv2.cvtColor(image,cv2.COLOR_RGB2GRAY)

# Inverse binary threshold by Otsu's
thr = cv2.threshold(gray,255,cv2.THRESH_OTSU + cv2.THRESH_BINARY_INV)[1]

# Morphological closing with large vertical line kernel
thr_mod = cv2.morphologyEx(thr,cv2.MORPH_CLOSE,np.ones((image.shape[0],1)))

# Count white pixels along y-axis
y_count = np.sum(thr_mod / 255,0)

# Calculate absolute difference between neighbouring x-axis values
y_count_diff = np.abs(np.diff(y_count))

# Find peaks in that "signal"
peaks = signal.find_peaks(y_count_diff,distance=50)[0]

# Number of columns is half the number of found peaks
n_cols = np.int(peaks.shape[0] / 2)

# Text output
print('Number of columns: ' + str(n_cols))

# Some visualization output
plt.figure(0)
plt.subplot(221)
plt.imshow(image)
plt.title('Original image')

plt.subplot(222)
plt.imshow(thr_mod,cmap='gray')
plt.title('Thresholded,morphlogically closed image')

plt.subplot(223)
plt.plot(y_count)
plt.plot(peaks,y_count[peaks],'r.')
plt.title('Summed white pixels along y-axis')

plt.subplot(224)
plt.plot(y_count_diff)
plt.plot(peaks,y_count_diff[peaks],'r.')
plt.title('Absolute difference in summed white pixels')

plt.tight_layout()
plt.show()

文本输出：

Number of columns: 3

可视化输出：

限制：如果图像倾斜等，可能会导致不良结果。如果您有很多（大）图像横穿文本列，则可能还会得到不好的结果。通常，您需要调整给定实现中的细节以满足您的实际需求（不再给出示例）。

----------------------------------------
System information
----------------------------------------
Platform:    Windows-10-10.0.16299-SP0
Python:      3.8.5
Matplotlib:  3.3.1
NumPy:       1.19.1
OpenCV:      4.4.0
SciPy:       1.5.2
----------------------------------------

搜索列之前，可以对图像进行一些不同的准备。例如，您可以先水平连接文本（通过某种形态学操作）。这将为您提供一定高度的轮廓（标题将垂直连接为每行一个轮廓，而列中的文本将连接为每行一个轮廓）。然后搜索所有轮廓，并在高于您设置的特定值（可以计算或手动设置）的轮廓上绘制边界矩形。之后，使用更大的内核（水平和垂直）再次执行形态学操作，以便将其余所有文本紧密连接在一起。

这是示例代码：

import cv2
import numpy as np

img = cv2.imread("columns.png")  # read image
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)   # grayscale transform
thresh = cv2.threshold(gray,cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU)[1]  # OTSU thresold
kernel = np.ones((5,10),dtype=np.uint8)  # kernel for first closing procedure (connect blobs in x direction)
closing = cv2.morphologyEx(thresh,kernel)  # closing
cv2.imwrite("closing1.png",closing)
contours = cv2.findContours(closing,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_NONE)[0]  # search for contours

heights = []  # all of contours heights

for cnt in contours:
    x,y,w,h = cv2.boundingRect(cnt)  # bounding rectangles height,width and coordinates
    heights.append(h)  # append height of one contours

boundary = np.mean(heights,axis=0)  # mean of heights will serve as boundary but
# this will probably not be the case on other samples - you would need to make
# a function to determin this boundary or manualy set it

# iterate through contours
for cnt in contours:
    x,width and coordinates
    if h > boundary:  # condition - contour must be higher than height boundary
        cv2.rectangle(closing,(x,y),(x+w,y+h),(0,0),-1)  # draw filled rectangle on the closing image

cv2.imwrite("closing1-filled.png",closing)

kernel = np.ones((25,25),dtype=np.uint8)  # kernel for second closing (connect blobs in x and y direction)
closing = cv2.morphologyEx(closing,kernel)  # closing again

cv2.imwrite("closing2.png",closing)


contours = cv2.findContours(closing,cv2.CHAIN_APPROX_NONE)[0]   # search for contours again


# iterate through contours
print("Number of columns: ",len(contours))  # this is the number of columns
for cnt in contours:
    x,h = cv2.boundingRect(cnt)  # this are height,width and coordinates of the columns
    cv2.rectangle(img,3)  # draw bouning rectangle on original image

cv2.imwrite("result.png",img)


cv2.imshow("img",img)
cv2.waitKey(0)
cv2.destroyAllWindows()

结果：

列数：3

第1步：

第2步：

第3步：

确定报纸文章中的列数

如何解决确定报纸文章中的列数

解决方法

相关推荐