减少子类的Python方式

如何解决减少子类的Python方式

背景：所以，我正在研究NLP问题。我需要根据不同类型的文本文档提取不同类型的功能。我目前有一个设置，其中有一个FeatureExtractor基类，该基类根据不同类型的文档被多次子类化，并且它们全部计算出不同的功能集并返回熊猫数据框作为输出。

所有这些子类都由一个称为FeatureExtractionRunner的包装类型类进一步调用，该类将调用所有子类并计算所有文档上的特征，并返回所有类型文档的输出。

问题：这种计算特征的模式导致许多子类。目前，我喜欢14个子类，因为我有14种类型的docs.it可能会进一步扩展。这是要维护的类太多了。有替代的方法吗？子类较少

这是我所解释的一些示例代表性代码：

from abc import ABCMeta,abstractmethod

class FeatureExtractor(metaclass=ABCMeta):
    #base feature extractor class
    def __init__(self,document):
        self.document = document
        
        
    @abstractmethod
    def doc_to_features(self):
        return NotImplemented
    
    
class ExtractorTypeA(FeatureExtractor):
    #do some feature calculations.....
    
    def _calculate_shape_features(self):
        return None
    
    def _calculate_size_features(self):
        return None
    
    def doc_to_features(self):
        #calls all the fancy feature calculation methods like 
        f1 = self._calculate_shape_features(self.document)
        f2 = self._calculate_size_features(self.document)
        #do some calculations on the document and return a pandas dataframe by merging them  (merge f1,f2....etc)
        data = "dataframe-1"
        return data
    
    
class ExtractorTypeB(FeatureExtractor):
    #do some feature calculations.....
    
    def _calculate_some_fancy_features(self):
        return None
    
    def _calculate_some_more_fancy_features(self):
        return None
    
    def doc_to_features(self):
        #calls all the fancy feature calculation methods
        f1 = self._calculate_some_fancy_features(self.document)
        f2 = self._calculate_some_more_fancy_features(self.document)
        #do some calculations on the document and return a pandas dataframe (merge f1,f2 etc)
        data = "dataframe-2"
        return data
    
class ExtractorTypeC(FeatureExtractor):
    #do some feature calculations.....
    
    def doc_to_features(self):
        #do some calculations on the document and return a pandas dataframe
        data = "dataframe-3"
        return data

class FeatureExtractionRunner:
    #a class to call all types of feature extractors 
    def __init__(self,document,*args,**kwargs):
        self.document = document
        self.type_a = ExtractorTypeA(self.document)
        self.type_b = ExtractorTypeB(self.document)
        self.type_c = ExtractorTypeC(self.document)
        #more of these extractors would be there
        
    def call_all_type_of_extractors(self):
        type_a_features = self.type_a.doc_to_features()
        type_b_features = self.type_b.doc_to_features()
        type_c_features = self.type_c.doc_to_features()
        #more such extractors would be there....
        
        return [type_a_features,type_b_features,type_c_features]
        
        
all_type_of_features = FeatureExtractionRunner("some document").call_all_type_of_extractors()

解决方法

首先回答问题，您可以避免完全子类化，而不必每次都编写__init__方法。或者，您可能会完全摆脱这些类，然后将它们转换为一堆函数。甚至您也可以将所有课程合并为一个。请注意，这些方法都不会使代码更简单或更可维护，实际上，它们只是在某种程度上改变了代码的形状。

恕我直言，这种情况是inherent problem complexity的完美示例，我的意思是域（NLP）和特定用例（文档特征提取）本身内部和外部都很复杂。

例如， featureX 和 featureY 可能是完全不同的东西，无法完全计算出来，因此最终只能使用一种方法。同样，在数据框中合并这些功能的过程可能与合并奇特功能的过程不同。在我看来，在这种情况下拥有很多功能/类是完全合理的，将它们分开是合理且可维护的。

如果您可以将某些feature calculation methods组合成更通用的功能，那说真正的代码缩减是可能的，很难确定这是否可能。

减少子类的Python方式

如何解决减少子类的Python方式

解决方法

相关推荐