Я в тупике из-за проблемы с Python/Sci-Kit Learn/Pipelines. Я получаю сообщение об ошибке, что форма данных, проходящих через конвейер, не соответствует ожидаемой.
Конкретная ошибка:
Что я отправляю в TFIDF
Соответствующий code приведен ниже:
Следуя моей трассировке отладки, вы можете увидеть, выделено жирным шрифтом ниже, что шаг TFIDF возвращает форму (1,1), а не (529, 1). Если я запускаю все эти шаги (включая TFIDF) вне конвейера, TFIDF возвращает (529,1). Я хотел бы использовать конвейер для возможностей поиска по сетке.
Спасибо за помощь. Если вам нужны какие-либо разъяснения, дайте мне знать.
blocks[0,:] has incompatible row dimensions. Got blocks[0,6].shape[0] == 4, expected 794.
X['Subject'].head()
5 FW: Customer PO 345 \\ HAC 73054 and 7345
8 Insured return request, o# 35693
10 Issue with a new Feature - QAR
13 FW: ABC / TSS PO catchup
15 WTM request - 1deaSe sales orders for CDE PO TSSe9-1r9
#FeatureSelector selects a list of columns from a data
#frame and returns them to a pipeline step for processing
class FeatureSelector(BaseEstimator, TransformerMixin):
def __init__(self, keys, description = ""):
self.keys = keys
self.description = description
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
print("Keys out", self.keys)
return X[self.keys]
# Custom transformer to help with debugging.
class Debugger(BaseEstimator, TransformerMixin):
def __init__(self, stepName = ""):
self.stepName = stepName
def transform(self, data):
print("Step Name", self.stepName)
print("Contents of data", data.columns)
print("Shape of Pre-processed Data:", data.shape)
#print(pd.DataFrame(data).head())
return data
def fit(self, data, y=None, **fit_params):
# No need to fit anything, because this is not an actual transformation.
return self
y = df['Assigned To']
X = df
# Construct Pipelines
ohe_pipe = Pipeline([
("feature_selector", FeatureSelector(df.select_dtypes(exclude=['int64','object']).columns.to_list())),
('feature selector debugger', Debugger()),
("ohe", OneHotEncoder()),
('ohe debugger', Debugger())],
verbose = True)
text_pipe = Pipeline([
("feature_selector", FeatureSelector(df.select_dtypes(include='object').columns.to_list())),
('feature selector debugger', Debugger()),
("tfidf_vectorizer", TfidfVectorizer()),
('tfidf debugger', Debugger())],
verbose = True)
knn_pipe = Pipeline([
("feature_union", FeatureUnion([
("ohe_pipe", ohe_pipe),
("text_pipe", text_pipe)
])),
("classifier", KNeighborsClassifier())
])
knn_grid = GridSearchCV(
estimator=knn_pipe,
param_grid = {'classifier__n_neighbors': [x for x in range(5, 20, 1) if x % 2 != 0],
'classifier__weights': ['distance'],
'classifier__leaf_size': range(20,40,1),
'feature_union__text_pipe__tfidf_vectorizer__min_df': [.01, .02, .03, .04, .05],
'feature_union__text_pipe__tfidf_vectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
'feature_union__text_pipe__tfidf_vectorizer__stop_words': ['english'],
'feature_union__text_pipe__tfidf_vectorizer__lowercase': [True],
'feature_union__ohe_pipe__ohe__sparse_output': [False]
},
scoring = {'accuracy': make_scorer(accuracy_score)
# ,
# 'f1': make_scorer(f1_score),
# 'precision': make_scorer(precision_score),
# 'recall': make_scorer(recall_score),
# 'roc_auc': make_scorer(roc_auc_score)
},
cv = 3,
refit = 'accuracy',
error_score='raise')
knn_grid.fit(X, y)
Keys out ['Issue Classification', 'Application', 'Case Submitter']
[Pipeline] .. (step 1 of 4) Processing feature_selector, total= 0.0s
Step Name
Shape of Pre-processed Data: (529, 3)
[Pipeline] (step 2 of 4) Processing feature selector debugger, total= 0.0s
[Pipeline] ............... (step 3 of 4) Processing ohe, total= 0.0s
Step Name
Shape of Pre-processed Data: (529, 66)
[Pipeline] ...... (step 4 of 4) Processing ohe debugger, total= 0.0s
Keys out ['Subject']
[Pipeline] .. (step 1 of 4) Processing feature_selector, total= 0.0s
Step Name
Shape of Pre-processed Data: (529, 1)
[Pipeline] (step 2 of 4) Processing feature selector debugger, total= 0.0s
[Pipeline] .. (step 3 of 4) Processing tfidf_vectorizer, total= 0.0s
Step Name
**Shape of Pre-processed Data: (1, 1)
[Pipeline] .... (step 4 of 4) Processing tfidf debugger, total= 0.0s**
Адам
Вопрос задан17 мая 2024 г.