pd.merge_asof fails with 'valueerror: left keys must be sorted' on second run

  • Last Update :
  • Techknowledgy :

Resolved by cleaning before merging:

df_stationManager = df_stationManager.dropna()

You will have to sort the values.

pd_open = pd.merge_asof(df_stationManager.sort_values('date_time_stamp_open'), df_gpgga.sort_values('date_time_stamp'), left_on = ['date_time_stamp_open'], right_on = ['date_time_stamp'], direction = "nearest")

Suggestion : 2

The merge_asof runs fine on the open date, but returns 'ValueError: left keys must be sorted' on the second date_time., 1 week ago Apr 13, 2020  · 问题Hi I'm trying to merge two datasets on the closest matching date_times. I have two time stamps for open and closed events. The merge_asof runs fine on the open date, but returns 'ValueError: left keys must be sorted' on the second date_time. I sort by the relevant date_time on both occasions. First dataframe: , 3 days ago Pandas : pd.merge_asof fails with 'ValueError: left keys must be sorted' on second run [ Beautify Your Computer : https://www.hows.tech/p/recommended.html ] ... , 6 days ago This ensures that the dataframe is sorted on the join keys as required by pd.merge_asof. Collected from the Internet Please contact [email protected] to delete if infringement.


   idtbl_station_manager date_time_stamp fld_station_number\ 0 1121 2017 - 09 - 19 15: 41: 24 AM00571 1 1122 2017 - 09 - 19 15: 41: 24 AM00572 2 1123 2017 - 09 - 19 15: 41: 24 AM00573 fld_grid_number fld_status fld_station_number_int\ 0 VOY - 024 - 001 CLOSED 571 1 VOY - 024 - 002 CLOSED 572 2 VOY - 024 - 003 CLOSED 573 fld_activities date_time_stamp_open fld_lat_open\ 0 Drift Net, CTD - Overside, Dredge 2017 - 04 - 13 07: 23: 35 1 Drift Net, CTD - Overside, Dredge 2017 - 04 - 13 10: 15: 07 4649.028 S 2 Drift Net, CTD - Overside, Dredge 2017 - 04 - 13 13: 15: 42 4648.497 S fld_lon_open date_time_stamp_close fld_lat_close fld_lon_close 0 03759.143 E 2017 - 04 - 13 09: 51: 18 4647.361 S 03759.142 E 1 03759.143 E 2017 - 04 - 13 12: 11: 00 4647.344 S 03759.143 E 2 2017 - 04 - 13 15: 09: 26 4647.344 S 03759.143 E
   idtbl_station_managerdate_time_stamp fld_station_number\ 01121 2017 - 09 - 19 15: 41: 24 AM00571 11122 2017 - 09 - 19 15: 41: 24 AM00572 21123 2017 - 09 - 19 15: 41: 24 AM00573fld_grid_number fld_status fld_station_number_int\ 0 VOY - 024 - 001 CLOSED 571 1 VOY - 024 - 002 CLOSED 572 2 VOY - 024 - 003 CLOSED 573 fld_activities date_time_stamp_open fld_lat_open\ 0 Drift Net, CTD - Overside, Dredge 2017 - 04 - 13 07: 23: 351 Drift Net, CTD - Overside, Dredge 2017 - 04 - 13 10: 15: 07 4649.028 S 2 Drift Net, CTD - Overside, Dredge 2017 - 04 - 13 13: 15: 42 4648.497 Sfld_lon_open date_time_stamp_close fld_lat_close fld_lon_close 0 03759.143 E 2017 - 04 - 13 09: 51: 18 4647.361 S 03759.142 E 1 03759.143 E 2017 - 04 - 13 12: 11: 00 4647.344 S 03759.143 E 2 2017 - 04 - 13 15: 09: 26 4647.344 S 03759.143 E
  idtbl_gpggadate_time_stamp fld_utc fld_lat fld_lat_dir\ 11798281179829 2017 - 04 - 04 02: 00: 04 000005.00 3354.138 S 01 2017 - 04 - 04 02: 00: 05 000006.00 3354.138 S 12 2017 - 04 - 04 02: 00: 07 000008.00 3354.138 S fld_lon fld_lon_dir fld_gps_quality fld_nos fld_hdop fld_alt\ 1179828 1825.557 E 1100.9 21.6 0 1825.557 E 1100.9 21.6 1 1825.557 E 1100.9 21.6 fld_unit_alt fld_alt_geoid fld_unit_alt_geoid fld_dgps_age fld_dgps_id 1179828 M 31.9 M0 0 M 31.9 M0 1 M 31.9 M0
# First we grab the open time lat and lons # Sort by date_times used
for merge df_stationManager.sort_values("date_time_stamp_open", inplace = True) df_gpgga.sort_values("date_time_stamp", inplace = True) #merge_asof used to get closest match on datetime pd_open = pd.merge_asof(df_stationManager, df_gpgga, left_on = ['date_time_stamp_open'], right_on = ['date_time_stamp'], direction = "nearest") pd_open["fld_lat_open"] = pd_open["fld_lat"] + ' ' + pd_open["fld_lat_dir"] pd_open["fld_lon_open"] = pd_open["fld_lon"] + ' ' + pd_open["fld_lon_dir"]
# Now we grab the close time lat and lons # Sort by date_times used
for merge df_stationManager.sort_values("date_time_stamp_close", inplace = True) df_gpgga.sort_values("date_time_stamp", inplace = True) #merge_asof used to get closest match on datetime pd_close = pd.merge_asof(df_stationManager, df_gpgga, left_on = ['date_time_stamp_close'], right_on = ['date_time_stamp'], direction = "nearest") pd_close["fld_lat_close"] = pd_close["fld_lat"] + ' ' + pd_close["fld_lat_dir"] pd_close["fld_lat_close"] = pd_close["fld_lon"] + ' ' + pd_close["fld_lon_dir"]
df_stationManager = df_stationManager.dropna()

Suggestion : 3

This method is used to perform an asof merge. This is similar to a left-join except that we match on the nearest key rather than equal keys. Both DataFrames must be sorted by the key.,pandas.merge_asof performs an as of merge. ,The values of the left DataFrame are not sorted. It must be sorted to merge_asof.

Raise code

        # we require sortedness and non - null values in the join keys
        if not Index(left_values).is_monotonic:
           side = "left"
        if isna(left_values).any():
           raise ValueError(f "Merge keys contain null values on {side} side")
        else:
           raise ValueError(f "{side} keys must be sorted")

        if not Index(right_values).is_monotonic:
           side = "right"
        if isna(right_values).any():
           raise ValueError(f "Merge keys contain null values on {side} side")
        else:
           raise ValueError(f "{side} keys must be sorted")

Error code:

import pandas as pd

left = pd.DataFrame({
   'a': [10.0, 20.0, 3.0, 12.0, 15.0],
   #Values are not monotonic 'left_val': ['a', 'b', 'c', 'd', 'e']
})
right = pd.DataFrame({
   'a': [1.0, 5.0, 10.0, 12.0],
   'right_val': [1, 6, 11, 15]
})

pd.merge_asof(left, right)

Fix code:

import pandas as pd

left = pd.DataFrame({
   'a': [10.0, 20.0, 3.0, 12.0, 15.0],
   'left_val': ['a', 'b', 'c', 'd', 'e']
})
right = pd.DataFrame({
   'a': [1.0, 5.0, 10.0, 12.0],
   'right_val': [1, 6, 11, 15]
})

pd.merge_asof(left.sort_values('a'), right)

Suggestion : 4

pd.merge_asof fails with 'ValueError: left keys must be sorted' on second run,Accessing column names of a pandas dataframe within a custom transformer in a Sklearn pipeline with ColumnTransformer?,ColumnTransformer fails with CountVectorizer in a pipeline,Problem with custom Transformers for ColumnTransformer in scikit-learn

You can utilize make_column_transformer and do something like the following. remainder are the remaining features on which you can apply other transformations. By default, remainder is set to 'drop' which means that the remaining features without any transformations will be dropped.:

preprocess = make_column_transformer((CountVectorizer(), 'text_feat'),
   remainder = 'passthrough')
make_pipeline(preprocess).fit_transform(X)

A few tips on your code: While transforming features, you do not need to (read: shouldn't) pass y (i.e. the target). The issue in your code is because you are passing the list of text features instead of name the column. If you change your code slightly, you should get the same results.

preprocessor = ColumnTransformer(
   transformers = [('text', text_transformer, 'text_feat')])
# wrap in ColumnTransformer
preprocessor = ColumnTransformer(transformers = [('text', CountVectorizer(), 'text_feat')])

# second pipeline
pipeline = Pipeline(steps = [('preprocessor', preprocessor)])

X_test = pipeline.fit_transform(X)