dask.dataframe.Series.str.extract

dask.dataframe.Series.str.extract#

dataframe.Series.str.extract(pat: str, flags: int = 0, expand: bool = True) → DataFrame | Series | Index#

Extract capture groups in the regex pat as columns in a DataFrame.

This docstring was copied from pandas.core.strings.accessor.StringMethods.extract.

Some inconsistencies with the Dask version may exist.

For each subject string in the Series, extract groups from the first match of regular expression pat.

Parameters:

patstr: Regular expression pattern with capturing groups.
flagsint, default 0 (no flags): Flags from the re module, e.g. re.IGNORECASE, that modify regular expression matching for things like case, spaces, etc. For more details, see re.
expandbool, default True: If True, return DataFrame with one column per capture group. If False, return a Series/Index if there is one capture group or DataFrame if there are multiple capture groups.

Returns:

DataFrame or Series or Index: A DataFrame with one row for each subject string, and one column for each group. Any capture group names in regular expression pat will be used for column names; otherwise capture group numbers will be used. The dtype of each result column is always object, even when no match is found. If expand=False and pat has only one capture group, then return a Series (if subject is a Series) or Index (if subject is an Index).

See also

Examples

A pattern with two groups will return a DataFrame with two columns. Non-matches will be NaN.

>>> s = pd.Series(["a1", "b2", "c3"])
>>> s.str.extract(r"([ab])(\d)")
    0    1
0    a    1
1    b    2
2  NaN  NaN

A pattern may contain optional groups.

>>> s.str.extract(r"([ab])?(\d)")
1
  a  1
  b  2
NaN  3

Named groups will become column names in the result.

>>> s.str.extract(r"(?P<letter>[ab])(?P<digit>\d)")
letter digit
0      a     1
1      b     2
2    NaN   NaN

A pattern with one group will return a DataFrame with one column if expand=True.

>>> s.str.extract(r"[ab](\d)", expand=True)
     0
0    1
1    2
2  NaN

A pattern with one group will return a Series if expand=False.

>>> s.str.extract(r"[ab](\d)", expand=False)
0      1
1      2
2    NaN
dtype: str