Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others


0 votes
in Technique[技术] by (71.8m points)

pyspark - Null value for column when selecting using dot or bracket notation but not when using UDF

I'm trying to clean up some nested data and extract the fields I care about.

My schema for the nested value is:

 |-- maritalstatus: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- text_: string (nullable = true)
 |    |-- text__extensions: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- extension: array (nullable = true)
 |    |    |-- element: string (containsNull = true)

I want to extract the text_ field as its own column.

I have tried: df.select(col("maritalstatus.text_")).show() and df.select(col("maritalstatus")["text_"]).show() but it returns:

| null|
| null|
| null|

When I define a UDF as:

def getMaritalStatus(ms):
    return ms.text_
gms = udf(getMaritalStatus, StringType())

and do df.select(gms(col("maritalstatus")).show() it returns the data that i expect

Interestingly, I have another nested struct field that has a similar structure but with numbers as keys instead of names, and I am able to use the df.select(col("birthdate")["0"]).show() notation

Birthdate's schema:

 |-- birthdate: struct (nullable = true)
 |    |-- 0: date (nullable = true)
 |    |-- 1: integer (nullable = true)

Is there anyway to extract maritalstatus.text_ without using the UDF? I've heard UDF is not as performant as other approaches?

cassandra table structure:

CREATE TABLE keyspace.patient (
    id text PRIMARY KEY,
    active boolean,
    active_extensions list<text>,
    address list<frozen<address>>,
    birthdate frozen<tuple<date, int>>,
    birthdate_extensions list<text>,
    communication list<frozen<patient_communication>>,
    contact list<frozen<patient_contact>>,
    contained list<frozen<tuple<text, text, text>>>,
    deceasedboolean boolean,
    deceasedboolean_extensions list<text>,
    deceaseddatetime frozen<tuple<timestamp, text, int>>,
    deceaseddatetime_extensions list<text>,
    extension list<text>,
    gender text,
    gender_extensions list<text>,
    generalpractitioner list<text>,
    identifier list<frozen<identifier>>,
    implicitrules text,
    implicitrules_extensions list<text>,
    language text,
    language_extensions list<text>,
    link list<frozen<patient_link>>,
    managingorganization text,
    maritalstatus frozen<codeableconcept>,
    meta frozen<meta>,
    modifierextension list<text>,
    multiplebirthboolean boolean,
    multiplebirthboolean_extensions list<text>,
    multiplebirthinteger int,
    multiplebirthinteger_extensions list<text>,
    name list<frozen<humanname>>,
    photo list<frozen<attachment>>,
    telecom list<frozen<contactpoint>>,
    text_ frozen<narrative>
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

and the codeableconcept is

CREATE TYPE keyspace.codeableconcept (
    extension list<text>,
    text_ text,
    text__extensions list<text>,
    id text,
    coding list<frozen<coding>>

Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

#using pyspark sql you can query nested data using "." and for array use explode function.




Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share