DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Introduction to Salesforce Batch Apex [Video]
  • Offline Data Pipeline Best Practices Part 2:Optimizing Airflow Job Parameters for Apache Hive
  • 6 Best Practices to Build Data Pipelines
  • Building Robust Real-Time Data Pipelines With Python, Apache Kafka, and the Cloud

Trending

  • IoT and Cybersecurity: Addressing Data Privacy and Security Challenges
  • Distributed Consensus: Paxos vs. Raft and Modern Implementations
  • Scaling Microservices With Docker and Kubernetes on Production
  • Securing the Future: Best Practices for Privacy and Data Governance in LLMOps
  1. DZone
  2. Data Engineering
  3. Data
  4. Ingesting Data From Files With Apache Spark, Part 1

Ingesting Data From Files With Apache Spark, Part 1

In this post, a data expert teaches us how to take in large data sets using Apache Spark.

By 
Jean-Georges Perrin user avatar
Jean-Georges Perrin
·
Apr. 08, 19 · Tutorial
Likes (7)
Comment
Save
Tweet
Share
15.6K Views

Join the DZone community and get the full member experience.

Join For Free
Abstract of Complex Ingestion from CSV, from  Spark in Action, 2nd Ed.                       by Jean Georges Perrin. 

CSV[1] is probably the most popular data-exchange format around. Due to its age and wide use, this format has many variations on its core structure: separators aren’t always commas, some records may span over multiple lines, there are various ways of escaping the separator, and many more creative considerations. When your customer tells you “I’ll send you a CSV file,” you can certainly nod, smile, and slowly start to freak out on the inside.

Fortunately for you, Apache Spark offers a variety of options for ingesting those CSV files. Ingesting CSV is easy and schema inference is a powerful feature.

Let’s have a look at more advanced examples with more options that illustrate the complexity of CSV files in the outside world. You’ll first look at the file you’ll ingest, and understand its specifications. You’ll then have a look at the result and finally build the mini-application to achieve the result. This pattern repeats for each format.

Figure 1 illustrates the process you’re going to implement.


Figure 1. Spark is ingesting a complex CSV-like file with non-default options. After ingesting the file, the data is in a dataframe, from which you can display records and the schema – in this case the schema is inferred by Spark.

Input CSV File to Be Processed

In listing 1 is an excerpt of a CSV file with two records and a header row. Note that CSV has become a generic term: nowadays, the C means more “character” than comma. You’ll find files where values are separated by semicolons, tabs, pipes (|), and more. For the purist, the acronym may matter, but for Spark, they all fall into the same category.

A few observations:

  • The file isn’t comma-separated but semicolon-separated.
  • I manually added the end of paragraph symbol (¶) to show the end of line. It isn’t in the file.
  • If you look at the record with id 4, there’s a semicolon in the title, which breaks the parsing, therefore this field is surrounded by stars. Keep in mind that this is a stretch example to illustrate some of Spark features.
  • If you look at the record with id 6, you’ll see that the title is split over two lines: there’s a carriage return after Language? And before An.

Listing 1. Complex CSV file (abstract of books.csv):

id;authorId;title;releaseDate;link ¶
 4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n ¶
 6;2;*Development Tools in 2006: any Room for a 4GL-style Language? ¶
 An independent study by Jean Georges Perrin, IIUG Board Member*;12/28/16;http://amzn.to/2vBxOe1 ¶

Desired Output

Listing 2 shows a possible output. I added the paragraph mark to illustrate the new line, as long records aren’t easy to read.

Listing 2. Desired output after ingestion of the complex CSV file:

Excerpt of the dataframe content:
 +---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+¶
 | id|authorId|                                                                                    title|releaseDate|                  link|¶
 +---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+¶
 …
 |  4|       1|   Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)|    10/4/16|http://amzn.to/2kYhL5n|¶
 …
 |  6|       2|Development Tools in 2006: any Room for a 4GL-style Language? ¶
 An independent study by...|   12/28/16|http://amzn.to/2vBxOe1|¶ //  ❷
 …
 +---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+¶
 only showing top 7 rows

 Dataframe's schema:
 root
  |-- id: integer (nullable = true)                              ❸
  |-- authorId: integer (nullable = true)                        ❸
  |-- title: string (nullable = true)
  |-- releaseDate: string (nullable = true)                      ❶
  |-- link: string (nullable = true)

1. See that our release date is seen as a string, not a date!

2. The line break that was in your CSV file is still here.

3. The datatype is an integer: in CSV files, everything is a string, but Spark makes an educated guess!

Code

To achieve the result in listing 2, you’ll have to code something similar to listing 3: first get a session, then configure and run the parsing operation in one call using method chaining. Finally, show some records and display the schema of the dataframe.

Listing 3. ComplexCsvToDataframeApp.java:

package net.jgp.books.spark.ch07.lab_100_csv_ingestion;

 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.SparkSession;

 public class ComplexCsvToDataframeApp {

   public static void main(String[] args) {
     ComplexCsvToDataframeApp app = new ComplexCsvToDataframeApp();
     app.start();
   }

   private void start() {
     SparkSession spark = SparkSession.builder()
         .appName("Complex CSV to Dataframe")
         .master("local")
         .getOrCreate();

     Dataset<Row> df = spark.read().format("csv")  ❶
         .option("header", "true")                 ❷
         .option("multiline", true)                ❸
         .option("sep", ";")                       ❹
         .option("quote", "*")                     ❺
         .option("dateFormat", "M/d/y")            ❻
         .option("inferSchema", true)              ❼
         .load("data/books.csv");

     System.out.println("Excerpt of the dataframe content:");
     df.show(7, 90);
     System.out.println("Dataframe's schema:");
     df.printSchema();
   }
 }

1. The format we want to ingest is CSV.

2. The first line of your CSV file is a header line.

3. Some of our records are splitting over multiple lines, note that you can either use a string or a Boolean, making it easier to load values from a configuration file.

4. The separator between values is a semicolon (;).

5. The quote character is a star (*).

6. The date format matches the month/day/year format, as commonly used in the United States (see below).

7. Spark infers (guesses) the schema (see below).

As you probably guessed, you need to know what your file looks like (separator character, escape character, and so on) before you can configure the parser. Spark won’t infer those, this is part of the contract that comes (or, as in most of the times, you have to guess) with your CSV files.

The schema inference feature is a pretty neat one; but, as you can see here, it didn’t infer that the releaseDate column was a date. One way to tell Spark that it’s a date is to specify a schema.

And that’s where we’re going to stop for this section. For more, check out part 2. If you’re interested in some more general information about the book, check it out on liveBook here and see this slide deck.


[1] For more information, look at Wikipedia’s page on CSV at https://en.wikipedia.org/wiki/Comma-separated_values.  In the history section, you’ll learn that CSV has been around for quite some time.

Get the full book, Spark in Action, 2nd edition, published by Manning.

Apache Spark CSV Data (computing) Record (computer science)

Published at DZone with permission of Jean-Georges Perrin, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Introduction to Salesforce Batch Apex [Video]
  • Offline Data Pipeline Best Practices Part 2:Optimizing Airflow Job Parameters for Apache Hive
  • 6 Best Practices to Build Data Pipelines
  • Building Robust Real-Time Data Pipelines With Python, Apache Kafka, and the Cloud

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!