Aspose.Words component enables reading, editing and printing of Word documents and converting of Word document to other formats, without using of Microsoft Office Automation.
By using of Aspose.Words instead of Microsoft Office Automation we get faster, more secure, stable and scalable solution. In this article, I will explain how to extract all email addresses and other valuable information from selected Word document.
Please download sample Visual Studio 2015 project, used in this article from AsposeWords Get Emails From Word Example. Zip package also contains sample SampleDocWithEmails.docx Word document you can use to test the project.
To get all emails from Microsoft Word document, follow this process: Open Visual Studio and create new Windows Forms project. In order to start work with Aspose.Words, first step is to add reference in your project like on image bellow:
Now, find Aspose.Words.dll file on location where you installed Aspose.Words (by default it should be in /Program Files (x86) folder. After you select the Aspose.Words.dll file, click OK and your project References list should look like this:
Now you can use Aspose.Words component in your project. First create some user interface like in next image:
The idea of project is pretty simple: User will first click on Select Document button to find Word document. Then, click on “Get Emails From Document” button should read the document, extract email addresses and show result in Emails found text box on the right side.
Code for “Select Document” button shows open file dialog, so user can find the document:
private void btnSelectFile_Click(object sender, EventArgs e) { OpenFileDialog docDialog = new OpenFileDialog(); DialogResult result = docDialog.ShowDialog(); // Show the dialog. if (result == DialogResult.OK) // Test result. { tbFileName.Text = docDialog.FileName; } }
After file is selected, click on “Get Emails From Document” button will process the Word document and extract wanted data:
private void btnGetEmails_click(object sender, EventArgs e) { // Check if file is selected and exists, exit procedure if not if (!validate()) { return; } // Create new instance of Word document Aspose.Words.Document doc = new Document(tbFileName.Text); // Read text from document string docText = doc.GetText(); // Get emails to list List<string> emails = getEmailsFromString(docText); // Show found emails on form tbEmails.Text = String.Join(Environment.NewLine, emails.Distinct().ToArray()); }
Please notice validate() function on the top, which we use first to ensure that correct file is selected, and inform user about appropriate action to correct the problem:
private bool validate() { if (tbFileName.Text == "") { MessageBox.Show("Please select Word document first.", "Warning", MessageBoxButtons.OK); return false; } else if (!File.Exists(tbFileName.Text)) { MessageBox.Show("Selected file does not exist", "Warning", MessageBoxButtons.OK); return false; } return true; }
To get emails from text, I use .Net Regular Expressions. To keep main function clean, extracting of emails is encapsulated in getEmailsFromString() function:
private List<string> getEmailsFromString(string text) { List<string> emails = new List<string>(); Regex emailPattern = new Regex(@"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*", RegexOptions.IgnoreCase); // find emails MatchCollection emailsFound = emailPattern.Matches(text); foreach (Match email in emailsFound) { emails.Add(email.Value); } return emails; }
And that’s all! The project is ready to test now. Start the project and find some Word document on disk, or use sample SampleDocWithEmails.docx file. Click on Get Emails button will find all email addresses in document and display them in text box on the right side.
Conclusion
As you can see, with help of Aspose.Words component, reading and extracting data from Word document is incredibly easy and fast process. You can use this approach to get any other kind of valuable information from single document like URLs, phone numbers and any other formatted data. As an exercise you can try to change project code to process all documents in selected folder.
Aspose.Words.Document is main class which represents Word document. It contains over a hundred of properties and methods which can be used to read and manipulate Word document in any possible way.
Happy Coding!