Home > General > Update on SWF Indexing Issues

Update on SWF Indexing Issues

After Tuesday’s announcement about SWF now getting fully indexed I thought I’d do a little experiment and put up a few test SWF files.

Its difficult to accurately deduct what exactly is happening but thought I’d write down what I’ve tried and what the results are thus far.

 

What did I use?

I created a Flash 9 SWF exported from Flash CS3, added some component instances in a variety of ways, added an input text field and set up a function that triggers a PHP script on my server and subsequently sends me an email with the value of the input text field.

Embed methods

Embedded SWF with object/embed tag, SWFObject and the standard publish from Flash CS3 (i.e. AC_FL_RunContent).

Result: all three SWF files were getting hits from Google searchbot and triggered an email to be sent, no arguments were being sent to the script either through POST or GET.

What is getting indexed?

Manually added a Button component instance on Stage, programmatically added one to the DisplayList, instantiated one but didn’t add it to the DisplayList and added one to the DisplayList outside of the Stage bounds so not visible to the user.

Result: of these four only two got a MouseEvent.CLICK triggered, the one manually added and the one programmatically added within the visual bounds of the Stage.

Trace statements

Added some trace statements throughout the code to see if those would get picked up.

Result: trace statements do not appear to be getting indexed.

 

Preliminary conclusion

I didn’t have a lot of static text and no dynamically loaded text to be indexed in my test SWF. I’m working on an updated version of the test SWF to put up and look into what exactly is happening with that, see what and how it gets indexed.

This morning I got a comment on my previous blog post by my brother Kristof saying he noticed Google was now indexing URLs to photographs and music files from his band he referenced from his Flash content. From what I can see what Google has done there is follow a reference to an XML file and indexed that file containing the URLs.

This is what Google says: “We currently do not attach content from external resources that are loaded by your Flash files. If your Flash file loads an HTML file, an XML file, another SWF file, etc., Google will separately index that resource..”

http://googlewebmastercentral.blogspot.com/2008/06/improved-flash-indexing.html

 
Why? Adobe, please tell me why this is a good thing and how this would help SEO of Flash content. It makes no sense whatsoever to index calls to .xml files and server-side scripts referenced from an SWF and link to those URLs.

Just to make this clear, if you do a filetype:swf search in Google no dynamically loaded data will show up. What happens instead is the URLs you use in your SWF get crawled separately. You’ll increasing start seeing .xml, .php etc. file show up in the search results that are used in your SWF but do not link to your SWF file that uses it but that .xml, .php, etc. file itself.

In short:

- Google follows URLRequest links, indexes XML and other referenced files in your SWF that return text (though not in context of the SWF, i.e. links to those URLs directly rather than the SWF that uses it)
- Only instances added to the DisplayList and visual on stage or getting triggered
- Using URLVariables, no values seem to get sent along with the URLRequest

 

Remaining issues

These are two things I’ve seen happen that could be troublesome:

- URLs to files loaded in from SWF content are getting exposed in search results (and not in reference to the SWF that uses it)
- Server-side scripts referenced in the SWF are getting hits from search bots, potentially causing unwanted behavior.

 
I really want to see Adobe, Google and Yahoo! urgently come out with additional information for developers on how to prevent unwanted files getting indexed, how the indexing works for the various search engines and how they individually handle things like follow URLRequests etc.

 


 
Creative Commons License This work, unless otherwise expressly stated, is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.0 Belgium License.

General

  1. July 6th, 2008 at 12:18 | #1

    It looks so messed up..
    Thanks for your efforts

  2. July 6th, 2008 at 22:42 | #2

    Howdy Peter, thanks for the tests.

    “Why? Adobe, please tell me why this is a good thing and how this would help SEO of Flash content.”

    I think this is in Google’s realm. Adobe created a “headless Player”, presumably with some automation and logging APIs. But the spidering policy itself, like the databasing and ranking aspects, is something that each search engine itself determines.

    There’s still a lot I don’t understand about the Google implementation (why worry about “copyright” text, eg?). This decision on dynamic files and data seems a tricky one — Ajax apps would face similar questions — I’d defer to Google on this one.

    jd/adobe

  3. July 6th, 2008 at 23:34 | #3

    Thanks for your thoughts John, I agree — the issue here is essentially with Google, wish Adobe had worked with them to publish some documentation before having this new indexing behavior going live.

    Google did a blog post on Tuesday so I assume the Flash Player team already knew how things were going to be handled.

    The issue of Google indexing URLs and linking to that instead of the SWF (even if that is just temporary) is pretty serious and think its fair to say that as a result there is not much if any improvement in the ability to do SEO for Flash content at this point.

    Is there any resource where we can flag possible issues at Google? Would guess its out of the hands of Adobe at this point.

    Hope this issue doesn’t repeat itself with Yahoo! with possible different indexing behavior to have to deal with.

  4. July 6th, 2008 at 23:59 | #4

    First, thanks for doing this research. We’re planning on doing some SWF indexing tests ourselves as well so it’s good to know what has already been tried.

    And your remark on “…come out with additional information for developers on how to prevent unwanted files getting indexed…”, well there’s standard robots.txt file that all search engine spiders adhere to. Use robots.txt to disallow stuff spiders shouldn’t touch.

  5. July 7th, 2008 at 00:15 | #5

    Thanks Erki, I wanted to raise the issue of robots.txt as well — if you have an SWF that references XML or scripts on a different domain you will essentially need to exclude your SWF from being indexed if you don’t want those scripts to show up.

    That means it won’t do the old static text indexing behavior either and that Flash content won’t show up at all in search results.

  6. July 7th, 2008 at 01:49 | #6

    “Google did a blog post on Tuesday so I assume the Flash Player team already knew how things were going to be handled.”

    That wasn’t the impression I got the day before the announcement, although I imagine someone at Adobe saw a draft of the Google description before it was later released.

    The “deep-linking” aspect is still too complex for me to have an opinion on… I’d do better with looking at specific searches, figuring how people actually try to find applications, and whether they want to be dumped to a specific state within that application or whether they just need to find that application itself. More clients worry about SEO for discoverability, I’d wager, rather than for data-extraction. But there are many different SEO priorities within different constituencies — the meaning of the word “seo” varies with the ear which hears it. Difficult area.

    (The most interesting angle for me in this whole thing is actually about the Headless Player, its automation, and what we might be able to do with such capabilities in the future.)

    jd/adobe

  7. July 7th, 2008 at 02:08 | #7

    Peter.

    Thanks for your investigations. Your article is the 1st decent write up I have seen with actual results.

    I have asked the Adobe engineers on a certain beta program to comment on any specifics they can give about what is/isn’t possible with the player (e.g. can the player index swf content embedded with js libraries such as swfobject). I am hoping for at least a bit of enlightenment…

  8. July 7th, 2008 at 16:48 | #8

    Thanks! This is the first place I’ve actually seen someone trying to diagnose what Google is doing, rather than just complain about lazy “security by obscurity” flash dev…

    I’m really interested in your testing results and hoping if Google doesn’t ever come out and tell us the how or why about what they’re doing, with more tests and community efforts we can create some guidelines for designing swf to receive correct indexing.

    Hoping as well that Google realizes that just giving out our back end xml and php files they’re not really giving better data in searches, but out of context code that will not really help users. (except of course those who want to sniff around for files and weren’t informed enough to be using existing tools already)

    Please keep us informed on your research.

  9. July 7th, 2008 at 19:09 | #9

    Hey Peter, great post. I’m also curious about robots.txt. Specifically in your brother’s case, I wonder if he could modify the robots.txt file so it can’t look in his music directory or the XML files. I wonder if that would keep them from being indexed.

    =Ryan
    rstewart@adobe.com

  10. Benny
    July 7th, 2008 at 20:52 | #10

    This is what Google says: “We currently do not attach content from external resources that are loaded by your Flash files. If your Flash file loads an HTML file, an XML file, another SWF file, etc., Google will separately index that resource..”

    Why? Adobe, please tell me why this is a good thing and how this would help SEO of Flash content.

    Well it’s good thing if your flash site maintains state (deep linking). If every dynamically loaded php file would be seen as part of the main movie then we would loose the state benefits.

    As I see things Google & Adobe are on the right track, now we - as developers/designers - have to do our bit:

    1. (If needed then) prevent the indexing of XML/images etc by Robots.txt
    2. Wrap the xml files in some server side scripting page like php and call that form our movie
    3. Next we always should implement deep linking (and of course flash detection).
    4. Help Google and other SE’s a hand by providing a sitemap.xml

    Now I guess there will be a (small) number of sites that don’t have (or don’t want to use) the availability of server side scripting. I think they should still implement 1,3 and 4. Instead of step 2 they could still do SEO the ‘old’ way by using something like SWFObject (although I would prefer a better solution, see following proposal).

    Maybe if the SE’s would process the link info delivered by the SE-FlashPlayer as proposed next, then we all could be happy:

    1) The SE-Flash player reports a XML file being loaded in.
    2) The SE checks if that XML is excluded for direct access in robots.txt
    \-> if so then index the content as be being an integral part of the loading swf
    \-> if not then index the linked page separately

  11. Benny
    July 7th, 2008 at 21:22 | #11

    I was just thinking of another option to get a bit control over what should and what shouldn’t be indexed. In HTML - in addition to robots.txt - we have the robots meta tag if we would have an equivalent of that in Flash/Flex that would be passed on or followed by the special Search Engine (Proxy) Flash Player then we would have much more control. We could even tell the Proxy Player to report the loaded data as internal data, e.g. “index_internal”, besides to already standard “noindex”, “nofollow”, etc.

    Where to add the robots info? I think there are several options, e.g. in AS3 we could extend URLReqeust with a property “robots:Array” or we could add it to URLRequest.data or maybe as a URLRequest.requestHeader, …

    ?

  12. July 8th, 2008 at 07:52 | #12

    We are also in urgent need of some place to put robots info / crawling restrictions.

    At work we make use of remote calls to xml files e.g. for our live event tickers . In addition to our reporting we also include near-time stats and position data. This data is licensed to us under the condition that it is not (easily) publicly available.

    The application and the data is then pushed to our customers sites (e.g. around 100 newspapers, publishers and portals for EURO2008). Since these sites are not ours we do not have access to the robots.txt files. Using META tags is also not an option since we are using XML files.

    Switching over to serving the data files only from servers that are under our control is (currently) not an option.

    So where do we put the robots infos? Google/Adobe? I guess there are more having the exact same problems. Some clarification would definitely help

    All this would not be an issue if instead of indexing the xml file etc. separately. Google would use the information in order to link back to the originating swf file.

  13. July 12th, 2008 at 22:56 | #13

    When you say that Google will start indexing XML files, I wonder about how Google will be able to interpret the text. For example, what I experienced is that a title-tag or h1,h2,h3,… tags have more weight when your content gets indexed and ul/ol-tags are simply lists of thing.s But when you don’t use HTML, there is no more standard meaning in your content (!)

    To prove my point, I also have a funny (quite nicely working) example/experiment I’d like to share here. For one particular website, I used XHTML pages to put my content in (and no XML files as usual). Surprisingly, parsing the XHTML was easier that I thought.
    Result: the whole site is fully indexed by Google (yippie !).
    You can check this example by typing “site:www.alternativ.be” in Google (even all the images have been indexed too).

    When structuring dynamic content it’s important to have tags that are semantically correct. For example, a title in your Flash site should be treated differently than body-text and so on, and so on…

    Side note: the Alternativ website itself was built in ActionScript 2 and the content management system with Flex 2 with ActionScript 3 and I must say that with E4X, extracting content from any XHTML page is a breeze.

  14. Mitchell Thomas
    July 18th, 2008 at 01:23 | #14

    Nice article Peter. I wonder what is the use of indexing a swf file that is dependent upon the data (i.e. flashvars) that get passed to it? I’m specifically talking about flash video players and widgets here - but even a flash website that is programmed to respond to deep linking, could potentially have the same problem. It seems that there are a lot of swf files out there that are not very useful outside of the context of the HTML page they are embedded in. And even if the swf can stand by itself one of the reasons we embed them in HTML pages is so the swf can be viewed at the size it was designed to be viewed at. I realize you can set the scaleMode to noScale in actionscript - but what if your application requires scaling?

    As for your tests - have you tried filling in the title/description in the Document properties so see how that affects indexing? The title and description are part of the compiled swf, so they should be readable by the Googlebot.

  1. July 6th, 2008 at 17:16 | #1
  2. July 7th, 2008 at 19:37 | #2
  3. July 8th, 2008 at 19:15 | #3
  4. July 11th, 2008 at 21:35 | #4
  5. July 12th, 2008 at 03:40 | #5
  6. July 14th, 2008 at 15:05 | #6
  7. July 16th, 2008 at 01:55 | #7
  8. August 4th, 2008 at 05:50 | #8
  9. August 18th, 2008 at 05:07 | #9
« Back to text comment